The post was originally written by Glenn Paulley and posted to his blog in April of 2008
My kids ask “Are we there yet?” every time we travel anywhere, even if the drive is under 15 minutes.
Lately I’ve been listening to my wife utter a similar question: “Why isn’t it finished yet?”. In this case, it’s because a Microsoft SQL Server reporting application that she is responsible for is taking far longer than anticipated (8 hours rather than an estimated 4), particularly after “test” runs were performed on a copy of the actual production database. Of course, knowing that I manage a team responsible for SQL Anywhere query processing, Leslie asks this subsequent question: “Why can’t the database server tell me how much longer a request is going to take?” My blank stare, accompanied by the response “Because it’s hard”, doesn’t help her much. But it’s a really good question.
I really enjoyed a presentation that Microsoft Research’s Surajit Chaudhuri gave to the Database Group at the University of Waterloo last year that touched on this subject. Surajit’s tongue-in-cheek remarks (not verbatim, as I’m going from memory) on this problem began with:
“Database systems are pretty good at telling the user that their SQL request has yet to begin execution, particularly if the request contains a syntax error. Moreover, database systems excel at telling the user that their SQL request has completed: either by returning a SQLCODE or a result set. However, for anything between these two extremes we really don’t have a clue.”
Why is doing this so difficult? Well, like many other query processing problems the simple problem – say an SQL query over a sequential scan of a single table – is straightforward, and uninteresting. For an application developer, the (much) more useful case is determining when an executing a complex DML statement, typically involving multiple joins, GROUP BYs, DISTINCT, and/or window functions will complete. In a nutshell, the requirement is to (1) determine the expected overall elapsed time of the request, and (2) determine the percentage of elapsed time remaining.
Simple. But the problem is that computing (1) is the job of the optimizer to estimate the overall elapsed time of the query, and that process is far from perfect. Typically, our team is reasonably happy if an estimate is within a factor of 10 of the actual elapsed time for a given request, and positively ecstatic if the estimate is within a factor of 3. At the end of the day, the absolute accuracy of the optimizer isn’t what is important, the relative accuracy is. That is to say, for the optimizer to work properly the following axiom must hold: if strategy A is estimated to be cheaper than strategy B, then the actual elapsed time of strategy A must be less than that of strategy B.
The difficulties with estimating (2) are many. First, it is difficult to estimate the impact of row locking and other concurrency control mechanisms when multiple requests are being serviced. Second, because query execution uses shared resources, such as the disk subsystem and the buffer pool, it is difficult to estimate future performance even with an accurate history of immediately prior execution behaviour. Third, it is very difficult to apportion execution time amongst individual operators within a query’s execution strategy, particularly if there is a mix of pipelined and materializing operators within the plan.
In a nutshell, accurate, robust progress indicators for complex queries remain very much an open research problem.