From the Archives: How to Give a Query Optimizer Fits
This post was originaly written by Glenn Paulley and posted to Sybase.com in September of 2008.
I am keenly interested in studying SQL constructions. Obviously it is part of my job to understand and anticipate common SQL constructions, since the SQL Anywhere server has to optimize and execute those statements. But I am also interested in studying these constructions to gain additional insight into how application developers write their applications, and whether or not SQL is succeeding (or failing) as a database sub-language.
Recent interaction with several customers has exposed some interesting tradeoffs related to particular SQL constructions. Let me illustrate what I mean with a real customer example, though transformed so as to execute over the DEMO database supplied as a sample with SQL Anywhere 11:
CREATE FUNCTION CurrentSalesRep(IN @CustomerID INTEGER, IN @FinancialCode CHAR(2), IN @Region CHAR(20), IN @dFrom DATE, IN @dTo DATE) RETURNS INTEGER BEGIN DECLARE @SalesRep INTEGER; SET @SalesRep = NULL; IF ISNULL( @CustomerID, 0 ) <> 0 then SELECT FIRST r.SalesRepresentative INTO @SalesRep FROM SalesOrders r WHERE r.CustomerID = @CustomerID AND r.FinancialCode = @FinancialCode AND TRIM(ISNULL(r.Region, '')) = TRIM(ISNULL(@Region, '')) AND ISNULL(r.OrderDate, '1900-01-01') <= ISNULL(@dTo, CURRENT DATE) AND ISNULL(r.OrderDate, '2200-01-01') >= ISNULL(@dFrom, CURRENT DATE) END IF; RETURN @SalesRep END
and the function is called with the following parameters:
SELECT CurrentSalesRep( 101, 'r1', 'Eastern', '2000-03-01', '2000-04-01') FROM DUMMY
Aside: I cheated slightly with the reformulation of this example from the original. In the customer’s function, the query refers to two different dates for comparison to @dTo and @dFrom. Unfortunately, the SalesOrder table in the Demo database contains but a single DATE column. For completeness, here is the text plan for the query in the CurrentSalesRep procedure, captured using Application Profiling tracing actual query plans:
( Plan [ Total Cost Estimate: 0.00069293, Estimated Cache Pages: 4960 ] ( RowLimit ( IndexScan ( SalesOrders r ) FK_CustomerID_ID[ r.CustomerID = 102 : 0.99869% Statistics ] [ ( r.FinancialCode IS NOT NULL : 100% Statistics | Bounded ) AND ( r.FinancialCode = 'r1' : 100% Statistics ) AND ( 'Eastern' = trim(isnull(r.Region,'')) : 5% Guess ) AND ( CAST(r.OrderDate AS timestamp) >= 2000-03-01 00:00:00.000 : 25% Guess ) AND ( CAST(r.OrderDate AS timestamp) <= 2000-04-01 00:00:00.000 : 25% Guess ) ] ) ) )
Now for some observations:
- A minor detail: the query uses
SELECT FIRST
but does not contain anORDER BY
clause, which leads to a SQLCODE +122 warning (not deterministic result). - Line 5 of the procedure points to a common data administration problem. In this case, the programmer is hedging their bet that either a NULL value, or a value of 0, represents an “unknown” customer ID. This construction is used throughout the customer’s procedures and views, and as we shall see leads to some interesting side-effects.
- Now the query itself. The first thing to note is line 8: @CustomerID is already verified, but @FinancialCode is not verified at all. Potential problem? Perhaps: all of the other arguments to the procedure are verified. The good thing from a performance standpoint is that each predicate on line 8 is sargable, which can be verified by looking at the text plan on lines 19-21.
- Line 9 is a problem, the result of the use of the ISNULL function again: either NULL or the empty string is to be treated similarly. In SQL Anywhere 11, the query optimizer is reasonably smart: it detects that
TRIM(ISNULL(@Region, ''))
can be evaluated independent of query execution, and for this specific invocation the result is the string “Eastern” (see line 22). However, since the Region column in the SalesOrder table is nullable, no such simplification can be made for the other expression. TheISNULL
function overr.Region
renders the predicate not sargable, and since this expression is not a table column the optimizer cannot utilize a column histogram to estimate this predicate’s selectivity. Rather, the optimizer assumes a “magic” value of 5% selectivity – nothing more than a wild guess. - Lines 10 and 11 result in similar problems to those on line 9, but for a slightly different reason: here the problem is that a NULL value is to be replaced on-the-fly with a default date. Perhaps this substitution is valid only for this procedure, but there are implications: we’ll use the predicate on Line 10 as our example. Again, the query optimizer is smart with the RHS of this inequality, determining (again for this invocation) that the argument @dTo is not NULL, eliminating the need to evaluate the special register
CURRENT DATE
. However, on the other side of the inequality, the literal string ‘1900-01-01’ is not recognized as aDATE
; because strings have to be parsed, and their format depends on connection option settings, this string is assumed to be a potentialTIMESTAMP
. In specific cases, SQL Anywhere recognizes this situation and will automaticallyCAST
the literal constant to aDATE
, but not in this case, again due to the presence of theISNULL
function in the query. This does represent a future optimization opportunity, but it’s not there today. The impact is that the comparison involvingr.OrderDate
takes place in theTIMESTAMP
domain (see lines 23 and 24). The semantics are correct; the problem is again with selectivity estimation because the comparison is in theTIMESTAMP
domain, and not theDATE
domain. In this situation the column histogram onOrderDate
is ignored (another optimization opportunity), and the “magic” (wild guess) selectivity estimate of 25% is used. Replacing the LHS expression withISNULL(r.OrderDate, CAST( '1900-01-01' as DATE))
would permit the comparison to take place in theDATE
domain. SinceSalesOrder.OrderDate
is not nullable, theISNULL
function is pointless anyway: once the optimizer removes it, the optimizer would then be able to treat the predicate as sargable, enabling use of the column histogram for selectivity estimation and potentially retrieving rows ofSalesOrder
through an index onOrderDate
.
At the end of the day, with this example, the lost predicate optimization opportunities due to the parameterized query aren’t that problematic, because they don’t affect the overall access plan: the predicate on @CustomerID
yields an indexed strategy to the SalesOrder table using the foreign key index (line 19). Consequently, access plan caching for this query is possible, which can further improve overall elapsed time by amortizing the cost of optimization over multiple invocations. The downside is that value-specific optimizations, such as the elimination of the ISNULL
function when its parameter is not a NULL value, cannot be performed because with a cached plan the value of the variable isn’t known at optimization time. Plan caching is performed automatically by SQL Anywhere for queries in procedures like this one.
My team here at iAnywhere have discussed possible, adaptive run-time optimization strategies that can mitigate some, though not all, of the issues that parameterized queries can cause, the two main ones being selectivity estimation of predicates and cardinality estimation of intermediate results. The optimizations are important because while the user could, potentially, eliminate the parameters by constructing the exact query required each time – using either nested IF
s within the procedure, or EXECUTE IMMEDIATE
over a constructed query string – the number of possible queries is exponential in the number of variables. Moreover, it takes some expertise to determine if the elimination of a parameter will, at the end of the day, make a different to query execution performance: that is, a tradeoff of execution plan quality versus the overhead of optimization at each and every invocation. To be truly self-managing, a database system needs to do this automatically, and that is a hard problem to solve.
The example above is pretty simple, really: there is only one table in the query. Where things can really break down is when the parameterized query contains several joins. In such cases, mistakes or poor assumptions with selectivity estimation can easily lead to cardinality estimation problems, and should that occur there is considerable risk that the join order chosen by the optimizer will be undesirable. Here is a second example that helps illustrate the problem; again, the query’s context is within a stored procedure, with arguments denoted using the “@” sign:
SELECT c.client_id, c.term_date, c.created_time FROM client c JOIN branch b ON (b.branch_id = c.branch_id) JOIN area a ON (a.area_id = b.area_id) JOIN region r ON (r.region_id = a.region_id) , employee e WHERE r.region_id = COALESCE(@region_id, r.region_id) AND a.area_id = COALESCE(@area_id, a.area_id) AND b.branch_id = COALESCE(@branch_id, b.branch_id) AND emp_id = @emp_id AND @emp_id = CASE e.emp_role WHEN 'BM' THEN b.BRANCH_MGR_ID WHEN 'AVP' THEN a.AREA_MGR_ID WHEN 'RP' THEN r.REGION_MGR_ID WHEN 'RA' THEN r.REGION_ADMIN_ID WHEN 'CA' THEN e.emp_id WHEN 'CU' THEN e.emp_id END;
The query in this example suffers from similar issues as in the first example. However, since joins are involved, cardinality estimation errors may lead to sub-optimal plans. In particular, note the CASE
expression at the end of the query’s WHERE
clause: not much can be done here, other than evaluating it at the very end of the join strategy, since it depends on almost all of the tables in the plan. Such placement in the access plan renders the predicate useless to restrict the size of any intermediate results.
My thanks to Dave Fishburn of iAnywhere’s consulting group for the second example, and Ivan Bowman for confirming the technical details with respect to predicate optimization.