In this post, originally written by Glenn Paulley and posted to sybase.com in May of 2008, Glenn talks about why using set operations can be very good for performance when working with large volumes of data.
One of the problems that the ad-hoc relational mappings offered by the Hibernate object persistence library causes is the generation of queries that contain subqueries – sometimes in a plentiful number. Subqueries can be very difficult to optimize because it remains a challenging research problem to estimate both their cost and their selectivity. SQL Anywhere contains quite sophisticated query rewrite optimizations for subqueries. However, for most commercial products, including SQL Anywhere, the optimization and execution of subqueries remains problematic. This week I’ve been doing some problem determination for a customer application, and what I want to convey in this post is that the impact of subquery execution can be pronounced, to the extent that it may affect the scalability of the application.
What I have commonly seen in complex queries from a variety of applications is a tendency to utilize subselects or user-defined functions to provide some measure of abstraction within complex queries. The complex query I was analyzing today – written by hand – is a four-way join of complex views, and the first view is itself a complex view that, even after rewrite optimizations that convert subqueries to joins, contains 48 subqueries. Most, though not all, of the subqueries are of the following form:
SELECT <select list from T>, (SELECT R.X FROM R WHERE R.PK = T.X) FROM T
The subselect is guaranteed to return a single row because the WHERE clause of the subselect covers the primary key columns of table R, and if T.X is NULL the subselect will return a NULL value. These properties mean that the nested query above is equivalent to the outer join query:
SELECT <select list from T>, R.X FROM T LEFT OUTER JOIN R ON (T.X = R.PK)
To illustrate the performance differences between the constructions, I ran queries similar in construction to the above examples using an actual customer database. In my tests, table T is 635K rows, and table R is a view over a single base table comprising 1.1 million rows. A slight difference from the example above is that table R has a 4-column composite key, hence the search condition in both the subselect and the left outer join’s ON condition contains four equality predicates. To ensure that the query’s constructions were kept intact, but to eliminate client-server transmission costs, I encapsulated the queries above in a derived table, and used aggregate functions in each case to limit the final result set to a single row. I would have used the FETCHTST utility to compare execution times, but I used graphical plans with actual statistics from DBISQL to capture detailed statistics about each execution. The results:
- As a subquery, the request completes in 15.29 seconds. Subquery memoization does not pay off with this request, because the correlation values from T are (almost) unique, and consequently previously memoized results cannot be used for subsequent invocations.
- As an outer join, with a single CPU (thread) the query executes in 7.92 seconds. The reason: the left outer join avoids nested-iteration semantics. If intra-query parallelism is enabled (the default), the query’s execution time falls to 2.16 seconds as the result of using parallel hash outer join.
- Finally, I rewrote the subselect as a PSM user-defined function, another technique often used to encapsulate logic in an SQL statement. The function looks something like this:
create function dba.foo(in @x integer, in @y integer, in @w integer, in @z integer) returns char(1) begin declare @lResult char(1); if ISNULL(@x,-1) = -1 then return 'N' end if; select R.X into @lResult from R where R.PK1 = @x and R.PK2 = @y and R.PK3 = @w and R.PK4 = @z; return(@lResult) end
and the query becomes
select max(dt.a), max(dt.y) from ( select T.a, foo(Tx, T.y, T.w, T.z ) as y from T ) as dt
The running time for this query is nearly 6 minutes (320 seconds). The reason? The conditional logic in the function’s preamble prevents inlining into the query proper, and the execution cost includes the construction/deconstruction of the execution context required for the user-defined function and, additionally, the periodic re-optimization of the UDF’s SELECT statement.
The lesson? Set-level operations aren’t critical when dealing with small volumes of data or transactions, but their power can be exploited with larger data volumes. Your mileage, of course, may vary. Could relational systems execute these constructions more efficiently? Yes, though that degree of sophistication will take significant time to implement. More on that in another post.