Why the SQL Standard does not need another way to ...

former_member204193 · ‎07-27-2016

In a recent article, Lukas Eder proposed enhancements to the SQL Standard’s GROUP BY syntax to include support for implicit grouping attributes, such as that provided by the Cypher query language. I applaud Lukas’ efforts to define useful changes to the SQL language, but I am going to try to show in this article that the proposed changes are incompatible with the existing SQL standard and commercial SQL implementations. In my examples, I am going to use my leagueDB schema that contains information about the teams and schedule for a hockey league, and is populated with NHL players from eight teams as of February, 2015:

Features of grouping and aggregation in SQL

The major issue that Lukas identifies in his article is the semantics of SQL grouped queries when the expressions in the SELECT clause do not match the expressions in the GROUP BY clause, so we consider the basics of grouped queries first. My SQL examples are executed using Oracle 11gR2, but most SQL implementations including DB2 and SAP SQL Anywhere will produce similar results.

GROUP BY fundamentals

Most readers are familiar with the ubiquitous SELECT-FROM-WHERE-GROUP BY but the order of evaluation of the clauses is important, particularly with the addition of PARTITION BY for window functions. The SQL/OLAP windowing extensions allow a user to divide the result set of a query into groups of rows called partitions. Logically, as part of the semantics of computing the result of a query specification, partitions are created after the groups defined by the GROUP BY clause are created, but before the evaluation of the final SELECT list – projection and duplicate elimination – and the query’s ORDER BY clause. Consequently, the order of evaluation of the clauses within an SQL statement becomes:

From → Where → Group by → Having → Window → Distinct → Order by.

Consider a simple query specification containing SELECT-FROM-WHERE-GROUP BY, something like the following example query over the PLAYER table in the leagueDB schema, which will return a result set consisting of the number of players of each country represented in the database that have players taller than 200 cm:

SELECT country, COUNT(*) FROM player WHERE height > 200 GROUP BY country;

The (familiar) semantics of this query are as follows:

The FROM clause is evaluated, producing a bag of tuples from the player table.

The resulting rows are filtered by the search condition in the WHERE clause, restricting the bag of tuples to only those players known to have a height greater than 200cm.

The resulting rows are then grouped by distinct values of the country attribute, as per the GROUP BY clause. Once the groups have been formed, the COUNT(*) aggregate function is computed over the rows in each group, yielding a set of result tuples containing pairs of values: the value of the country attribute, and the value of the COUNT(*) aggregate function, the number of rows in each group.

The tuples produced by the GROUP BY clause are then projected using the attribute list in the SELECT clause.

In most commercial database system products, including SQL Anywhere, Oracle, DB2, and Microsoft SQL Server, an attempt to project an attribute from the SELECT clause that is not produced by the grouping operation of the GROUP BY clause will result in an error. For example, suppose we attempt the following query:

SELECT country, state_province, COUNT(*)

FROM player

WHERE height > 200

GROUP BY country;

The inclusion of the state_province attribute in the SELECT list is problematic because that attribute is not produced by the grouping operation – hence each of the above systems will generate a syntax error for this request.

However, MySQL will not generate an error for the statement above. With MySQL, its engine will produce an arbitrary value drawn from the bag of rows combined to form each group. In many cases this value is taken from the first tuple discovered to create a group, but there is no guarantee of determinism.

(Aside: note that if the additional attribute(s) in the SELECT list are formed from subquery expressions, and if any correlation variables in the subqueries are not taken from the set of grouping expressions or aggregate functions produced by the GROUP BY clause, then these queries as well are problematic and will return a syntax error).

The astute reader will realize that if the additional attributes in the projection (the SELECT list) are functionally determined – that is, satisfy a functional dependency – by the grouping expressions, then the result of the query will be deterministic for any instance of the database. If we have

SELECT X, Y, COUNT(*) FROM T GROUP BY X;

and the functional dependency X->Y holds, then the query will return a deterministic result, since there will be a single value of Y for any group of rows produced by the GROUP BY clause. This notion is supported in the ISO SQL Standard by directly supporting the computation of derived functional dependencies with each clause in an SQL statement; but few products implement these semantics. Unfortunately, MySQL is too liberal and its non-determinism with GROUP BY semantics is, in my view, quite unfortunate, for two reasons:

First, MySQL’s non-determinism is difficult to explain to novice SQL users when explaining the semantics of SQL’s GROUP BY operator. Students struggle when formulating queries because MySQL does not return an error when the query is malformed; MySQL assumes the programmer knows what she is doing, and hence the programmer can fail to realize the non-deterministic nature of their SQL request.

Second, discovering these semantic errors is difficult in testing. I have first-hand experience with situations where a MySQL database application passes all of its tests in a test environment, but fails in production due to the non-deterministic evaluation of the additional expressions in the GROUP BY clause, often exposed by different query plans with production data.

What makes MySQL’s non-determinism of these forms of queries truly unfortunate is that, in my experience both with students and with commercial applications, often the inclusion of an extra attribute in a grouped query’s SELECT list is simply an error on the part of the programmer. Unfortunately, unless the user attempts to port the query to another DBMS, or is knowledgeable enough of SQL and their application to discover the error through static analysis, the query will go into a production version of the application with a latent defect.

Derived functional dependencies can be computed from the primary key, foreign key, uniqueness, and CHECK constraints specified for the schema, in addition to analysis of the query’s GROUP BY and WHERE clauses. For example, in this similar query:

SELECT country, state_province, COUNT(*)

FROM player

WHERE height > 200 AND country = 'Canada'

GROUP BY state_province;

the functional dependency {} -> country trivially holds, and by Armstrong’s axioms of functional dependencies we then have state_province -> country, so this query will be deterministic, even in MySQL. However, since most commercial systems fail to analyze FDs as per the SQL 2011 Standard, queries like this will still return a syntax error – in Oracle 11gR2, for example, the above query yields the common “ORA-00979: not a GROUP BY expression” error.

The GROUP BY clause isn’t optional

From the earliest days of SQL, the result of a query involving aggregation but not involving GROUP BY has been clear: the query result consists of a single row containing the results of the aggregate functions in the query’s SELECT list. For example,

SELECT COUNT(*) FROM player WHERE height > 200;

yields a single row, and with the instance of the leagueDB database I’m working with, the result of the COUNT(*) aggregate function is 1 – the only player in the database that is taller than 2m is Zdeno Chara, a defenceman with the Boston Bruins.

If the query contains a WHERE clause it is possible that the restriction of the bag of tuples from the FROM clause will be the empty set. In this case:

SELECT COUNT(*), AVG(weight)

FROM player

WHERE height > 200 AND country = 'Canada'

still yields a single row in the result, with the values of 0 and NULL, respectively, for the two aggregate functions, since there are no Canadian players in the database taller than 200cm. The distinction – an arbitrary one, from the beginnings of the SQL language – between COUNTand all other aggregate functions which yield the NULL value when computed over the empty set is known as the ‘COUNT bug’ and has been responsible for a multitude of DBMS query processing bugs in numerous systems over the past 30 years – for example, see references [1-6] below.

However, now consider adding a GROUP BY clause to the query – for example, to count the number of tall players by country, which will return a row in the result for each country:

SELECT country, COUNT(*), AVG(weight)

FROM player

WHERE height > 210

GROUP BY country;

This query returns no rows whatsoever – the empty set, as there are no players in the database taller than 210cm (Zdeno Chara is listed at 206cm tall). Note that in this case a single row is not returned. Why? What the GROUP BY operator produces is a bag of tuples, with each tuple corresponding to a group of rows in the input. If the input to GROUP BY is the empty set, then there are no groups, and hence GROUP BY produces nothing. Similarly, a HAVING clause in a grouped query might restrict all of the tuples returned by the GROUP BY clause so that, once again, no tuples generated by the GROUP BY operator will appear in the result.

Now consider the previous example but with a GROUP BY clause added:

SELECT COUNT(*), AVG(weight)

FROM player

WHERE height > 200 AND country = 'Canada'

GROUP BY ()

On the surface, the addition of the GROUP BY () appears harmless since the original statement was over the entire set of rows in the player table, restricted by the query’s WHERE clause. However, due to the existence of GROUP BY () – syntax originally added when the ISO SQL Standard added support for GROUP BY GROUPING SETS – this query will not return the same result if the input to the GROUP BY clause is empty. This is an important distinction in the semantics of the SQL language – and it makes a difference in how one would write an application program that executes such a query.

The impact of WINDOW

Because window partitioning semantically follows a GROUP BY operator, the result of any aggregate function, such as SUM(), AVG(), and VARIANCE(), is available to the computation done for a partition. Hence a window provides another opportunity to perform grouping and ordering operations in addition to a query’s GROUP BY and ORDER BY clauses. However, any computation involving a window function’s result – for example, using it in a predicate — will require a more complex statement construction. Typically that construction will require a derived table to compute the window function result(s), with the derived table’s outer SELECT block containing the desired WHERE clause predicates. A window’s partition can consist of the entire input, a single row, or something in between. The advantage of a ‘window’ construct is that the rows within the partition can then be sorted to support the additional expressive power provided by window functions.

Discussion

As we have seen, it makes a semantic difference to a query’s result if the query contains a GROUP BY clause or not, and MySQL’s ‘flexibility’ in handling of grouped queries can be argued as both a feature or as a bug, depending on your point of view. One can argue that SQL’s overall semantics are ill-formed: many would argue that the arbitrary distinction between COUNT(*) and other aggregate functions is unfortunate. Moreover, many would argue that GROUP BY‘s treatment of NULL values – as equivalent values for grouping, set operations, and duplicate elimination – is nothing short of bizarre, given the three-valued logic used in the remaining portions of the language. While the ISO-IEC committee has not historically been overly concerned with backwards compatibility or precisely matching the behaviour of existing DBMS implementations, “tinkering” with the syntax of the language given the above is not, I believe, in the best interests of commercial products or customers.

Lukas’ proposed extensions to the SQL language are well-intentioned but they unfortunately conflict with behaviour already specified in the SQL Standard. That said, what I would enjoy seeing is the holistic design of a language that encompasses the range of relational operators described above – projection, duplicate elimination, restriction, grouping, and partitioning, coupled with the treatment of NULL values – and puts the semantics of each on a firmer footing. In my view we as a database community should be looking at better overall replacements for a query language that, despite its visionary and unprecedented success, is now 40 years old. Surely we have learned some things about language design in the past four decades.

References

[1] Werner Kießling (August, 1985). On Semantic Reefs and Efficient Processing of Correlation Queries with Aggregates. In Proceedings of the 11th International Conference on Very Large Data Bases, August 21-23, 1985, Stockholm, Sweden, pp. 241-250.

[2] Günter von Bültzingsloewen (September, 1987). Translating and Optimizing SQL Queries Having Aggregates. In Proceedings of 13th International Conference on Very Large Data Bases, September 1-4, 1987, Brighton, England, pp. 235-243.

[3] Richard A. Ganski and Harry K. T. Wong (May 1987). Optimization of Nested Queries Revisited. In Proceedings of the ACM SIGMOD Conference, San Francisco, California, May 1987, pp. 23-33.

[4] Umeshwar Dayal (August 1987). Of Nests and Trees: A Unified Approach to Processing Queries That Contain Nested Subqueries, Aggregates, and Quantifiers. In Proceedings of 13th International Conference on Very Large Data Bases, September 1-4, 1987, Brighton, England, pp. 197-208.

[5] M. Negri and G. Pelagatti and L. Sbattella (September 1991). Formal Semantics of SQL Queries. ACM Transactions on Database Systems 16(3), pp. 513-534.

[6] M. Muralikrishna (August 1992). Improved Unnesting Algorithms for Join Aggregate SQL Queries. In Proceedings of the 8th International Conference on Very Large Data Bases (VLDB), Vancouver, British Columbia, pp. 91-102.