From the Archives: The First Deadly Sin
In this post, originally written by Glenn Paulley and posted to sybase.com in April of 2011, Glenn talks about some of the critical components of database design and how they can impact overall application performance.
I previsouly introduced the Seven Deadly Sins of Database Application Performance. Our first deadly sin concerns physical database design, which includes schema design issues, table column order, indexing, and choice of database page size, to name but a few important factors, that can adversely impact performance. In this article, I want to explore a particular point surrounding database design in SQL Anywhere, and that concerns the concept of domains.
What are domains?
Domains are part of the original relational model of data as first developed by E. F. Codd. In his seminal text [1, pp. 81], Chris Date defines a domain as follows:
Next, we define a domain to be a named set of scalar values, all of the same type. For example, the domain of supplier numbers is the set of all possible supplier numbers, the domain of shipment quantities is the set of all integers greater than zero and less than 10,000 (say). Thus domains are pools of values, from which actual attribute values are drawn.
In a nutshell, Codd defined domains in the relational model as we might define strong typing in a programming language. In a way similar to that which occurs in programming languages, one intention behind the use of domains is to prevent application developers from making mistakes, such as comparing, for example, invoice numbers to customer numbers. While both sets of values may be (say) integers, comparing an invoice number to a customer number would – normally – not make a great deal of sense.
While that’s the theory, supporting domains in commercial relational database products has not gained significant traction over the past forty years. Sure, virtually every commercial database system, including SQL Anywhere, supports the definitions of
DOMAINs – sometime referred to as user-defined data types – but the implementation of these
DOMAINs and the intent of domains in the relational model are two very different things. All commercial RDBMS systems permit loose typing; SQL Anywhere, for example, will happily execute an SQL statement that contains a comparison between a numeric column and a string. Ivan Bowman has written a whitepaper that outlines the semantics of the implicit type conversions in SQL Anywhere that take place when a mismatched comparison must be evaluated. (Aside: commercial products each have their own implicit conversion rules, and the semantics of such comparisons are implementation-defined.)
Such flexibility may seem advantageous to application developers, but it’s not. It’s a sin. Let me explain why.
SQL is not “C” – or any other programming language
It is exceedingly important that application developers remember that SQL is a data sub-language based on tuple relational calculus. In a nutshell, that means that SQL is based on first-order predicate logic, where the query specifies what is to be computed, but not how. (Aside – I well realize that the SQL language now embodied in the SQL/2008 standard contains a mix of relational calculus and relational algebra –
UNION are set-level algebraic operators – but in the main an SQL
SELECT block still is calculus-based.)
By basing its semantics on predicate logic, SQL permits an implementation to translate the original calculus-based query into an access plan, hopefully an efficient one, of algebraic operations. Different algebraic rewritings are of course possible, so long as the ordering of those operations yields an equivalent result. As an example, suppose we have the following nested query:
SELECT * FROM CORPORATE_SUPPLIERS AS CS WHERE CS.SUPPLIER_ID IN ( SELECT S.SUPPLIER_ID FROM LOCAL_SUPPLIERS AS S WHERE S.ADDRESS LIKE '%Ontario%' AND S.REGION = 'Eastern Canada' )
Here, we are in search of suppliers who are listed as Eastern Canadian suppliers in the
LOCAL_SUPPLIERS table, and who are in turn extant in the
So far, so good. We have written this query as a nested query, where the semantics seem clear: first, find those suppliers in the
LOCAL_SUPPLIERS table who are from Ontario, and then use that intermediate result as a filter so that only the suppliers from the
CORPORATE_SUPPLIERS table with the same key are returned.
With a small amount of background knowledge, it should be clear that the above query should be equivalent to this SQL query, written as a join:
SELECT DISTINCT CS.* FROM CORPORATE_SUPPLIERS AS CS, LOCAL_SUPPLIERS AS S WHERE CS.SUPPLIER_ID = S.SUPPLIER_ID AND S.ADDRESS LIKE '%Ontario%' AND S.REGION = 'Eastern Canada'
This sort of transformation is at the very root of query optimization. The query optimizer is free to re-arrange the query and evaluate the predicates in any order as long as the original semantics are maintained. In the main, the sophistication of a query optimizer is its ability to perform these kinds of rewritings, amongst other things, to determine the least expensive access plan for this SQL request – or, more correctly, the least expensive plan it can find in a relatively short, finite amount of time.
The fly in the ointment, however, is weak typing.
What if the supplier numbers in the
CORPORATE_SUPPLIERS table were numeric, but the supplier numbers in the
LOCAL_SUPPLIERS table were a mix of values, some numeric and some alphanumeric? Suppose the application developer knows that suppliers in Eastern Canada are guaranteed to have numeric identifiers – then, if the query is executed using the semantics of nested iteration, then all will be well. However, if the query is executed as a join, the optimizer may choose to evaluate the query’s predicates in an order that will not guarantee that only Eastern Ontario suppliers will be joined to the
CORPORATE_SUPPLIERS table – and, because SQL Anywhere uses a numeric domain to compare a number and a string, and an alphanumeric string will not convert properly, a data exception may result.
I chose this example deliberately, because entity-type hierarchies often lead to these sorts of problems, which are extremely difficult to discover during testing because the generation of the error is access-plan specific. The real issue – or sin, if you will – is the poor choice of data types for the identifiers of the two tables. There are many reasons to utilize surrogate keys in a schema – I have described some of these issues elsewhere in this blog – but standardizing domains (or data types, at a bare minimum) are important to ensure that queries return expected results, and the query optimizer can use the appropriate indexes, if they exist, to speed retrieval of rows that match columns or values of the same type (i.e. same domain). Otherwise, the application developer is forced to resort to “SQL gymnastics”, such as the use of
IF-expressions, to try to prevent predicate re-orderings that can lead to such anomalies. Unfortunately,
IF-expressions and similar constructions are often complex to write, difficult to maintain, expensive to evaluate, and are not sargable.
 C. J. Date (1995). An Introduction to Database Systems, Sixth Edition. Addison-Wesley.