From The Archives: Surrogate Identifiers
In this post, originally written by Glenn Paulley and posted to sybase.com in December of 2008, Glenn provides some background and insight about the use of surrogate keys in your database schema.
It is almost always a good idea to define surrogate identifiers for business entities such that the identifiers are under your control. What I mean by “your control” is that your business defines the format, value ranges, and other properties of the identifier, and not a third-party (such as a government agency or another business). Why? There are several reasons, which I’ll go into a bit later, but certainly one reason is to eliminate the possibility of duplicate, changed, or simply incorrect identifiers when they are specified by external sources. By and large I think this idea is well-understood: most of us would not use a person’s telephone number as an identifier for an individual, because not only can people share a phone but also people can and do change their phone numbers.
So if surrogate identifiers are a good idea, the subsequent question is how to create them, a question I see asked frequently.
One possibility is to use built-in surrogate key generation mechanisms, which include [GLOBAL] AUTOINCREMENT
(or IDENTITY
fields), GUIDs, sequences (if you’re using Oracle), and the like. In SQL Anywhere, autoincrement columns are simply integer values, stored in a numeric type. With GLOBAL AUTOINCREMENT
the generated values are guaranteed to be unique across a set of distributed databases because the value is partitioned into two halves, the higher-order value being the “database” value and the lower-order value being the increment value for that table at that specific site.
Autoincrement values and GUID generation both yield unique values, and both mechanisms are supported by several ORM packages such as Hibernate. If using Mobilink, GLOBAL AUTOINCREMENT
or GUID identifiers are encouraged because they eliminate distributed key conflicts.
In a nutshell, the two technical issues that autoincrement and GUID values solve are (1) elimination of the possibility of key conflicts, and (2) be able to do so efficiently by using the built-in facilities in the server to generate the key values, rather than conjuring a mechanism within the application. The point I wish to make, however–my apologies to those of you who have seen me speak at Sybase Techwave, where I have made similar remarks in the past–is that while autoincrement and GUID values solve those two technical problems, there are other issues to consider. Here’s a list of them:
- In my experience it is very difficult to “hide” surrogate identifiers. Despite perhaps the best of intentions by the application developer, eventually identifiers become exposed to either users within the business, external people or organizations (such as customers), or both. Because of that, and for ease of debugging both system and business processing, it is a good idea to differentiate surrogate identifiers by their format for important business objects. One might have a customer identifier such XYZ456, a product code of 7877-876, and a supplier identifier of 879BC. Differentiating them by format (but not embedding information into their value) enables instant recognition of a class of objects, which can save time and confusion when interacting with third parties such as customers (Aside: imagine the confusion if Bell Canada used 7-digit or 10-digit customer account numbers with the same format as telephone numbers.) When using alphabetic values, one should be careful to avoid the use of vowels to eliminate forming (possibly undesirable) words as part of the key value.
- Autoincrement and GUID identifiers are not self-checking. In Canada (and unlike the US), social insurance (social security) numbers (SIN) are self-checking: they are 8 digits with a 9th check digit at the end. The check digit is formed by an algorithm (published by Revenue Canada) that eliminates the possibility of transposing two digits of a SIN but still yielding the same check digit. One can use a similar technique for surrogate identifiers, making it possible to distinguish between simply invalid identifiers and valid ones that cannot be found–a real advantage in environments like a call center.
- With autoincrement (integer) values, transposition errors are a real concern, particularly if the numeric value is longer than 3 or 4 digits. GUIDs suffer from a similar problem, only it’s worse because a GUID value can be virtually any binary string. Moreover, the length of a GUID value makes it difficult to type, introducing a higher probability of transposition or other errors. GUID values are also relatively inefficient as key values in an index, because of their length.
- One can reduce the frequency of error by making identifiers more difficult to type: this was the impetus behind the introduction in Canada several decades ago of Canadian Postal Codes, which are six characters in two groups of three. For example, the Postal Code for the iAnywhere head office in Waterloo is N2L 6R2. The alternating letter-number pattern is difficult for all but the most skilled touch-typists; by slowing data entry, errors are reduced. Moreover, the pattern, which is fixed, makes it impossible to mistake an “S” for a “5” or a “1” for an “L”.
I was introduced to these ideas as a Database Administrator twenty years ago. These are still good ideas: however, to generate one’s own surrogate keys requires application software that must be coded very carefully to ensure that it will scale.
I was prompted to write this up by a customer issue that I’m currently working on. This customer uses an autoincrement value to identify individuals; when doing online lookups by identifier value, the application generates a SELECT TOP
query with a FASTFIRSTROW
hint to display a list of all of the individuals who have identifiers that begin with the digits typed by the user thus far.
Because autoincrement was used, however, if the user types a “1” there may exist a valid individual whose identifier is “1”, between 10 and 19, between 100 and 199, between 1000 and 1999, and so on. This makes the initial query, based on the first digit, both inefficient (lots of matching rows) and, likely, worthless. Instead, if the system specified a specific pattern for the identifier (even leaving it completely numeric), such as seven digits (ie 9999999), then searching for by identifier value could be deferred until the user had typed enough digits to make the search worthwhile (say 4 or 5), reducing the number of queries sent to the server and permitting a much more refined result because a BETWEEN
predicate could be used.