In this A Deeper Look At Sybase, we’ll now take a look at Sybase IQ, Sybase’s analytics database.
Somewhere around 1994 Sybase acquired a company named Expressway that had taken a novel approach towards handling analytics-type workload (although the word ‘analytics’ did not really exist in today’s sense; ‘data warehousing’ was more like the buzzword of that time. And “business intelligence” was something you had as a person rather than as an IT system).
Their principal innovation was to physically store data not by row, but by column — despite fully supporting the conceptually row-oriented ANSI SQL standard.
Sybase renamed the product “IQ” (for “Intelligent Query”, should you wonder) and developed it into the powerful analytics engine it is today.
While column-oriented databases are all the rage in the last few years, it was far from a common database concept in those days. Sybase recognized the potential for analytics and IQ became the first commercially available column database.
The underlying idea of a column database is that analytics-type queries rarely touch all columns of a table, but typically only a few, but do so for many rows in a table. By storing all values of a particular column together, the disk I/O required for, say, an aggregation on a column, becomes much more efficient than when the data were stored as rows.
Once the data is stored column-wise, various optimizations become possible that help to boost performance further. For example, column values can be encoded in a small number of bits, with a lookup table holding the actual values. Imagine a column holding the color of an item (“red”, “blue”, “yellow” etc.), and that no more than 14 different colors occur in the actual data. Instead of storing the individual color values, IQ creates a lookup table of those 14 color values and encodes each value in 4 bits (which is enough for 16 values). Since there are many duplicate color values, using this encoding saves space, and therefore improves efficiency of disk I/O and memory usage. Also, that series of 4-bit values can often be compressed quite well itself, bringing further efficiency. When running a query involving the color column, the encoded 4-bit value is used in processing rather than the actual color value itself; only when data is returned to the client is the actual color value substituted back into the result set.
In a nutshell, this is the basic idea of a column-oriented database.
Despite storing data in a column-oriented manner, from the outside it looks as if IQ operates on rows: IQ supports ANSI SQL with full transactional semantics. When needed, a row is reconstructed on-the-fly from the individual column values.
Breaking up a row into columns and reconstructing it takes CPU cycles. The compression and decompression of data is also CPU-intensive. Indeed, IQ was designed to be hungry for CPU and memory. This is different from OLTP-oriented architectures like Sybase ASE, which are essentially optimized for disk I/O bandwidth.
A consequence of the column-oriented nature of IQ is that bulk operations are far more efficient than operations on individual rows. Simply put: updating 1 million rows in a single update statement is very fast in IQ, but when you want to update those rows one by one with an individual update each, you could go off and play a game of golf while it’s running — with time left for lunch. Performing many repeated single row operations should be basically be considered a mistake in IQ.
Another area where Sybase IQ really shines is in bulk-loading of data: as long as you load a lot of data in one operation, it will be very fast. At Sybase we’ve seen load speeds of up to 500GB/hour; loading 1 million rows (or more) per second is typical. Moreover, you can bulk-load data into IQ while your queries continue running — that may sound obvious, but judging by how some competitors peform, it is not.
IQ’s column storage and its additional compression techniques bring some interesting advantages. One phenomenon that often baffles new IQ customers is that the volume of the raw data loaded into IQ actually shrinks once its in IQ — just by loading it, compression will kick in. Compression rates vary, but it is not uncommon to see a reduction to 20-50% of the original raw data size.
A great example is the world record for the world largest data warehouse which Sybase achieved in partnership with Sun Microsystems (when the were still their own). In this case, one Petabyte of raw data was loaded into IQ and occupied only 260 Terabyte of actual storage.
This compression aspect is where column databases in general and IQ in particular have an edge over row-oriented databases. If you want to run Business Intelligence-type applications on a row-oriented database, you will inevitably need to add indexes and pre-aggregate data (creating the ‘cube’). These structures take additional storage space — so when you compare such a row-oriented data warehouse-with-aggegates-and-indexes with the same database in Sybase IQ, the difference in storage size can easily get to a factor of 10 or more. By the time we’re talking about Terabytes of data, such storage space savings are not just convenient, but they also represent real money. As a bonus, they have allowed the Sybase marketing department to push the notion that IQ is “green” technology (‘cos you need less disk storage and less electricity, thus, supposedly, melting less of the polar icecaps. Who would not want to feel good about their data warehouse?)
By default, IQ is a stand-alone single-process database server running on a single host. For scaling to larger numbers of users, IQ also comes in a clustered version known as “IQ Multiplex” (as opposed to “Simplex” for the default single-process server). In Multiplex, multiple IQ server processes run on clustered hardware (like a set of Linux blades on a backplane).
Architecturally, IQ Multiplex is a hybrid cluster. On the one hand it is a shared-disk architecture in the sense that all cluster nodes can access all data. On the level of the data caches of the Multiplex nodes, it is a shared-nothing cluster in that the cluster nodes do not coordinate modifications to cached data directly between each other (a different mechanism is used, which is out of scope for this article).
IQ has some great success stories. One of the better-known customers is Nielsen Media (the folks doing the TV viewer ratings in the USA), who have appeared on stage at past Sybase Techwave conferences. The amount of data they must crunch is huge, and to very short timescales — and for many years, they’ve been doing that with IQ. Another interesting factoid is that *all* insurance companies in South Korea use IQ – the competitive edge provided by IQ must have been irresistable. And there’s a story about a major airline who started using IQ for fraud detection, and discovered entirely new types of fraud they didn’t yet know about. For more customer success stories with IQ, see http://www.sybase.com/detail_list?id=73623
It is worth noting that Sybase IQ is proven technology with a long track record. Even though every database startup these days seems to be doing a column database, there are typically big practical differences with IQ. A particularly telling test is to try running some complex queries concurrently while loading data into the tables being queried (that’s what you do when you refresh your data warehouse while applications continue working — real-time BI, in other words). Many new arrivals on the column database market will have a hard time with such a test, but IQ manages very well.
IQ tends to be especially cost-effective on Linux — but it is of course available on all major platforms.
The current latest version of IQ is version 15.2. The next version -15.3- has been announced to be released in the summer of 2011.
More information about IQ, as well as a downloadable evaluation version, can be found at www.sybase.com/iq.
Next topic: A Deeper Look At Sybase: Replication Server.