Vertical, sure – but how lean can it be?
The vertical or columnar storage architecture for RDBMS used in Data Warehouse / Business Analysis scenarios is widely accepted as state of the art. But lining up data values by column rather than per row is not the end of the story. There are ways to further slice these values, and this article is intended to show the impact of these data structures on the performance and – more important – the scalability of a certain pattern of queries.
The data structures and mechanisms I describe here are from SAP Sybase IQ (all versions). IQ insiders will recognize the Low Fast (LF) and High Non-Group (HNG) index types. For readers new to IQ with a background in algorithms and data structures, I hope to be able to demonstrate the concepts implemented there – and of course tickle their interest.
The vertical storage architecture has an implicit side effect. Unlike in traditional (OLTP style) RDBMS, a data row does not exist as a (potentially structured or fragmented) memory item. It merely exists as a row number used to locate the various column values inside the column vectors. This side effect can be used for specific data structures, especially to support evaluations on columns of low cardinality (cardinality = number of distinct column values). Here, we can store bitmaps for column values where each row is represented by a single bit:
Row # |
Column Value |
R |
G |
B |
0 |
Blue |
0 |
0 |
1 |
1 |
Red |
1 |
0 |
0 |
2 |
Blue |
0 |
0 |
1 |
3 |
Red |
1 |
0 |
0 |
4 |
Blue |
0 |
0 |
1 |
5 |
Green |
0 |
1 |
0 |
6 |
Blue |
0 |
0 |
1 |
7 |
Red |
1 |
0 |
0 |
8 |
Blue |
0 |
0 |
1 |
9 |
Red |
1 |
0 |
0 |
10 |
Blue |
0 |
0 |
1 |
11 |
Green |
0 |
1 |
0 |
There is one bitmap per distinct column value (this number is further referred to as c). The raw size of each bitmap is (number of table rows, further referred to as n) bits. Each position is set in exactly one of the bitmaps, which implies that it is unset in c-1 bitmaps. This typically allows for highly efficient compression of these sparsely populated bitmaps. Otherwise they would occupy n*c/8 bytes which would be quite unpleasant for large values of n (aka Big Data).
Whenever two or more bitmap- indexed columns are used in a query as restrictions (in the WHERE clause) or grouped aggregations (GROUP BY clause), the semantic operations can be mapped to AND or OR combinations of these bitmaps – an operation so elementary in the digital world that any hardware can be expected to execute this highly efficient. And the various permutations of bitmap pairs offer parallel execution of subtasks with low overhead. In the end, we’ll see resulting bitmaps where the bitwise total represents the number of hits, representing the COUNT (…) SQL aggregate.
But this is only the first stage. Common use cases are based upon the calculations of total or average (which is total divided by number) aggregates of numeric values. These cannot be expected to be low cardinality, BUT:
they are made up of bits where each bit represents a fixed value (when using an exact numeric type; it’s not true for the floating point types).
Decimal |
Binary Digits |
|||||
2 |
0 |
0 |
0 |
0 |
1 |
0 |
3 |
0 |
0 |
0 |
0 |
1 |
1 |
5 |
0 |
0 |
0 |
1 |
0 |
1 |
8 |
0 |
0 |
1 |
0 |
0 |
0 |
13 |
0 |
0 |
1 |
1 |
0 |
1 |
21 |
0 |
1 |
0 |
1 |
0 |
1 |
34 |
1 |
0 |
0 |
0 |
1 |
0 |
55 |
1 |
1 |
0 |
1 |
1 |
1 |
5 |
0 |
0 |
0 |
1 |
0 |
1 |
7 |
0 |
0 |
0 |
1 |
1 |
1 |
12 |
0 |
0 |
1 |
1 |
0 |
0 |
19 |
0 |
1 |
0 |
0 |
1 |
1 |
31 |
0 |
1 |
1 |
1 |
1 |
1 |
42 |
1 |
0 |
1 |
0 |
1 |
0 |
Σ: 257 |
3 |
4 |
5 |
8 |
8 |
9 |
Just like we did with data rows, where we aligned the data by column rather than by row, we can do with the single bit positions of the values. These bitmaps are different from those shown above since their number does not depend on column cardinality but on column data type, and of course the same position can be set in multiple bitmaps. But the cool thing about this is that we can calculate the total by counting the bits in each slice and multiplying the result with the weight of the bit position (2 + 3 + 5 + 8 + 13 + 21 + 34 + 55 + 5 + 7 + 12 + 19 + 31 + 42 = 257 = 3*32 + 4*16 + 5*8 + 8*4 + 8*2 + 9*1). Not only can the bit slice operations be executed using simple technical data types, they also can be executed in parallel with close to zero overhead since each slice can be evaluated completely independent of all others.
To give you an idea of the possible scale- out: Let’s assume we need the grouped average (total / count) for the number of client interactions, stored in a 16bit unsigned integer. We’re restricting to status ‘current’ or ‘historic’ (2 out of 20 or 200 possible status values, making up 98+% of all rows) and aggregate by year (assuming our database covers 15 of them). How can this be evaluated:
- The bitmaps representing ‘current’ and ‘historic’ are ORed to a resulting bitmap (I’ll call it CH). This can be expected to be a single thread, but the resulting bitmap can be streamed, meaning the next steps can be started before completion of this step.
- The CH bitmap stream is ANDed with each bitmap of a year value, resulting in 15 bitmaps (I’ll call them Yi). These can be evaluated in 15 independent and potentially concurrent threads, and they can be streamed and pipelined again. As a side task, each of these threads counts the set bits in the resulting bitmap to calculate the count of values for the aggregation group.
- Each Yi bitmap stream is ANDed with each of the 16 bit slices of the interaction counter. These can be evaluated in 240 (15*16) independent and potentially concurrent threads. The resulting bitmaps are not further processed, it’s simply required to count the set bits and discard. As a result, there will be one unsigned integer value per thread (I’ll call them Vij)
- For each year, the 16 Vij values calculated in step 3 are multiplied with the weight of the related bit. The result is totaled up to the group sum and divided by the group member count calculated in step 2 to compute the final average. This final step of course can only start after all previous threads have completed.
The above sketch should have explained how this task can be executed in up to 256 (1 + 15 + 15*16) potentially concurrent threads (when maximizing the pipeline effect, which particularly works for a very large number of rows aka Big Data) which means that this set of data structures facilitates a massive scale-out to utilize the capabilities of the most powerful hardware today. Think of a second dimension to group by – the potential scale-out will be multiplied by the cardinality of that extra dimension.
So, we’re able to keep the big rigs busy – but will it be efficient? Take a look at the steps. Much of the workload is the most basic digital operation: Binary AND. Any digital hardware more recent than the abacus does that efficiently. To calculate the Vij subtotals and the Yi bit counters, a simple 32bit register can be reliably used for up to 2^32-1 rows. A 64bit register can be reliably used for up to 2^64-1 rows, a value that I consider fair to think of as unlimited. The closest thing to real mathematics happens in the final step where the number of rows involved before has become completely irrelevant.
Summary: The article shows how bitmap data structures as the leanest possible vertical representation of data allow completely different evaluation approaches for analytic queries. These approaches are efficient (which I hope to have outlined well; a more detailed evaluation would exceed the scope of this document) and extremely scalable. It is not necessary to have a high number of CPU cores available but if they exist and are not completely utilized, they can contribute to the faster execution of the query. Since the elementary execution steps are almost completely independent from each other, the overhead coming with parallel execution is almost completely avoided.
Meet me at #sapteched Las Vegas (October 21 -25, 2013)
Be inspired by the #HANA #Database #Technology featuring #SAPonHANA and #BigData
Discuss this subject in my Expert Networking session EXP10849 SAP Sybase IQ beyond NLS
Join my presentation RDP140 Extend the Range of Your Business Warehouse with SAP Sybase IQ and NLS
Get introduced to your personal database server in EXP10850 SAP Sybase SQL Anywhere – Instant Database To-Go
Learn more about how to leverage leading-edge innovation with SAP Database Services
Stay in the conversation by following SAP Services on #SCN.
Follow our news on Twitter @SAPServices.
Clarification:
There was a question about the statement "We’re restricting to status ‘current’ or ‘historic’ (2 out of 20 or 200 possible status values, making up 98+% of all rows)".
The "98+%" (like the "20 or 200 possible status values") is a plain assumption, which I consider realistic. The intention was to showcase the efficiency of the filter mechanism for restrictions where almost all rows are qualified, a situation which cuts off any kind of improvement through traditional (B- / B*- Tree) indexes.