Okay, so data compression isn’t new to ASE 16 – SAP first supported data compression in ASE 15.7.  Index compression is new in ASE 16, but perhaps it takes a bit of background in data compression as a bit of a primer before we can really understand the full considerations.

Data compression in ASE 15.7 was implemented at 3 basic levels:

  • Row compression – which tried to compress the storage of each row through basic compression techniques such as reducing duplicate characters, etc.
  • Page compression – which used the industry standard dictionary compression techniques to replace data on the page with tokens.
  • LOB compression – which used standard industry ZLib type compression on large LOB objects, which would not lend themselves well to page compression.

Before we get into index compression, let’s discuss page compression a bit more because it is the style used in index compression and there is a lot of FUD about it.  First of all, dictionary compression is extremely common in analytic systems.  As noted, SAP ASE simply looks at each column value, determines if there is already an encoded dictionary token, and if so, simply replaces that data element in the storage row with the encoded token.  If the value is not in the dictionary, a new token is assigned to the new value and then it is replaced in the row.

In the DBMS industry, there is a lot of noise about page vs. table level dictionary compression and who has better compression, etc.  Some vendors argue that table level dictionaries are more effective as you don’t have to store the dictionary of encoded values on every page.  This nets them slightly better compression ratios.  However, what they don’t explain is the trade-offs – such as the fact that page level compression is more likely to resolve any data value to a low cardinality value which can be encoded using a single byte (or less) whereas table level is more likely to consider data elements as high cardinality and need more bytes for encoding.  Consider for example, the data element “city”.  In a typical database page, there likely will be at most 10’s to low 100’s of distinct values – easily represented in a single byte.  However, in the full table, there could be 10’s of thousands, needing 2 bytes for encoding – possibly 3.

The other consideration is that the dictionary table has to be maintained.  This can be a source of potential bottleneck.  For example, at a table level, multiple concurrent writers would need access to the dictionary table and likely would need to be sequenced.  A similar problem pops up with other compression techniques in which adjacent rows are included and the same set of values for a set of columns is replaced with a single data encoded value for the multiple rows.  Yes, it improves compression.  But at an interesting cost for subsequent DML operations that need to modify one of those values as now that single value for multiple rows needs to be replaced with an encoded value for each row…a bit of an escalation from attempting to modify a single row and ending up having to modify several others.

As a consequence, ASE’s page compression is a trade-off in which maximum compression is not the goal, but rather a balance between data compression and user concurrency.  One aspect to keep in mind is that data is compressed both on disk as well as in-memory.  This has benefits as well as trade-offs.  For example, one benefit is that it is likely that more data will fit into memory, thus reducing the number of physical reads.  However, it also means that logical reads for previously cached data will be slow as each logical read will have to re-uncompress the data.

That last point is a bit of a concern as the overhead can be considerable.  While SAP is looking at ways to reduce this overhead, any reduction would come at a cost of increased memory consumption – which could lead to other problems.  As a result, the most effective solution is to minimize the amount of logical reads that any query needs to perform.

I will put it in plain English:  You need better indexing.

But doesn’t adding indexes consume even more space??  …the very thing we are trying to reduce???

Yes, but, let’s face it – any frequently executed query (other than an aggregate) that does 100’s or 1000’s of LIO’s – even if all cached/in memory – per row returned is a good candidate for better indexing.    Any penalty from compression is a side issue at that point and just serves to magnify an already bad situation.  For example, on an early SAP system migration, we found a query that ran every 2 seconds that did a table scan of 200,000 pages in ASE.  Yep, completely in memory.  Yep, noticeably slower than when not compressed.  But was compression the real issue??  Nope – the real issue was the in-memory tablescan of 200,000 pages that with a proper index might have been only 10’s of pages of LIO.  Adding the index to  support the query would not only fix the penalty of compression – but also run orders of magnitude faster than it did previously.

Now, that doesn’t mean you will be adding an index or 10 to every table.  However, it does mean that you will need to carefully monitor for queries that are running slower and then consider adding better indexing.  Probably it will mean 1-2 additional indexes on a few tables.  Do the 1-2 additional indexes add extra overhead for DML operations?  Yes.  But not enough to be really measurable, unless you don’t have any indexes to begin with.  However, of course, it will add space – as well as perhaps extra maintenance time with respect to dbcc’s, reorgs, updatendex statistics and other operations.

But let’s talk about space for a minute.  First of all, this discussion will focus on DOL tables.  Yeah, I know, there are tons of legacy systems out there still using APL tables.  Sorry.  Too many variances with that to get into.

Now, then DOL indexes from the very beginning did two really interesting tricks to try to reduce space consumption:

  • Index suffix compression on intermediate nodes – Intermediate nodes only contained enough of the index prefix to be able todetermine B-tree traversal
  • Non-unique index duplicate RID lists.  In a non-unique index, at the leaf level, if a key value had multiple rows associated with it, rather than storing the key value(s) multiple times (once for each leaf row), the key value(s) would be only stored once with a list of RIDs that it pointed to.

While this reduced the unnecessary space consumption, the more indexes you have and the more distinct the keys, the more space is necessary to store them.

Now, then, along comes ASE 16 and its support for index compression.  ASE 16 supports index compression for leaf pages only (non-leaf intermediate nodes still have suffix compression) and uses a technique of index prefix compression.  The simple solution would be to compress each index key column individually as a whole unit – and this would work well for multi-key indexes.  For example, in a 4 column index key, the first 3 are likely highly repetitive within any given index leaf page.   However, this might not reduce space at all for many single column indexes – or for multi-key indexes in which the last index key had lengthy prefixes.   For example, an index on city name – with the names sorted, often the first multiple characters of the index key values are identical – e.g. Jackson (as in Jackson, MS) and Jacksonville (as in FL).  This doesn’t only affect character data – consider datetime columns in which the date values – and perhaps the hours, minutes and possibly even seconds are identical with the only difference in milliseconds.  Of course, the datetime isn’t stored as character – it is a structure of two integers – a date component and time component.  But likely the date component and the first multiple bytes of the time component may be identical and could be compressed.

There are a couple of problems with that….consider a numbering sequence with numbers 100001, 100002, 100003, 100004, … and so forth.  It would appear that looking at a prefix of a partial column, I might be able to compress out the first 5 digits.  Not so fast.  Remember, we are looking at the numeric representation.  What is stored is the binary representation.  On big endian systems (e.g. IBM Power series), this still works out as 100001=0x000186a1, 100002=0x000186a2, 100003=0x000186a3, 100004=0x000186a4,….    However, on little endian systems (such as Intel x86), the LSB ordering of the bytes results instead in 100001=0xa1860100, 100002=0xa2860100, 100003=0xa3860100, 100004=0xa4860100,….  Ugh!  This also impacts datatypes that are derived from numeric types as ’04 Aug 2014’ = 0x7da30000 and ’05 Aug 2014’ = 0x7ea30000 on x86 LSB platforms where IBM Power and other MSB chips would store them as 0x0000a37d and 0x0000a37e respectively – which lends itself much better to prefix compression.

The second problem is that most DBMS’s internally store data in fixed length columns first internal structure and then variable length columns.  This reduces the amount of overhead for each row and thus saves some space.  For example, when the data is stored in column order and there are variable length columns, each row would need a column offset table specifying where each column began.  By storing the fixed length columns first, the table structure can simply have single offset table that is used for every row for the fixed length columns and a much smaller column offset table for the variable length columns.  Consider for example an index on columns {col1, col2, col3, col4} in which columns col1, col2 and col4 are fixed length.  If the internal structure was {col1, col2, col4, col3}, then I would only need a column offset table for a single column. This is fixed length first column reorganization internally is why some DBMS’s often suggestion specifying fixed length columns first for a table anyhow.

This impacts prefix compression as I would need to make sure that the columns are re-ordered into the correct column sequence as specified in the index key.  Unfortunately, this means that now the larger column offset table takes up more space in the index row than it previously did – which means the index format for index prefix compression could be even larger than it is today.  However, once compression kicks in, the prefix compression is extremely efficient at reducing index size as the typical leading columns of a multi-key index are identical and can be replaced with a single token.

A second aspect is that not all indexes may compress very well.  Consider a single column index on a monotonically increasing key – such as an identity column, order number, trade dates, etc.  If the index is unique (e.g. an identity or pseudo-key column), then the prefix might not be very repetitious at all – especially on x86 platforms in which the leading bytes are byte swapped to the end.  If we then arbitrarily apply page compression, each distinctive value would have an entry in the dictionary, plus the token – which means that that we would still need the same amount of space for the dictionary values – but we would also have double the space for the tokens.

This leads to the following observations about index compression:

  • Multi-column indexes will likely have much greater compression ratios than single column indexes.
  • Indexes with distinctive leading columns (e.g. timestamp) may not compress well.
  • Single column indexes on monotonic increasing values (e.g. order number) may not compress well – especially on little endian (LSB) platforms such as x86.
  • Big endian (MSB) platforms such as IBM Power series might have better compression ratios.

With those kind of considerations, how do you decide whether to compress a table and/or its indexes or not??   This is where SAP Control Center 3.3 tries to help with the new Compression Advisor feature – which is pretty neat.  What it does is:

  • Allow you to specify the data, LOB and index compression you wish
  • Override the compression on individual columns or indexes as desired
  • Specify a sampling size (default 10%)

Then it makes a copy of the table structure, loads it with the sample size data volume, then applies the compression attributes you’ve selected and finally runs a reorg rebuild to compress the existing data in the table.  The reason I like this wizard is that it does two things:

  • Produces a fairly reliable compression ratio for space savings
  • Gives a good indication of how long the actual compression will take

The first version simply reported an overall space savings and compression ratio for the entire table.   As noted, this might include indexes that shouldn’t be compressed – or rather might not compress well.  There is an enhancement coming in the near future that I have been playing with that actually breaks down the compression by index and provides a result similar to:

compression advisor.jpg

If you want to play around with index compression, you can download the developer’s edition of ASE 16 from http://scn.sap.com/community/developer-center/oltp-db.  In addition, there was a webcast about compression and other features that you can watch the replay of at http://event.on24.com/r.htm?e=775360&s=1&k=D1FE610794ECF6A138793A50967D48D4.  Finally, there is a demo of the SCC Compression Advisor at https://www.youtube.com/watch?v=XiQW_Xl6bVU&feature=youtu.be.

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply