HANA Data Compression Algorithms: Dictionary Encoding
Hi All !!!
In this post we’re going talk about something very important that is usually overlooked, but in my view is a great resource, it makes investing in HANA worthwhile, and it increases ROI while decreasing TCO: data compression.
And sure, anyone with a little computer knowledge has heard of this and knows that it saves disk space. But in the case of HANA it saves RAM, which can bring a better cost-benefit ratio because RAM is still far more expensive than disk memory despite recent drops in price, and this has allowed for the development of in-memory platforms like SAP HANA.
The logic is simple: let’s suppose your company is near their storage limit with 9TB for a 10TB DW capacity, which is basically a red light for IT managers. Correct me if I’m wrong, but doubling this capacity with additional hard drives would require about $100,000, and you could almost double that cost if your company is more aggressive. Meanwhile, SAP HANA could compress the data stored in the DW by at least 3x. So, if your company chose to invest in HANA, your 9TB would be reduced to 3TB, and could even reach less than 1TB. And with zero investment in new hard drives. Great, right? You would only need to invest in a 6TB server to double your capacity, and any data HANA compressed beyond 3x would be contingency.
Of course, no well-managed company likes to waste money and the ideal would be to invest with no waste at all, if it were possible. (Actually, there is a way to do this, and I will talk about it later.)
Database compression is not exactly new and other databases already use it. What makes HANA unique are its compression algorithms which are all performed in-memory, unlike any other compression system on the market!!
But how does it work?
In this post I’ll only cover the technique that supports all the other techniques because, roughly speaking, other compression algorithms are based on it and it’s relatively simple. The technique is called Dictionary Encoding.
Let’s take as an example a table containing data on world population. In the image below we have 2 objects that are the basis of this technique, the Dictionary and the Attribute Vector:
Here the table name column and the Dictionary storage column contain the only the non-repeated values, and this creates an index for each value in the Value ID column.
The Vector Attribute makes a one-to-one relationship between the data in the original table and the Dictionary, and it also creates a Position Index.
Now everything gets easier because it just converts the data from the original table into bits. Storing data in bits is much more economical than storing it in its original format, and the compression is even greater if the field data is CHAR type.
Let’s suppose that our table has 8 billion lines and one data set is 200 bytes. We have:
8.000.000.000 x 200 = 1.6TB
Let’s start with the “first name” field. To be able to replace its data with bits, we must know how many bits are needed for 49 bytes of data listed there. For this we use a binary logarithm function:
Log2(5.000.000) = 23
According to this calculation, we need 23 bits to store the same 49 bytes. Now let’s make a comparison:
8.000.000.000 x 49 Bytes = 392.000.000.000 Bytes / 1024 / 1024 / 1024 = 365.1GB
8.000.000.000 x 23 Bits = 184.000.000.000 Bits / 8 / 1024 / 1024 / 1024 = 21.4GB
WOW!!! A 94% reduction in size. Fantastic, right? Now what is recorded in the “first name” field is a sequence of 23 bits instead of “William”, for example. For those who know BW, this is similar to using Surrogate IDs (SIDs) in cube’s dimensions.
But when we do a Select function, why does the correct name come up instead of the bits? Ah-ha, this happens because of another technique called “Materialization”, but this is the subject of another post J !
We can’t forget to add the space occupied by the dictionary
49 bytes x 5.000.000 / 1024 / 1024 / 1024 = 0.23GB
So finally, we have:
In our scenario, we had a reduction factor of approximately 17 times, and the new size of the “first name” column consumes only 6% of the original space … excellent !!!
Just repeat this for the other columns and we get the reduction factor of the entire table. So, we can conclude that compression level depends not only on the data type but also on how many times we have repeated values in a given column, or rather, how many distinct values we have, as well as the number of data sets.
This we call Entropy:
So, less entropy provides higher compression ratios, and we happily store corporate data in tables that used to have lower entropy when stored in columns. In most cases it’s possible to obtain higher compression ratios than the minimum 3x suggested by SAP.
Other compression techniques such as Prefix Encoding, Run-Length Encoding, Cluster Encoding, Indirect Encoding and Delta Encoding are applied to these 3 objects (the table, the Dictionary and the Attribute Vector) usually after Dictionary Encoding has already been applied, so it is the most important step while the others have the role of making HANA compression even more efficient and thus making this database quite special and wonderful.
As I noted at the beginning of this post regarding the decision to reduce sizing investment by implementing HANA is that that this maximizes efficiency because the main factor in the price of HANA is the amount of RAM. A program can be run for analyzing any company´s spreadsheets by applying this routine:
- First, define the number of distinct values in each column of a given table
- Determine the length of each column
- Finally, calculate the results using the logarithmic function
At the end, add the results in GB from each column and compare this with the original.
Done! After this, the program gives you a report and you can help your director or IT manager buy exactly the size HANA server needed for your company, of course taking into account any contingency regarding annual growth prospects for company data or the company Sizing Plan.
If your company’s system is SAP, this will be even easier. Just refer to table DD02 and DD03 to find the names of all the tables and their fields that exist in the system and optimize a search for only tables that contain data, and, voilá!
In a future post I’ll try to gift my readers (as inspired by Christmas) with a compression program in ABAP and/or SQLScript.
See you next post!
Thanks for making this so easy to understand ! can you please also post a blog on materialization technique ? the article craves for the next logical step !!
For anyone interested in details of the compression algorithms and their use in SAP HANA the following links point to great resources:
2112604 - FAQ: SAP HANA Compression - the SAP note covering compression in the SAP HANA product in detail, including how to manage the feature(s).
In-Memory Data Management - openHPI course by Prof. Hasso Plattner - this course (recorded lectures) provides an in-depth course on the different technologies involved that make SAP HANA the high-performing in-memory DBMS it is.
As for the "compression" program, SAP offers ABAP reports (/SDF/HDB_SIZING, see SAP note 1872170 - ABAP on HANA sizing report (S/4HANA, Suite on HANA...) for details) that not only collect relevant information of the source database but also consider whether or not tables will get stored in the row- or in the column-store after a migration to SAP HANA.