Effects of changing data in HANA’s column store tables
I was curious about the inner workings of HANA. When you change data in column store tables, what is the effect? How much is written to the data files, how much change happens to the backup dumps, how much is written to the log files? With a real SAP system, you can measure the activity of the HANA database, but it is hard to measure the actual amount of changed data in some kind of controlled way. As usual, SAP provides some documentation, but not in the area I am interested in (with the usual disclaimer “as far as I can see”). The real fun with SAP always starts with reverse engineering, so I decided to create my own lab experiment.
My lab consists of:
– HANA 1.0 revision 70
– 7.4 GB or 49.108.572 rows of real-life ASCII data
– resulting in a 960 MB column store table
I wanted to measure in a controlled way the reaction of the database to changes in this table. First I delete some data, then I insert the same amount (but different) data. These two steps should simulate an update. I had a look at the in-memory table size. I created a backup dump and measured how much of the data file and backup dump file have changed. I tried to identify the written log files. This table summarizes my measurements:
||Insert #4||Delete #5
|delta size (raw ASCII Bytes)||276.500.919||288.948.984||247.715.063||246.271.066||258.70.592||249.046.220||499.568.790||591.892.341||1.038.293.459||1.039.845.787|
|delta size (column store Bytes)||9.232.860||51.012.984||8.647.776||29.823.960||4.183.096||11.260.656||3.177.416||27.956.080||6.444.088||22.006.248|
|changed datafile (Bytes)||135.090.176||1.318.801.408||1.112.358.912||1.369.174.016||1.119.375.360||1.366.949.888||149.934.080||3.299.258.368||91.615.232||3.882.795.008|
|ratio of changed datafile||0,036||0,351||0,296||0,364||0,298||0,364||0,040||0,819||0,023||0,851|
|delta backupfile (Bytes)||48.971.776||1.066.991.616||1.060.003.840||1.117.069.312||1.073.582.080||1.098.489.856||62.418.944||1.202.388.992||90.517.504||1.284.259.840|
|ratio of changed backupfile||0,037||0,780||0,786||0,790||0,793||0,767||0,043||0,835||0,061||0,844|
|logfiles written (Bytes)||16.314.368||239.538.176||14.798.848||204.611.584||14.266.368||206.983.168||25.640.960||486.739.968||55.025.664||856.743.936|
|Annotations||3,2% changes||delta merge||3,2% changes||delta merge||3,2% changes||6,5% changes||13% changes|
The table contains a lot of information, let me summarize my most interesting findings:
- After the delete #2 and delete #3 I manually performed a delta merge on the table, which is of course not necessary or outright useless. Surprisingly, this delta merge has got a huge effect on the data file and the backup dump file. Delete #1 was comparable in size, but without a manual delta merge and shows only very small changes to the data file and the backup dump file.
- Delta merges after inserts have always significant effects on the data file and backup dump file. The changes to the data file are typically several times higher than the actual amount of changed data. This is no issue for HANA since these writes happen asynchronously, but it is important to know that the amount of change to the datafiles is not representative for how much change actually happened to the data.
- Even though the delta merge causes significant changes to the table representation both in the data file and in the backup dump file, it is still possible for most real-life SAP systems to use backup deduplication because only a tiny percentage of the overall data is changed per day. (I verified that on an actual BW on HANA system.) However, I predict that there is some critical threshold, if too many table (partitions) are changed and therefore reorganized via a delta merge then it won’t be possible to benefit from a backup deduplication any more.
- Only delete #1 and Delete #2 showed a reduction in the memory consumption of the table. Delete #3, Delete #4 and Delete #5 increased the memory consumption of my sample table! Also the data file and the backup dump file size increased over the time by ~20%. Contrary to that, the amount of data was the same at the beginning and at the end of my experiment.
- The amount of log files written by HANA correlates very well with the amount of (raw ASCII) data inserted into the database. So as a rule of thumb, if you want to know how much data has been added to or updated in your HANA instance, have a look at the log files.
- Deletes are very efficient in terms of log files, only few data needs to be written to the logs. Don’t expect any significant reduction of the in-memory space requirements, however.
In case of questions or if something was unclear, please ask.