I was curious about the inner workings of HANA. When you change data in column store tables, what is the effect? How much is written to the data files, how much change happens to the backup dumps, how much is written to the log files? With a real SAP system, you can measure the activity of the HANA database, but it is hard to measure the actual amount of changed data in some kind of controlled way. As usual, SAP provides some documentation, but not in the area I am interested in (with the usual disclaimer “as far as I can see”). The real fun with SAP always starts with reverse engineering, so I decided to create my own lab experiment.

My lab consists of:

– HANA 1.0 revision 70

– 7.4 GB or 49.108.572 rows of real-life ASCII data

– resulting in a 960 MB column store table

I wanted to measure in a controlled way the reaction of the database to changes in this table. First I delete some data, then I insert the same amount (but different) data. These two steps should simulate an update. I had a look at the in-memory table size. I created a backup dump and measured how much of the data file and backup dump file have changed. I tried to identify the written log files. This table summarizes my measurements:

Description
Delete #1

Insert #1

Delete #2
Insert #2
Delete #3
Insert #3
Delete #4
Insert #4 Delete #5
Insert #5
MEMORY_SIZE_IN_MAIN 998.832.085 1.049.845.069 1.041.197.293 1.071.021.253 1.075.204.349 1.086.465.005 1.089.642.421 1.117.598.501 1.124.042.589 1.146.048.837
RECORD_COUNT 47.359.893 49.187.726 47.620.728 49.179.120 47.542.719 49.117.838 45.957.846 49.701.911 43.134.825 49.712.183
delta rows 1.748.679 1.827.833 1.566.998 1.558.392 1.636.401 1.575.119

3.159.992

3.744.065 6.567.086 6.577.358
delta size (raw ASCII Bytes) 276.500.919 288.948.984 247.715.063 246.271.066 258.70.592 249.046.220 499.568.790 591.892.341 1.038.293.459 1.039.845.787
delta size (column store Bytes) 9.232.860 51.012.984 8.647.776 29.823.960 4.183.096 11.260.656 3.177.416 27.956.080 6.444.088 22.006.248
changed datafile (Bytes) 135.090.176 1.318.801.408 1.112.358.912 1.369.174.016 1.119.375.360 1.366.949.888 149.934.080 3.299.258.368 91.615.232 3.882.795.008
ratio of changed datafile 0,036 0,351 0,296 0,364 0,298 0,364 0,040 0,819 0,023 0,851
delta backupfile (Bytes) 48.971.776 1.066.991.616 1.060.003.840 1.117.069.312 1.073.582.080 1.098.489.856 62.418.944 1.202.388.992 90.517.504 1.284.259.840
ratio of changed backupfile 0,037 0,780 0,786 0,790 0,793 0,767 0,043 0,835 0,061 0,844
logfiles written (Bytes) 16.314.368 239.538.176 14.798.848 204.611.584 14.266.368 206.983.168 25.640.960 486.739.968 55.025.664 856.743.936
Annotations 3,2% changes delta merge 3,2% changes delta merge 3,2% changes 6,5% changes 13% changes

  

The table contains a lot of information, let me summarize my most interesting findings:

  1. After the delete #2 and delete #3 I manually performed a delta merge on the table, which is of course not necessary or outright useless. Surprisingly, this delta merge has got a huge effect on the data file and the backup dump file. Delete #1 was comparable in size, but without a manual delta merge and shows only very small changes to the data file and the backup dump file.
  2. Delta merges after inserts have always significant effects on the data file and backup dump file. The changes to the data file are typically several times higher than the actual amount of changed data. This is no issue for HANA since these writes happen asynchronously, but it is important to know that the amount of change to the datafiles is not representative for how much change actually happened to the data.
  3. Even though the delta merge causes significant changes to the table representation both in the data file and in the backup dump file, it is still possible for most real-life SAP systems to use backup deduplication because only a tiny percentage of the overall data is changed per day. (I verified that on an actual BW on HANA system.) However, I predict that there is some critical threshold, if too many table (partitions) are changed and therefore reorganized via a delta merge then it won’t be possible to benefit from a backup deduplication any more.
  4. Only delete #1 and Delete #2 showed a reduction in the memory consumption of the table. Delete #3, Delete #4 and Delete #5 increased the memory consumption of my sample table! Also the data file and the backup dump file size increased over the time by ~20%. Contrary to that, the amount of data was the same at the beginning and at the end of my experiment.
  5. The amount of log files written by HANA correlates very well with the amount of (raw ASCII) data inserted into the database. So as a rule of thumb, if you want to know how much data has been added to or updated in your HANA instance, have a look at the log files.
  6. Deletes are very efficient in terms of log files, only few data needs to be written to the logs. Don’t expect any significant reduction of the in-memory space requirements, however.

In case of questions or if something was unclear, please ask.

To report this post you need to login first.

3 Comments

You must be Logged on to comment or reply to a post.

  1. Krishna Tangudu

    Nice effort Mark. Thanks for sharing your thoughts which started as a question and then converted into a blog.

    Can you also share the stats on “Update” & “Update”  + “insert” against “Upsert”? it would be interesting to see…

    Regards,

    Krishna Tangudu

    (0) 
    1. Mark Förster Post author

      Hi Krishna,

      I haven’t done any research on Updates because that is more difficult for me to perform and would have required more time. I am not sure whether that is actually true, but Hasso Plattner stated that In-Memory Databases process Updates as a Delete followed by an Insert. In this case there should be no more insight, but of course we cannot be sure before having tested that in reality. That might be a topic for a further blog.

      Regards,
      Mark

      (0) 
      1. Krishna Tangudu

        Yes and there was once case where we went for (Delete + Insert) instead of Update which gave a better performance , As i saw you testing with considerable Volumes of Data . I suggested you can try that .. ( Ho yes one more Topic … one more Blog ) 🙂

        Regards,

        Krishna Tangudu

        (0) 

Leave a Reply