Skip to Content

In Quick notes during a fiddle-session I played around with a table that contained all possible values for a date column.

To fill it I used a FOR LOOP with an INSERT and a commit every 10000 records.

That took 9:37 minutes for 3.652.061 records.

That’s terrible performance and nothing that needs to be accepted with SAP HANA!

I found a way to do the same much quicker:

Statement ‘insert into rmanydates (select * from all_days)’

successfully executed in 6.120 seconds  (server processing time: 4.630 seconds) – Rows Affected: 3652060 ;

Question 1:

The dates inserted above have not been pre-computed in this example.

How do I do this?

——–

Answer 1


The 6,120 seconds to create every possible date once and insert it into the row table is a lot faster than the one-by-one looping approach I took in the procedure (manly for clarity purposes).

But, as Fernando Da Ros demonstrated, there are other options to do that.

The one I chose was part of a rather new feature in SAP HANA: SAP HANA Series


SELECT to_date(GENERATED_PERIOD_START)
    FROM SERIES_GENERATE_TIMESTAMP('INTERVAL 1 DAY', '01.01.0001', '31.12.9999')

——–

I also looked at the memory consumption for the stored date tuples.

In the blog post I just used a row store table, because I thought: well, there’s not going to be any column store compression anyhow.

(you know, because column store compression mainly builds on compressing duplicate column values. But with every possible date once, there are no duplicates, so no compression – or is there?)

However, I managed to get the data loaded into a column store table and use a lot less memory.

Question 2:

How to get from


BEFORE

— General —

Total Memory Consumption (KB): 25.719

Number of Entries: 3.652.061

Size on Disk (KB): 23.152

Memory Consumption in Main Storage (KB): 25.700

Memory Consumption in Delta Storage (KB): 19

Estimated Maximum Memory Consumption (KB): 25.719

to

AFTER

— General —

Total Memory Consumption (KB): 1.645

Number of Entries: 3.652.061

Size on Disk (KB): 12.912

Memory Consumption in Main Storage (KB): 1.626

Memory Consumption in Delta Storage (KB): 19

Estimated Maximum Memory Consumption (KB): 1.645

Both tables were fully loaded when the space consumption was analyzed.

if you know how that works, put your answers into the comments section! 😀

——–

Answer 2

This one is again a feature from SAP HANA Series: the Series Table,



create column table smanydates (dday date)
SERIES (  EQUIDISTANT INCREMENT BY INTERVAL 1 DAY
     MINVALUE '01.01.0001'
     MAXVALUE '31.12.9999'
     PERIOD FOR SERIES (dDAY));

The above CREATE TABLE statements, specifies that the timestamp information in column DDAY are not explicitly stored.

Instead the internal storage is merely a calculation formula that “knows” how every date between MIN/MAXVALUE with the granularity of 1 DAY can be computed.

Instead of storing a full date now, only a number uniquely identifying a day needs to be stored.

Given the starting date ‘01.01.0001’ the current day would then be stored just as the integer 735.832 (20 bits).

Of course there are also limitations to this type of data storage but for ever increasing regular date or timestamp information, this is really an option to save a lot of memory.

For more information around SAP HANA series please check

——–

– Lars

To report this post you need to login first.

10 Comments

You must be Logged on to comment or reply to a post.

  1. Fernando Da Ros

    Hi Lars,

    I realized that I never had played with dates before last century, so was an nice trip through the history also. I was trying with some “sofisticated” solutions yesterday and despite the time was around 8 seconds it was not elegant and readable.

    There’s nothing like a night with jobs running in background.. rsss

    I wake up today with a very simple idea in mind. Take a table which has your 3 million lines and use window functions to number it.. Simple and readable. Now I got currious about yours.

    drop table manydates_r;

    create table manydates_r (dday date);

    drop table manydates_c;

    create column table manydates_c (dday date);

    select add_days(‘0001-01-01’,row_number() over ()-1)

    from (select top 3652061 1 from _sys_repo.PACKAGE_CATALOG,_sys_repo.PACKAGE_CATALOG)

    into manydates_r;

    select add_days(‘0001-01-01’,row_number() over ()-1)

    from (select top 3652061 1 from _sys_repo.PACKAGE_CATALOG,_sys_repo.PACKAGE_CATALOG)

    into manydates_c;

    merge delta of manydates_c;

    unload manydates_c;

    load manydates_c all;

    I saw that row sometimes server processing time jump to 5 or 7 seconds (no idea), but insert on column seems constant down:

    — row table

    Statement ‘select add_days(‘0001-01-01′,row_number() over ()-1) from (select top 3652061 1 from …’

    successfully executed in 5.259 seconds  (server processing time: 3.626 seconds) – Rows Affected: 3652061

    — column table

    Statement ‘select add_days(‘0001-01-01′,row_number() over ()-1) from (select top 3652061 1 from …’

    successfully executed in 3.895 seconds  (server processing time: 3.672 seconds) – Rows Affected: 3652061

    Question 2 – I have no clue what’s going on but during investigation I found a question 3.

    I ran many times inserting different number of records… Could see that column have a penault of memory just to start (around 256k) less than row (around 16k)… but what I really have no explanation is the 2mi rows sample which consume more memory than the 3,6mi rows ?!?!?!?! There is…

    Question 3. Why Column waste more memory for 2mi rows than 3,5mi?

    BTW: I also have no clue why

    background: These testes was made on SP10, but to check question 2 I also reproduced on SP8 and the difference pointed out on Question 3 doesn’t exist… 3,6mi consume 55mega… with a progressive curve withoud disruption. See atached sheet with samples

    Other findings:

    1) The DO BEGIN END have a “good” memory, probably a bug… during development they stopped to “compile” my new changes and was always running an old version… Just was ignoring my changes…

    2) SP8 ran this sample faster

    Statement ‘select add_days(‘0001-01-01′,row_number() over ()-1) from (select top 3652061 1 from …’

    successfully executed in 4.102 seconds  (server processing time: 2.399 seconds) – Rows Affected: 3652061

    Statement ‘select add_days(‘0001-01-01′,row_number() over ()-1) from (select top 3652061 1 from …’

    successfully executed in 2.996 seconds  (server processing time: 2.748 seconds) – Rows Affected: 3652061

    3) The runs with around 8 to 10 seconds with more complexity I’ve mentioned are like:

    – create view from dummy producing digits

    – produce an result set with cross join over digits view for Year, Month and Day

    – produce the output filling the rules for 1532, leap year, difference < 1500….

    Regards, Fernando Da Rós

    (0) 
    1. Lars Breddemann Post author

      Hi Fernando

      glad you picked up this little quiz!

      Ad 1): yes, this is a quick way to insert lots of data.

      However, it’s not the way I used. Let’s see if someone comes up with the right solution here before I reveal it 🙂

      Ad 2)  I leave this open till the final closing of this quiz, too…

      Ad 3) I tried that and couldn’t reproduce this.

      My 2.000.000 rows table takes 13.259.764 bytes in total

      My 3.652.061 rows table takes 24.661.176 bytes in total

      and these numbers somewhat make sense, I’d say.

      COUNT   MEM_SIZE_TOTAL  MEM_SIZE_MAIN MAIN_MEM_SIZE_DATA MAIN_MEM_SIZE_DICT

      2000000 13259764        13251064      5250008            8000000

      3652061 24661176        24652476      10043176           14608244

      Some data from M_CS_ALL_COLUMNS

      2 Mio 

      MAIN_MEMORY_SIZE_IN_DICT/entry = 4

      MAIN_MEMORY_SIZE_IN_DATA/entry = 2,625004

      3,6 Mio

      MAIN_MEMORY_SIZE_IN_DICT/entry = 4

      MAIN_MEMORY_SIZE_IN_DATA/entry = 2,750002

      So, the number of bytes per entry in the dictionary is 4.

      For the main “vector” I’d say that the dictionary reference encoding always just uses as many bits as required to address all items items.

      Only when really a large number of unique entries in the dictionary must be addressed all 4 bytes will be used.

      And 3.6 Mio records just contain more references that require more bits than 2 Mio, do.

      Thanks again for contributing!

      (0) 
      1. Fernando Da Ros

        Hi Colleagues,

        Forget question 3, I repeated more times and couldn’t reproduce it in no SP10 system.

        Follow the other solution (the complex one). You don’t need to look for a table with 3mi rows on system as it is based on dummy table.

        Regards, Fernando Da Rós

        drop view digits10;

        create view digits10 as (

          select ‘0’ as d from dummy union all

          select ‘1’ as d from dummy union all

          select ‘2’ as d from dummy union all

          select ‘3’ as d from dummy union all

          select ‘4’ as d from dummy union all

          select ‘5’ as d from dummy union all

          select ‘6’ as d from dummy union all

          select ‘7’ as d from dummy union all

          select ‘8’ as d from dummy union all

          select ‘9’ as d from dummy);

        drop view anos;

        create view anos as (

          select ano,to_number(ano) as ano_i

          from (select d1.d||d2.d||d3.d||d4.d as ano

                from digits10 as d1,digits10 as d2,digits10 as d3,digits10 as d4)

                where ano > ‘0000’);

        drop view meses;

        create view meses as (

          select mes,to_number(mes) as mes_i

          from (select d1.d||d2.d as mes

                from digits10 as d1,digits10 as d2

                where d1.d in (‘0′,’1’))

          where mes between ’01’ and ’12’);

        drop view dias;

        create view dias as (

        select dia,to_number(dia) as dia_i

        from (select d1.d||d2.d as dia

              from digits10 as d1,digits10 as d2

              where d1.d between ‘0’ and ‘3’)

        where dia between ’01’ and ’31’);

        drop view all_possible_dates;

        create view all_possible_dates as (

          select to_date(sday,’yyyymmdd’) as dday from (

          select ano || mes || dia as sday from anos,meses,dias

          where ((dia_i <= 28) and not (ano_i=1532 and mes_i=10 and dia_i between 5 and 14))

          union all

          select ano || mes || dia as sday from anos,meses,dias

          where (dia_i in (29,30,31) and mes_i in (01,03,05,07,08,10,12))

          union all

          select ano || mes || dia as sday from anos,meses,dias

          where (dia_i in (29,30) and mes_i in (04,06,09,11))

          union all

          select ano || ‘0229’ as sday from anos

          where (   (mod(ano_i,4)=0 and mod(ano_i,100)<>0)

                       or (mod(ano_i,400)=0)

                       or (ano_i <= 1500 and mod(ano_i,100)=0))

        ));

        drop table manydates_r;

        create table manydates_r (dday date);

        drop table manydates_c;

        create column table manydates_c (dday date);

        select * from all_possible_dates

        into manydates_r;

        select * from all_possible_dates

        into manydates_c;

        merge delta of manydates_c;

        unload manydates_c;

        load manydates_c all;

        Statement ‘select * from all_possible_dates into manydates_r’

        successfully executed in 7.933 seconds  (server processing time: 5.965 seconds) – Rows Affected: 3652061

        Statement ‘select * from all_possible_dates into manydates_c’

        successfully executed in 4.972 seconds  (server processing time: 4.724 seconds) – Rows Affected: 3652061

        (0) 
        1. Lars Breddemann Post author

          Great work and good effort Fernando!

          Would there be a price for this little quiz, you would have gotten it, even though the original solution was not found so far.

          I’ll update the blog post now, so that the riddle is solved.

          Thanks again for playing along. Looking forward to the next riddle 😉

          Cheers,

          Lars

          (0) 
          1. Fernando Da Ros

            Hi Lars,

            The answers remind me that I are still stucked on SP8, need to back to school again.

            Thanks to bring this up. HANA DB being more clever each SP.

            Cheers, Fernando Da Rós

            (0) 
  2. Anil R

    Hi Lars,

    I have few questions:

    1. If its just a formula and not the actual values stored using series data definition for timestamp or whatever columns used as part of series. Don’t you think the retrieval would be slower when compared to normal tables as it has to compute on the fly and get back with values.

    2. I believe the Linear RLE would be used as advanced compression technique for storing series data and as far as my research I couldn’t fit the concept of storing the formula with Linear RLE methodology.

    3. How does missing values are handled eg: In the series of timestamp data which is at 1 day interval, if few values are missed in the middle, how does the formula cope up with these missing values?

    Thanks in Advance

    -Anil

    (0) 
    1. Lars Breddemann Post author

      ad 1) The formula really is just an offset calculation that can easily be vectorized on CPU level. I don’t see why this would necessarily be much slower than the “normal” access.

      ad 2) Not sure what you’re asking here. Consider the function as a filter. Instead of dates, say ‘12.02.1996’, ‘13.02.1996’, ‘13.02.1996’  you might just store 43, 44 and 45 as integers (if e.g. the base date had been the 01.01.1996.

      So the integer values are what gets stored in the column store.

      The RLE compression (or any other column store compression technique) mainly works on the value vector. So if you have multiple occurrences for the same date, say 1000 times 44 (representing the 13.02.1996) then these can be compressed well via RLE just as usual columns.

      ad 3) Why would the formula need to cope here? It’s an absolute offset formula – not a relative one. Technically missing values are possible.

      (0) 

Leave a Reply