Fiddle riddle – Amazing: stuff gets faster and takes less space!
In Quick notes during a fiddle-session I played around with a table that contained all possible values for a date column.
To fill it I used a FOR LOOP with an INSERT and a commit every 10000 records.
That took 9:37 minutes for 3.652.061 records.
That’s terrible performance and nothing that needs to be accepted with SAP HANA!
I found a way to do the same much quicker:
Statement ‘insert into rmanydates (select * from all_days)’
successfully executed in 6.120 seconds (server processing time: 4.630 seconds) – Rows Affected: 3652060 ;
Question 1:
The dates inserted above have not been pre-computed in this example.
How do I do this?
——–
Answer 1
The 6,120 seconds to create every possible date once and insert it into the row table is a lot faster than the one-by-one looping approach I took in the procedure (manly for clarity purposes).
But, as Fernando Da Ros demonstrated, there are other options to do that.
The one I chose was part of a rather new feature in SAP HANA: SAP HANA Series
SELECT to_date(GENERATED_PERIOD_START)
FROM SERIES_GENERATE_TIMESTAMP('INTERVAL 1 DAY', '01.01.0001', '31.12.9999')
——–
I also looked at the memory consumption for the stored date tuples.
In the blog post I just used a row store table, because I thought: well, there’s not going to be any column store compression anyhow.
(you know, because column store compression mainly builds on compressing duplicate column values. But with every possible date once, there are no duplicates, so no compression – or is there?)
However, I managed to get the data loaded into a column store table and use a lot less memory.
Question 2:
How to get from
BEFORE
— General —
Total Memory Consumption (KB): 25.719
Number of Entries: 3.652.061
Size on Disk (KB): 23.152
Memory Consumption in Main Storage (KB): 25.700
Memory Consumption in Delta Storage (KB): 19
Estimated Maximum Memory Consumption (KB): 25.719
to
AFTER
— General —
Total Memory Consumption (KB): 1.645
Number of Entries: 3.652.061
Size on Disk (KB): 12.912
Memory Consumption in Main Storage (KB): 1.626
Memory Consumption in Delta Storage (KB): 19
Estimated Maximum Memory Consumption (KB): 1.645
Both tables were fully loaded when the space consumption was analyzed.
if you know how that works, put your answers into the comments section! 😀
——–
Answer 2
This one is again a feature from SAP HANA Series: the Series Table,
create column table smanydates (dday date)
SERIES ( EQUIDISTANT INCREMENT BY INTERVAL 1 DAY
MINVALUE '01.01.0001'
MAXVALUE '31.12.9999'
PERIOD FOR SERIES (dDAY));
The above CREATE TABLE statements, specifies that the timestamp information in column DDAY are not explicitly stored.
Instead the internal storage is merely a calculation formula that “knows” how every date between MIN/MAXVALUE with the granularity of 1 DAY can be computed.
Instead of storing a full date now, only a number uniquely identifying a day needs to be stored.
Given the starting date ‘01.01.0001’ the current day would then be stored just as the integer 735.832 (20 bits).
Of course there are also limitations to this type of data storage but for ever increasing regular date or timestamp information, this is really an option to save a lot of memory.
For more information around SAP HANA series please check
- SAP HANA SPS10- Series Data/ TimeSeries
- SAP HANA SPS 09 – What’s New? Time Series
- http://help.sap.com/hana/SAP_HANA_Series_Data_Developer_Guide_en.pdf
——–
– Lars
Hi Lars,
I realized that I never had played with dates before last century, so was an nice trip through the history also. I was trying with some "sofisticated" solutions yesterday and despite the time was around 8 seconds it was not elegant and readable.
There's nothing like a night with jobs running in background.. rsss
I wake up today with a very simple idea in mind. Take a table which has your 3 million lines and use window functions to number it.. Simple and readable. Now I got currious about yours.
drop table manydates_r;
create table manydates_r (dday date);
drop table manydates_c;
create column table manydates_c (dday date);
select add_days('0001-01-01',row_number() over ()-1)
from (select top 3652061 1 from _sys_repo.PACKAGE_CATALOG,_sys_repo.PACKAGE_CATALOG)
into manydates_r;
select add_days('0001-01-01',row_number() over ()-1)
from (select top 3652061 1 from _sys_repo.PACKAGE_CATALOG,_sys_repo.PACKAGE_CATALOG)
into manydates_c;
merge delta of manydates_c;
unload manydates_c;
load manydates_c all;
I saw that row sometimes server processing time jump to 5 or 7 seconds (no idea), but insert on column seems constant down:
-- row table
Statement 'select add_days('0001-01-01',row_number() over ()-1) from (select top 3652061 1 from ...'
successfully executed in 5.259 seconds (server processing time: 3.626 seconds) - Rows Affected: 3652061
-- column table
Statement 'select add_days('0001-01-01',row_number() over ()-1) from (select top 3652061 1 from ...'
successfully executed in 3.895 seconds (server processing time: 3.672 seconds) - Rows Affected: 3652061
Question 2 - I have no clue what's going on but during investigation I found a question 3.
I ran many times inserting different number of records... Could see that column have a penault of memory just to start (around 256k) less than row (around 16k)... but what I really have no explanation is the 2mi rows sample which consume more memory than the 3,6mi rows ?!?!?!?! There is...
Question 3. Why Column waste more memory for 2mi rows than 3,5mi?
BTW: I also have no clue why
background: These testes was made on SP10, but to check question 2 I also reproduced on SP8 and the difference pointed out on Question 3 doesn't exist... 3,6mi consume 55mega... with a progressive curve withoud disruption. See atached sheet with samples
---
Other findings:
1) The DO BEGIN END have a "good" memory, probably a bug... during development they stopped to "compile" my new changes and was always running an old version... Just was ignoring my changes...
2) SP8 ran this sample faster
Statement 'select add_days('0001-01-01',row_number() over ()-1) from (select top 3652061 1 from ...'
successfully executed in 4.102 seconds (server processing time: 2.399 seconds) - Rows Affected: 3652061
Statement 'select add_days('0001-01-01',row_number() over ()-1) from (select top 3652061 1 from ...'
successfully executed in 2.996 seconds (server processing time: 2.748 seconds) - Rows Affected: 3652061
3) The runs with around 8 to 10 seconds with more complexity I've mentioned are like:
- create view from dummy producing digits
- produce an result set with cross join over digits view for Year, Month and Day
- produce the output filling the rules for 1532, leap year, difference < 1500....
---
Regards, Fernando Da Rós
No attachment... so pictures...
Hi Fernando
glad you picked up this little quiz!
Ad 1): yes, this is a quick way to insert lots of data.
However, it's not the way I used. Let's see if someone comes up with the right solution here before I reveal it 🙂
Ad 2) I leave this open till the final closing of this quiz, too...
Ad 3) I tried that and couldn't reproduce this.
My 2.000.000 rows table takes 13.259.764 bytes in total
My 3.652.061 rows table takes 24.661.176 bytes in total
and these numbers somewhat make sense, I'd say.
COUNT MEM_SIZE_TOTAL MEM_SIZE_MAIN MAIN_MEM_SIZE_DATA MAIN_MEM_SIZE_DICT
2000000 13259764 13251064 5250008 8000000
3652061 24661176 24652476 10043176 14608244
Some data from M_CS_ALL_COLUMNS
2 Mio
MAIN_MEMORY_SIZE_IN_DICT/entry = 4
MAIN_MEMORY_SIZE_IN_DATA/entry = 2,625004
3,6 Mio
MAIN_MEMORY_SIZE_IN_DICT/entry = 4
MAIN_MEMORY_SIZE_IN_DATA/entry = 2,750002
So, the number of bytes per entry in the dictionary is 4.
For the main "vector" I'd say that the dictionary reference encoding always just uses as many bits as required to address all items items.
Only when really a large number of unique entries in the dictionary must be addressed all 4 bytes will be used.
And 3.6 Mio records just contain more references that require more bits than 2 Mio, do.
Thanks again for contributing!
Strange.. I'll run the 2mi and 3.5mi test on other instance SP10... Back to this at night
Hi Colleagues,
Forget question 3, I repeated more times and couldn't reproduce it in no SP10 system.
Follow the other solution (the complex one). You don't need to look for a table with 3mi rows on system as it is based on dummy table.
Regards, Fernando Da Rós
drop view digits10;
create view digits10 as (
select '0' as d from dummy union all
select '1' as d from dummy union all
select '2' as d from dummy union all
select '3' as d from dummy union all
select '4' as d from dummy union all
select '5' as d from dummy union all
select '6' as d from dummy union all
select '7' as d from dummy union all
select '8' as d from dummy union all
select '9' as d from dummy);
drop view anos;
create view anos as (
select ano,to_number(ano) as ano_i
from (select d1.d||d2.d||d3.d||d4.d as ano
from digits10 as d1,digits10 as d2,digits10 as d3,digits10 as d4)
where ano > '0000');
drop view meses;
create view meses as (
select mes,to_number(mes) as mes_i
from (select d1.d||d2.d as mes
from digits10 as d1,digits10 as d2
where d1.d in ('0','1'))
where mes between '01' and '12');
drop view dias;
create view dias as (
select dia,to_number(dia) as dia_i
from (select d1.d||d2.d as dia
from digits10 as d1,digits10 as d2
where d1.d between '0' and '3')
where dia between '01' and '31');
drop view all_possible_dates;
create view all_possible_dates as (
select to_date(sday,'yyyymmdd') as dday from (
select ano || mes || dia as sday from anos,meses,dias
where ((dia_i <= 28) and not (ano_i=1532 and mes_i=10 and dia_i between 5 and 14))
union all
select ano || mes || dia as sday from anos,meses,dias
where (dia_i in (29,30,31) and mes_i in (01,03,05,07,08,10,12))
union all
select ano || mes || dia as sday from anos,meses,dias
where (dia_i in (29,30) and mes_i in (04,06,09,11))
union all
select ano || '0229' as sday from anos
where ( (mod(ano_i,4)=0 and mod(ano_i,100)<>0)
or (mod(ano_i,400)=0)
or (ano_i <= 1500 and mod(ano_i,100)=0))
));
drop table manydates_r;
create table manydates_r (dday date);
drop table manydates_c;
create column table manydates_c (dday date);
select * from all_possible_dates
into manydates_r;
select * from all_possible_dates
into manydates_c;
merge delta of manydates_c;
unload manydates_c;
load manydates_c all;
Statement 'select * from all_possible_dates into manydates_r'
successfully executed in 7.933 seconds (server processing time: 5.965 seconds) - Rows Affected: 3652061
Statement 'select * from all_possible_dates into manydates_c'
successfully executed in 4.972 seconds (server processing time: 4.724 seconds) - Rows Affected: 3652061
Great work and good effort Fernando!
Would there be a price for this little quiz, you would have gotten it, even though the original solution was not found so far.
I'll update the blog post now, so that the riddle is solved.
Thanks again for playing along. Looking forward to the next riddle 😉
Cheers,
Lars
Hi Lars,
The answers remind me that I are still stucked on SP8, need to back to school again.
Thanks to bring this up. HANA DB being more clever each SP.
Cheers, Fernando Da Rós
-question 2:
Reorg table & partition table?
Hi Lars,
I have few questions:
1. If its just a formula and not the actual values stored using series data definition for timestamp or whatever columns used as part of series. Don't you think the retrieval would be slower when compared to normal tables as it has to compute on the fly and get back with values.
2. I believe the Linear RLE would be used as advanced compression technique for storing series data and as far as my research I couldn't fit the concept of storing the formula with Linear RLE methodology.
3. How does missing values are handled eg: In the series of timestamp data which is at 1 day interval, if few values are missed in the middle, how does the formula cope up with these missing values?
Thanks in Advance
-Anil
ad 1) The formula really is just an offset calculation that can easily be vectorized on CPU level. I don't see why this would necessarily be much slower than the "normal" access.
ad 2) Not sure what you're asking here. Consider the function as a filter. Instead of dates, say '12.02.1996', '13.02.1996', '13.02.1996' you might just store 43, 44 and 45 as integers (if e.g. the base date had been the 01.01.1996.
So the integer values are what gets stored in the column store.
The RLE compression (or any other column store compression technique) mainly works on the value vector. So if you have multiple occurrences for the same date, say 1000 times 44 (representing the 13.02.1996) then these can be compressed well via RLE just as usual columns.
ad 3) Why would the formula need to cope here? It's an absolute offset formula - not a relative one. Technically missing values are possible.