Inside SAP HANA – optimising data load performance and tuning
So late last year, I had the opportunity to load some reasonable data volumes into SAP HANA. We got about 16GB of data or 160m records, and ran some tests.
Well now I have closer to 256GB of the same data to load, and I only have a 512GB SAP HANA appliance to spare, which already has a SAP BW system and a bunch of other things on it! So I thought it was time to try to optimise the data model and make it slicker.
We cut a few corners last time around because some of the data was a bit corrupt and so we had some text fields where there could otherwise be dates – and in addition, we were not judicious with the use of things like smaller integer data types, where there were only small numbers of distinct values.
I’m not sure how much value this has in SAP HANA because of the way it creates indexes, but text fields certainly take up a lot of unneccessary space. Today I’m going to do a bunch of testing with a 42m row test data set, and then use this for the basis of the final model – loading the full 2bn+ rows in-memory. And we’ll see how it performs!
Step 1: Data Model optimisation
I now have no text values in my main fact table – only TINYINT, SMALLINT, INT, REAL, DATE and TIME. I’m going to load this file into the old fact table which isn’t optimised, and compare for space to see how much we have saved. The model is already well normalised so I’m not expecting to be able to reduce this any further.
So we are moving from:
create column table “POC”.”FACT”( “Col1” INTEGER, “Col2” INTEGER, “Col3” INTEGER, “Col4” INTEGER, “Col5” INTEGER, “Col6” INTEGER, “Col7” INTEGER, “Col8” INTEGER, “Col9” REAL, “Col10” INTEGER, “Col11” INTEGER, “Col12” INTEGER, “Col13″ VARCHAR (32) default ”, “Col14″ VARCHAR (16) default ”, “Col15” INTEGER, “Col16” INTEGER, “Col17” INTEGER, “Col18” INTEGER, “Col19” INTEGER);
To:
create column table “POC”.”FACT2″( “Col1” INTEGER, “Col2” DATE, “Col3” TINYINT, “Col4” TINYINT, “Col5” SMALLINT, “Col6” SMALLINT, “Col7” TINYINT, “Col8” TINYINT, “Col9” REAL, “Col10” INTEGER, “Col11” TINYINT, “Col12” TINYINT, “Col13” REAL, “Col14” TIME, “Col15” SMALLINT, “Col16” TINYINT, “Col17” TINYINT, “Col18” SMALLINT, “Col19” DATE);
Now I’ve loaded the same data into both fact tables.
Fact Table 1: 1,327,747kb
Fact Table 2: 778,737kb
Wow – I’m really surprised by this. I’m guessing it’s the text fields which were the killer, but this has halved the table size. This will make a big difference! I checked this and of course the column store has limited data types so things like TINYINT don’t have an impact as integer datatypes are already compressed.
What’s even more interesting is that we have reduced the load time from 88 seconds down to 26 seconds. I guess those text inserts were expensive in the column store.
Step 2: Load optimisation
HANA has two main load optimisation steps – the number of parallel processes and the amount of data consumed before commit. This SAP HANA box is a Medium, which means 40 CPUs and 512GB RAM. Theory should mean that you will get an improvement up to 40 threads, which will tail off.
This is really important right now because I know that the original 160m row test case takes 9 minutes to load. So we’re looking at at least 2 hours to load the full set. Do you think we can make this much less?
|
Default |
10k |
50k |
200k |
500k |
1 Thread |
7m11s |
|
|
|
|
2 Threads |
3m33s |
|
|
|
|
4 Threads |
1m57s |
|
|
|
|
8 Threads |
1m13s |
|
|
38s |
|
16 Threads |
44s |
34s |
28s |
25s |
26s |
32 Threads |
47s |
32s |
27s |
25s |
27s |
40 Threads |
47s |
32s |
43s |
25s |
42s |
The learning here is really clear – it’s really easy to get down from 7m to about 30 seconds without being very clever and then another 10-15% performance can be had by fine-tuning. It also looks like there is no benefit of moving past about 16 load threads or a Batch of more than 200k, at least in this system. What’s more when you move to 40 threads, the results start becoming inconsistent. I have no idea why.
Step 3: HANA Database Version
So I’m interested to see what happens if we update our HANA Database version. I’m told that there improvements both in compression and performance as new releases come out. What do you think?
Well I looked into it and unfortunately the current version of HANA SP03 is Patch 25, which has a problem with data load performance. Guess this is going to have to wait until another day.
Update: well we were on HANA SP03 Patch 20 and it wasn’t as stable as I’d have liked, so Lloyd upgraded us to Patch 25. This seems to be so much better and I’d really recommend keeping up to date. I’m not sure it actually fixed any of our problems but it stopped a whole load of errors from appearing in the logs.
Step 4: The main load
OK – now we’re ready to do the main load. I don’t think this system can be optimised any further and by my calculations of 40m records in 25 seconds, we should be able to load the full 2bn in 20 minutes or so. Are you ready? Let’s see.
The load is split up into 17 files – details for the load are below.
|
Size |
Rows |
Time |
File 1 |
3925MB |
42m |
25s |
File 2 |
1510MB |
16m |
11s |
File 3 |
14304MB |
152m |
120s |
File 4 |
13196MB |
140m |
144s |
File 5 |
14239MB |
152m |
164s |
File 6 |
13049MB |
139m |
107s |
File 7 |
13569MB |
144m |
105s |
File 8 |
15511MB |
166m |
177s |
File 9 |
14156MB |
151m |
144s |
File 10 |
14960MB |
160m |
156s |
File 11 |
13449MB |
146m |
218s |
File 12 |
15316MB |
166m |
419s |
File 13 |
19843MB |
214m |
163s |
File 14 |
17037MB |
176m |
415s |
File 15 |
16399MB |
183m |
247s |
File 16 |
15275MB |
163m |
198s |
File 17 |
11959MB |
128m |
178s |
Total |
227697MB |
2312m |
2991s |
As you can see, the performance is great initially and then starts to tail off in the later files. This leads to a fairly disappointing load performance (still massively faster than any other database remember!!!) of 2991s or 50 minutes to load 2.3bn rows into memory. I was hoping for over double this.
The reason for this seems to be that HANA puts the rows initially into a delta store, and after that has loaded, it then automaticaly merges them into the main column store. This happens concurrently with the load and seems to kill performance. So what I did was to join all 17 files into one big file, and then try loading it. Let’s see what happens:
|
Size |
Rows |
Time |
File 1 |
227690 |
2312m |
51m |
Curiously it’s not any faster, and this turns out it’s because it does the merge delta during the load. You can disable this with the following statement:
ALTER TABLE “POC”.”FACT2″ WITH PARAMETERS (‘AUTO_MERGE’ = ‘OFF);
Note that whilst auto merge is off (turn it on when you’re done loading) you can do manual merges doing:
MERGE DELTA OF “POC”.”FACT2″;
Step 5: Learning about partitions
It’s at this point that i learnt there is a 2bn row limit in a single SAP HANA database partition. Curiously it allows you to load more than 2bn rows but then fails on the subsequent merge delta. This sounds like a pain but it’s actually a blessing in disguise for two reasons. First, it turns out that as the amount of data in the main store increases, so does the cost of a merge delta and second, using partitions allows you to get the best out of HANA. If you turn automatic merge off and do it manually, look what happens:
An early merge takes just a minute:
Statement ‘IMPORT FROM ‘/hana/FACT_05_2010.ctl’ WITH THREADS 16 BATCH 200000′ successfully executed in 1:50.816 minutes – Rows Affected: 0
Statement ‘MERGE DELTA OF “POC”.”FACT2″‘ successfully executed in 1:04.592 minutes – Rows Affected: 0
Wait a while and you see that the merge time has increased by a factor of 4 whilst the load time into the merge store is more or less linear. This makes sense of course because the merge process has to insert the records into a compressed store, which is computationally expensive. It appears to increase at O(m.log(n)) where n is the size of the main store and m is the size of the delta store, which more or less makes sense based on my knowledge of search and sort algorithms.
Statement ‘IMPORT FROM ‘/hana/FACT_09_2011.ctl’ WITH THREADS 16 BATCH 200000′ successfully executed in 2:23.475 minutes – Rows Affected: 0
Statement ‘MERGE DELTA OF “POC”.”FACT2″‘ successfully executed in 4:11.611 minutes – Rows Affected: 0
And since it turns out that you can partition easily by month, I emptied the table and decided to repartition it like this. I now have 26 partitions, one for each month, plus an extra partition for anything else that doesn’t fit.
ALTER TABLE “POC”.”FACT2″ PARTITION BY RANGE (DATE)
(PARTITION ‘2010-01-01’ <= VALUES < ‘2010-02-01’,
……
PARTITION ‘2012-02-01’ <= VALUES < ‘2012-03-01’,
PARTITION OTHERS);
Note that HANA makes it easy to add new partitions and move data between partitions. Data management even in large volumes won’t be a problem, you will be glad to know. And look at the load times – it is completely linear for the delta merge right up until the last partition. This is a major improvement compared to a legacy RDBMS where you get very slow batch load times unless you drop indexes – massively slowing concurrent read performance.
Statement ‘IMPORT FROM ‘/hana/FACT_02_2012.ctl’ WITH THREADS 16 BATCH 200000′ successfully executed in 1:46.066 minutes – Rows Affected: 0
Statement ‘MERGE DELTA OF “POC”.”FACT2″‘ successfully executed in 53.618 seconds – Rows Affected: 0
What’s more it loads a total of 3.7bn rows in 73 minutes – including the merge delta exercise, which I wasn’t even counting before.
Step 6 – Using Table locks
Another HANA tip is to spend some time reading the SQL Reference Manual. It has lots of stuff in it, much of which you won’t find documented anywhere else. I found a little function called TABLE lock which should allow you to load data faster. Let’s try it, the SQL Syntax looks like this:
IMPORT FROM ‘/hana/FACT_02_2010.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
Note that you only want to do this on initial batch loads because it locks the entire table, but it’s unlikely you will want to load billions of rows in one go after the initial load. What’s really interesting here is that this feature puts the data directly into the main table – bypassing the delta table and the need to do a merge delta – but it is some slower 16% than a regular load followed by a merge delta. Why? Who knows!
Step 7 – Conclusions
Well I’m sure you have figured out a lot of this as you read this blog, but here’s the takeaways that I got out of this exercise. But first and foremost, despite being fast, it is definitely worth your time to optimise your SAP HANA scenario.`
1) Spend time on optimising your HANA Data Model. It will reduce the size of your database for the big tables, improve performance and also reduce cost, because HANA is licensed by appliance memory size.
2) Test to optimise your load parameters. But don’t spend too much time here. It’s not that sensitive to detail changes so get it nearly right and move on.
3) Choose your partitioning scheme carefully. And then load partitions into individual files and do manual merges in-between partitions, if you are loading a lot of data. You don’t have to but it will speed end-end load performance and allow for speedier issue resolution.
SAP HANA is pretty amazing technology and if you have worked with any other kind of RDBMS you will know that even the slowest of these times is far faster than anything else.
But first and foremost two things are persistent for me. First, tuning SAP HANA is just as important as with any other system – with a bit of work you can achieve performance that you wouldn’t have believed before. And second, performance tuning SAP HANA is different to other systems – you tune for parallelisation and memory usage rather than for I/O. But remember this: performance tuning is about finding the performance envelope of a system and working around the weakest point. In that sense SAP HANA is no different to any other computer system in the world.
And in case you are in any way confused, here is the final SQL I used to create and load the table optimally:
drop table “POC”.”FACT2″;
create column table “POC”.”FACT2″(
“Col1” INTEGER,
“Col2” DATE,
“Col3” TINYINT,
“Col4” TINYINT,
“Col5” SMALLINT,
“Col6” SMALLINT,
“Col7” TINYINT,
“Col8” TINYINT,
“Col9” REAL,
“Col10” INTEGER,
“Col11” TINYINT,
“Col12” TINYINT,
“Col13” REAL,
“Col14” TIME,
“Col15” SMALLINT,
“Col16” TINYINT,
“Col17” TINYINT,
“Col18” SMALLINT,
“Col19” DATE) NO AUTO MERGE;
ALTER TABLE “POC”.”FACT2″ PARTITION BY RANGE (DATE)
(PARTITION ‘2010-01-01’ <= VALUES < ‘2010-02-01’,
PARTITION ‘2010-02-01’ <= VALUES < ‘2010-03-01’,
PARTITION ‘2010-03-01’ <= VALUES < ‘2010-04-01’,
PARTITION ‘2010-04-01’ <= VALUES < ‘2010-05-01’,
PARTITION ‘2010-05-01’ <= VALUES < ‘2010-06-01’,
PARTITION ‘2010-06-01’ <= VALUES < ‘2010-07-01’,
PARTITION ‘2010-07-01’ <= VALUES < ‘2010-08-01’,
PARTITION ‘2010-08-01’ <= VALUES < ‘2010-09-01’,
PARTITION ‘2010-09-01’ <= VALUES < ‘2010-10-01’,
PARTITION ‘2010-10-01’ <= VALUES < ‘2010-11-01’,
PARTITION ‘2010-11-01’ <= VALUES < ‘2010-12-01’,
PARTITION ‘2010-12-01’ <= VALUES < ‘2011-01-01’,
PARTITION ‘2011-01-01’ <= VALUES < ‘2011-02-01’,
PARTITION ‘2011-02-01’ <= VALUES < ‘2011-03-01’,
PARTITION ‘2011-03-01’ <= VALUES < ‘2011-04-01’,
PARTITION ‘2011-04-01’ <= VALUES < ‘2011-05-01’,
PARTITION ‘2011-05-01’ <= VALUES < ‘2011-06-01’,
PARTITION ‘2011-06-01’ <= VALUES < ‘2011-07-01’,
PARTITION ‘2011-07-01’ <= VALUES < ‘2011-08-01’,
PARTITION ‘2011-08-01’ <= VALUES < ‘2011-09-01’,
PARTITION ‘2011-09-01’ <= VALUES < ‘2011-10-01’,
PARTITION ‘2011-10-01’ <= VALUES < ‘2011-11-01’,
PARTITION ‘2011-11-01’ <= VALUES < ‘2011-12-01’,
PARTITION ‘2011-12-01’ <= VALUES < ‘2012-01-01’,
PARTITION ‘2012-01-01’ <= VALUES < ‘2012-02-01’,
PARTITION ‘2012-02-01’ <= VALUES < ‘2012-03-01’,
partition others);
IMPORT FROM ‘/hana/FACT_02_2010.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_03_2010.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_05_2010.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_06_2010.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_07_2010.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_08_2010.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_09_2010.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_10_2010.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_11_2010.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_12_2010.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_01_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_02_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_03_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_04_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_05_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_06_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_07_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_08_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_09_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_10_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_11_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_12_2011.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_01_2012.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
IMPORT FROM ‘/hana/FACT_02_2012.ctl’ WITH THREADS 16 BATCH 200000 WITH TABLE LOCK;
MERGE DELTA OF “POC”.”FACT2″;
ALTER TABLE “POC”.”FACT2″ WITH PARAMETERS (‘AUTO MERGE’ = ON);
Next Steps
Well what’s next? I think you know – I’ve optimised this data model for load performance, but how did I do for query performance? I’m already wondering about this because HANA is a parallel processing engine. If you put 27 months of partitions and then do a query over a wide date range then it should kick off a bunch of parallel processes. If you query on a single month you will hit less data but with fewer parallel processes?
What’s the optimal data model for query performance and how does how you setup your joins, CE Functions and SQL Script affect this?
Acknowledgements and thank yous.
As is often the case with this sort of post, there are thanks to dole around. Lloyd Palfrey from Bluefin for being the HANA Ops guru and fixing the database every time I filled up disks, corrupted it and tuning the box. Vijay Vijaysankar, Vitaliy Rudnytskiy and David Hull from IBM, HP and SAP for bouncing ideas. Margaret Anderson and Serge Mutts from SAP CSA for helping with issue resolution. Markus Friedel from SAP AGS for helping resolve errors – mostly between my keyboard and chair.
And let’s not forget Aleks Barilko, Marilyn Pratt, Gail Kling Schneider and the rest of the SCN crew for recovering most of this blog, which got lost in the move from the old to the new SCN whilst I was on vacation
Great post John! Thank you.
A suggestion: you could use "advance editor" to edit this post
Read this: /community/support/blog/2012/03/14/code-samples-in-new-scn
Regards,
Osvaldo
Thanks for the tip, will do next time!
John,
Thanks for sharing your experiences with SAP-HANA. I hope you don't mind me asking a few questions as my thoughts on SAP-HANA are not in line with the details you provided in this blog.
Is loading data directly into FACT table valid in SAP-HANA world? Or this is just POC to measure data load performance? I thought SAP-HANA data model is supposed to contain only views(logical entities) which are designed to aggregate values on the fly using data in physical tables.
Secondly if the source system is ERP/SAP-BW/non-SAP(including flat file), would we better off - from performance and ETL standpoint - loading data using data services?
Thirdly- this may or may not be related to your blog - since you discuss about SAP-HANA data model, assuming this is BW on SAP-HANA, what happens during DB migration to SAP-HANA? What gets migrated? Do we need to design fact tables or build content of views or nothing?
On partitioning: I assume - based on 512GB size - you used one-blade SAP-HANA. From performance standpoint, is there any advantage(other than 2b records limit) of partitioning tables on one-blade SAP-HANA appliance? Wouldn't this be counter-productive in some cases as CPUs now need to spend time splitting the job/scheduling/monitoring etc?
Thank you,
Bala
Hi Bala,
That's fine, no problems.
1) It's fine to write directly into a Fact table on HANA provided data transformation is being done somewhere. In this case the data is being transformed by Data Services outside of HANA into a normal form. This is a bulk load for the initial load; future loads would be only 5-10m rows a day so load performance becomes insignificant.
2) Yes in the real world this would use DS as a feed. But, for the purposes of testing different data models, CSV files are very useful!
3) If you migrate a BW database from Oracle to SAP HANA, you get all your custom and standard SAP BW content just as it worked before. Then you can migrate it to HANA optimised star schemas at the cube level for better performance and space saving using an ABAP based wizard. BW on HANA takes care of all the database level tuning, partitions etc.
4) Yes a single 512gb 40-core system is what this is on. If you have >2bn rows you have no choice but to partition. Does this reduce query performance? It's a great question that I can't answer yet but in theory you could be correct. What I can say is that queries do use 100% of all 40 CPUs when running against 27 partitions. Next steps are to play with the data model a bit and then see what happens when we vary the number of partitons.
Would it perhaps perform better with just 2 or 4 partitions rather than monthly? That's for another blog!
Regards,
John
Thank you John.
Hi John, nice blog post as usual. Good job Lloyd was on hand! 🙂
I know, and he paid for his own 10" meat feast pizza last night too!
Great post - and it sounds like a lot of fun. Not bad for a "Dreamy Gateway Developer". 😉
Cheers
Graham Robbo
Well it was a few evenings of fun 🙂
Unfortunately some of the things in this post mess up performance so I will have to tear up this rule book once I have figured out how to optimise query performance!
I'm nothing but a dreamer.
John:
Thanks for all your efforts in sharing your knowledge with the SAP HANA community developers like me.
Please continue dreaming. Dreams are free…all we need is little PASSION to fuel our dreams and most of have it and you seem to have it in surplus for SAP HANA….smiling… 😉 😎 …
I feel that you are becoming the un-official faceplate or go-to person for SAP HANA. Like I used to refer to Ingo as for SAP BOBJ few years ago when I was learning SAP BOBJ!
Thanks, once again.
Cheers!
Rama Shankar
Thanks Rama you are too kind. I am really just trying to get some good information about tuning HANA out into the public domain. There are people out there with much more knowledge than I! Glad you are enjoying it anyhow.
You are very welcome.
Absolutely, I am enjoying your write-ups. Thanks.
Excellent John, This document is completely useful..
I am planning to partition for four 4 fact tables total size of 2.4 Billions of rows and daily of 3 million loads appx.
What type of partitioning you suggest please?
what are the parameters should be kept in mind?
Really appreciate your help and support.
Regards,
Ganesh
John: Great Discussion - Thanks.
You're welcome, I'm glad you enjoyed it!
Fantastic post John. You have been busy.
Cheers.
Cheers Doug. Just a few hours here and there in the evenings, don't have time to detract from the day job. Performance optimisation has long been a hobby of mine!
this is really awesome , also had a question on the query model used by Hana , ie. calculation model instantiation and the L nodes etc.
I came accross a diagaram and the document \\\SAP HANA Database – SQLScript Guide - Document version 1.0 – 2011/12/22 - Pages 9 -12.
Hey Joseph,
I'm glad you enjoyed it, what exactly was your question? I'll bear this in mind when I do the investigation on the query model.
John
Hi John
Now as per the diagram it is shown different processes during compilation and execution.
1. L language is also used as a wrapper FOR Predictive analysis which SAP just released to select customers , is this the same L language which is the intermediate one used to define the imperative logic(if & while loops are imperative which changes state after every execution , I am assuming) of sqlscript/
2. How does calculation models internally help in parallelizing the query load/ run
3."When the calculation models are instantiated they can be optimized based on concrete input provided." How would the inputs optimize?
Basically I am trying to understand what code will run poorly so that I can avoid such cases.
Thanks
Joseph
Joseph: Here are my thoughts on your questions. John, please feel free to comment on my response:
Q #1: I have no idea; will have to research...on L language details – please can you let us know what kind of application scenario that you are trying to build?
Q#2: The calculation model is similar to an SAP BW multi-provider. It follows similar rules with respect to unionizing two table sources or data sets. The calculation model definition internally generates SQL if you use the designer mode or you write SQL to create the calculation model manually when you use the script mode. The SQL script will have to be written efficiently to leverage on the best practices (refer to SQL Script developer guide - it has good tips on building SQL scripts). We will have to do a trial and error to fine tune the SQL based on different options available in the manual to take advantage the SAP HANA MPP( massive parallel processing) and in-memory columnar data set ).
Q#3: The inputs are like where clause. If you use the correct where clause then, you will have fewer data sets to execute for the overall calculation models…
Hope this helps. Cheers!
Rama
For the readers this guide is found here:
http://help.sap.com/hana_appliance/ - links tohttps://service.sap.com/~sapidb/011000358700000604932011
1) The wrapper for predictives is called R. It wasn't created by SAP and the details of it are on Wikipedia, here: http://en.wikipedia.org/wiki/R_(programming_language). L is a lower level language created by SAP for procedural statements within HANA - a developer language.
2) As I understand it the way HANA works means that it makes a parallel query choice wherever possible. You can facilitate this with data model design e.g. partitioning, query design or design of CE functions and SQL Script as described in the guide. Getting MPP to run optimally remains a bit of a black art.
3) Mechanisms for optimising execution are detailed in the guide - keep reading!
John,
Please can you share with us the type of table corruptions issues that you have encountered and which type of corruption errors are recoverable and which type of corruption errors are not worth spending lot of time trying to resolve?
Thanks,
Rama
Well I got problems with merge deltas on early revisions of HANA but as of R25 I don't get any of that.
If you load more than 2bn rows into a partition then the merge blows up though and you have to reload the table.
In general with HANA though if you get table corruption problems, you need to reload that table. The good news is that unless you fill up one of the data drives, it has (in my experience) been localised to that table and hasn't caused database corruption.
If you fill up your data drives in HANA then be prepared to restore from backup. This is bad in any database of course not just HANA!
Thanks John.
Does this 2bn row limit per partition strike anyone else as strange? Do any other databases have logical limits like this? Is it on the road-map to get rid of this limit?
It isn't what I was expecting from HANA. I thought HANA was all about getting rid of the need to add arbitrary extra structure to your data to get performance. Chuck it all in memory and it is so fast you don't have to think about it any more. Right?
The suspicious side of me notices that 2bn is the biggest signed 32-bit integer. It can't be that, can it?
Not sure I agree with you Steve. You can definitely just throw stuff in HANA and get great performance, though the 2bn limit is a bit frustrating it is not a problem because you can determine some arbitrary partitioning structure.
But like any computer system, if you spend the time to make it sing, it will perform at its best. HANA is no different in this respect but its performance envelope is different to a regular RDBMS.
I do fear the 2^32 limit is exactly that and TREX/BWA has the same limit. In BWA, BA just looks after it for you.
It's a throwback from TREX. I hope it is fixed in SP04 but I've seen nothing to confirm this yet. I hope it's not deeply architectural.
In the context of the size of datasets always mentioned in the HANA Hype, 2bn rows is really not a lot. I'd be very surprised if it is an architectural limit, but if it is that's worrying. And if it isn't, then somebody somewhere thought a 32-bit int was a good size to use and that's equally worrying.
This could all be nothing, of course, but I love digging about in the lower levels of systems trying to understand what's going on under the covers, and limits of any sort usually get me worried. If there's a good reason for it, then maybe I'll stop worrying when I know what that is. Until then...
Well the crashing problem is fixed in Revision 27, it will give you an error message.
I'm not sure. I don't know how this works well enough to be certain, but we do know that HANA is based on TREX and TREX was a 32-bit application with a 2GB RAM limitation until Version 7.1 which is quite recent.
What's more if you increase your number of rows beyond 2bn you need a 64-bit data type to represent it. Is it possible that HANA is highly optimised for a 32-bit data type here?
That seems plausible, if a little unfortunate. 32-bit limits are a little last decade and I'm surprised to see them anywhere in HANA. Hopefully they're not deeply entrenched, and they'll disappear soon and without causing any major grief to early adopters.
yes, me suprised too, and also hope that it's fixible.
Hi John,
Thanks for the nice document. I need a clarification. I attended the ramp up session from SAP and we were told NOT to go for user defined partitioning for the tables. The SAP ramp up coach specifically mentioned that user defined Partitioning will degrade the performance and it is advisable to keep the data for one table on one single node and let HANA internally handle the partitioning across multiple CPU cores.
Can you please help me with this as I understand that you mentioned partitioning will increase the loading and reporting performance.
Thanks for your help.
Regards,
Ravi
Hi,
No problem and yes that is what they are teaching. But it makes no sense because it means that a single-node system has a limit of 2bn rows.
The point is clear - you need to be careful when partitioning on HANA because you can degrade the performance of the MPP engine or cause a single query to slow the system down.
Multiple partitions improves data loads because each partition has its own delta store. So you can do a full load into 10x1bn row delta stores = 10bn rows and still have good performance of the merge delta.
So yes - partition if you need to but do it with care.
John
Hi John,
I just came across this article on database sharding for performance and scalability: http://highscalability.com/blog/2009/8/6/an-unorthodox-approach-to-database-design-the-coming-of-the.html. The article is an overview really, but contains many links to articles including analyses of web-scale architectures. Might be relevant to this discussion, particularly in light of the 2bn record limit maybe not being such a bad thing after all...
Sascha
Hi John,
This is great benchmarking....
However do you have any more details on the following point of yours:
3) If you migrate a BW database from Oracle to SAP HANA, you get all your custom and standard SAP BW content just as it worked before. Then you can migrate it to HANA optimised star schemas at the cube level for better performance and space saving using an ABAP based wizard. BW on HANA takes care of all the database level tuning, partitions etc.
SAP BW Infocubes and DSOs have a certain behaviour which is different from a normal Database table like a FACT Table. Data gets aggregated or overwritten and that is tied very closely to the extraction method in SAP, AIM/PM (After Image/Pre-image). How does HANA account for that ?
Prateek: checkout this blog for additional points on SAP BW DB migration HANA DB.
Regards,
Rama
John - where are you? You are missing in Action!! 😉
Come back sir!
Regards,
Rama
Hi John,
Thanks for this informative blog.
Best,
Gopal
Update: You can't use > when defining a partition.....that results in a syntax error. So, I guess it must use XOR logic.... if both tests have same value, it doesn't match, if both tests have differing values, it matches?
Hello,
I know this is a long post, but I think it is very important:
Do you know how to run a query that shows the row count by partitions, something like "select partition, count(*) from table group by partition"
I think the data is getting mis-partitioned....and the definitions of the partitions maybe causing a performance hit.
From your post:
...
(PARTITION '2010-01-01' <= VALUES < '2010-02-01',
……
PARTITION '2012-02-01' <= VALUES < '2012-03-01',
PARTITION OTHERS);
..
Consider date '2010-01-15'.
Upper Bound TEST of first partition results in TRUE: Is '2010-01-15' < '2010-02-01? YES
Lower Bound TEST of first partition results in FALSE: Is '2010-01-15' <= '2010-01-01' ? NO!
Only records with the date equal to a first of the month get assigned to the intended partition.
Consider date '2010-01-01'.
Upper Bound TEST of first partition results in TRUE: Is '2010-01-01' < '2010-02-01'? YES
Lower Bound TEST of first partition results in TRUE: Is '2010-01-01' <= '2010-01-01'? YES
But, look consider '2010-06-15'.... and partitions for June(6) and July(7):
JUNE
Upper Bound TEST of partition 6 results in TRUE: Is '2010-06-15' < '2010-07-01'? YES
Lower Bound TEST of partition 6 results in FALSE: Is '2010-06-15' < '2010-06-01'? NO
JULY
Upper Bound TEST of partition 7 results in TRUE: Is '2010-06-15' < '2010-08-01'? YES
Lower Bound TEST of partition 7 results in FALSE: Is '2010-06-15' < '2010-07-01'? YES
A record for june will be assigned to July....or anoth month after July.......
I don't know how the engine works when assigning partitions, but if it compares each record the same way a case statement works, it will inspect it 54 times before assigning it to other. Once for lower bound and once for upper bound on each of the 27 partitions(months) as other is the same as default in a case statment.
I think that if you define the partitions this way, you will get better performance.
(PARTITION '2010-01-01' >= VALUES < '2010-02-01',
……
PARTITION '2012-02-01' >= VALUES < '2012-03-01',
PARTITION OTHERS);
Consider date '2010-01-15'.
Upper Bound TEST of first partition results in TRUE: Is '2010-01-15' < '2010-02-01? YES
Lower Bound TEST of first partition results in FALSE: Is '2010-01-15' >= '2010-01-01' ?YES!
Consider date '2010-01-01'.
Upper Bound TEST of first partition results in TRUE: Is '2010-01-01' < '2010-02-01'? YES
Lower Bound TEST of first partition results in TRUE: Is '2010-01-01' >= '2010-01-01'? YES
Finally, for '2010-06-15'.... and partitions for June(6) and July(7):
JUNE
Upper Bound TEST of partition 6 results in TRUE: Is '2010-06-15' < '2010-07-01'? YES
Lower Bound TEST of partition 6 results in FALSE: Is '2010-06-15' >= '2010-06-01'? YES
JULY
Upper Bound TEST of partition 7 results in TRUE: Is '2010-06-15' < '2010-08-01'? YES
Lower Bound TEST of partition 7 results in FALSE: Is '2010-06-15' >= '2010-07-01'? NO
Please let me know if my undesrtanding is incorrect.
SAP Hana might be a good tool and even SAP says so, am not denying what you have told here buddy, but clients are not warming up to it,
Guess we have to agree to disagree on this one. All the best!
Thanks for this article.