Skip to Content
Author's profile photo Former Member

JOIN’s in an Imperfect World

In the following blog I compare the run-time performance of the Join Engine, Calculation Engine and SQL Engine  as DATA grows from 1 million to 99 Million rows across several tables.

e.g.

Below you will also find links to sample procedures for generating data and recording query run-time stats.

I skipped the OLAP Engine as the following background may explain.

In a Perfect World:

– All your main data is on 1 table and requires very few joins with small master data tables.

– You only need to use a single field in JOIN rules.

– You can safely use INNER JOINS.

– All your data would sits on 1 Node and wouldn’t be partitioned across Nodes.

– No one else is running other jobs on your HANA box 😉

In a perfect world HANA returns  aggregated results, of millions/billions of records, in blisteringly fast time.

If you can design your reports to run off a single table or off a single Fact Table with limited joins (e.g. with Analytic View – OLAP Engine) then you will have some very very

happy users.

We always strive to live in the perfect world, but sometimes user requirements get in the way. 😯

This may be particular true when you don’t have control of the table structures (e.g. if using ECC Tables) and as your data volume grows.

An Imperfect World:

Imagine reports that need to:

– join Financial documents with Material Movement Documents

– join Sales Order Contracts with Billing Documents and Accounts Receivable Open items

In this murky world:

–  Values / Quantities are in more than one Fact Table.

–  JOIN rules are usually on more than one field.

–  INNER JOIN can rarely be used reliably

–  Data may be partitioned across Nodes

–  User selection Criteria is unlikely to exist on all tables being joined

     (E,g, Company Code, Year, Period,  Profit Center, Account etc.)

      Why is that important?  For performance it’s typically best to push your user selections down to the table level before a join)

I don’t think there is one simple answer how best to do complex joins.  I think each requirement needs careful analysis and testing.

I’ve written this blog to simply share some testings I’ve done on a very limited scenario. 

I welcome people to give comments,tips, tricks and enhancement suggestions. Constructive criticism is most definitely welcome.  🙂

Testing Joins on Three Tables

To test out the implications of more complex joins on an increasing data set I’ve created 3 test tables, 3 different views joining the data, procedures to generate test data, execute test queries and store results.

I then used the procedures to gradually load 99 million records (in the 2 main tables X & Y) and periodically execute the queries and store results as data volumes increased.  All the SQL code for you to replicate this test is shared in links at the end.

3 Tables:

Values are stored in Table X and Table Y, and Table Z has a additional information user may need to use in selection criteria.

3 Views:

1) AT_PERF_001             Attribute View (JOIN Engine) joining all 3 tables  [Heresy some people will say, but HANA didn’t stop me 🙂 ]

2) CA_PERF_GRA_001   Calculation View (Calculation Engine) joining all 3 tables

3) CA_PERF_GRA_002   Calculation View joining all 3 tables with the following setting:

NOTE: Analytic view can’t be used because values/measures are on more than 1 table. There is quite a few other ways I could have done these joins but I thought this was a good start, but if you think you can build a faster model then look at my example below and please try it out and provide feedback.

3 Procedures:

1) PERF_GEN_CHUNK   randomly generates test data for the 3 tables  (e.g. 100k rows in X, 100k in Y, 10k in Z)

2) PERF_GET_REPORT_STATS executes the 3 queries and captures their runtime results in table PERF_RESULTS

     Execute query aggregates SUM( XVAL) & SUM(YVAL) and is executed 3 times with different WHERE clause

          a) NO Where Clause

          b) WHERE Z TABLE.ZARG  = ‘Z01’  (of  ~10  possible values) [Restriction is in last Join Table]

          c) WHERE X TABLE.XARG  = ‘X050’  (of  ~100  possible values) [Restriction is in first Join Table]

3) PERF_SCALE_TEST which provides input parameter to choose how much data to generate, then executes loops executing 1) & 2)

Run the Test:

I then ran PERF_SCALE_TEST  99 times to generate 99Million rows in Table X, Table Y  and 10m in Table X.

I ran as follows:

     call PERF_SCALE_TEST(10,10,’YES’,1);     

     — Parameters Explained( 10: Loop 10 times: Run Reports on 10th iteration,  YES: Delete tables,1: iteration number)  — Loads first Million

     call PERF_SCALE_TEST(10,10,’NO’,11);    — Loads second Million

     call PERF_SCALE_TEST(10,10,’NO’,21);    — Loads third Million

     …..

     call PERF_SCALE_TEST(9,9,’NO’,991);    — Loads 99th Million

NOTE: Hana Studio auto commit was switched off, and merge delta executed between each CALL statement.  Commit & Merge Delta inside Procedure seemed to have NO effect and blew out run times.

Query Test Results [HANA REV 67 – Tables all on same node / NO PARTITIONING]:

The following graphs show the impact of different query run-times as data grows in the tables.



                                            NO Where Restriction (Run-time in Seconds)


                                        Where ZARG = ‘Z050’  (Run-time in Seconds)

                                    Where XARG = ‘X01’  (Run-time in Seconds)

NOTE: Loading the 99 million records with the Procedure took approx 7 hours.  During the load of the first 55 Million records the Node used was actively being used by others.  I’ve re-run during a quiet period and you see a much more uniform line up to 55Million rows, but I thought including the graphs with heavy usage during the 1st half of the load was more dramatic and shows how system load can impact run-time differently as well.

If you want to see the graphs with minimal other processes running then see

https://docs.google.com/document/d/1SjMxxPsx-v6RAHMj_M_X9c0e6OqdshIeCsHC-_8WIus/edit?usp=sharing

All the SQL Code for creating this data and executing the queries is supplied here:

https://docs.google.com/document/d/1Ep89BtWhdgUagTLiPkAFd66wgKVtwcy5_Pg8xe_i31I/edit?usp=sharing

The detailed test results for the above graphs are available here:

https://docs.google.com/spreadsheet/ccc?key=0Ahydpie8Km_fdEpBOERmZXpDNGtTZFRfSmtNZXhOTnc&usp=sharing

As an additional observation the memory usage of the tables are:

         Memory usage (MB)  of 100 Million Rows in  Table X & Y

                 Table X                        Table Y                        Table Z

                   100m                           100m                           10m

These size stats are taken from M_CS_ALL_COULMNS:

CORE  FIELDS (internal_attribute_type is null)

PRIMARY KEY   (internal_attribute_type in (‘TREX_EXTERNAL_KEY’,’TREX_UDIV’,’ROWID’)

SECONDARY KEY (internal_attribute_type = ‘ROWID’)

It’s interesting that the Primary Key  (plus a few other HANA system fields)  and the Seconday Key (created by HANA automatically to support the Join between NON-KEY fields in Table X  with Key Fields in Table Y)   take up more than 1/2 of the memory used by  Table X & Y.


For more detailed explanation of M_CS_ALL_COLUMNS and how joins on NON Keys fields impact memory see Lars Breddemann ‘s excellent blog “Playing with SAP HANA”  http://scn.sap.com/docs/DOC-46329

Summary:

So if you are lucky and the tables you need to join just happen to look exactly like Table X,  Y & Z and your data set is similar then you now know which way to model your joins:  Attribute Views Won.  [But wait isn’t that just meant for master data joins?  It felt so wrong to use but it worked so well]

In the more likely event that the tables you need to join, in the real world, are different from my examples then just consider that there is always more than one way to do the joins, and your choice may need to change as data volumes increase.

Assigned Tags

      2 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Justin Molenaur
      Justin Molenaur

      Aron, this is fantastic test work - I can really appreciate the attention to detail and methodology.

      So of course the conclusion is that for heavy joins, AT's do a great job. Of course you can wrap these in a CV in order to expose to users also.

      Question is - have you implemented productive solutions using this approach? It really does turn the "normal" modeling suggestions upside down. But once again, I am not surprised in the least, as in HANA things aren't clear until you take the time to really test it out.

      I have a solution right now that really is nothing more than some heavy joins. I have a CV based solution that doesn't quite perform, and based on your post here I am going to rework it using a series of AT's to do the heavy joining. I'll post the results back here.

      Regards,

      Justin

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Thanks. And yes I have found some instances where the AT view (with Transactional tables) definitely helped in Prod, but it's used if all else has failed. 

      The SQL Engine can be comparable if you are applying suitable "where" restrictions so might be something to try first.

      I do often wonder with each Revision whether these performance stats may change, as SAP continues to enhance the different  Engines.