JOIN's in an Imperfect World

Former Member · ‎11-28-2013

In the following blog I compare the run-time performance of the Join Engine, Calculation Engine and SQL Engine as DATA grows from 1 million to 99 Million rows across several tables.

e.g.

Below you will also find links to sample procedures for generating data and recording query run-time stats.

I skipped the OLAP Engine as the following background may explain.

In a Perfect World:

- All your main data is on 1 table and requires very few joins with small master data tables.

- You only need to use a single field in JOIN rules.

- You can safely use INNER JOINS.

- All your data would sits on 1 Node and wouldn't be partitioned across Nodes.

- No one else is running other jobs on your HANA box :wink:

In a perfect world HANA returns aggregated results, of millions/billions of records, in blisteringly fast time.

If you can design your reports to run off a single table or off a single Fact Table with limited joins (e.g. with Analytic View - OLAP Engine) then you will have some very very

happy users.

We always strive to live in the perfect world, but sometimes user requirements get in the way. :shock:

This may be particular true when you don't have control of the table structures (e.g. if using ECC Tables) and as your data volume grows.

An Imperfect World:

Imagine reports that need to:

- join Financial documents with Material Movement Documents

- join Sales Order Contracts with Billing Documents and Accounts Receivable Open items

In this murky world:

- Values / Quantities are in more than one Fact Table.

- JOIN rules are usually on more than one field.

- INNER JOIN can rarely be used reliably

- Data may be partitioned across Nodes

- User selection Criteria is unlikely to exist on all tables being joined

(E,g, Company Code, Year, Period, Profit Center, Account etc.)

Why is that important? For performance it's typically best to push your user selections down to the table level before a join)

I don't think there is one simple answer how best to do complex joins. I think each requirement needs careful analysis and testing.

I've written this blog to simply share some testings I've done on a very limited scenario.

I welcome people to give comments,tips, tricks and enhancement suggestions. Constructive criticism is most definitely welcome. :smile:

Testing Joins on Three Tables

To test out the implications of more complex joins on an increasing data set I've created 3 test tables, 3 different views joining the data, procedures to generate test data, execute test queries and store results.

I then used the procedures to gradually load 99 million records (in the 2 main tables X & Y) and periodically execute the queries and store results as data volumes increased. All the SQL code for you to replicate this test is shared in links at the end.

3 Tables:

Values are stored in Table X and Table Y, and Table Z has a additional information user may need to use in selection criteria.

3 Views:

1) AT_PERF_001 Attribute View (JOIN Engine) joining all 3 tables [Heresy some people will say, but HANA didn't stop me :smile: ]

2) CA_PERF_GRA_001 Calculation View (Calculation Engine) joining all 3 tables

3) CA_PERF_GRA_002 Calculation View joining all 3 tables with the following setting:

NOTE: Analytic view can't be used because values/measures are on more than 1 table. There is quite a few other ways I could have done these joins but I thought this was a good start, but if you think you can build a faster model then look at my example below and please try it out and provide feedback.

3 Procedures:

1) PERF_GEN_CHUNK randomly generates test data for the 3 tables (e.g. 100k rows in X, 100k in Y, 10k in Z)

2) PERF_GET_REPORT_STATS executes the 3 queries and captures their runtime results in table PERF_RESULTS

Execute query aggregates SUM( XVAL) & SUM(YVAL) and is executed 3 times with different WHERE clause

a) NO Where Clause

b) WHERE Z TABLE.ZARG = 'Z01' (of ~10 possible values) [Restriction is in last Join Table]

c) WHERE X TABLE.XARG = 'X050' (of ~100 possible values) [Restriction is in first Join Table]

3) PERF_SCALE_TEST which provides input parameter to choose how much data to generate, then executes loops executing 1) & 2)

Run the Test:

I then ran PERF_SCALE_TEST 99 times to generate 99Million rows in Table X, Table Y and 10m in Table X.

I ran as follows:

call PERF_SCALE_TEST(10,10,'YES',1);

-- Parameters Explained( 10: Loop 10 times: Run Reports on 10th iteration, YES: Delete tables,1: iteration number) -- Loads first Million

call PERF_SCALE_TEST(10,10,'NO',11); -- Loads second Million

call PERF_SCALE_TEST(10,10,'NO',21); -- Loads third Million

.....

call PERF_SCALE_TEST(9,9,'NO',991); -- Loads 99th Million

NOTE: Hana Studio auto commit was switched off, and merge delta executed between each CALL statement. Commit & Merge Delta inside Procedure seemed to have NO effect and blew out run times.

Query Test Results [HANA REV 67 - Tables all on same node / NO PARTITIONING]:

The following graphs show the impact of different query run-times as data grows in the tables.

NO Where Restriction (Run-time in Seconds)

Where ZARG = 'Z050' (Run-time in Seconds)

Where XARG = 'X01' (Run-time in Seconds)

NOTE: Loading the 99 million records with the Procedure took approx 7 hours. During the load of the first 55 Million records the Node used was actively being used by others. I've re-run during a quiet period and you see a much more uniform line up to 55Million rows, but I thought including the graphs with heavy usage during the 1st half of the load was more dramatic and shows how system load can impact run-time differently as well.

If you want to see the graphs with minimal other processes running then see

https://docs.google.com/document/d/1SjMxxPsx-v6RAHMj_M_X9c0e6OqdshIeCsHC-_8WIus/edit?usp=sharing

All the SQL Code for creating this data and executing the queries is supplied here:

https://docs.google.com/document/d/1Ep89BtWhdgUagTLiPkAFd66wgKVtwcy5_Pg8xe_i31I/edit?usp=sharing

The detailed test results for the above graphs are available here:

https://docs.google.com/spreadsheet/ccc?key=0Ahydpie8Km_fdEpBOERmZXpDNGtTZFRfSmtNZXhOTnc&usp=sharing

As an additional observation the memory usage of the tables are:

Memory usage (MB) of 100 Million Rows in Table X & Y

Table X Table Y Table Z

100m 100m 10m

These size stats are taken from M_CS_ALL_COULMNS:

CORE FIELDS (internal_attribute_type is null)

PRIMARY KEY (internal_attribute_type in ('TREX_EXTERNAL_KEY','TREX_UDIV','ROWID')

SECONDARY KEY (internal_attribute_type = 'ROWID')

It's interesting that the Primary Key (plus a few other HANA system fields) and the Seconday Key (created by HANA automatically to support the Join between NON-KEY fields in Table X with Key Fields in Table Y) take up more than 1/2 of the memory used by Table X & Y.

For more detailed explanation of M_CS_ALL_COLUMNS and how joins on NON Keys fields impact memory see Lars Breddemann 's excellent blog "Playing with SAP HANA" http://scn.sap.com/docs/DOC-46329

Summary:

So if you are lucky and the tables you need to join just happen to look exactly like Table X, Y & Z and your data set is similar then you now know which way to model your joins: Attribute Views Won. [But wait isn't that just meant for master data joins? It felt so wrong to use but it worked so well]

In the more likely event that the tables you need to join, in the real world, are different from my examples then just consider that there is always more than one way to do the joins, and your choice may need to change as data volumes increase.

JOIN's in an Imperfect World

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z