Task partitioning enhances initial load performanc...

swapan_saha · ‎01-24-2017

Many SAP HANA customers are using HANA smart data integration to simplify their data integration landscape to run real-time applications as announced in a previously published blog. Starting with SAP HANA Rev 122.04, HANA smart data integration introduced task partitioning in the Flowgraph Editor. Task partitioning helps our customers load large initial data sets faster and utilizes available memory from various supported sources into SAP HANA. While SAP HANA Rev 122.04 introduced full partitioning support in the Flowgraph Editor, the SAP HANA 122.05 release introduced single-level (or single column) partitioning in the replication task. SAP HANA Rev 122.06 includes multi-level (or multi-column) partitioning in replication tasks.

The goals of these enhancements are:

Optimize initial loading of large volume of data from various supported sources into SAP HANA in terms of loading time, memory utilization in SAP HANA and resource utilization at the data sources

Support a partitioned HANA column table as input with more than 2 billion rows (but less than 2 billion rows in each partition)

Using this newly introduced feature, internal testing teams and early customer adopters have experienced 2-10 times improved performance in completing a large initial load of data using task partitioning while running tasks in parallel. If initial loading time is not critical, customers can partition the data and run them sequentially. This reduces memory consumption at the target SAP HANA and avoids the out of memory error when the available memory is not sufficient.

To illustrate this task partitioning feature, let’s use two sample internal test scenarios. In the first scenario, we use a narrow table (few columns) and in the second scenario a wide table (many columns).

Details for Scenario 1

Number of rows = 3.5 Billion

Number of columns = 14

Data Size at source = 500GB

Number of partitions = 12

For this scenario, we partitioned the source data based on range values equally across all partitions and executed the tasks both sequentially and in parallel. The results are presented here.

Mode	Throughput (GB/hr)	Peak Memory at Target HANA (GB)
Source partitioned and executed sequentially	38	183
Source partitioned and executed in parallel	136	650

Without source portioning, this scenario would have failed due to an out of memory error in the test HANA server.

Details for Scenario 2

Number of rows = 66M

Number of columns = 227

Data Size at source = 500GB

Number of partitions = 8

For this case, the corresponding loading throughputs and max memory consumptions are summarized here. The source data is partitioned similar to Scenario 1.

Mode	Throughput (GB/hr)	Peak Memory at Target HANA (GB)
No source partition	76	385
Source partitioned and executed sequentially	77	51
Source partitioned and executed in parallel	476	383

These two sample results show how to improve performance by loading large source data using the task partitioning feature. The first scenario shows a throughput improvement from 38 GB/hr to 136GB/hr, whereas the second scenario shows that the throughput increased from 77 GB/hr to 476GB/hr. Task partitioning allows SAP HANA to read, process and commit the partitioned virtual table input sources in parallel. Notice in the second scenario that a customer with the same amount of data who runs the replication task sequentially uses only 51 GB of the memory in the target. The second scenario shows that partitioning and executing in parallel rather than without using task partitioning, returns a much higher throughput (476 GB/hr vs 76 GB/hr) consuming the same memory on the HANA side.

You can define task partitions in the Partitions tab within the Replication Editor. Two partition types are available: range partitions and list partitions.

This feature is described in sections 6.1.3 and 6.1.4 of Best Practices for SAP HANA Smart Data Integration and SAP HANA Smart Data Quality,

With this enhancement, we believe all our customers will benefit optimizing their HANA memory utilization in loading large initial data and will address HANA partitioned table scenario with more than 2 billion records.

Task partitioning enhances initial load performance in HANA smart data integration

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win