Many SAP HANA customers are using HANA smart data integration to simplify their data integration landscape to run real-time applications as announced in a previously published blog. Starting with SAP HANA Rev 122.04, HANA smart data integration introduced task partitioning in the Flowgraph Editor. Task partitioning helps our customers load large initial data sets faster and utilizes available memory from various supported sources into SAP HANA. While SAP HANA Rev 122.04 introduced full partitioning support in the Flowgraph Editor, the SAP HANA 122.05 release introduced single-level (or single column) partitioning in the replication task. SAP HANA Rev 122.06 includes multi-level (or multi-column) partitioning in replication tasks.

The goals of these enhancements are:

  1. Optimize initial loading of large volume of data from various supported sources into SAP HANA in terms of loading time, memory utilization in SAP HANA and resource utilization at the data sources
  2. Support a partitioned HANA column table as input with more than 2 billion rows (but less than 2 billion rows in each partition)

Using this newly introduced feature, internal testing teams and early customer adopters have experienced 2-10 times improved performance in completing a large initial load of data using task partitioning while running tasks in parallel. If initial loading time is not critical, customers can partition the data and run them sequentially. This reduces memory consumption at the target SAP HANA and avoids the out of memory error when the available memory is not sufficient.

To illustrate this task partitioning feature, let’s use two sample internal test scenarios. In the first scenario, we use a narrow table (few columns) and in the second scenario a wide table (many columns).

Details for Scenario 1

  • Number of rows = 3.5 Billion
  • Number of columns = 14
  • Data Size at source = 500GB
  • Number of partitions = 12

For this scenario, we partitioned the source data based on range values equally across all partitions and executed the tasks both sequentially and in parallel. The results are presented here.

Mode Throughput (GB/hr) Peak Memory at Target HANA (GB)
Source partitioned and executed sequentially 38 183
Source partitioned and executed in parallel 136 650

 

Without source portioning, this scenario would have failed due to an out of memory error in the test HANA server.

Details for Scenario 2

  • Number of rows = 66M
  • Number of columns = 227
  • Data Size at source = 500GB
  • Number of partitions = 8

For this case, the corresponding loading throughputs and max memory consumptions are summarized here. The source data is partitioned similar to Scenario 1.

Mode Throughput (GB/hr) Peak Memory at Target HANA (GB)
No source partition 76 385
Source partitioned and executed sequentially 77 51
Source partitioned and executed in parallel 476 383

 

These two sample results show how to improve performance by loading large source data using the task partitioning feature. The first scenario shows a throughput improvement from 38 GB/hr to 136GB/hr, whereas the second scenario shows that the throughput increased from 77 GB/hr to 476GB/hr. Task partitioning allows SAP HANA to read, process and commit the partitioned virtual table input sources in parallel. Notice in the second scenario that a customer with the same amount of data who runs the replication task sequentially uses only 51 GB of the memory in the target. The second scenario shows that partitioning and executing in parallel rather than without using task partitioning, returns a much higher throughput (476 GB/hr vs 76 GB/hr) consuming the same memory on the HANA side.

You can define task partitions in the Partitions tab within the Replication Editor. Two partition types are available: range partitions and list partitions.

This feature is described in sections 6.1.3 and 6.1.4 of Best Practices for SAP HANA Smart Data Integration and SAP HANA Smart Data Quality,

With this enhancement, we believe all our customers will benefit optimizing their HANA memory utilization in loading large initial data and will address HANA partitioned table scenario with more than 2 billion records.

To report this post you need to login first.

1 Comment

You must be Logged on to comment or reply to a post.

Leave a Reply