Integrate Large Data Sets from Flat File to SAP HA...

asoni · ‎02-21-2023

Introduction:

This blog post details a scenario in which data needs to be loaded from files from a demand and supply planning system into SAP HANA Cloud DB to build BI reports. The integration process should achieve the following high-level requirements:

A tab-delimited file with close to 2 million rows will be received via SFTP from the source system.

The staging table in SAP HANA Cloud DB should be truncated before loading the data.

The data should be loaded from the Tab-delimited flat file into the HANA table.

After the data is loaded, a stored procedure call should be initiated to take a snapshot of the data into another table that will be used for reporting.

While designing this integration process, there are a few things to consider. First, since there is a high number of rows in the incoming file, it is essential to split them into batches and run them in parallel to avoid the DB insert through the JDBC receiver from timing out or taking too long to respond. Second, the procedure call should be serialized and triggered only once all the data in the table is loaded. Finally, if the process takes longer than 15 minutes, the default lock timeout for the sender SFTP must be increased to avoid a duplicate run.

Pre-requisites:

Create a JDBC user in SAP HANA Cloud through SAP HANA Cockpit -> User & Role Management -> User Management

Grant required roles for schema under which tables exist and Execute object privileges for the stored procedure also through SAP HANA Cockpit -> User & Role Management -> Privilege Management

Create JDBC data source using JDBC user created above in SAP Integration Suite -> Manage Security -> JDBC MaterialJDBC URL: jdbc:sap://<SAP HANA Cloud host>:443/?encrypt=trueRefer - JDBC for SAP HANA Cloud | SAP Help Portal (Do not pass the databaseName property to be able to successfully connect to SAP HANA Cloud). You can refer to JDBC Connection Properties | SAP Help Portal for a full set of JDBC connection properties

Integration Process Flow:

Integration Process

SFTP sender channel polls for the file and picks it up for processing when available.

Content Modifier: backs up the incoming file name to be retrieved later in case of any exception.

Parallel Multicast: used to archive the incoming file on the internal server in one branch and to load the data to HANA in another branch.

Content Modifier: generates SQL for truncating the staging table and backs up the incoming dataset in the property variable to be retrieved later.

Request-Reply: calls SAP HANA Cloud using JDBC receiver adapter for executing the SQL query constructed in the Content Modifier.

Content Modifier: sets a dummy payload to create two serialized processes once a success message is received from the JDBC call

Iterating Splitter: configured with Line Break and parallel processing turned OFF to create two serialized sub-processes

Content Modifier: creates a property for the body of the splitter dummy payload in the DBInsert or ProcedureCall branch. The reason for storing it in the property is because Router doesn’t allow to use of ${in.body} expression in the routing condition is required to branch both serialized processes to different branches.

Router: branch based on the between DBInsert & ProcedureCall subprocess.

Content Modifier: resets the original incoming file data (tab-delimited) from the property variable which was backed up in earlier steps.

General Splitter: splits the rows in tab files into multiple groups of 10k records, which can be executed in parallel using the maximum concurrent processes supported (50)

CSV to XML converter: converts data in each split process into XML, which is used to map to insert structure for the database.

Message Mapping: Message mapping is used to map the incoming data into the structure on the receiver side for data insert. Refer to SAP Help Batch Payload and Operation | SAP Help Portal

JDBC receiver: This is configured in batch mode to insert multiple records in a single call. Once all the records are updated in the table from all the split concurrent processes. Then the second branch for procedure call is started.

Content Modifier in branch 2(Procedure Call): setting the body with a stored procedure call which needs to be executed after the table load is complete.

Request-Reply: JDBC receiver adapter sends the procedure call to SAP HANA Cloud.

Send (branch 2 in Parallel Multicast): used to archive the incoming file on an external SFTP server locally using an SFTP receiver adapter.

Conclusion:

I opted to use a Splitter and Router combination instead of sequential multicast to ensure that the split process call is completed before the next call, instead of just ensuring that the step is executed first without waiting for it to complete.

In my testing on our SAP IS tenant, the integration flow takes about 5 minutes to finish for an incoming file with approximately 1.7 million rows, which I consider to be decent performance.

Hopefully, this blog will help SAP IS developers & architects in designing integration solutions for loading data to SAP HANA Cloud, even for larger data sets.

For further learning on using the JDBC adapter with MS SQL server, refer to the blog post at https://blogs.sap.com/2021/09/30/sap-cloud-integration-cpi-jdbc-adapter-step-by-step-guide-pre-requi....

Additionally, another blog post on SAP HANA as a service is available for reference, although please note that the database name is no longer required in the JDBC connection property for SAP HANA Cloud, and the connection will fail if it is specified. This post can be found at https://blogs.sap.com/2021/03/02/hana-as-a-service-integration-with-cpi/

Furthermore, there are alternative approaches to loading data from files to SAP HANA Cloud including the SAP SDI File adapter. Refer SDI Configuration: Connect Data Provision Agent(SDI) to Cloud Foundry environment and enable the Ada...