Skip to Content
Technical Articles
Author's profile photo Ankit Soni

Integrate Large Data Sets from Flat File to SAP HANA Cloud using SAP IS JDBC Adapter

Introduction:

This blog post details a scenario in which data needs to be loaded from files from a demand and supply planning system into SAP HANA Cloud DB to build BI reports. The integration process should achieve the following high-level requirements:

  1. A tab-delimited file with close to 2 million rows will be received via SFTP from the source system.
  2. The staging table in SAP HANA Cloud DB should be truncated before loading the data.
  3. The data should be loaded from the Tab-delimited flat file into the HANA table.
  4. After the data is loaded, a stored procedure call should be initiated to take a snapshot of the data into another table that will be used for reporting.

While designing this integration process, there are a few things to consider. First, since there is a high number of rows in the incoming file, it is essential to split them into batches and run them in parallel to avoid the DB insert through the JDBC receiver from timing out or taking too long to respond. Second, the procedure call should be serialized and triggered only once all the data in the table is loaded. Finally, if the process takes longer than 15 minutes, the default lock timeout for the sender SFTP must be increased to avoid a duplicate run.

Pre-requisites:

  1. Create a JDBC user in SAP HANA Cloud through SAP HANA Cockpit -> User & Role Management -> User Management
  2. Grant required roles for schema under which tables exist and Execute object privileges for the stored procedure also through SAP HANA Cockpit -> User & Role Management -> Privilege Management
  3. Create JDBC data source using JDBC user created above in SAP Integration Suite -> Manage Security -> JDBC MaterialJDBC URL: jdbc:sap://<SAP HANA Cloud host>:443/?encrypt=trueRefer – JDBC for SAP HANA Cloud | SAP Help Portal (Do not pass the databaseName property to be able to successfully connect to SAP HANA Cloud). You can refer to JDBC Connection Properties | SAP Help Portal for a full set of JDBC connection properties

Integration Process Flow:

Integration Process

  1. SFTP sender channel polls for the file and picks it up for processing when available.
  2. Content Modifier: backs up the incoming file name to be retrieved later in case of any exception.
  3. Parallel Multicast: used to archive the incoming file on the internal server in one branch and to load the data to HANA in another branch.
  4. Content Modifier: generates SQL for truncating the staging table and backs up the incoming dataset in the property variable to be retrieved later.
  5. Request-Reply: calls SAP HANA Cloud using JDBC receiver adapter for executing the SQL query constructed in the Content Modifier.
  6. Content Modifier: sets a dummy payload to create two serialized processes once a success message is received from the JDBC call
  7. Iterating Splitter: configured with Line Break and parallel processing turned OFF to create two serialized sub-processes
  8. Content Modifier: creates a property for the body of the splitter dummy payload in the DBInsert or ProcedureCall branch. The reason for storing it in the property is because Router doesn’t allow to use of ${in.body} expression in the routing condition is required to branch both serialized processes to different branches.
  9. Router: branch based on the between DBInsert & ProcedureCall subprocess.
  10. Content Modifier: resets the original incoming file data (tab-delimited) from the property variable which was backed up in earlier steps.
  11. General Splitter: splits the rows in tab files into multiple groups of 10k records, which can be executed in parallel using the maximum concurrent processes supported (50)
  12. CSV to XML converter: converts data in each split process into XML, which is used to map to insert structure for the database.
  13. Message Mapping: Message mapping is used to map the incoming data into the structure on the receiver side for data insert. Refer to SAP Help Batch Payload and Operation | SAP Help Portal
  14. JDBC receiver: This is configured in batch mode to insert multiple records in a single call. Once all the records are updated in the table from all the split concurrent processes. Then the second branch for procedure call is started.
  15. Content Modifier in branch 2(Procedure Call): setting the body with a stored procedure call which needs to be executed after the table load is complete.
  16. Request-Reply: JDBC receiver adapter sends the procedure call to SAP HANA Cloud.
  17. Send (branch 2 in Parallel Multicast): used to archive the incoming file on an external SFTP server locally using an SFTP receiver adapter.

Conclusion:

I opted to use a Splitter and Router combination instead of sequential multicast to ensure that the split process call is completed before the next call, instead of just ensuring that the step is executed first without waiting for it to complete.

In my testing on our SAP IS tenant, the integration flow takes about 5 minutes to finish for an incoming file with approximately 1.7 million rows, which I consider to be decent performance.

Hopefully, this blog will help SAP IS developers & architects in designing integration solutions for loading data to SAP HANA Cloud, even for larger data sets.

For further learning on using the JDBC adapter with MS SQL server, refer to the blog post at https://blogs.sap.com/2021/09/30/sap-cloud-integration-cpi-jdbc-adapter-step-by-step-guide-pre-requisites-and-configuration-jdbc-batch-insert-cpi/.

Additionally, another blog post on SAP HANA as a service is available for reference, although please note that the database name is no longer required in the JDBC connection property for SAP HANA Cloud, and the connection will fail if it is specified. This post can be found at https://blogs.sap.com/2021/03/02/hana-as-a-service-integration-with-cpi/

Furthermore, there are alternative approaches to loading data from files to SAP HANA Cloud including the SAP SDI File adapter. Refer SDI Configuration: Connect Data Provision Agent(SDI) to Cloud Foundry environment and enable the Adapters(File Adapter, ABAP Adapter and HANA Adapter) | SAP Blogs

Assigned Tags

      11 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Gautam Purohit
      Gautam Purohit

      that's an excellent write-up Ankit Soni

      will definitely be useful for many integration colleagues working with high volume JDBC scenarios.

       

      Regards,

      Gautam

      Author's profile photo Ankit Soni
      Ankit Soni
      Blog Post Author

      Hi Gautam,

      Thank you for taking the time to share your positive feedback. I'm glad to know that you found the write-up to be useful for colleagues working with high-volume JDBC scenarios.

      Thanks,

      Ankit

      Author's profile photo Saurabh Kabra
      Saurabh Kabra

      Hi Ankit,

       

      Nice blog with a very precise explanation about each step in iflow.

      I have a question though... Why did you take a snapshot of the table inserted on the last step(branch 2 --> procedure call) to another table? Why cannot you report based on the inserted record for the source table itself in HANA Cloud? Any limitation on HANA Cloud or was that a business requirement to not expose the source table for reporting?

       

      Thanks

      Saurabh

      Author's profile photo Ankit Soni
      Ankit Soni
      Blog Post Author

      Hi Saurabh,

      Thanks for the feedback. I don't think there is any restriction to using the same table for reporting. However, as this is more of demand planning/forecast data, the procedure also provides flexibility to add additional aggregate measures required to report how the forecast changes every week.

      Additionally, in our case, there were power BI reports that currently exist and this approach provided us the flexibility to switch to better forecast data from the new planning tool from excel based forecast models without impacting current reports.

      Thanks,

      Ankit

      Author's profile photo Nicolas Binet
      Nicolas Binet

      Hi Ankit

      Thank you for this interesting article.

      One thing I don't fully get:

      What is the advantage of the splitter vs. a sequental multicast in this case? You write:

      "I opted to use a Splitter and Router combination instead of sequential multicast to ensure that the split process call is completed before the next call, instead of just ensuring that the step is executed first without waiting for it to complete."

      But using a sequential multicast, it would also wait for the first branch to complete.

      Do I miss something?

       

      Thank you

      Nicolas

      Author's profile photo Ankit Soni
      Ankit Soni
      Blog Post Author

      Hi Nicolas,

      Thanks for taking the time to read and provide feedback. In my experience, I have noticed that sequential multicast doesn't wait for a branch to finish completely if the branch takes a long time.

      I am not particularly sure about the reason behind it but my recon is Apache camel Multicast component has a timeout option as well. You can read about it here: Multicast :: Apache Camel

      Sets a total timeout specified in millis. If the Multicast hasn't been able to send and process all replies within the given timeframe, then the timeout triggers and the Multicast breaks out and continues. Notice if you provide a TimeoutAwareAggregationStrategy then the timeout method is invoked before breaking out.

      Currently sequential multicast component in CPI doesn't have any option to control the timeout settings. Also, I couldn't find a reference to the default timeout for sequential multicast in CPI documentation. Hence, I opted for the above approach to not have issues if the data load branch takes a longer time when the record count increases further.

      Thanks,

      Ankit

      Author's profile photo Mustafa Bensan
      Mustafa Bensan

      Hi Ankit,

      That's a very helpful and comprehensive description of a real-world use case.  A couple of questions:

      1)  In the context of managing BTP consumption costs for such large dataset uploads, with the above iFlow setup does one record in the file correspond to one message in SAP Integration Suite?

      2)  What was the reason for choosing SAP HANA Cloud as your BI data warehouse instead of SAP Datasphere (formerly Data Warehouse Cloud)?

      Thanks,

      Mustafa.

      Author's profile photo Ankit Soni
      Ankit Soni
      Blog Post Author

      Hi Mustafa Bensan,

      Thanks for taking the time to read and provide feedback. Please see my response to your questions.

      1. SAP IS Message metric definition isn't straightforward, this SAP note covers different cases - 2942344. As we are using a splitter, each split outbound message will be counted as a separate message. So for million rows in a single file with a splitter grouping of 50K will correspond to ~20 messages.
      2. We have an on-premise BW system and SAP HANA service was onboarded as an Analytics HANA sidecar quite a few years back when DWC wasn't available. We migrated last year to SAP HANA Cloud as SAP plans to phase out SAP HANA service. Our future roadmap is undecided with respect to the future of SAP HANA Cloud given SAP Datasphere is a parallel offering with overlapping capabilities. Does SAP have published guidance around use cases for SAP HANA Cloud vs SAP Datasphere and preferred reference architecture with SAP S/4?

      Thanks,

      Ankit

      Author's profile photo Mustafa Bensan
      Mustafa Bensan

      Hi Ankit Soni,

      Thanks for your responses.  Please see my comments below:

      1.  The SAP note reference is very helpful.  In your example of the splitter grouping of 50K, I assume that means groupings of 50K rows in the file?  In that case, when determining that that corresponds to ~20 messages, for metering purposes doesn't the aggregated message size also come into play?  Meaning as per the note, if the aggregated size of the messages is greater than 250KB then we could end up with significantly more than 20 messages as the count used for metering purposes.  Have I understood correctly?

      2.  Regarding choice between SAP HANA Cloud and SAP Datasphere for data warehousing, I think the main difference is SAP HANA Cloud would be a more technical SQL data warehouse approach, whereas Datasphere provides a more abstracted semantic layer approach geared towards business users.  I think the discussion in "SAP DATAWAREHOUSE CLOUD" VS "SAP HANA CLOUD" for MODELLING addresses your question well.  You can also have the best of both worlds with SAP BTP Showcase – Access the SAP HANA Cloud database underneath SAP Data Warehouse Cloud or Access HANA Cloud Database Service (BTP) as source in SAP DWC.

      Regards,

      Mustafa.

      Author's profile photo Ankit Soni
      Ankit Soni
      Blog Post Author

      Hi Mustafa Bensan,

      Thank you for sharing the discussions regarding the comparison between SAP HANA Cloud and DWC.

      Just to confirm, you are correct that if the message size exceeds 250KB, it may result in multiple message counts. Fortunately, SAP IS message blocks are not costly. However, message size measurement is not available as per the same note, so for the time being, messages will be counted as one regardless of their size.

      Thanks,
      Ankit

      Author's profile photo Mustafa Bensan
      Mustafa Bensan

      Hi Ankit Soni,

      Thanks for the clarification.  That's a bonus for SAP customers that at least for now message size metering is not implemented 🙂

      It's been a very insightful discussion.

      Regards,

      Mustafa.