Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
thomas_jung
Developer Advocate
Developer Advocate
Recently I was working on a project and hit a rather interesting challenge. I needed to insert large datasets from multiple Excel spreadsheets into SAP HANA. That alone wouldn't be too difficult, but my situation had a few wrinkles.

First the data I needed to process would contain duplicate entries even within an a single input spreadsheet. This was simply an issue with the source dataset and I would need to discard those duplicate records that occur within a single input sheet programmatically.

Second I needed to load multiple spreadsheets and there would be duplicate records between the data already in the target table and the input spreadsheet. When these duplicate key matches occur, I need to add the quantities to the existing records.

And the final problem; the size of the datasets. I need to automate this loading but I also needed a way to mass load the data.  Looping over the dataset and calling HANA once per record would not result in acceptable performance. I needed to find a way to send the whole dataset to the server and do all of the duplicate handling there in a batch sort of way.

Ultimately I found a solution to the problem, learned a fair amount along the way and wanted to share as both an explanatory blog post and some sample code the reader can study in detail.

Recreating the Scenario


I can't share the whole project nor would everything about it be interesting.  Therefore I've recreated a simplified version of the project that has the same core issues. You can fine the complete project here: SAP-samples/sap-tech-bytes at 2021-03-17-hana-mass-input-with-dups (github.com)

I did create this as an SAP Cloud Application Programming model project, although we won't use a whole lot of CAP specific capabilities here. I just want to model a single table and setup the project to use HANA even during testing.  That way I can use HANA specific logic for the loading of data.

I setup the project to target SAP HANA Cloud but it should all work with recent versions of HANA on premise (including HANA, express edition).  I did all the development using SAP HANA Cloud free trial and with VS Code as the development tooling.

If you clone my project it should be nearly ready to use.  You would need to create the HDI container service instance, service key and then you can use hana-cli to write those technical details to a default-env.json file for local testing and deployment. Just be sure to put the default-env.json that results in this command in the root of the project (for CAP) and make a copy in the /db folder to perform HDI deployment.


Create HDI Container and Save Details in default-env.json


The terminal commands for this step can all be found here: sap-tech-bytes/create-services.md at 2021-03-17-hana-mass-input-with-dups · SAP-samples/sap-tech-byt...

The data model itself is reduced in simplicity. I cut everything down to a single table.  It has key fields to identify a organization and repository but then also the date as a key.  That way per repository we are accepting views by date and its's these views we might need to add to an existing record in the database for the same repository and date.
entity Stats {
key org : String;
key repository : String;
key date : Date;
views : Integer;
}

CDS Schema Definition here: https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/db/schema.cd...

If you build and deploy this model into SAP HANA, you will get a single table with the following structure:


STATS Table Deployed to HANA


Likewise the sample project contains a small version of the Excel data but still contains all the problems I described above.  This sample data is in the project here: https://github.com/SAP-samples/sap-tech-bytes/tree/2021-03-17-hana-mass-input-with-dups/data

For easy testing and development the sample data only has 6 rows in it.


Sample Data in Excel


But this still contains the first problem I described. Rows 2 and Rows 6 (TEST/Site1 on March 3rd) are duplicate records.

The Solution


The complete source code of the loading script can be found here: https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData...

To make the solution easy to test I've altered the loading script to run from the shell, interactively and let you select the Excel file you want to load.  That's not really the core of what we want to study here but if you are interested in the logic and modules used to both query the user for the file and parse the Excel, that logic is here: https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData...

What I want to focus on here is how to get this data into HANA efficiently. For that purpose we will pickup with the processing once we have it in a JSON object and ready to send to HANA.

The first problem was how to get the data into HANA in a batch. Any operation that worked just one record at a time had too much back and forth overhead between the Node.js and database.  I needed to be able to use a Batch operation. But I have the duplicates issue to deal with as well.

My first thought was to use an UPSERT but that isn't supported via the HANA Client executeBatch.  My next thought was to instead create a temporary table that matches the structure of my target table but doesn't have the primary key index or constraints on it. As it turns out, this is easy as the CREATE TABLE statement allows you to use the LIKE <Table> option as well as modifiers to remove features from the copy. With one line of SQL I can create a duplicate of my target table that will allow duplicates to be inserted.

https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData...

 
    await dbConn.execSQL(`CREATE LOCAL TEMPORARY TABLE #TEMP_STATS LIKE STATS WITHOUT INDEX WITHOUT CONSTRAINT`)

Now that I have the temporary table I can use the execBatch of the HANA client to insert all the records at once with minimal back and forth between the Node.js and HANA layers.

https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData...
    const statementStats = await dbConn.preparePromisified(`INSERT INTO #TEMP_STATS (ORG, REPOSITORY, DATE, VIEWS) VALUES (?, ?, TO_DATE(?, 'MM/DD/YYYY'), ?)`)

await dbConn.statementExecBatchPromisified(statementStats, stats)

So first problem solved. I have the entire input dataset on the server but still with duplicates and not yet aggregated into my target table.

For that, my next thought was back to using the UPSERT command. UPSERT has a subquery option that will allow the use of the temporary table as the source. But here I hit an unexpected issue. The UPSERT (and later the MERGE) command needs the source sub query to return unique results. When I performed the operation with my temporary table as the subquery source, I received an unique constraint violation. I'm guessing that the operation actually creates a temporary table for the subquery results but matches the target table exactly with its temporary table. Therefore my duplicate records in the source dataset cause problems again.

The solution was that I needed to de-dupe the records in the temporary table. I was able to do this by using the ROW_NUMBER function over a WINDOW of data created by a PARTITION BY over my actual keys. This way any duplicate records are numbered and I can safely delete any record with a row number greater than 1 knowing that it is a duplicate.

https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData...
    await dbConn.execSQL(
`DELETE FROM #TEMP_STATS WHERE "$rowid$" IN
( SELECT ROW_ID FROM (SELECT ROW_NUMBER() OVER (PARTITION BY ORG, REPOSITORY, "DATE") as RN,
"$rowid$" as ROW_ID, ORG, REPOSITORY, "DATE" FROM #TEMP_STATS ORDER BY 3, 2, 1)
WHERE RN>1)`)

My final issue was the way I wanted to handle those records that matched existing records in the database. With the subquery and an UPSERT statement, I didn't have enough control over what happens when I find a matching record. This lead me to switch from the UPSERT command to the MERGE INTO command. This more powerful command allows you as the developer to code both the WHEN MATCHED and WHEN NOT MATCHED logic to control exactly what happens in each scenario.

This worked wonderfully; allowing me to add the current record value to the input value when I find matching keys.

https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData...
    await dbConn.execSQL(`MERGE INTO STATS AS T1 USING #TEMP_STATS AS T2
ON T1.ORG = T2.ORG and T1.REPOSITORY = T2.REPOSITORY and T1.DATE = T2.DATE
WHEN MATCHED THEN UPDATE SET T1.VIEWS = (T1.VIEWS + T2.VIEWS)
WHEN NOT MATCHED THEN INSERT (ORG, REPOSITORY, DATE, VIEWS) VALUES(T2.ORG, T2.REPOSITORY, T2.DATE, T2.VIEWS)`)

And finally it's time for testing.  I load my small Excel sample and upon the first load it correctly drops the one duplicate record but the rest are inserted (in a mass operation without any constraint errors).


First Test Results (HANA Table on top and Excel on bottom)


And if you now load the exact same sample Excel again, you will that the duplicate from the source is removed first and then all the views are aggregated (doubled in this simple test case).


Second Test


If you are following along with the execution on your own project, you can clear the data easily from the HANA table by running the clearData.js script.  That allows you to run various tests in different scenarios and orders. https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/clearDat...

Closing


You are of course welcome to try all of this out yourself. That's why we created the simplified version of the data model, project and small dataset. There’s a branch for this SAP Tech Bytes post in the accompanying SAP Tech Bytes repository: SAP-samples/sap-tech-bytes at 2021-03-17-hana-mass-input-with-dups (github.com)


SAP Tech Bytes is an initiative to bring you bite-sized information on all manner of topics, in video and written format. Enjoy!
6 Comments