Skip to Content
Technical Articles

SAP Tech Bytes: SAP HANA Mass Input with Duplicates

Recently I was working on a project and hit a rather interesting challenge. I needed to insert large datasets from multiple Excel spreadsheets into SAP HANA. That alone wouldn’t be too difficult, but my situation had a few wrinkles.

First the data I needed to process would contain duplicate entries even within an a single input spreadsheet. This was simply an issue with the source dataset and I would need to discard those duplicate records that occur within a single input sheet programmatically.

Second I needed to load multiple spreadsheets and there would be duplicate records between the data already in the target table and the input spreadsheet. When these duplicate key matches occur, I need to add the quantities to the existing records.

And the final problem; the size of the datasets. I need to automate this loading but I also needed a way to mass load the data.  Looping over the dataset and calling HANA once per record would not result in acceptable performance. I needed to find a way to send the whole dataset to the server and do all of the duplicate handling there in a batch sort of way.

Ultimately I found a solution to the problem, learned a fair amount along the way and wanted to share as both an explanatory blog post and some sample code the reader can study in detail.

Recreating the Scenario

I can’t share the whole project nor would everything about it be interesting.  Therefore I’ve recreated a simplified version of the project that has the same core issues. You can fine the complete project here: SAP-samples/sap-tech-bytes at 2021-03-17-hana-mass-input-with-dups (github.com)

I did create this as an SAP Cloud Application Programming model project, although we won’t use a whole lot of CAP specific capabilities here. I just want to model a single table and setup the project to use HANA even during testing.  That way I can use HANA specific logic for the loading of data.

I setup the project to target SAP HANA Cloud but it should all work with recent versions of HANA on premise (including HANA, express edition).  I did all the development using SAP HANA Cloud free trial and with VS Code as the development tooling.

If you clone my project it should be nearly ready to use.  You would need to create the HDI container service instance, service key and then you can use hana-cli to write those technical details to a default-env.json file for local testing and deployment. Just be sure to put the default-env.json that results in this command in the root of the project (for CAP) and make a copy in the /db folder to perform HDI deployment.

Create%20HDI%20Container%20and%20Save%20Details%20in%20default-env.json

Create HDI Container and Save Details in default-env.json

The terminal commands for this step can all be found here: sap-tech-bytes/create-services.md at 2021-03-17-hana-mass-input-with-dups · SAP-samples/sap-tech-bytes (github.com)

The data model itself is reduced in simplicity. I cut everything down to a single table.  It has key fields to identify a organization and repository but then also the date as a key.  That way per repository we are accepting views by date and its’s these views we might need to add to an existing record in the database for the same repository and date.

entity Stats {
    key org        : String;
    key repository : String;
    key date       : Date;
        views      : Integer;
}

CDS Schema Definition here: https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/db/schema.cds

If you build and deploy this model into SAP HANA, you will get a single table with the following structure:

STATS%20Table%20Deployed%20to%20HANA

STATS Table Deployed to HANA

Likewise the sample project contains a small version of the Excel data but still contains all the problems I described above.  This sample data is in the project here: https://github.com/SAP-samples/sap-tech-bytes/tree/2021-03-17-hana-mass-input-with-dups/data

For easy testing and development the sample data only has 6 rows in it.

Sample%20Data%20in%20Excel

Sample Data in Excel

But this still contains the first problem I described. Rows 2 and Rows 6 (TEST/Site1 on March 3rd) are duplicate records.

The Solution

The complete source code of the loading script can be found here: https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData.js

To make the solution easy to test I’ve altered the loading script to run from the shell, interactively and let you select the Excel file you want to load.  That’s not really the core of what we want to study here but if you are interested in the logic and modules used to both query the user for the file and parse the Excel, that logic is here: https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData.js#L10-L47

What I want to focus on here is how to get this data into HANA efficiently. For that purpose we will pickup with the processing once we have it in a JSON object and ready to send to HANA.

The first problem was how to get the data into HANA in a batch. Any operation that worked just one record at a time had too much back and forth overhead between the Node.js and database.  I needed to be able to use a Batch operation. But I have the duplicates issue to deal with as well.

My first thought was to use an UPSERT but that isn’t supported via the HANA Client executeBatch.  My next thought was to instead create a temporary table that matches the structure of my target table but doesn’t have the primary key index or constraints on it. As it turns out, this is easy as the CREATE TABLE statement allows you to use the LIKE <Table> option as well as modifiers to remove features from the copy. With one line of SQL I can create a duplicate of my target table that will allow duplicates to be inserted.

https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData.js#L64

 

    await dbConn.execSQL(`CREATE LOCAL TEMPORARY TABLE #TEMP_STATS LIKE STATS WITHOUT INDEX WITHOUT CONSTRAINT`)

Now that I have the temporary table I can use the execBatch of the HANA client to insert all the records at once with minimal back and forth between the Node.js and HANA layers.

https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData.js#L65-L67

    const statementStats = await dbConn.preparePromisified(`INSERT INTO #TEMP_STATS (ORG, REPOSITORY, DATE, VIEWS) VALUES (?, ?, TO_DATE(?, 'MM/DD/YYYY'), ?)`)

    await dbConn.statementExecBatchPromisified(statementStats, stats)

So first problem solved. I have the entire input dataset on the server but still with duplicates and not yet aggregated into my target table.

For that, my next thought was back to using the UPSERT command. UPSERT has a subquery option that will allow the use of the temporary table as the source. But here I hit an unexpected issue. The UPSERT (and later the MERGE) command needs the source sub query to return unique results. When I performed the operation with my temporary table as the subquery source, I received an unique constraint violation. I’m guessing that the operation actually creates a temporary table for the subquery results but matches the target table exactly with its temporary table. Therefore my duplicate records in the source dataset cause problems again.

The solution was that I needed to de-dupe the records in the temporary table. I was able to do this by using the ROW_NUMBER function over a WINDOW of data created by a PARTITION BY over my actual keys. This way any duplicate records are numbered and I can safely delete any record with a row number greater than 1 knowing that it is a duplicate.

https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData.js#L68-L72

    await dbConn.execSQL(
      `DELETE FROM #TEMP_STATS WHERE "$rowid$" IN
       ( SELECT ROW_ID FROM (SELECT ROW_NUMBER() OVER (PARTITION BY ORG, REPOSITORY, "DATE") as RN,    
         "$rowid$" as ROW_ID, ORG, REPOSITORY, "DATE" FROM #TEMP_STATS ORDER BY 3, 2, 1)    
        WHERE RN>1)`)

My final issue was the way I wanted to handle those records that matched existing records in the database. With the subquery and an UPSERT statement, I didn’t have enough control over what happens when I find a matching record. This lead me to switch from the UPSERT command to the MERGE INTO command. This more powerful command allows you as the developer to code both the WHEN MATCHED and WHEN NOT MATCHED logic to control exactly what happens in each scenario.

This worked wonderfully; allowing me to add the current record value to the input value when I find matching keys.

https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/loadData.js#L74-L77

    await dbConn.execSQL(`MERGE INTO STATS AS T1 USING #TEMP_STATS AS T2
        ON T1.ORG = T2.ORG and T1.REPOSITORY = T2.REPOSITORY and T1.DATE = T2.DATE
        WHEN MATCHED THEN UPDATE SET T1.VIEWS = (T1.VIEWS + T2.VIEWS)
        WHEN NOT MATCHED THEN INSERT (ORG, REPOSITORY, DATE, VIEWS) VALUES(T2.ORG, T2.REPOSITORY, T2.DATE, T2.VIEWS)`)

And finally it’s time for testing.  I load my small Excel sample and upon the first load it correctly drops the one duplicate record but the rest are inserted (in a mass operation without any constraint errors).

First%20Test%20Results%20%28HANA%20Table%20on%20top%20and%20Excel%20on%20bottom%29

First Test Results (HANA Table on top and Excel on bottom)

And if you now load the exact same sample Excel again, you will that the duplicate from the source is removed first and then all the views are aggregated (doubled in this simple test case).

Second%20Test

Second Test

If you are following along with the execution on your own project, you can clear the data easily from the HANA table by running the clearData.js script.  That allows you to run various tests in different scenarios and orders. https://github.com/SAP-samples/sap-tech-bytes/blob/2021-03-17-hana-mass-input-with-dups/srv/clearData.js

Closing

You are of course welcome to try all of this out yourself. That’s why we created the simplified version of the data model, project and small dataset. There’s a branch for this SAP Tech Bytes post in the accompanying SAP Tech Bytes repository: SAP-samples/sap-tech-bytes at 2021-03-17-hana-mass-input-with-dups (github.com)

SAP Tech Bytes is an initiative to bring you bite-sized information on all manner of topics, in video and written format. Enjoy!

2 Comments
You must be Logged on to comment or reply to a post.
  • This post deals with a common problem that both have been solved many times over and at the same time is non-obvious enough to require considerable effort when encountered for the first time.

    It's a good idea to post about this sort of problem and show how it can be solved!

    The presented solution solves the problem, but leaves a few "rough edges".

    The first of those "rough edges" is the use of the HANA internal column "$rowid$" to create a unique set of records. Not only is this column not supported by SAP to be used in your application, moreover it should absolutely ring a tiny alarm bell if it is required to use a low-level concept like "$rowid$" to solve a problem that is clearly in the domain of the SQL language.

    Looking further into the presented solution the following becomes clear:

    - the per-input-file de-duping works based on the ORG, REPOSITORY, DATE key.
    That means if there are multiple entries with the same key but with different VIEW counts only one of those VIEW count values will be taken further. And it is not specified *which* one it will be, as the ROW_NUMBER() expression lacks the ORDER BY clause.

    - the ORDER BY clause in the DELETE FROM subquery is superfluous AND obscure at the same time as instead of writing out the column names to order by, the column indexes of the result set are used.

    - running over the temporary table and deleting records in place first and using the remaining data for the MERGE surely adds system workload but does not add value to the solution.

    Instead, this could be done with a single command without looping over the source table twice, without using an unsupported low-level construct, with a less obscure and unambiguous logic.

    MERGE INTO STATS_t AS trgt
    USING(
            SELECT ORG, REPOSITORY, DDATE, VIEWS 
            FROM (
                SELECT 
                    ROW_NUMBER() OVER 
                        (PARTITION BY ORG, REPOSITORY, DDATE
                         ORDER BY ORG, REPOSITORY, "DDATE", VIEWS DESC ) as RN
                         , ORG
                         , REPOSITORY
                         , DDATE 
                         , VIEWS 
                FROM STATS)    
            WHERE RN = 1
            ) AS src
            ON (trgt.ORG, trgt.REPOSITORY, trgt.DDATE) 
             = (src.ORG , src.REPOSITORY , src.DDATE)
    
    WHEN MATCHED THEN 
        UPDATE SET trgt.VIEWS = (trgt.VIEWS + src.VIEWS)
    
    WHEN NOT MATCHED THEN 
        INSERT (ORG    , REPOSITORY    , DDATE    , VIEWS) 
        VALUES (src.ORG, src.REPOSITORY, src.DDATE, src.VIEWS);  

    Going through this one change at a time:
    - I renamed STATS to STATS_t and #STATS to STATS, as well as "DATE" to DDATE.
    That doesn't change the logic but made it more convenient for me.

    - Instead of modifying data in the temp table, I simply put the subquery that is responsible to generate unique records into the MERGE statement.

    - Instead of T1 and T2 I use trgt (TARGET) and src (SOURCE) as table aliases to make the whole statement easier to grok - especially since the merge SYNTAX is not commonly the best-known SQL command.

    - Instead of a chain of `AND t1.column = t2.column AND ...` I opt for the "tuple"-syntax that allows for two-line formatting that makes it very clear which fields are to be matched and that this is the complete key to be used.

    - Similarly, the INSERT branch formatting matches target fields with values.
    - Speaking of branches, by formatting the MATCHED and NOT MATCHED branch properly, the "logic" of the MERGE command becomes more apparent.

    - Coming to the "heart" of the de-duplication logic.
    Using ROW_NUMBER to identify duplicates when not all columns of the set should be considered is a common approach. And filtering on the first occurrence is part of this approach, so that's perfectly fine and what I do here, too.
    My statement does away with the final ORDER BY for the SELECT, as it really does not matter here.

    Instead, the ROW_NUMBER clause gets a lot more specific by using an ORDER BY clause to define WHICH of the duplicates will "survive". Here I specified that the ROW NUMBER will be assigned to the records by VIEWs in descending order.
    That means, the record with the HIGHEST VIEW value in the key-group will get ROW NUMBER 1 and thus will be the "survivor" of the de-duplication process.

    Using the MERGE statement this way leaves uns with a single statement that gets executed by HANA. No two-step processing required here. No internal column required here. The statement itself is more readable (both in source code and in the HANA monitoring).
    And by spelling out which records are to be included, the whole solution is a lot less ambiguous.