Skip to Content
Technical Articles

Import of multiple files into SAP HANA Cloud from a cloud storage (Amazon S3) using SAP Data Intelligence 3.1

I hope you haven’t missed the announcement of SAP Data Intelligence, trial edition 3.1 posted by Dimitri Vorobiev last week! Please do not miss as well exercises from SAP TechEd hands-on sessions that were published too. One of them describes steps to read and load data into SAP HANA.

In my post, I would like to share a slightly different approach: with no data, but control flowing between operators. I will use the same setup and scenario from my last post where I was scripting the import of multiple files into SAP HANA Cloud from S3 cloud storage.

In a nutshell, I want to do automate a load of multiple files stored in a single Amazon S3 bucket into the corresponding tables in the SAP HANA Cloud database. In my exercise, I work with 25 files generated for TPC-DS. Some of these files have significant sizes, but further optimization of their load is not in the scope for now.

The setup…

…is taken from the previous post, assuming the files are generated and stored in the S3 bucket, the PSE in the database is all configured, and we can use the same S3Reader user credentials (but do not try to copy/paste this key/secret, as I recreated them after the publication 🙂 ).

The graph…

… has at least 3 operators:

  1. The one to list all required files from the S3 bucket,
  2. A custom operator to build SQL statements to import data files + truncate, in case of reloads of this initial step to populate tables,
  3. HANA Client to execute SQL statements.

But I included two more operators to detect and terminate execution when the last IMPORT is done.

Btw, I used two different programming languages for custom operators — JavaScript and Python — not for any other reason than showing possibilities. The logic could (and in normal situations as well should) be done without multiplying variations in one solution.

List files

This operator works only with connections defined in the Connection Manager, so I had to configure the connection to the S3 bucket there first.

Please note:

  • RegEx-based patter in the filter; it would allow you to reload only selected files, e.g. /sf0001/data/init/\S*_dim\S* would reload only dimension files.
  • We are loading from S3, where “catalog structure” is a prefix of objects stored in the flat structure. Therefore /sf0001/data/init/ is a part of the name.

Convert the list into SQL statements

Just remember to replace the AWS region, the bucket name, and user credentials with your own 😉

$.setPortCallback("input",onInput);

function onInput(ctx,s) {
    var msg = {};

    var inbody = s.Body;
    var inattributes = s.Attributes;

    msg.Attributes = {};
    msg.Attributes["message.lastBatch"]=inattributes["message.lastBatch"];
    
    var theS3Obj = inattributes.file.path;
    var tableName = (theS3Obj.substring(theS3Obj.lastIndexOf('/')+1).split('.'))[0];
    msg.Body = "TRUNCATE TABLE TPCDS." + tableName + "; ";
    msg.Body+= "IMPORT FROM CSV FILE 's3-eu-central-1://AKIAUJZDQQSWQ3BAMYWJ:QSlFbJGngCdQyp+Ft3fUw6S8g9iCMvmCCVFK0Jb6@tpcds4hdbc" + theS3Obj + "' INTO TPCDS." + tableName + " WITH FIELD DELIMITED BY '|' THREADS 2;"
    msg.Attributes["sql_stmnt"]=msg.Body;
    
    $.output(msg);
}

SAP HANA Client

Make sure you set a long enough time-out, so that IMPORT has enough time to finish without being terminated by the client. The value 0 means no time-out, which should not be used in the real situation.

The graph termination

Because we send two SQL statements (TRUNCATE and IMPORT) with every incoming message the client sends two outgoing messages — once for each completed statement — even if both are treated as a database transaction.

lastBatchesReceived=0

def on_input(msg):
    global lastBatchesReceived
    if 'message.lastBatch' in msg.attributes and msg.attributes['message.lastBatch'] == True: lastBatchesReceived+= 1
    if lastBatchesReceived > 1: api.send("stopsig", lastBatchesReceived)
    
api.set_port_callback("sqlresult", on_input)

So, we need to have a counter two collect two messages with the 'message.lastBatch' equal to True before sending a signal to the Graph Termination operator.


It was an example — kind of a prototype — quick and dirty development. I will appreciate your comments on how you would improve this graph.

Enjoy the exploration of these trial editions!
-Vitaliy aka @Sygyzmundovych

 

2 Comments
You must be Logged on to comment or reply to a post.