While working on an interesting use case for my client where we were designing the solution around a DataLake offered by SAP DI ,which in our case was the S3 bucket of AWS, I felt the need of performing SQL like operations on a CSV file storage. We were extracting ABAP based tables and storing the data in the DI DataLake. Since the S4 Hana or ECC specific tables have changes which occur very frequently we were dealing mainly with Delta Load. Since the DataLake which we created was mainly based on CSVs and we were missing an actual DB, hence, we were mainly left pondering over the question of ‘How to Merge the changes which the Delta Loads bring into the DI DataLake’.
While designing this whole Architecture, I primarily focused on first finding a solution and then later enhancing the solution with more performant features. The most viable solution which we could come up with was using the Pandas Library and leverage the open source Python features. Below is a high level Architecture design of our solution
To be fairly honest, the fastest way to interact with the Kernel of SAP DI is when we use ‘GO’. The only downside is the implementation effort and the cost to maintain the overall solution when the knowledge is gone from the project.
Pandas is a relatively stable library in the market with over 13 years of experience which primarily was created for topics on data manipulation and analysis. In the recent era of topics dominated by ML, Pandas has been a core to many operations.
Of course we cant relatively be closer to what a Database would offer from a performance perspective but we can optimize the solution in tandem with the concepts of a Traditional DB.
Objective of this Blog
The primary objective of writing this blog is to cover
- How we used these various operations to construct some of the core functionalities of our Architecture
- How these operations can be reused in other operation like Data Cleansing , Preprocessing etc.
SQL Like Operations with Pandas
I recall when I learned some core concepts of SAP ABAP almost 10 years ago, almost every example had reference to SFLIGHT. I somewhat have this table drilled in my DNA and its always my go to table for testing. It probably was my ‘Hello World’ in ABAP.
Before I digress in the memory lane lets get back to the Technical aspect. We would use the SFLIGHT table for reference throughout this blog
In a traditional DI setup the data would be stored somewhere either in the DI data Lake , or fetched from a source system and then further operations can be performed based on the pipeline flow.
Lets take the first use case where the Data is read from DI DataLake (SDL).In our architecture , the data is pushed from the ABAP system via an SLT operators and lets say we store this data into a CSV file in the DataLake. Below is a screenshot of how the table would look like in the SDL
Described below is a Fact Sheet of the file where the file location is shown as well as the table structure is defined in columns.
Below is a quick data preview of the table in the SDL
Incase you want to take a deeper look into the file (which I would personally advice) , before you go ahead please check here .
Please Note : Incase you noticed a field called IUUC_OPERATION and incase you are not aware of it, this field is used for SLT Delta Operations and more details can be found on this blog on SAP ABAP Integration by Britta (Link here).
SAP DI comes prepacked with a lot of predefined operators which can be used to read and write files. However I really love to have control over the code and prefer to load the file directly into the same python operator and build on top of it. Below is a quick look at a graph which we would use to read the file from the SDL and further perform SQL operations later on it.
I am using 2 output operators , The output of wiretap is of type message and the output of the Terminal operator is of type string. The way to send response to both the operator is different. Please check line 13 and 14 of the code snippet.
The code snippet from the python operator is like below:
We would use the HDFS library to read the CSV file stored in the SDL. Line number 4 in above code snippet is used to instantiate the client details. The SDL in DI is hosted over the host:client as mentioned in line 4.
Line 10 is where we add the path of file in the SDL and Line 11 and 12 are used as a combination to read the data into a Data Frame.
When working with Pandas Dataframe , incase we want to see the output of how the Dataframe looks like , we can either use wiretap operator or terminal. However please keep in mind the Dataframe has to be converted to a string to show a proper output or else it would show metadata specific info about the DF. Line 13 and 14 shows how such an output can be shown. Below mentioned is an output of the Terminal which shows the whole dataframe.
Incase you noticed this is basically the whole table output or the Select * operation on the table. From a core Python perspective , this would be equivalent to a print (dataframe) or df.head() statement.
Select Top N *
The select top operation of SQL can be emulated with a head statement from Pandas. The command would look like (where df is your dataframe) :
df.head(5) -> this would return the top 5 result of the dataframe as below :
However this data is raw as present in the table and maybe lack objective/meaning. Lets look at how to sort data
The sort operations can be performed in Pandas using the sort_values function. A statement considering our example would look like below :
df = df.sort_values(‘CARRID’,ascending = True)
In this case we are sorting the data based on the CARRID field and the output should look like below:
Group By Statement
The group by example in our dataset wont fit since each entry is kind of unique in its own way and an aggregation wont work , however the Group By statement in pandas would be like below :
The aggregation is for each column using the above statement based on a min value.
Select Specific columns
The classic Select a,b,c from table T which is more like selecting a subset from the main data set can be performed in Pandas very easily. The below statement will select the fields CARRID , CONNID and FLDATE from the big dataset and display the output
df = df[[‘CARRID’ , ‘CONNID’ , ‘FLDATE’]]
Now lets add a little more flavor to this. The group by statement didn’t work to the finest in our previous example because the column FLDATE has unique value for each row. What we if select unique columns and then group by to get an aggregated result? All we need to do is to combine our above 2 statements either as 2 separate statements or as 1 combined statement.
Lets say we select CARRID , CONNID and PAYMENTSUM and want to find the sum of amount based on a group by for CARRID and CONNID(i.e Sum of amount for a unique combination of CARRID and CONNID).The below statement can achieve this result
df = df[[‘CARRID’ , ‘CONNID’ , ‘PAYMENTSUM’]]
df = df.groupby([‘CARRID’,’CONNID’]).agg(‘sum’)
Here is a screenshot of the result
Select count *
This operation basically gives a count of total number of rows. In pandas the details relevant to row / column can be obtained with the statement df.shape[0 or 1 for specific details] or just df.shape to view the n*m output of the table structure
df.shape -> This gives the row count
df.shape -> This gives the column count
Now lets move towards bit more complicated cases.
Select distinct values
Now lets discuss operations where we need distinct values for a particular field or group of fields. Lets say we want all the unique value from our dataset for CARRID.
Our first step would be to select logically would be to select just CARRID dataset from the whole table and then determine the unique values. We can achieve this with the below statement
LT_CARRID = df[‘CARRID’].unique()
The above statement would give as a response a NumPY array of unique values. Below is an output and if you would want a closer look at the code please check here.
Select * or Select Subset with Where conditions
We have already discussed the various combinations of selecting either the whole table , dataset or even selecting distinct values. Lets combine this knowledge so far to build ahead from here.
Lets say we want to select the whole table where CARRID = ‘AC’ and CONNID = ‘0820’.
In situations where we are dealing with more than 1 selection , we do so as defined below
Step 1 ( Create a combination for basic skeleton for a where condition)
df_carrid = df[‘CARRID’] == ‘AC’
df_connid = df[‘CONNID’] == ‘0820’
The above statement would go through the whole data set and try to map each row with the defined value .If the value is found it would be set to true or else False. Below is a screenshot of the response of these 2 statements.
I hope this could clarify a little but let me be honest, I did lie a bit. The second statement would not return the correct results. Any guesses why?
df_connid = df[‘CONNID’] == ‘0820’
The reason is that CONNID is of type integer however our selection in the above statement is of type string. Of course we can do a quick workaround like below to resolve this ambiguity which would eventually mean the same as the above defined statement, however, with the right data type assigned and a bit of dynamic flavor added(however we still have the hardcoding).
LT_val = [‘AC’ ,820]
df_carrid = df[‘CARRID’] == LT_Val
df_connid = df[‘CONNID’] == LT_Val
Now Lets move ahead to the second step which is to create a data frame based on the union of our above 2 conditions. This can be achieved with the below statement
df_result = df[df_carrid & df_connid ]
The result of this statement would look like below which matched our desired result :
Incase we have an or condition the statement would change like below :
df_result = df[df_carrid | df_connid ]
This statement would give a subset where CARRID = ‘AC’ or CONNID = 0820
Select based on Like condition.
Well to be honest this dataset of SFLIGHT table is horrible for this example. But still lets make a random invalid example to build a use case.
Lets say we want to extract all the records where Currency contains the letter ‘U’.
The below statement would do the trick
however this in itself wont be sufficient for our dataset. If you take a closer look at the data there are few currency rows which are empty. These empty values are interpreted as NAN by Data frames. Our first step would be to convert these Nan to blank
The result of this would give us what we are looking for , below is the screenshot of the response :
Incase you noticed the fillna function there is an argument passed with inplace = TRUE.
With inplace = TRUE setting , the dataframe will be updated with the changes performed in that statement so to simplify my complex explanation a statement like
df.fillna(”, inplace=True) is equal to df = df.fillna(”)
Well to be honest we still need to cover the most fundamental operations of Insert, Update and Delete.
We still need to cover Select statements based on a time range like
select from sflight where fldate > 01.01.2021 & fldate< 01.01.1989
There is type casting, Union operations and much more.
As I can see in the preview , this blog is already roughly 10 minutes of read and I honestly would prefer to break this blog into another part where these core concepts are covered in more details.
As I mentioned in the beginning of the blog, handling the delta records is a core concept of our Architecture, I would love to discuss it in my next blog. Here is the link to it :
Hope this blog could be of some use to someone, somewhere, someday 🙂 ! Have a nice time ahead.