Skip to Content
Technical Articles
Author's profile photo Shivam Shukla

HANA ML DataFrame : End-to-end methods and it’s usage

A small write-up on HANA ML dataframe , it is really a learning , an exposure and a knowledge sharing process to write something beautiful you learn along with your day to day job so holding the passion for technology in both of my hands here come’s my first post of 2020 and topic is interesting enough , everyone’s favorite. HANA Machine learning & it’s about dataframe this time.

Dataframe methods it’s meaning and their python implementation.

Let’s go step by step.

HANA ML Dataframe – A Skeleton for data

  • Represents a frame that is backed by a database SQL statement and can also be created by the table statement.
  • The SAP HANA dataframe , which provides a set of methods for analyzing the data in SAP HANA without bringing the data to the client.

This module represents a database query as a dataframe. Most operations are designed to not bring data back from the database unless explicitly requested.

DataFrame Methods : – 

Import dataframe library from hana_ml package.

import hana_ml
from hana_ml import dataframe   ##import dataframe from HANA ML

Before diving deep into dataframes let’s connect first to SAP HANA System and load some data into dataframe for manipulation , filtering & slicing etc.

from data_load_utils import DataSets , Settings
url , port , usr , pwd = Settings.load_config(r"C:\Users\abc\config\e2edata.ini")

 

ConnectionContext – This represents a connection to HANA System let’s code this

connection_context = dataframe.ConnectionContext(url,port,usr,pwd) ##pass user id and password host and port for connection

connection_context.connection.isconnected()     if this is connected you will True in console as output

 

Load Data –  Create dataframe from table if you are choosing the table method pass the table name of corresponding schema

dataset1 = connection_context.table("ADULT_DATA_FULL_TBL") ##Table Name of Schema 

##Another way of achieving it 

dataset1 = connection_context.sql("SELECT * FROM ADULT_DATA_FULL_TBL") ##SQL Query for dataframe 

this returns the dataframe which is just a skeleton , it’s not having any data So most of the operations on the dataframes are optimized to execute inside SAP HANA & on the fly which result’s in huge performance benefit.

dataset1.select_statement   ## This is select statement for table which we have used to create dataframe , Meanning is ..
Output
'SELECT * FROM "ADULT_DATA_FULL_TBL"'

 

Execute dataset1 jupyter cell and output is the hana ml dataframe in return it doesn’t contain data as of now , we need to call the .collect() method which will further execute the select statement & give’s result back to the client (in our case jupyter notebook or python environment)

count() –  Return the number of records in the dataframe.

dataset1.count()

Output
48822

collect() – This triggers the select statement on dataframe & fetches data from SAP HANA , this must be used when you think you need that data into python environment for some display , validate or any other kind of requirement , the design goal of this is to perform calculations on the fly and give back the result to python environment for further processing.

example:

dataset1.head(10).collect() ## Use Collect method to load data from sap hana into client ,  this will fetch first 10 records

head(10) means top 10 records

result :

columns –  List the number of columns in dataset

dataset1.columns

Ouput
['ID',
 'AGE',
 'WORKCLASS',
 'FNLWGT',
 'EDUCATION',
 'EDUCATIONNUM',
 'MARITALSTATUS',
 'OCCUPATION',
 'RELATIONSHIP',
 'RACE',
 'CAPITALGAIN',
 'CAPITALLOSS',
 'HOURSPERWEEK',
 'NATIVECOUNTRY',
 'INCOME']

 

Limit rows during display – just pass the lower & upper limit in the square bracket , this is just to get the overview of data like you wanna sample something.

Copy of dataframe into a new dataframe –  this is the case when you don’t wanna change/alter the original dataframe.

dataset3 = dataset1.head(100).collect()

dataset3 ##it will display the 100 records copied from dataset1 

 

Drop duplicates – drop duplicates from the dataset , this is required when you are processing the data in algorithms which just require the unique entries.

data cleansing

dataset4 = dataset1.drop_duplicates() 

##Check the select statement for dataset4 

dataset4.select_statement 
'SELECT DISTINCT * FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") AS "DT_4"'

When removing duplicates from the dataset , select statement on the fly selects only the unique records from the sap hana database table , hence your original table in sap hana database remains intact.

Drop Columns from the dataset –  Drop the columns which are not required this is used where some useless columns are removed so that algorithms performs better and results must be accurate in regression/classification.

dataset5 = dataset1.drop(["AGE"])

dataset5.select_statement      #####There is no column AGE here ,this is also called slicing of data 

'SELECT "ID", "WORKCLASS", "FNLWGT", "EDUCATION", "EDUCATIONNUM", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "CAPITALGAIN", "CAPITALLOSS", "HOURSPERWEEK", "NATIVECOUNTRY", "INCOME" FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") AS "DT_4"'

 

So every time you perform an operation on dataset using any method , it prepare’s corresponding select statement & execute it on the fly & give’s result back to the client.

Select TOP 10 records from dataset but for two columns only – 

Select statement on dataframe contains only two columns and selecting only 10 records once the collect method called.

dataset6 = dataset1.select('ID' , 'WORKCLASS').head(10)# , 'AGE' , 'HOURSPERWEEK' , 'RACE').head(10).collect()

dataset6.select_statement  ##Dynamic select statement
'SELECT TOP 10 * FROM (SELECT "ID", "WORKCLASS" FROM (SELECT * FROM "ADULT_DATA_FULL_TBL") AS "DT_6") dt'

dataset6.collect()

	ID	WORKCLASS
0	28	Private
1	75	Private
2	196	Private
3	264	Private
4	303	Private
5	324	Private
6	390	Private
7	442	Private
8	447	Private
9	472	Private

 

NAs in Dataset – Check for Null values in dataset , this is required before processing the data into any algorithm , drop the null or put any average value into it.

dataset1.hasna('CAPITALGAIN') ##Check if there is any NULL value in CAPITALGAIN column of dataset it will
return true or false 

Output
True

Fill NAs with some random value 

dataset1 = dataset1.fillna(10 , ["CAPITALGAIN"])
dataset1.hasna('CAPITALGAIN')

Output
False

 

Operation on dataframe and do some manipulation on fields – 

A new column of TWICE_AGE has been added with a multiplication of 2 in original age.

dsp = dataset1.select('ID' , 'AGE'  , ('"AGE"*2' , "TWICE_AGE"))     ##Age Column has been added with multiplication

dsp.head(5).collect()

Ouput - 
	ID	AGE	TWICE_AGE
0	1	71	142
1	2	44	88
2	3	39	78
3	4	27	54
4	5	44	88

 

Checkout the select statement of above dataframe (dsp)

dsp.select_statement

Output

'SELECT "ID", "AGE", "AGE"*2 AS "TWICE_AGE" FROM 
(SELECT "ID", "AGE", "WORKCLASS", "FNLWGT", "EDUCATION", 
"EDUCATIONNUM", "MARITALSTATUS", "OCCUPATION",
"RELATIONSHIP", "RACE", COALESCE("CAPITALGAIN", 10) 
AS "CAPITALGAIN", "CAPITALLOSS", "HOURSPERWEEK", 
"NATIVECOUNTRY", "INCOME" 
FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") dt) AS "DT_18"'

 

Data Filtration methods – 

dataset4.head(5).collect()  ## Select 5 Sample records from the dataset 

Output

	ID	AGE	WORKCLASS	FNLWGT	EDUCATION	EDUCATIONNUM	MARITALSTATUS	OCCUPATION	RELATIONSHIP	RACE	GENDER	CAPITALGAIN	CAPITALLOSS	HOURSPERWEEK	NATIVECOUNTRY	INCOME
0	41959	39	Private	286789	Doctorate	16	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States	>50K
1	33040	22	Private	349212	Some-college	10	Never-married	Other-service	Own-child	White	Female	0	0	20	United-States	<=50K
2	46142	53	Private	191389	HS-grad	9	Divorced	Exec-managerial	Not-in-family	White	Female	0	0	40	United-States	<=50K
3	1319	37	Private	162424	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	45	United-States	<=50K
4	35823	56	Private	200316	Some-college	10	Married-civ-spouse	Adm-clerical	Husband	White	Male	0	0

 

Query – We want those records which are having AGE greater than 60 

dataset_age = dataset4.filter('AGE > 60') ## records AGE greater than 60

dataset_age.collect()
Output:

dataset_age.count()
Output:
3606

dataset_age.select_statement
Output:
'SELECT * FROM (SELECT DISTINCT * FROM (SELECT * FROM "ADULT_DATA_FULL_TBL") 
AS "DT_6") AS "DT_20" WHERE AGE > 60'

Output : 3906 records displayed.

Add Multiple conditions for multiple columns – 

dataset4_res = dataset4.filter('AGE > 60 AND EDUCATIONNUM = 8')
dataset4_res.select_statement

Output:
'SELECT * FROM (SELECT DISTINCT * FROM 
(SELECT * FROM "ADULT_DATA_FULL_TBL") AS "DT_6") 
AS "DT_20" 
WHERE AGE > 70 AND EDUCATIONNUM = 8'

Here multiple conditions have been added and see the 
where clause for this dynamic select statement 

dataset4_res.count()

Output:
7

Only 7 records

Age > 70 & Educationnum = 8

## Suppose we want those records where EDUCATIONNUM = 9 

dataset4.filter('EDUCATIONNUM = 9').head(5).collect()


## Suppose we want those records where AGE = 27

dataset4.filter('AGE = 27').head(5).collect()

 

Select Statement for Filter : – Filter condition + Selecting three columns  , let’s see what select statement get’s build dynamically

Python Script –

df = connection_context.table('ADULT_DATA_FULL_TBL', schema='DM_PAL').filter('AGE>70').select('ID', 'AGE' , 'WORKCLASS')

df.select_statement
Output:
'SELECT "ID", "AGE", "WORKCLASS" FROM 
(SELECT * FROM (SELECT * FROM "DM_PAL"."ADULT_DATA_FULL_TBL") 
AS "DT_16" WHERE AGE>70) AS "DT_17"'

See the dynamically prepared select statement .

df.collect()
Output:
	ID	AGE	WORKCLASS
0	9526	90	Private
1	29221	90	Private
2	15578	71	Private
3	25061	74	Private
4	6922	76	Private
...	...	...	...
863	35161	71	Self-emp-not-inc
864	35964	75	Local-gov
865	37604	72	?
866	39221	73	Self-emp-not-inc
867	48214	74	Private
868 rows × 3 columns

 

Multiple filter conditions on Pandas Dataframe – 

dataset3[(dataset3['AGE'] > 27) & ( dataset3['OCCUPATION'] == 'Craft-repair')]

if you have a pandas dataframe at client and you wanna filter records at client end only then above filter syntax will suffice.

Output :

 

Slicing of Dataframe – 

Slice few columns from dataset

## Copy the original dataset into a new one 
new_dataset1 = dataset1.head(5).collect()

new_dataset1[['ID' , 'AGE']]

Output:
	ID	AGE
0	28	27
1	75	57
2	196	48
3	264	52
4	303	35

Copy sliced data into third one 
new_dataset2 = new_dataset1[['ID' , 'AGE']]

new_dataset2
Output:
	ID	AGE
0	28	27
1	75	57
2	196	48
3	264	52
4	303	35

Create dataframe for few columns only – 

dataset_id_age = connection_context.sql('SELECT TOP 10 ID , AGE FROM ADULT_DATA_FULL_TBL')

dataset_id_age.collect()

Output:
	ID	AGE
0	28	27
1	75	57
2	196	48
3	264	52
4	303	35
5	324	24
6	390	34
7	442	49
8	447	35
9	472	42

dataset_id_age.select_statement
'SELECT TOP 10 ID , AGE FROM ADULT_DATA_FULL_TBL'

In Short dataframe is a select statement which just 
wait to trigger unless it is called

 

describe() –  Describe the dataframe with all the stats information

dataset4.describe().collect() 

empty() – Returns True if this DataFrame has 0 rows

dataset2.empty()

Output:
False

has() – Returns true if a column is in the DataFrame.

dataset1.columns

Output:
['ID',
 'AGE',
 'WORKCLASS',
 'FNLWGT', 'EDUCATION','EDUCATIONNUM',
 'MARITALSTATUS',
 'OCCUPATION',
 'RELATIONSHIP',
 'RACE',
 'GENDER',
 'CAPITALGAIN',
 'CAPITALLOSS',
 'HOURSPERWEEK',
 'NATIVECOUNTRY',
 'INCOME']


dataset1.has('ID')

Output:
True

 

join() – Joining of two dataset based on the condition like ID = ID

condition = '{}."ID"={}."ID"'.format(dataset4.quoted_name, dataset2.quoted_name)

## Display Condition

condition 

Output:
'"DT_7"."ID"="DT_3"."ID"'

## Let's join dataset4 and dataset2 and copy the result into a new dataset7
dataset7 = dataset4.join(dataset2, condition)

dataset7.head(5).collect()
Output:

dataset7 

save dataframe save() – 

dataset4.save('#SAVEDATA')

## retrieve the saved dataframe into a new dataframe 
new_dataset4 = connection_context.table('#SAVEDATA')

##Fetch first 10 records and display 
new_dataset4.head(10).collect()

Ouput: 

10 Records displayed 


new_dataset4.select_statement   ### See here data is coming from temp database table or may be a pointer to the original table 

'SELECT * FROM "#SAVEDATA"'

 

AGGREGATE FUNCTIONS – one of the most important topic under the dataframe methods , calculating average age , or summing up the prices , counting the items sold , such kind of requirements resulted inventing these kind of functions , let’s code and see how they work.

## Suppose we need to find out the average age of employees working in a specific department

agg_list = [('AVG' , 'AGE' , 'AVG_AGE')] ## Aggr. funtion , column and New column 

dataset_avg_age = dataset4.agg(agg_list=agg_list , group_by='WORKCLASS').collect()

## Display the results 

dataset_avg_age

Group by WORKCLASS and Average Age of employees are displayed here.

## Count the number of people of same age

dataset_age_count = dataset4.agg(agg_list=[('COUNT' , 'ID' , 'ID_COUNT')] , group_by='AGE').collect()

dataset_age_count 

Output:
from IPython.display import HTML
HTML(dataset_age_count.head(5).to_html())

## Maximum AGE of employees in each group

dataset4.agg([('max', 'AGE', 'MAX_AGE')], group_by='WORKCLASS').collect()

Output:

## MIN Age of employees in each workclass

dataset4.agg([('min', 'AGE', 'MIN_AGE')], group_by='WORKCLASS').collect()

Output:

 

corr() –  finding correlation between columns of a dataset

dataset4.corr('ID' , 'AGE').collect()

 

Create dataframe from pandas – An SAP HANA DataFrame that contains the data in the pandas_df.

import pandas as pd

d = {'col1': [1, 2], 'col2': [3, 4]}  

df = pd.DataFrame(data=d)

df_new = hana_ml.dataframe.create_dataframe_from_pandas(connection_context, df, 'pd_df', force=False, replace=True)

df   ##Pandas dataframe 

Output:
	col1	col2
0	1	3
1	2	4

df_new.collect()  ##HANA Ml dataframe 

Output:
	col1	col2
0	1	3
1	2	4

 

Rename column:

dataset1.columns
Output:
['ID',
 'AGE',
 'WORKCLASS',
 'FNLWGT',
 'EDUCATION',
 'EDUCATIONNUM',
 'MARITALSTATUS',
 'OCCUPATION',
 'RELATIONSHIP',
 'RACE',
 'GENDER',
 'CAPITALGAIN',
 'CAPITALLOSS',
 'HOURSPERWEEK',
 'NATIVECOUNTRY',
 'INCOME']

Rename Column ID :
dataset1 = dataset1.rename_columns({'ID' :'ID_NEW'})

dataset1.columns
Output:
['ID_NEW',
 'AGE',
 'WORKCLASS',
 'FNLWGT',
 'EDUCATION',
 'EDUCATIONNUM',
 'MARITALSTATUS',
 'OCCUPATION',
 'RELATIONSHIP',
 'RACE',
 'GENDER',
 'CAPITALGAIN',
 'CAPITALLOSS',
 'HOURSPERWEEK',
 'NATIVECOUNTRY',
 'INCOME']

Dataframe API – https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.04/en-US/hana_ml.dataframe.html

I have covered almost all the methods in dataframe but still few are left to explore , i will keep sharing other methods in my next blog.

 

Thank you for reading 🙂

 

PS : I am continuously updating the blog by adding the new methods & any new learning 🙂

Assigned Tags

      7 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Emmanuel Joaquim Da Costa
      Emmanuel Joaquim Da Costa

      Hi

      Thank you for the post

      Regards

      Emmanuel

       

      Author's profile photo J Deepa
      J Deepa

      Thanks for the post...

      Author's profile photo Rubane s
      Rubane s

      Hi Shivam,

       

      Getting error while trying to execute below code:

       

      from data_load_utils import DataSets , Settings
      url , port , usr , pwd = Settings.load_config(r"C:\Users\abc\config\e2edata.ini")

      error

      ModuleNotFoundError: No module named 'data_load_utils'

       

      Just wondering do we need to do something before that which I'm missing?\

      And also I created e2edata.ini file on my own I have all information but dont have URL. What does URL stands for I've Hana express edition on GCP.

       

      Thanks

      Author's profile photo Shivam Shukla
      Shivam Shukla
      Blog Post Author

      Hi ,

       

      e2data.in => it's just which HANA System you are trying to connect to or pass directly the values of url pass etc.

      second data_load_utils donwload this file from here & store in your local working directory of Jupyter notebook if error is still there please let me know.

      https://github.com/SAP-samples/hana-ml-samples/tree/main/Python-API/pal/notebooks

       

      Thanks,

      Shivam

      Author's profile photo Nico Wong
      Nico Wong

      Hello,

       

      I was wondering if there's a method to load prediction results back into the HANA database? So say the results are in a pandas df, we would convert it to a SAP df, but how would you go about loading it back into HANA as a table?

      Author's profile photo Mukesh Sharma
      Mukesh Sharma

      Hi Nico Wong ,

       

      You got the answer to your query ? Even I have same query.

       

      Thanks and Regards,

      Mukesh Sharma

      Author's profile photo Paulo Castro
      Paulo Castro

      Where can i get a sap connection to test your code? I have no this data. thanks for attention.

      dataframe.ConnectionContext(url,port,usr,pwd)