Skip to Content
Technical Articles

HANA ML DataFrame : End-to-end methods and it’s usage

A small write-up on HANA ML dataframe , it is really a learning , an exposure and a knowledge sharing process to write something beautiful you learn along with your day to day job so holding the passion for technology in both of my hands here come’s my first post of 2020 and topic is interesting enough , everyone’s favorite. HANA Machine learning & it’s about dataframe this time.

Dataframe methods it’s meaning and their python implementation.

Let’s go step by step.

HANA ML Dataframe – A Skeleton for data

  • Represents a frame that is backed by a database SQL statement and can also be created by the table statement.
  • The SAP HANA dataframe , which provides a set of methods for analyzing the data in SAP HANA without bringing the data to the client.

This module represents a database query as a dataframe. Most operations are designed to not bring data back from the database unless explicitly requested.

DataFrame Methods : – 

Import dataframe library from hana_ml package.

import hana_ml
from hana_ml import dataframe   ##import dataframe from HANA ML

Before diving deep into dataframes let’s connect first to SAP HANA System and load some data into dataframe for manipulation , filtering & slicing etc.

from data_load_utils import DataSets , Settings
url , port , usr , pwd = Settings.load_config(r"C:\Users\abc\config\e2edata.ini")

 

ConnectionContext – This represents a connection to HANA System let’s code this

connection_context = dataframe.ConnectionContext(url,port,usr,pwd) ##pass user id and password host and port for connection

connection_context.connection.isconnected()     if this is connected you will True in console as output

 

Load Data –  Create dataframe from table if you are choosing the table method pass the table name of corresponding schema

dataset1 = connection_context.table("ADULT_DATA_FULL_TBL") ##Table Name of Schema 

##Another way of achieving it 

dataset1 = connection_context.sql("SELECT * FROM ADULT_DATA_FULL_TBL") ##SQL Query for dataframe 

this returns the dataframe which is just a skeleton , it’s not having any data So most of the operations on the dataframes are optimized to execute inside SAP HANA & on the fly which result’s in huge performance benefit.

dataset1.select_statement   ## This is select statement for table which we have used to create dataframe , Meanning is ..
Output
'SELECT * FROM "ADULT_DATA_FULL_TBL"'

 

Execute dataset1 jupyter cell and output is the hana ml dataframe in return it doesn’t contain data as of now , we need to call the .collect() method which will further execute the select statement & give’s result back to the client (in our case jupyter notebook or python environment)

count() –  Return the number of records in the dataframe.

dataset1.count()

Output
48822

collect() – This triggers the select statement on dataframe & fetches data from SAP HANA , this must be used when you think you need that data into python environment for some display , validate or any other kind of requirement , the design goal of this is to perform calculations on the fly and give back the result to python environment for further processing.

example:

dataset1.head(10).collect() ## Use Collect method to load data from sap hana into client ,  this will fetch first 10 records

head(10) means top 10 records

result :

columns –  List the number of columns in dataset

dataset1.columns

Ouput
['ID',
 'AGE',
 'WORKCLASS',
 'FNLWGT',
 'EDUCATION',
 'EDUCATIONNUM',
 'MARITALSTATUS',
 'OCCUPATION',
 'RELATIONSHIP',
 'RACE',
 'CAPITALGAIN',
 'CAPITALLOSS',
 'HOURSPERWEEK',
 'NATIVECOUNTRY',
 'INCOME']

 

Limit rows during display – just pass the lower & upper limit in the square bracket , this is just to get the overview of data like you wanna sample something.

Copy of dataframe into a new dataframe –  this is the case when you don’t wanna change/alter the original dataframe.

dataset3 = dataset1.head(100).collect()

dataset3 ##it will display the 100 records copied from dataset1 

 

Drop duplicates – drop duplicates from the dataset , this is required when you are processing the data in algorithms which just require the unique entries.

data cleansing

dataset4 = dataset1.drop_duplicates() 

##Check the select statement for dataset4 

dataset4.select_statement 
'SELECT DISTINCT * FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") AS "DT_4"'

When removing duplicates from the dataset , select statement on the fly selects only the unique records from the sap hana database table , hence your original table in sap hana database remains intact.

Drop Columns from the dataset –  Drop the columns which are not required this is used where some useless columns are removed so that algorithms performs better and results must be accurate in regression/classification.

dataset5 = dataset1.drop(["AGE"])

dataset5.select_statement      #####There is no column AGE here ,this is also called slicing of data 

'SELECT "ID", "WORKCLASS", "FNLWGT", "EDUCATION", "EDUCATIONNUM", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "CAPITALGAIN", "CAPITALLOSS", "HOURSPERWEEK", "NATIVECOUNTRY", "INCOME" FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") AS "DT_4"'

 

So every time you perform an operation on dataset using any method , it prepare’s corresponding select statement & execute it on the fly & give’s result back to the client.

Select TOP 10 records from dataset but for two columns only – 

Select statement on dataframe contains only two columns and selecting only 10 records once the collect method called.

dataset6 = dataset1.select('ID' , 'WORKCLASS').head(10)# , 'AGE' , 'HOURSPERWEEK' , 'RACE').head(10).collect()

dataset6.select_statement  ##Dynamic select statement
'SELECT TOP 10 * FROM (SELECT "ID", "WORKCLASS" FROM (SELECT * FROM "ADULT_DATA_FULL_TBL") AS "DT_6") dt'

dataset6.collect()

	ID	WORKCLASS
0	28	Private
1	75	Private
2	196	Private
3	264	Private
4	303	Private
5	324	Private
6	390	Private
7	442	Private
8	447	Private
9	472	Private

 

NAs in Dataset – Check for Null values in dataset , this is required before processing the data into any algorithm , drop the null or put any average value into it.

dataset1.hasna('CAPITALGAIN') ##Check if there is any NULL value in CAPITALGAIN column of dataset it will
return true or false 

Output
True

Fill NAs with some random value 

dataset1 = dataset1.fillna(10 , ["CAPITALGAIN"])
dataset1.hasna('CAPITALGAIN')

Output
False

 

Operation on dataframe and do some manipulation on fields – 

A new column of TWICE_AGE has been added with a multiplication of 2 in original age.

dsp = dataset1.select('ID' , 'AGE'  , ('"AGE"*2' , "TWICE_AGE"))     ##Age Column has been added with multiplication

dsp.head(5).collect()

Ouput - 
	ID	AGE	TWICE_AGE
0	1	71	142
1	2	44	88
2	3	39	78
3	4	27	54
4	5	44	88

 

Checkout the select statement of above dataframe (dsp)

dsp.select_statement

Output

'SELECT "ID", "AGE", "AGE"*2 AS "TWICE_AGE" FROM 
(SELECT "ID", "AGE", "WORKCLASS", "FNLWGT", "EDUCATION", 
"EDUCATIONNUM", "MARITALSTATUS", "OCCUPATION",
"RELATIONSHIP", "RACE", COALESCE("CAPITALGAIN", 10) 
AS "CAPITALGAIN", "CAPITALLOSS", "HOURSPERWEEK", 
"NATIVECOUNTRY", "INCOME" 
FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") dt) AS "DT_18"'

 

Data Filtration methods – 

dataset4.head(5).collect()  ## Select 5 Sample records from the dataset 

Output

	ID	AGE	WORKCLASS	FNLWGT	EDUCATION	EDUCATIONNUM	MARITALSTATUS	OCCUPATION	RELATIONSHIP	RACE	GENDER	CAPITALGAIN	CAPITALLOSS	HOURSPERWEEK	NATIVECOUNTRY	INCOME
0	41959	39	Private	286789	Doctorate	16	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States	>50K
1	33040	22	Private	349212	Some-college	10	Never-married	Other-service	Own-child	White	Female	0	0	20	United-States	<=50K
2	46142	53	Private	191389	HS-grad	9	Divorced	Exec-managerial	Not-in-family	White	Female	0	0	40	United-States	<=50K
3	1319	37	Private	162424	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	45	United-States	<=50K
4	35823	56	Private	200316	Some-college	10	Married-civ-spouse	Adm-clerical	Husband	White	Male	0	0

 

Query – We want those records which are having AGE greater than 60 

dataset_age = dataset4.filter('AGE > 60') ## records AGE greater than 60

dataset_age.collect()
Output:

dataset_age.count()
Output:
3606

dataset_age.select_statement
Output:
'SELECT * FROM (SELECT DISTINCT * FROM (SELECT * FROM "ADULT_DATA_FULL_TBL") 
AS "DT_6") AS "DT_20" WHERE AGE > 60'

Output : 3906 records displayed.

Add Multiple conditions for multiple columns – 

dataset4_res = dataset4.filter('AGE > 60 AND EDUCATIONNUM = 8')
dataset4_res.select_statement

Output:
'SELECT * FROM (SELECT DISTINCT * FROM 
(SELECT * FROM "ADULT_DATA_FULL_TBL") AS "DT_6") 
AS "DT_20" 
WHERE AGE > 70 AND EDUCATIONNUM = 8'

Here multiple conditions have been added and see the 
where clause for this dynamic select statement 

dataset4_res.count()

Output:
7

Only 7 records

Age > 70 & Educationnum = 8

## Suppose we want those records where EDUCATIONNUM = 9 

dataset4.filter('EDUCATIONNUM = 9').head(5).collect()


## Suppose we want those records where AGE = 27

dataset4.filter('AGE = 27').head(5).collect()

 

Multiple filter conditions on Pandas Dataframe – 

dataset3[(dataset3['AGE'] > 27) & ( dataset3['OCCUPATION'] == 'Craft-repair')]

if you have a pandas dataframe at client and you wanna filter records at client end only then above filter syntax will suffice.

Output :

 

Slicing of Dataframe – 

Slice few columns from dataset

## Copy the original dataset into a new one 
new_dataset1 = dataset1.head(5).collect()

new_dataset1[['ID' , 'AGE']]

Output:
	ID	AGE
0	28	27
1	75	57
2	196	48
3	264	52
4	303	35

Copy sliced data into third one 
new_dataset2 = new_dataset1[['ID' , 'AGE']]

new_dataset2
Output:
	ID	AGE
0	28	27
1	75	57
2	196	48
3	264	52
4	303	35

Create dataframe for few columns only – 

dataset_id_age = connection_context.sql('SELECT TOP 10 ID , AGE FROM ADULT_DATA_FULL_TBL')

dataset_id_age.collect()

Output:
	ID	AGE
0	28	27
1	75	57
2	196	48
3	264	52
4	303	35
5	324	24
6	390	34
7	442	49
8	447	35
9	472	42

dataset_id_age.select_statement
'SELECT TOP 10 ID , AGE FROM ADULT_DATA_FULL_TBL'

In Short dataframe is a select statement which just 
wait to trigger unless it is called

 

describe() –  Describe the dataframe with all the stats information

dataset4.describe().collect() 

empty() – Returns True if this DataFrame has 0 rows

dataset2.empty()

Output:
False

has() – Returns true if a column is in the DataFrame.

dataset1.columns

Output:
['ID',
 'AGE',
 'WORKCLASS',
 'FNLWGT', 'EDUCATION','EDUCATIONNUM',
 'MARITALSTATUS',
 'OCCUPATION',
 'RELATIONSHIP',
 'RACE',
 'GENDER',
 'CAPITALGAIN',
 'CAPITALLOSS',
 'HOURSPERWEEK',
 'NATIVECOUNTRY',
 'INCOME']


dataset1.has('ID')

Output:
True

 

join() – Joining of two dataset based on the condition like ID = ID

condition = '{}."ID"={}."ID"'.format(dataset4.quoted_name, dataset2.quoted_name)

## Display Condition

condition 

Output:
'"DT_7"."ID"="DT_3"."ID"'

## Let's join dataset4 and dataset2 and copy the result into a new dataset7
dataset7 = dataset4.join(dataset2, condition)

dataset7.head(5).collect()
Output:

dataset7 

save dataframe save() – 

dataset4.save('#SAVEDATA')

## retrieve the saved dataframe into a new dataframe 
new_dataset4 = connection_context.table('#SAVEDATA')

##Fetch first 10 records and display 
new_dataset4.head(10).collect()

Ouput: 

10 Records displayed 


new_dataset4.select_statement   ### See here data is coming from temp database table or may be a pointer to the original table 

'SELECT * FROM "#SAVEDATA"'

 

AGGREGATE FUNCTIONS – one of the most important topic under the dataframe methods , calculating average age , or summing up the prices , counting the items sold , such kind of requirements resulted inventing these kind of functions , let’s code and see how they work.

## Suppose we need to find out the average age of employees working in a specific department

agg_list = [('AVG' , 'AGE' , 'AVG_AGE')] ## Aggr. funtion , column and New column 

dataset_avg_age = dataset4.agg(agg_list=agg_list , group_by='WORKCLASS').collect()

## Display the results 

dataset_avg_age

Group by WORKCLASS and Average Age of employees are displayed here.

## Count the number of people of same age

dataset_age_count = dataset4.agg(agg_list=[('COUNT' , 'ID' , 'ID_COUNT')] , group_by='AGE').collect()

dataset_age_count 

Output:
from IPython.display import HTML
HTML(dataset_age_count.head(5).to_html())

## Maximum AGE of employees in each group

dataset4.agg([('max', 'AGE', 'MAX_AGE')], group_by='WORKCLASS').collect()

Output:

## MIN Age of employees in each workclass

dataset4.agg([('min', 'AGE', 'MIN_AGE')], group_by='WORKCLASS').collect()

Output:

 

corr() –  finding correlation between columns of a dataset

dataset4.corr('ID' , 'AGE').collect()

 

Create dataframe from pandas – An SAP HANA DataFrame that contains the data in the pandas_df.

import pandas as pd

d = {'col1': [1, 2], 'col2': [3, 4]}  

df = pd.DataFrame(data=d)

df_new = hana_ml.dataframe.create_dataframe_from_pandas(connection_context, df, 'pd_df', force=False, replace=True)

df   ##Pandas dataframe 

Output:
	col1	col2
0	1	3
1	2	4

df_new.collect()  ##HANA Ml dataframe 

Output:
	col1	col2
0	1	3
1	2	4

 

Dataframe API – https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.04/en-US/hana_ml.dataframe.html

I have covered almost all the methods in dataframe but still few are left to explore , i will keep sharing other methods in my next blog.

 

Thank you for reading 🙂

 

 

 

 

 

 

 

 

 

Be the first to leave a comment
You must be Logged on to comment or reply to a post.