Diving into the HANA DataFrame: Python Integration...

former_member224878 · ‎12-17-2018

>>> from math import pi

>>> π = pi

>>> area = π * r**2



>>> résumé = "knows Python"

>>> "Python" in résumé

True

Python is a powerful tool in the Data Scientist's toolbox. A quick search on LinkedIn will support this hypothesis (if you had any doubts). At SAP we appreciate that Python is not just a nice-to-have, but an expectation for modern technology platforms and from SAP HANA 2.0 (Express Edition Support Package 03, Revision 033), a Python Client API will provide integration for Data Scientist's to accelerate and enhance their work.

The API makes use of the HANA Python Driver (hdbcli) as shown by this diagram:

For further details on the architecture of the API, see this blog. The main two components that a Data Scientist will use are:

The SAP HANA DataFrame

The Machine Learning API

For the first part of this blog series, I will cover the HANA DataFrame and the second part will cover the Machine Learning API.

The SAP HANA DataFrame

Picture this... you've recently set up a team of highly skilled Data Scientists.

5+ years experience in IT? Check.

Strong research background? Check.

PhD in Astrophysics, Mathematics, Neuroscience etc...? Check. *sigh*.

Great. You're surely on the path to success. You assign the team a task of building a fraud detection system using a database of 50 billion rows across 10 tables. A robust prototype algorithm must be presented in a months time.

What could go wrong? You have the skills, you have the data... but all too often Data Science is seen as more of an analysis task than a process.

When Data Science is appreciated as a process, performance and deployment are paramount and this is the difference between a "Data Science Project", and building industrial machine learning.

The biggest performance hurdle comes from the data, and when dealing with the plethora of channels and data sources available... this equates to a LOT of data. When using Python IDE's such as Jupyter, the data is persisted to the client and this means more processing time, which directly converts into a drop of productivity for your prized unicorns/Data Scientists.

"So how can we avoid this seemingly unavoidable issue? My Data Scientists don't want to use proprietary tools, they like Python!"

This is where the SAP HANA DataFrame can add real value to a Data Scientist's work.

What is it?

An SAP HANA DataFrame provides a way to view the data stored in SAP HANA without containing any of the physical data. The Machine Learning API's make use of an SAP HANA DataFrame as input for training and scoring purposes. An SAP HANA DataFrame hides the underlying SQL statement, providing users with a Python interface to SAP HANA data.

What's the value?

By reducing data transfer, this massively improves productivity and speed. This means faster development cycles, rapid prototyping and more flexibility.

No need to re-train or upskill Data Scientist's with new tools. Use Pythonic syntax that they are comfortable and familiar with.

Open-ness. By combining open source packages with the Python API, we are not restricting Data Scientists to use specific tools. We are opening our technology to the Data Science world.

How can I use it?

For detailed instructions on how to set up the Python Client API, see this tutorial. Now, I will give you some details on how to use the API.

To use an SAP HANA DataFrame you must create the "ConnectionContext" object, then use the methods provided in the library for creating an SAP HANA dataframe.

The DataFrame is only usable while the "ConnectionContext" is open, and is inaccessible once the connection is closed.

Let's open the "ConnectionContext" and invoke a table() method for creating a DataFrame:

# First instantiate the Connection Object (conn)

conn = dataframe.ConnectionContext("IP ADDRESS", "PORT", "USER", "PASSWORD")



# Then create the HANA dataframe (df) and point to the table.

df = conn.table("TABLE", schema="SCHEMA")

df.head(5).collect()

ITEM_IDENTIFIER	ITEM_WEIGHT	ITEM_FAT_CONTENT	ITEM_VISIBILITY	ITEM_TYPE	ITEM_MRP	OUTLET_IDENTIFIER	OUTLET_ESTABLISHMENT_YEAR	OUTLET_SIZE	OUTLET_LOCATION_TYPE	OUTLET_TYPE	ITEM_OUTLET_SALES
FDA15	9.3	Low Fat	0.016047	Dairy	249.8092	OUT049	1999	Medium	Tier 1	Supermarket Type1	3735.138
DRC01	5.92	Regular	0.019278	Soft Drinks	48.2692	OUT018	2009	Medium	Tier 3	Supermarket Type2	443.4228
FDN15	17.5	Low Fat	0.01676	Meat	141.618	OUT049	1999	Medium	Tier 1	Supermarket Type1	2097.27
FDX07	19.2	Regular	0	Fruits and Vegetables	182.095	OUT010	1998		Tier 3	Grocery Store	732.38
NCD19	8.93	Low Fat	0	Household	53.8614	OUT013	1987	High	Tier 3	Supermarket Type1	994.7052

To understand what is happening behind the scenes, let's take a look at the SQL that is generated and pushed down to HANA:

-- The first method, head() generates this SQL

SELECT TOP 5 * FROM (SELECT * FROM "SCHEMA"."TABLE") dt

# The second method collect() returns the results in a Pandas DataFrame

return pd.DataFrame(results, columns=self.columns)

The first select statement is generated by the head() method.

This query is then run by a cursor() and returned with a fetchall() method.

The returned results are handled by the collect() method and copied into a Pandas DataFrame, along with the names of the columns.

Now that we have our DataFrame, we have a collection of methods that we can invoke covering 3 capabilities:

DataFrame Manipulations: casting columns into a new type, dropping columns, filling null values, joining dataframes, sorting dataframes, renaming columns, etc.

Descriptive Functions: statistics relating to the data, showing distinct values, creating DataFrame with top n values, etc.

DataFrame Transformations: copying an SAP HANA DataFrame to a Pandas DataFrame and materialize a DataFrame to a table.

Here is an example of how to use a descriptive function on the DataFrame:

# Describe all columns in a DataFrame

df.describe().collect()

column	count	unique	nulls	mean	std	min	max	median	25_percent_cont	25_percent_disc	50_percent_cont	50_percent_disc	75_percent_cont	75_percent_disc
ITEM_WEIGHT	8523	416	0	10.65059	6.431899	0	21.35	11	6.65	6.65	11	11	16	16
ITEM_VISIBILITY	8523	7880	0	0.066132	0.051598	0	0.328391	0.053931	0.026989	0.026986	0.053931	0.053931	0.094585	0.094603
ITEM_MRP	8523	5938	0	140.9928	62.27507	31.29	266.8884	143.0128	93.8265	93.8094	143.0128	143.0128	185.6437	185.6608
OUTLET_ESTABLISHMENT_YEAR	8523	9	0	1997.832	8.37176	1985	2009	1999	1987	1987	1999	1999	2004	2004
ITEM_OUTLET_SALES	8523	3493	0	2181.289	1706.5	33.29	13086.96	1794.331	834.2474	833.5816	1794.331	1794.331	3101.296	3101.296

Once again, let's take a look at the SQL that is generated and pushed down to HANA (this example shows the SQL just for numeric columns):

-- First some simple statistics are calculated

SELECT 'ITEM_WEIGHT' AS "column", 

	COUNT("ITEM_WEIGHT") AS "count", 

	COUNT(DISTINCT "ITEM_WEIGHT") AS "unique", 

	SUM(CASE WHEN "ITEM_WEIGHT" IS NULL THEN 1 ELSE 0 END) AS "nulls", 

	AVG("ITEM_WEIGHT") AS "mean", 

	STDDEV("ITEM_WEIGHT") AS "std", 

	MIN("ITEM_WEIGHT") AS "min", 

	MAX("ITEM_WEIGHT") AS "max", 

	MEDIAN("ITEM_WEIGHT") AS "median" FROM (SELECT * FROM "SCHEMA"."TABLE") AS "DT_0" 

UNION ALL SELECT 'ITEM_VISIBILITY' AS "column", 

	COUNT("ITEM_VISIBILITY") AS "count", 

	COUNT(DISTINCT "ITEM_VISIBILITY") AS "unique", 

	SUM(CASE WHEN "ITEM_VISIBILITY" IS NULL THEN 1 ELSE 0 END) AS "nulls", 

	AVG("ITEM_VISIBILITY") AS "mean", 

	STDDEV("ITEM_VISIBILITY") AS "std", 

	MIN("ITEM_VISIBILITY") AS "min", 

	MAX("ITEM_VISIBILITY") AS "max", 

	MEDIAN("ITEM_VISIBILITY") AS "median" FROM (SELECT * FROM "SCHEMA"."TABLE") AS "DT_0" 

UNION ALL SELECT 'ITEM_MRP' AS "column", 

	COUNT("ITEM_MRP") AS "count", 

	COUNT(DISTINCT "ITEM_MRP") AS "unique", 

	SUM(CASE WHEN "ITEM_MRP" IS NULL THEN 1 ELSE 0 END) AS "nulls", 

	AVG("ITEM_MRP") AS "mean", 

	STDDEV("ITEM_MRP") AS "std", 

	MIN("ITEM_MRP") AS "min", 

	MAX("ITEM_MRP") AS "max", 

	MEDIAN("ITEM_MRP") AS "median" FROM (SELECT * FROM "SCHEMA"."TABLE") AS "DT_0" 

UNION ALL SELECT 'OUTLET_ESTABLISHMENT_YEAR' AS "column", 

	COUNT("OUTLET_ESTABLISHMENT_YEAR") AS "count", 

	COUNT(DISTINCT "OUTLET_ESTABLISHMENT_YEAR") AS "unique", 

	SUM(CASE WHEN "OUTLET_ESTABLISHMENT_YEAR" IS NULL THEN 1 ELSE 0 END) AS "nulls", 

	AVG("OUTLET_ESTABLISHMENT_YEAR") AS "mean", 

	STDDEV("OUTLET_ESTABLISHMENT_YEAR") AS "std", 

	MIN("OUTLET_ESTABLISHMENT_YEAR") AS "min", 

	MAX("OUTLET_ESTABLISHMENT_YEAR") AS "max", 

	MEDIAN("OUTLET_ESTABLISHMENT_YEAR") AS "median" FROM (SELECT * FROM "SCHEMA"."TABLE") AS "DT_0" 

UNION ALL SELECT 'ITEM_OUTLET_SALES' AS "column", 

	COUNT("ITEM_OUTLET_SALES") AS "count", 

	COUNT(DISTINCT "ITEM_OUTLET_SALES") AS "unique", 

	SUM(CASE WHEN "ITEM_OUTLET_SALES" IS NULL THEN 1 ELSE 0 END) AS "nulls", 

	AVG("ITEM_OUTLET_SALES") AS "mean", 

	STDDEV("ITEM_OUTLET_SALES") AS "std", 

	MIN("ITEM_OUTLET_SALES") AS "min", 

	MAX("ITEM_OUTLET_SALES") AS "max", 

	MEDIAN("ITEM_OUTLET_SALES") AS "median" FROM (SELECT * FROM "SCHEMA"."TABLE") AS "DT_0";

	

-- Second, percentiles are calculated

SELECT 'ITEM_WEIGHT' as "column", 

	* FROM (SELECT percentile_cont(0.25) WITHIN GROUP (ORDER BY "ITEM_WEIGHT") AS "25_percent_cont", 

		percentile_disc(0.25) WITHIN GROUP (ORDER BY "ITEM_WEIGHT") AS "25_percent_disc", 

		percentile_cont(0.50) WITHIN GROUP (ORDER BY "ITEM_WEIGHT") AS "50_percent_cont", 

		percentile_disc(0.50) WITHIN GROUP (ORDER BY "ITEM_WEIGHT") AS "50_percent_disc", 

		percentile_cont(0.75) WITHIN GROUP (ORDER BY "ITEM_WEIGHT") AS "75_percent_cont", 

		percentile_disc(0.75) WITHIN GROUP (ORDER BY "ITEM_WEIGHT") AS "75_percent_disc" FROM (SELECT * FROM "SCHEMA"."TABLE") AS "DT_0") 

UNION ALL SELECT 'ITEM_VISIBILITY' as "column", 

	* FROM (SELECT percentile_cont(0.25) WITHIN GROUP (ORDER BY "ITEM_VISIBILITY") AS "25_percent_cont", 

	percentile_disc(0.25) WITHIN GROUP (ORDER BY "ITEM_VISIBILITY") AS "25_percent_disc", 

	percentile_cont(0.50) WITHIN GROUP (ORDER BY "ITEM_VISIBILITY") AS "50_percent_cont", 

	percentile_disc(0.50) WITHIN GROUP (ORDER BY "ITEM_VISIBILITY") AS "50_percent_disc", 

	percentile_cont(0.75) WITHIN GROUP (ORDER BY "ITEM_VISIBILITY") AS "75_percent_cont", 

	percentile_disc(0.75) WITHIN GROUP (ORDER BY "ITEM_VISIBILITY") AS "75_percent_disc" FROM (SELECT * FROM "SCHEMA"."TABLE") AS "DT_0") 

UNION ALL SELECT 'ITEM_MRP' as "column", 

	* FROM (SELECT percentile_cont(0.25) WITHIN GROUP (ORDER BY "ITEM_MRP") AS "25_percent_cont", 

	percentile_disc(0.25) WITHIN GROUP (ORDER BY "ITEM_MRP") AS "25_percent_disc", 

	percentile_cont(0.50) WITHIN GROUP (ORDER BY "ITEM_MRP") AS "50_percent_cont", 

	percentile_disc(0.50) WITHIN GROUP (ORDER BY "ITEM_MRP") AS "50_percent_disc", 

	percentile_cont(0.75) WITHIN GROUP (ORDER BY "ITEM_MRP") AS "75_percent_cont", 

	percentile_disc(0.75) WITHIN GROUP (ORDER BY "ITEM_MRP") AS "75_percent_disc" FROM (SELECT * FROM "SCHEMA"."TABLE") AS "DT_0") 

UNION ALL SELECT 'OUTLET_ESTABLISHMENT_YEAR' as "column", 

	* FROM (SELECT percentile_cont(0.25) WITHIN GROUP (ORDER BY "OUTLET_ESTABLISHMENT_YEAR") AS "25_percent_cont", 

	percentile_disc(0.25) WITHIN GROUP (ORDER BY "OUTLET_ESTABLISHMENT_YEAR") AS "25_percent_disc", 

	percentile_cont(0.50) WITHIN GROUP (ORDER BY "OUTLET_ESTABLISHMENT_YEAR") AS "50_percent_cont", 

	percentile_disc(0.50) WITHIN GROUP (ORDER BY "OUTLET_ESTABLISHMENT_YEAR") AS "50_percent_disc", 

	percentile_cont(0.75) WITHIN GROUP (ORDER BY "OUTLET_ESTABLISHMENT_YEAR") AS "75_percent_cont", 

	percentile_disc(0.75) WITHIN GROUP (ORDER BY "OUTLET_ESTABLISHMENT_YEAR") AS "75_percent_disc" FROM (SELECT * FROM "SCHEMA"."TABLE") AS "DT_0") 

UNION ALL SELECT 'ITEM_OUTLET_SALES' as "column", 

	* FROM (SELECT percentile_cont(0.25) WITHIN GROUP (ORDER BY "ITEM_OUTLET_SALES") AS "25_percent_cont", 

	percentile_disc(0.25) WITHIN GROUP (ORDER BY "ITEM_OUTLET_SALES") AS "25_percent_disc", 

	percentile_cont(0.50) WITHIN GROUP (ORDER BY "ITEM_OUTLET_SALES") AS "50_percent_cont", 

	percentile_disc(0.50) WITHIN GROUP (ORDER BY "ITEM_OUTLET_SALES") AS "50_percent_disc", 

	percentile_cont(0.75) WITHIN GROUP (ORDER BY "ITEM_OUTLET_SALES") AS "75_percent_cont", 

	percentile_disc(0.75) WITHIN GROUP (ORDER BY "ITEM_OUTLET_SALES") AS "75_percent_disc" FROM (SELECT * FROM "SCHEMA"."TABLE") AS "DT_0");

The first statement generates basic statistics for the columns.

The second statement generates 25, 50 and 75 percentile calculations for each column.

Finally, both statements are assigned an alias and called in a final select statement, bringing the results together into one table.

If you wish to view the generated SQL without using the collect method, you can issue the following statement:

# This will return the SQL statement for the describe method

df.describe().select_statement

For a full list of DataFrame methods, see the documentation site here.

Summary

The SAP HANA DataFrame is a great way to improve productivity of Data Scientists by drastically reducing data transfer. The DataFrame has a collection of methods that can further enhance a Data Scientists work and they can use this in combination with their favourite Python packages. When training machine learning models, by shifting the focus from analysis to process, the Python Client API can help to convert a "Data Science Project" into an industrial machine learning project.

Download HANA Express Edition today and get started!

What do you think of the SAP HANA DataFrame? Please feel free to share, comment, like and follow!

Arun Godwin Patel

Appendix

Python Client API for Machine Learning in SAP HANA 2.0, Express Edition SPS 03, Revision 33 - https://blogs.sap.com/2018/10/29/python-client-api-for-machine-learning-in-sap-hana-2.0-express-edit...

Install the SAP HANA Python Client API for Machine Learning Algorithms - https://developers.sap.com/tutorials/hxe-ua-install-python-ml-api.html

Documentation site - https://help.sap.com/doc/0172e3957b5946da85d3fde85ee8f33d/2.0.03/en-US/html/index.html

Download link for HANA Express - https://developers.sap.com/topics/sap-hana-express.html

How to set up a HANA Express Python Machine Learning API Demo VM - https://blogs.sap.com/2018/11/03/setting-up-a-hana-express-python-machine-learning-api-demo-vm/

Diving into the HANA DataFrame: Python Integration - Part 1

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win