Technical Articles
Multi-model in hana_ml 2.6 for Python (part 01): Dataframes
In the previous post, I showed how I have done the setup of the environment I am planning to use for this series to explain the new multi-model functionality of hana_ml
2.6 in the context of my demo used in this year’s SAP TechEd’s DAT108 session.
Dataframes — a bit of the background
The SAP HANA DataFrame is the central concept we need to understand in hana_ml
.
If you worked intensively with data wrangling and analysis using programming languages, then you might have worked with dataframes already in R, or in Python thanks to Pandas library, or in Julia thanks to DataFrames.jl package.
In simple terms, these dataframes are two-in-one: (1) a structure storing the data in a client, (2) a rich number of sometimes very sophisticated methods that can be applied to this structure to understand and to manipulate this data.
SAP HANA DataFrame…
Now, when it comes to the HANA DataFrame in hana_ml
:
- a dataframe object does not store physical data, but it does store the HANA SQL
SELECT
statement backing the dataframe, - most methods are designed to not bring data from the database to a client unless explicitly requested, e.g. via
collect()
method, - HANA DataFrame API is tightly coupled with Pandas: data from Pandas dataframe can be persisted as an SAP HANA database object, and — on the other hand — results of running operations on SAP HANA data are usually returned in a format of a Pandas dataframe.
It all means that all the heavy lifting, like data processing or ML algorithms, are executed in the SAP HANA database: right where the data is stored. Data is not moved between a DB server and a client where Python code is executed. It follows the SAP HANA principle of bringing a code to the data, not data to the code.
…and its relationship with Python Pandas
It means as well, that as a developer using hana_ml
you should have some basic understanding of the Pandas module.
I would recommend checking official Pandas getting started tutorials: https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html, but if anyone knows the better place for starters please share in the comments below. Personally, I found “10 Minutes to pandas” the most confusing 60(!) minutes, when I went through this 😀
But what I would certainly recommend to everyone coming from an SQL background is checking: https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html.
Some preparation work…
…before we move on.
Upgrade hana_ml
if needed
The new update of the hana_ml
2.6 has been released since my previous post was published last week. I can see it using:
pip search hana
which I can run:
- from my laptops command line
docker exec hmlsandbox01 pip search hana
- from Jupyter notebook’s cell
!pip search hana
, or - simply using a terminal within Jupyter.
pip install --upgrade hana-ml
in any of the ways mentioned above, e.g. in a Jupyter’s terminal.
Install Python shapely
module
Shapely is a Python package for the manipulation and analysis of planar geometric objects. It is used by hana_ml
to support geospatial data manipulation, but must be separately installed manually to avoid errors like “name ‘wkb’ is not defined” or “ModuleNotFoundError: No module named ‘shapely’.” It is a known limitation and should be fixed in the next patch of hana_ml
.
To install shapely
please follow: https://shapely.readthedocs.io/en/stable/project.html#installing-shapely.
In my environment, I just executed installation using Jupyter’s terminal.
pip install shapely
Ok, it is time for coding!
JupyterLab is a client I use for this exercise, so I will create a new notebook 01 Dataframes.ipynb
.
Read the files with the data using Pandas
In most production scenarios hana_ml
will be used against some large volumes of data already stored in SAP HANA on-prem or in SAP HANA Cloud. But in our case of starting with the empty trial instance of SAP HANA Cloud, we need to load some data first. Actually, I showed already how to quickly load CSV files into SAP HANA in my post Quickly load data with hana_ml….
For my SAP TechEd demo, I wanted data that would be good to show multi-model processing — spatial and graph — so I used a dataset provided by Kelvin Lawrence at https://github.com/krlawrence/graph/raw/master/sample-data/ representing a network of connections between airports. At the time of writing this post, the data represents the status from January 2020 — before the situation changed dramatically for airlines and airports. But let us focus on the data and processing here.
import pandas
pandas.__version__
I will use dfp_
notation for Pandas dataframes.
dfp_nodes=pandas.read_csv('https://github.com/krlawrence/graph/raw/master/sample-data/air-routes-latest-nodes.csv')
dfp_edges=pandas.read_csv('https://github.com/krlawrence/graph/raw/master/sample-data/air-routes-latest-edges.csv')
Let’s check the shape (number of columns and rows) of these Pandas dataframes.
print('Size of nodes dataframe: {}'.format(dfp_nodes.shape))
print('Size of edges dataframe: {}'.format(dfp_edges.shape))
Nodes dataframe has 3742 rows with 16 columns, while edges dataframe has 57574 records with 5 columns.
Data analysis and wrangling with Pandas
Let’s reshape this data before persisting in SAP HANA Cloud and for that, we first need to understand our dataset.
What are the columns and their types in the dfp_nodes
dataframes?
dfp_nodes.dtypes
Ok, there is a column ~label
, so what are the node labels?
dfp_nodes.groupby('~label').size()
As you can see there are some columns (like type:string
) as well as some rows (like those labeled continet
or version
) that we do not need. Additionally, all columns have either some special characters (like ~
) or data types (like :object
) as part of their names that we do not need. Plus some of the columns have some data types too generic for their real content. And ideally, we need column names in all capitals for SAP HANA.
Let’s fix all these by creating a new Pandas dataframe dfp_ports
and check it!
dfp_ports=(
dfp_nodes[dfp_nodes['~label'].isin(['airport'])]
.drop(['~label','type:string','author:string','date:string'], axis=1)
.convert_dtypes()
)
dfp_ports.columns=(dfp_ports.columns
.str.replace('~','')
.str.replace(':.*','')
.str.upper()
)
And let’s do the same for the edges dataset by saving required columns, rows, and data types into a new dataframe dfp_edges
.
dfp_edges.dtypes
dfp_edges.groupby('~label').size()
dfp_routes=dfp_edges[dfp_edges['~label'].isin(['route'])].drop(['~label'], axis=1).copy()
dfp_routes.columns=dfp_routes.columns.str.replace('~','').str.replace(':.*','').str.upper()
Switch to using HANAML
database user
With the new database user HANAML
created in the previous post let’s switch to using it for further exercises.
import hana_ml
hana_ml.__version__
hana_cloud_endpoint="<uuid>.hana.trial-<region>.hanacloud.ondemand.com:443"
hana_cloud_host, hana_cloud_port=hana_cloud_endpoint.split(":")
cchc=hana_ml.dataframe.ConnectionContext(port=hana_cloud_port,
address=hana_cloud_host,
user='HANAML',
password='Super$ecr3t!', #Should be your user's password ;)
encrypt=True
)
print(cchc.sql("SELECT SCHEMA_NAME, TABLE_NAME FROM TABLES WHERE SCHEMA_NAME='{schema_name}'"
.format(schema_name=cchc.get_current_schema()))
.collect()
)
The last statement shows that the schema of HANAML
does not have any tables yet. So, let’s save the data from Pandas dataframes to SAP HANA tables using hana_ml
.
dfh_ports=hana_ml.dataframe.create_dataframe_from_pandas(cchc,
dfp_ports, "PORTS",
force=True
)
dfh_routes=hana_ml.dataframe.create_dataframe_from_pandas(cchc,
dfp_routes, 'ROUTES',
force=True)
I will use dfh_
notation for HANA DataFrame variables.
print(cchc.sql("SELECT SCHEMA_NAME, TABLE_NAME FROM TABLES WHERE SCHEMA_NAME='{schema_name}'"
.format(schema_name=cchc.get_current_schema()))
.collect()
)
We can see that two tables have been created in db user’s schema in the SAP HANA.
And we can check data in these tables using collect()
method of the HANA dataframe.
print(dfh_ports.collect())
I did a few things (intentionally) that should hurt the eyes of programmers using Python and Pandas for a while. If you are one of those, please leave your improvement suggestions in the comments below!
For everyone else, please stay tuned as we will look at dataframes a bit closer in the next post.
Stay healthy ❤️
-Vitaliy (aka @Sygyzmundovych)
This is so interesting .... wonder how many clients actually use Python in SAP.
I do not work directly with customers, but as far as I know from my colleagues working in consulting or CoE -- pretty much every customer's project that involves Machine Learning or Data Intelligence has Python in it. Plus pretty much all of our SAP colleagues working in SAP AI organization (like those behind SAP AI Business Services (https://developers.sap.com/topics/artificial-intelligence.html) are programming using Python.
Anyone else would like to comment on this?
We are using the HANA ML python library on some of our customers that use HANA and require predictive models.
In general, for SAP projects, we are using python is HANA ML, Data Intelligence and SAP Data Services.
Recent public reference from customer , deploying a predictive maintenance Pilot with SAP HANA Cloud and its Python ML library can be found here:
https://www.sap.com/documents/2021/07/56bacfeb-ec7d-0010-bca6-c68f7e60039b.html
Nice Post! Thanks for sharing such an amazing article, really informative, it helps me a lot.
Thanks for the great blog! I don't think many customers know all the great features in PAL/APL and the Python/R wrappers are a big step forward. It's the best of both worlds: the ability to apply powerful algos without having to move the data out and to utilize familiar Python/R skills on SAP data.