Python Client API for machine learning in SAP HANA 2.0, Express Edition SPS 03, Revision 33
In this blog, I will be providing an overview of the Python Client API for machine learning which was recently released in SAP HANA 2.0, Express Edition Support Package 03, Revision 033
What’s New for Machine Learning in SAP HANA, express edition
SAP HANA, express edition now supports a set of client-side Python functions which can be used for developing machine learning models, thereby making it easy for Python users to use SAP HANA, express edition for machine learning purposes.
It consists of ML API, which are a set of machine learning APIs for different algorithms and the SAP HANA dataframe which consists of a set of functions for accessing and manipulating SAP HANA data. I will provide an overview of the key capabilities and a sample application along with pointers how you can get started.
This capability is complementary to the Predictive Analysis Library (PAL) in SAP HANA, express edition which enables users to build machine learning models using a SQL interface.
The Python Client API for ML makes use of the HANA Python driver, as shown below in the diagram. Users first need to install the Python driver (hdbcli) and then the Python Client API for machine learning algorithms. Instructions are available in “How to Get Started” section below.
SAP HANA Dataframe
The SAP HANA dataframe provides a way to view and manipulate the data stored in SAP HANA without physically moving any data between HANA and the Python client application. SAP HANA dataframe hides the underlying SQL statement, providing users with a Python interface to SAP HANA data. Once the HANA dataframe is created, it can be used in the Python Client ML APIs (which contain various machine learning algorithms) as input for training and scoring purposes.
The SAP HANA dataframe functions provide capabilities for a variety of different operations for creating and manipulating dataframes. Some sample functions include: adding an ID column, casting columns into new type, dropping columns, filling null values, joining dataframes, sorting dataframes, renaming columns, statistics relating to the data, showing distinct values, creating dataframes with top n values, copying the HANA dataframe to a Pandas data frame, creating a table from a HANA dataframe, etc. For details refer to the documentation
To create a SAP HANA dataframe, first create the “ConnectionContext” object and then use the methods provided in the library for creating a SAP HANA dataframe.
The ML API are a set of APIs to the SAP HANA Machine Learning algorithms.
These algorithms cover classification, clustering, decomposition, metric functions, neural network, pre-processing, regression, statistics functions, neighborhood-based algorithms, decision trees and support vector machines. For details on which specific algorithms are available in this release, please refer to the documentation
These client-side Python functions require a HANA dataframe and parameters as inputs in order to train the model. The model training executes in HANA, and no data is brought back to the client side for training the model. The Python machine learning APIs invoke SQL functions within SAP HANA for performing the training and scoring, using data within HANA. Hence, data movement from the server to the client, or vice-versa is avoided, resulting in high performance.
This approach is open, allowing users to not only use the ML API functions provided, but also use Python libraries of their choice, in conjunction with what is available in SAP HANA, express edition. Users can convert a HANA dataframe to a Pandas dataframe, and after that use their Python library of choice for their applications.
An end-to-end example using the HANA dataframe and ML API is provided in the section “End to End Example: Using The Python Client API for ML”.
How to Get Started
- The first step is to install the SAP HANA Python Client API for Machine Learning Algorithms. This API requires the SAP HANA Python driver as pre-requisite. For detailed steps on how to install the Python Client API for ML (and the Python driver) refer to the tutorial.
- If you would like to do your development in a Jupyter notebook (or JupyterLab), and use the ML API in the notebook environment, you can do so. Instructions on how to use JupyterLab with SAP HANA can be found here.
- Now, you are ready to start using the SAP HANA Python Client API for Machine Learning Algorithms. The example below shows a sample end-to-end scenario.
Note: in case you need to install SAP HANA, express edition (or need to update your existing installation) please follow the steps below:
- Installing SAP HANA, express edition: https://developers.sap.com/topics/sap-hana-express.html
- Upgrading existing SAP HANA, express edition: https://developers.sap.com/tutorials/hxe-ua-updating-vm.html
End to End Example: Using The Python Client API for ML
Shown below is an example where the RandomForestClassifier model is being trained, using data in a HANA table.
In the use case below, the user wants to predict if a game will be played or not on a given day, using the weather conditions for that day. For training purposes, the user has access to historical data where the weather conditions and outcome is known. The historical information contains information on outlook (sunny, overcast, rain), temperature, humidity, windy(yes, no) and the actual historical outcome that took place on that day, namely, whether the game was played or not. The historical outcome is captured in the column named Label(PLAY, DO NOT PLAY).
Using the features such as outlook, temperature, humidity, windy, and label (as outcome), the user will train a randomforestclassifier model. Once trained, this model will be used for predicting if a game will be played or not on a given day, based on knowing the outlook, temperature, humidity and wind conditions for that specific day.
Note: In the example below, the user is connecting to a HANA database, and using a table named DATA_TBL_RFT (for training), and DATA_TBL_RFTPredict (for prediction).
Step 1: Import the Python Client API Library and Dataframe Library (dataframe, trees) from hana_ml import dataframe from hana_ml.algorithms import trees Step 2: Instantiate the Connection Object (conn) conn = dataframe.ConnectionContext('<address>', <port>, '<user>', '<password>') Step 3: Create the HANA Dataframe (df1) which references the "DATA_TBL_RFT" Table. df1 = conn.table("DATA_TBL_RFT") Step 4: Inspect the Data df1.head(4).collect() OUTLOOK TEMP HUMIDITY WINDY LABEL Sunny 75.0 70.0 Yes Play Sunny 60.0 90.0 Yes Do not Play Sunny 85.0 75.0 No Do not Play Sunny 72.0 95.0 No Do not Play Step 5: Create the RandomForestClassifier instance and specify the parameters rfc = RandomForestClassifier(conn_context=conn, n_estimators=3, max_features=3, random_state=2, split_threshold=0.00001, calculate_oob=True, min_samples_leaf=1, thread_ratio=1.0) Step 6: Store the necessary features in a List and Invoke the fit method rfc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'], label='LABEL',) Step 7: Create the HANA dataframe (df2) which references the "DATA_TBL_RFTPredict" Table. df2 = conn.table("DATA_TBL_RFTPredict") Step 8: Preview the dataframe before predicting df2.collect() ID OUTLOOK TEMP HUMIDITY WINDY 0 Overcast 75.0 100.0 Yes 1 Rain 78.0 70.0 Yes Step 9: Invoke the Prediction Method ("predict()") and inspect the result result = rfc.predict(df2, key='ID', verbose=False) result.collect() ID SCORE CONFIDENCE 0 Play 0.666667 1 Play 0.666667
Summary and Next Steps
SAP HANA Python Client API For Machine Learning Algorithms provides a set of Python APIs for creating and manipulating HANA dataframes and training and scoring machine learning models.
While ensuring there is no data transfer between the client and server, these functions enable the model training/prediction to occur in SAP HANA, thereby providing high performance as execution occurs close to the data. The HANA dataframe allows conversion to a Pandas dataframe in situations where users may want to download the data (or subset of it), and use it with other Python libraries, along with what’s provided in SAP HANA, express edition for machine learning.
So, give this new capability a try by downloading SAP HANA, express edition and getting started !
Links to references
- SAP HANA, express edition documentation: https://help.sap.com/viewer/product/SAP_HANA,_EXPRESS_EDITION/2.0.03/en-US
- SAP HANA, express edition Python Client API for machine learning algorithms documentation: https://help.sap.com/http.svc/rc/3f0dbe754b194c42a6bf3405697b711f/2.0.03/en-US/html/index.html
- Tutorial on using JupyterLab with HANA: https://blogs.sap.com/2018/10/01/machine-learning-in-a-box-part-10-jupyterlab/
- Tutorial on installing the Python Client API for machine learning algorithms: https://developers.sap.com/tutorials/hxe-ua-install-python-ml-api.html
I've created a post that explains how to set up a VM that demos the ML library described in this post.
Great blog entry! Even though I imported the trees package I had to append "trees." to the RandonForesterClassifier
rfc = trees.RandomForestClassifier(conn_context=conn, n_estimators=3,
There are some couple of questions on HANA PAL and thought you are right person.
We are planning to use SAP HANA PAL capabilities on our Business Suite on SAP HANA 1.0 system.Version is :SP 12 and already PAL functions are activated.
To use PAL , do we need to have a seperate license?. Could you please share some details on them.
for licensing questions, please check with your sales rep.
Thanks for the post, I went through Hana dataframe APIs at https://help.sap.com/doc/0172e3957b5946da85d3fde85ee8f33d/2.0.03/en-US/html/hana_ml.dataframe.html#
My first impression is the Hana dataframe isn't much suitable to handle wrangling as the APIs are limited.
Suppose I wanted to replace a column value based on a condition we can do using numpy with pandas dataframe in simple way as (example) np.where(df['col']>100,'High','Low')
I was into an assumption that hana dataframe can be applied the same but it is failing.