Skip to Content
Product Information

Getting Started With “SAP HANA Python Client API For Machine Learning Algorithms” SAP HANA 2.0, Express Edition Support Package 03, Revision 035

Python API For Machine Learning

SAP HANA, express edition supports a set of client-side Python functions which can be used for developing machine learning models, thereby making it easy for Python users to use SAP HANA, express edition for machine learning purposes.

This is in addition to the Predictive Analysis Library (PAL) in SAP HANA, express edition which enables users to build machine learning models using a SQL interface.

In this blog, I would like to introduce you to the Python Client API for Machine Learning (ML). It consists of ML API, which are a set of machine learning APIs for different algorithms and the SAP HANA dataframe which consists of a set of functions for accessing and manipulating SAP HANA data. I will provide an overview of the key capabilities and a sample application along with pointers how you can get started.

Additionally, it should be noted that this capability has been there in SAP HANA 2.0 Express Edition Support Package 03, Rev 33, and the section titled “What’s New In This Release” provides a summary of the new features.  Users who are familiar with this capability can directly go to the section below to review the changes.

 

What’s New In This Release

There are 3 areas which have updates and additions:

  • Dataframe Functions:
    • New multi-column renaming function added
    • Enhanced save functionality allows saving a dataframe as a view (in addition to a table). This allows for processing of a dataframe, saving it as a view and using it in any of the machine learning functions.
    • New function for counting rows in a dataframe has been added
    • Enhanced join function (between data frames) which allows a subset of the columns to be returned
    • Added a new dataframe function for processing the union of two dataframes
  • Machine Learning Functions
    • Several new algorithms have been added: Chi-sqaure tests, Naive Bayes, DBScan, Analysis of Variance, Generalized Linear Model
    • Support for thread ratio that specifies the percentage of available threads to use by the algorithm has been enhanced. If the value is out of range, it now defaults to a pre-defined value.
  • Package Structure Changes
    • To use the dataframe functions in your Python application, the package structure has not changed, and hence the import statement is still “from hana_ml import dataframe”.
    • To use the ML algorithms, as the package structure has changed, there are some differences in this release. For example, if a user wants to use clustering then will need to specify “from hana_ml.algorithms.pal import clustering”  in your application.

 

Architecture

The Python Client API for ML makes use of the HANA Python driver, as shown below in the diagram. Users first need to install the Python driver (hdbcli) and then the Python Client API for machine learning algorithms. Instructions are available in “How to Get Started” section below.

 

 

SAP HANA Dataframe

The SAP HANA dataframe provides a way to view and manipulate the data stored in SAP HANA without physically moving any data between HANA and the Python client application. SAP HANA dataframe hides the underlying SQL statement, providing users with a Python interface to SAP HANA data. Once the HANA dataframe is created, it can be used in the Python Client ML APIs (which contain various machine learning algorithms) as input for training and scoring purposes.

The SAP HANA dataframe functions provide capabilities for a variety of different operations for creating and manipulating dataframes. Some sample functions include: adding an ID column, casting columns into new type, dropping columns, filling null values, joining dataframes, sorting dataframes, renaming columns, statistics relating to the data, showing distinct values, creating dataframes with top n values, copying the HANA dataframe to a Pandas data frame, creating a table or view from a HANA dataframe, etc. For details refer to the documentation

To create a SAP HANA dataframe, first create the “ConnectionContext” object and then use the methods provided in the library for creating a SAP HANA dataframe.

 

ML API

The ML API are a set of APIs to the SAP HANA Machine Learning algorithms.

These algorithms cover classification, clustering, decomposition, metric functions, neural network, preprocessing, linear models, naïve bayes, statistics functions, neighborhood-based algorithms, decision trees and support vector machines. For details on which specific algorithms are available in this release, please refer to the documentation

These client-side Python functions require a HANA dataframe and parameters as inputs in order to train the model. The model training executes in HANA, and hence no data is brought back to the client side for training the model. The Python machine learning APIs invoke SQL functions within SAP HANA for performing the training and scoring, using data within HANA. Hence, data movement from the server to the client, or vice-versa is avoided, resulting in high performance. All these operations occur as part of the execution process.

This approach is open, allowing users to not only use the ML API functions provided, but also use Python libraries of their choice, in conjunction with what is available in SAP HANA, express edition. Users can convert a HANA dataframe to a Pandas dataframe, and after that use their Python library of choice for their applications.

An end-to-end example using the HANA dataframe and ML API is provided in the section “End to End Example: Using The Python Client API for ML”.

 

How to Get Started

  • The first step is to install the SAP HANA Python Client API for Machine Learning Algorithms. This API requires the SAP HANA Python driver as pre-requisite. For detailed steps on how to install the Python Client API for ML (and the Python driver) refer to the tutorial.
  • If you would like to do your development in a Jupyter notebook (or JupyterLab), and use the ML API in the notebook environment, you can do so. Instructions on how to use JupyterLab with SAP HANA can be found here.
  • Now, you are ready to start using the SAP HANA Python Client API for Machine Learning Algorithms. The example below shows a sample end-to-end scenario.

 

End to End Example: Using The Python Client API for ML  

Shown below is an example where the RandomForestClassifier model is being trained, using data in a HANA table.

In the use case below, the user wants to predict if a game is will be played on not on a given day, using the weather conditions of that day. For training purposes, the user has access to historical data where the weather conditions and outcome is known. The historical information contains information on outlook (sunny, overcast, rain, etc.), temperature, humidity, windy(yes, no) and the actual historical outcome that took place on  that day, namely, whether the game was played or not. The historical outcome is captured in the column named Label(PLAY, DO NOT PLAY).

Using the features such as outlook, temperature, humidity, windy, and label (as outcome), the user will train a randomforestclassifier model. Once trained, this model will be used for predicting if a game will be played or not on a given day, based on knowing the outlook, temperature, humidity and windy conditions for that specific day.

Also, in the example below, the user is connecting to a HANA database, and using a table named DATA_TBL_RFT (for training), and DATA_TBL_RFTPredict (for prediction).

Step 1: Import the Python Client API Library and Dataframe Library (dataframe, trees)
from hana_ml import dataframe
from hana_ml.algorithms.pal import trees

Step 2: Instantiate the Connection Object (conn)
conn = dataframe.ConnectionContext('<address>', <port>, '<user>',  '<password>')

Step 3: Create the HANA Dataframe (df1) which references the "DATA_TBL_RFT" Table.
df1 = conn.table("DATA_TBL_RFT")

Step 4: Inspect the Data
df1.head(4).collect()

OUTLOOK  TEMP    HUMIDITY   WINDY  LABEL
Sunny    75.0    70.0       Yes    Play
Sunny    60.0    90.0       Yes    Do not Play
Sunny    85.0    75.0       No     Do not Play
Sunny    72.0    95.0       No     Do not Play

Step 5: Create the RandomForestClassifier instance and specify the parameters
rfc = RandomForestClassifier(conn_context=conn, n_estimators=3,
                             max_features=3, random_state=2,
                             split_threshold=0.00001, calculate_oob=True,
                             min_samples_leaf=1, thread_ratio=1.0)

Step 6: Store the necessary features in a List and Invoke the fit method
rfc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'], label='LABEL',)

Step 7: Create the HANA dataframe (df2) which references the "DATA_TBL_RFTPredict" Table.
df2 = conn.table("DATA_TBL_RFTPredict")

Step 8: Preview the dataframe before predicting
df2.collect()

ID  OUTLOOK   TEMP   HUMIDITY  WINDY
0   Overcast  75.0   100.0     Yes
1   Rain      78.0   70.0      Yes

Step 9: Invoke the Prediction Method ("predict()") and inspect the result
result = rfc.predict(df2, key='ID', verbose=False)

result.collect()

ID  SCORE  CONFIDENCE
0   Play   0.666667
1   Play   0.666667

 

Summary and Next Steps

SAP HANA Python Client API For Machine Learning Algorithms provides a set of Python APIs for creating and manipulating HANA dataframes, training and scoring machine learning models and a set of functions for data preprocessing.

While ensuring there is no data transfer between the client and server, these functions ensure that the model training/prediction executes in SAP HANA, thereby providing high performance and execution close to the data.  The HANA dataframe allows conversion to a Pandas dataframe in situations where users may want to download the data (or subset of it), and use it with other Python libraries, thereby ensuring that they can use their favorite Python libraries along with what’s provided in SAP HANA, express edition for machine learning.

So give this new capability a try by downloading SAP HANA, express edition and getting started !

 

Links to references

  1. SAP HANA, express edition documentation: https://help.sap.com/viewer/product/SAP_HANA,_EXPRESS_EDITION/2.0.03/en-US
  2. SAP HANA, express edition Python Client API for machine learning algorithms documentation: https://help.sap.com/http.svc/rc/869ecfddc30a45868cf47b95760ff5c1/2.0.03/en-US/html/index.html
  3. Using JupyterLab with HANA: https://blogs.sap.com/2018/10/01/machine-learning-in-a-box-part-10-jupyterlab/
  4. Installing the Python Client API for machine learning algorithms tutorial: https://developers.sap.com/tutorials/hxe-ua-install-python-ml-api.html
2 Comments
You must be Logged on to comment or reply to a post.