Getting Started With “SAP HANA Python Client API For Machine Learning Algorithms” SAP HANA 2.0, Express Edition Support Package 03, Revision 035
Python API For Machine Learning
SAP HANA, express edition supports a set of client-side Python functions which can be used for developing machine learning models, thereby making it easy for Python users to use SAP HANA, express edition for machine learning purposes.
This is in addition to the Predictive Analysis Library (PAL) in SAP HANA, express edition which enables users to build machine learning models using a SQL interface.
In this blog, I would like to introduce you to the Python Client API for Machine Learning (ML). It consists of ML API, which are a set of machine learning APIs for different algorithms and the SAP HANA dataframe which consists of a set of functions for accessing and manipulating SAP HANA data. I will provide an overview of the key capabilities and a sample application along with pointers how you can get started.
Additionally, it should be noted that this capability has been there in SAP HANA 2.0 Express Edition Support Package 03, Rev 33, and the section titled “What’s New In This Release” provides a summary of the new features. Users who are familiar with this capability can directly go to the section below to review the changes.
What’s New In This Release
There are 3 areas which have updates and additions:
- Dataframe Functions:
- New multi-column renaming function added
- Enhanced save functionality allows saving a dataframe as a view (in addition to a table). This allows for processing of a dataframe, saving it as a view and using it in any of the machine learning functions.
- New function for counting rows in a dataframe has been added
- Enhanced join function (between data frames) which allows a subset of the columns to be returned
- Added a new dataframe function for processing the union of two dataframes
- Machine Learning Functions
- Several new algorithms have been added: Chi-sqaure tests, Naive Bayes, DBScan, Analysis of Variance, Generalized Linear Model
- Support for thread ratio that specifies the percentage of available threads to use by the algorithm has been enhanced. If the value is out of range, it now defaults to a pre-defined value.
- Package Structure Changes
- To use the dataframe functions in your Python application, the package structure has not changed, and hence the import statement is still “from hana_ml import dataframe”.
- To use the ML algorithms, as the package structure has changed, there are some differences in this release. For example, if a user wants to use clustering then will need to specify “from hana_ml.algorithms.pal import clustering” in your application.
The Python Client API for ML makes use of the HANA Python driver, as shown below in the diagram. Users first need to install the Python driver (hdbcli) and then the Python Client API for machine learning algorithms. Instructions are available in “How to Get Started” section below.
SAP HANA Dataframe
The SAP HANA dataframe provides a way to view and manipulate the data stored in SAP HANA without physically moving any data between HANA and the Python client application. SAP HANA dataframe hides the underlying SQL statement, providing users with a Python interface to SAP HANA data. Once the HANA dataframe is created, it can be used in the Python Client ML APIs (which contain various machine learning algorithms) as input for training and scoring purposes.
The SAP HANA dataframe functions provide capabilities for a variety of different operations for creating and manipulating dataframes. Some sample functions include: adding an ID column, casting columns into new type, dropping columns, filling null values, joining dataframes, sorting dataframes, renaming columns, statistics relating to the data, showing distinct values, creating dataframes with top n values, copying the HANA dataframe to a Pandas data frame, creating a table or view from a HANA dataframe, etc. For details refer to the documentation
To create a SAP HANA dataframe, first create the “ConnectionContext” object and then use the methods provided in the library for creating a SAP HANA dataframe.
The ML API are a set of APIs to the SAP HANA Machine Learning algorithms.
These algorithms cover classification, clustering, decomposition, metric functions, neural network, preprocessing, linear models, naïve bayes, statistics functions, neighborhood-based algorithms, decision trees and support vector machines. For details on which specific algorithms are available in this release, please refer to the documentation
These client-side Python functions require a HANA dataframe and parameters as inputs in order to train the model. The model training executes in HANA, and hence no data is brought back to the client side for training the model. The Python machine learning APIs invoke SQL functions within SAP HANA for performing the training and scoring, using data within HANA. Hence, data movement from the server to the client, or vice-versa is avoided, resulting in high performance. All these operations occur as part of the execution process.
This approach is open, allowing users to not only use the ML API functions provided, but also use Python libraries of their choice, in conjunction with what is available in SAP HANA, express edition. Users can convert a HANA dataframe to a Pandas dataframe, and after that use their Python library of choice for their applications.
An end-to-end example using the HANA dataframe and ML API is provided in the section “End to End Example: Using The Python Client API for ML”.
How to Get Started
- The first step is to install the SAP HANA Python Client API for Machine Learning Algorithms. This API requires the SAP HANA Python driver as pre-requisite. For detailed steps on how to install the Python Client API for ML (and the Python driver) refer to the tutorial.
- If you would like to do your development in a Jupyter notebook (or JupyterLab), and use the ML API in the notebook environment, you can do so. Instructions on how to use JupyterLab with SAP HANA can be found here.
- Now, you are ready to start using the SAP HANA Python Client API for Machine Learning Algorithms. The example below shows a sample end-to-end scenario.
End to End Example: Using The Python Client API for ML
Shown below is an example where the RandomForestClassifier model is being trained, using data in a HANA table.
In the use case below, the user wants to predict if a game is will be played on not on a given day, using the weather conditions of that day. For training purposes, the user has access to historical data where the weather conditions and outcome is known. The historical information contains information on outlook (sunny, overcast, rain, etc.), temperature, humidity, windy(yes, no) and the actual historical outcome that took place on that day, namely, whether the game was played or not. The historical outcome is captured in the column named Label(PLAY, DO NOT PLAY).
Using the features such as outlook, temperature, humidity, windy, and label (as outcome), the user will train a randomforestclassifier model. Once trained, this model will be used for predicting if a game will be played or not on a given day, based on knowing the outlook, temperature, humidity and windy conditions for that specific day.
Also, in the example below, the user is connecting to a HANA database, and using a table named DATA_TBL_RFT (for training), and DATA_TBL_RFTPredict (for prediction).
Step 1: Import the Python Client API Library and Dataframe Library (dataframe, trees) from hana_ml import dataframe from hana_ml.algorithms.pal import trees Step 2: Instantiate the Connection Object (conn) conn = dataframe.ConnectionContext('<address>', <port>, '<user>', '<password>') Step 3: Create the HANA Dataframe (df1) which references the "DATA_TBL_RFT" Table. df1 = conn.table("DATA_TBL_RFT") Step 4: Inspect the Data df1.head(4).collect() OUTLOOK TEMP HUMIDITY WINDY LABEL Sunny 75.0 70.0 Yes Play Sunny 60.0 90.0 Yes Do not Play Sunny 85.0 75.0 No Do not Play Sunny 72.0 95.0 No Do not Play Step 5: Create the RandomForestClassifier instance and specify the parameters rfc = RandomForestClassifier(conn_context=conn, n_estimators=3, max_features=3, random_state=2, split_threshold=0.00001, calculate_oob=True, min_samples_leaf=1, thread_ratio=1.0) Step 6: Store the necessary features in a List and Invoke the fit method rfc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'], label='LABEL',) Step 7: Create the HANA dataframe (df2) which references the "DATA_TBL_RFTPredict" Table. df2 = conn.table("DATA_TBL_RFTPredict") Step 8: Preview the dataframe before predicting df2.collect() ID OUTLOOK TEMP HUMIDITY WINDY 0 Overcast 75.0 100.0 Yes 1 Rain 78.0 70.0 Yes Step 9: Invoke the Prediction Method ("predict()") and inspect the result result = rfc.predict(df2, key='ID', verbose=False) result.collect() ID SCORE CONFIDENCE 0 Play 0.666667 1 Play 0.666667
Summary and Next Steps
SAP HANA Python Client API For Machine Learning Algorithms provides a set of Python APIs for creating and manipulating HANA dataframes, training and scoring machine learning models and a set of functions for data preprocessing.
While ensuring there is no data transfer between the client and server, these functions ensure that the model training/prediction executes in SAP HANA, thereby providing high performance and execution close to the data. The HANA dataframe allows conversion to a Pandas dataframe in situations where users may want to download the data (or subset of it), and use it with other Python libraries, thereby ensuring that they can use their favorite Python libraries along with what’s provided in SAP HANA, express edition for machine learning.
So give this new capability a try by downloading SAP HANA, express edition and getting started !
Links to references
- SAP HANA, express edition documentation: https://help.sap.com/viewer/product/SAP_HANA,_EXPRESS_EDITION/2.0.03/en-US
- SAP HANA, express edition Python Client API for machine learning algorithms documentation: https://help.sap.com/http.svc/rc/869ecfddc30a45868cf47b95760ff5c1/2.0.03/en-US/html/index.html
- Using JupyterLab with HANA: https://blogs.sap.com/2018/10/01/machine-learning-in-a-box-part-10-jupyterlab/
- Installing the Python Client API for machine learning algorithms tutorial: https://developers.sap.com/tutorials/hxe-ua-install-python-ml-api.html