Skip to Content
Technical Articles

Hands-On Tutorial: Machine Learning push-down to SAP HANA with Python

With this tutorial you will learn how to train Machine Learning (ML) models in SAP HANA through Python code. Trigger predictive algorithms either from local Jupyter Notebooks or, even better, from Jupyter Notebooks within SAP Data Intelligence.

If you are using SAP HANA, you probably have valuable business data in that system. This data can also be a very valuable asset for Machine Learning tasks. Since SAP HANA contains predictive algorithms you can train ML models within SAP HANA on the existing information – without having to extract and duplicate the data! I like to call this the “push-down”.

In case you are not familiar with Machine Learning or Python, this project can be a starting point. If you are already experienced with Machine Learning, you might be curious how to train ML models directly in SAP HANA from your preferred Python environment. That’s right, leverage the power of SAP HANA without leaving your existing Python framework!

You can implement the scenario yourself using your own SAP HANA instance. Or if you just want to get an idea of the concept without getting hands-on, you can also just scroll through the Notebooks that are shared.

To get hands-on you need to:

  • have access to a SAP HANA system (version 2.0 SPS 03 or higher)
  • have a Python development environment, preferably JupyterLab
  • install the libraries to your Python environment, which are needed to connect and push-down calculation and training logic to SAP HANA
  • download a set of Jupyter Notebooks that have been prepared for you

If you have access to SAP Data Intelligence, you can get started quicker as SAP Data Intelligence already has JupyterLab integrated. Those who will work with SAP Data Intelligence can jump to the SAP Data Intelligence chapter in this blog after having read through the remainder of this chapter.

The notebooks will implement a typical Machine Learning scenario in which a regression model is trained using the Predictive Algorithm Library (PAL). You will estimate the price of a used vehicle, based on the car model, the year in which it was built as well as the car’s mileage and other parameters.

For this scenario we are using a dataset of cars that were offered for sale at the end of 2016. This dataset was compiled by scraping offers on eBay and shared as “Ebay Used Car Sales Data” on Kaggle.

As the data is from 2016, any analysis or prediction refers to that time. In today’s prices the used car’s value would have reduced further. Unless you are looking at an old-timer, for which the price might rise over time…

Needless to say, this blog and the code and any advice that comes with it is not part of an official product of SAP. I am hoping you find the information useful to learn and create your own analysis on your own data, but there is no support for any of the content.

The official documentation for the components used in this blog are

A big “Thank you” goes to Thomas Bitterle, who was the first to test out this blog before publication! His feedback rounded off a number of areas, making it easier for everyone.

Install Python environment

You should be able to use your own Python environment, in case you already have one. In case you do not have Python installed yet, I suggest using the Anaconda Installer. Anaconda is a free and open-source distribution that installs everything that is typically needed to get started.

After the installation you can easily open the JupyterLabs environment from the Anaconda Navigator.

 

Alternatively, you can also start the JupyterLab environment from the “Anaconda Prompt” application with the command: jupyter lab

 

JupyterLabs provides a browser-based interface to work with multiple Jupyter Notebooks. A Jupyter Notebook allows to code and execute Python syntax from your browser. You can add nicely formatted comments to describe the code. And the Notebooks can display the output of the code, ie some text or charts that were produced. Having all this information in the same place makes it easier to keep track of your project.

Don’t worry if this sounds complex. It doesn’t take long to pick up and it is good fun to use! If you haven’t worked with Jupyter Notebooks so far, this collection of very brief introductory videos is a good start.

 

SAP HANA access and configuration

The Python wrapper, which is facilitating the push-down, is supported beginning from SAP HANA 2.0 SPS 03. Should you not have such a system ready for testing, a quick way to get access can be to start SAP HANA Express on a hyperscaler, ie on Amazon Web Services, Microsoft Azure or Google Cloud Platform

For this blog I chose to use a 32 GB instance of SAP HANA Express 2.0 SPS 04 on AWS as outlined in this guide. Please keep a close look on the hosting costs. Do check daily to avoid surprises! I understand the SAP HANA Express on AWS is not covered by the AWS free tier. 

Once you have an appropriate SAP HANA available, you need to run some SQL statements to configure it to be used by the Python wrapper.

SQL syntax can be executed in different environments. In this blog I am using the SAP Web IDE for SAP HANA. If you are also using this interface, you must add your SAP HANA instance to the Web IDE. In the Web IDE’s “Database Explorer” on the left hand side click the “+”-sign and choose:

  • Database Type: SAP HANA Database (Multitenant)
  • Host: Your SAP HANA’s IP address or server name (hxehost if you set up SAP HANA Express on AWS)
  • Identifier: Instance Number: 90
  • Tenant database: HXE
  • User: SYSTEM
  • Password: [the password you specified]

Open the SQL Console for that connection and execute the following statements.

 

Create a user named ML, who will access SAP HANA to upload data, to analyse the data and to train Machine Learning models. On the risk of stating the obvious, please replace ‘YOURPASSWORD’ with a password of your own choosing.

CREATE USER ML Password "YOURPASSWORD";

 

Optionally, you can ensure that the user will never be prompted to change the password:

ALTER USER ML DISABLE PASSWORD LIFETIME;

 

Assign the user the necessary rights to trigger the Predictive Algorithm Library (PAL):

GRANT AFLPM_CREATOR_ERASER_EXECUTE TO ML;
GRANT AFL__SYS_AFL_AFLPAL_EXECUTE TO ML;

 

The Predictive Algorithm Library (PAL) requires that the index server is running on the tenant database. The index server can be activated through the System database. Therefore add the System database as additional connection to the Database Explorer in the SAP Web IDE for SAP HANA:

  • Database Type: SAP HANA Database (Multitenant)
  • Host: Your SAP HANA’s IP address or server name (hxehost if you set up SAP HANA Express on AWS)
  • Identifier: Instance Number: 90
  • Database: System database
  • User: SYSTEM
  • Password: [the password you specified]

 

Now use this connection to start the index server on the HXE tenant database with this statement:

ALTER DATABASE HXE ADD 'scriptserver';

 

Later on we will need to know the SQL port of the HXE tenant. The port can be retrieved from the System database. Use the same connection to the System database to execute this SQL statement. Note down the SQL_PORT that is shown for our tenant HXE. Credit for that clever SQL statement goes to this developer tutorial!

SELECT DATABASE_NAME, SERVICE_NAME, PORT, SQL_PORT, (PORT + 2) HTTP_PORT 
FROM SYS_DATABASES.M_SERVICES
WHERE (SERVICE_NAME = 'indexserver' and COORDINATOR_TYPE = 'MASTER' )
OR SERVICE_NAME = 'xsengine';

 

Install the Python libraries for SAP HANA push-down

By now you have JupyterLabs installed and have access to a SAP HANA system. Now you need to install the wrapper, which allows Python to connect to SAP HANA and to push-down data calculations and the training of ML-models to SAP HANA.

Start JupyterLab as explained above, either through the Anaconda Navigator or from the Anaconda Prompt. Then create a new Jupyter Notebook and install these two libraries:

  1. The SAP HANA Python Client, which is the underlying connectivity from Python to SAP HANA:
    !pip install hdbcli​
  2. The Python wrapper, which facilitates the push-down to SAP HANA, is currently (October 2019) not available through pip. You need to download and install a recent version of the SAP HANA 2.0 client (at least SAP HANA 2.0 SPS 04 Revision 42). After installation you find in “C:\Program Files\SAP\hdbclient” the file hana_ml-1.0.7.tar.gz. Install this library in the Jupyter Notebook with:
    !pip install "C:\Program Files\SAP\hdbclient\hana_ml-1.0.7.tar.gz"​

    Test the installation with the following code, which should print the version of the hana_ml package. This hands-on guide requires you to have at least version 1.0.7.

    import hana_ml
    print(hana_ml.__version__)

 

Test the connection from JupyterLab to SAP HANA

Run a quick test whether the hana_ml package can indeed connect to your SAP HANA system. To keep things simple for the now, logon with the user name and password. The code connects to SAP HANA, executes a very simple SELECT statement and retrieves and displays the result in Python. You may need to change the server name the SAP HANA, “hxehost” is the name given in the above AWS guide.

import hana_ml.dataframe as dataframe

# Instantiate connection object
conn = dataframe.ConnectionContext("hxehost", 39015, "ML", "YOURPASSWORD")

# Send basic SELECT statement and display the result
sql = 'SELECT 12345 FROM DUMMY'
df_pushdown = conn.sql(sql)
print(df_pushdown.collect())

# Close connection
conn.close()

 

Running the cell should display the value 12345.

 

Should you receive an error, scroll to the end of the error message. Typically the last line of the error is the most helpful one.

 

Connect with secure password

In the previous cell it was convenient to write the SAP HANA password directly into the Python code. This is obviously not very secure, and you may want to take a more sophisticated approach. It would be better to save the logon parameters securely with the hdbuserstore application, which is part of the SAP HANA client.

Navigate in a command prompt (cmd.exe) to the folder that contains the hdbuserstore, ie
C:\Program Files\SAP\hdbclient

Then store the logon parameters in the hdbuserstore. In this example the parameters are saved under a key called hana_hxe. You are free to chose your own name, but if you stick with hana_hxe you can execute the Jupyter Notebooks as they are.

C:\Program Files\SAP\hdbclient>hdbuserstore -i SET hana_hxe “SERVER:SQL_PORT” YOURUSER

The password is not specified in the above command as you will be prompted for it.

 

Now that the logon credentials are securely saved, they can be leveraged by the hana_ml wrapper to logon to SAP HANA. Create a new cell in the Jupyter Notebook and repeat the earlier test, but now use the securely stored credentials.

import hana_ml.dataframe as dataframe

# Instantiate connection object
conn = dataframe.ConnectionContext(key="hana_hxe")

# Send basic SELECT statement and display the result
sql = 'SELECT 12345 FROM DUMMY'
df_pushdown = conn.sql(sql)
print(df_pushdown.collect())

# Close connection
conn.close()

 

You should see the familiar output of 12345.

 

Run the notebooks to trigger Machine Learning within SAP HANA

Everything is in place now to run the Notebooks that are shared with this blog. Download these Jupyter Notebooks and the data file from the hana_ml samples Github repository.

 

Save these file to a local folder. Now open JupyterLab as described in the first chapter above. In the File Browser on the left navigate to that folder. You should see something similar to:

 

We will now go through the notebooks in the order of their numbering. If you have implemented the above steps, you should be able to run the notebooks without modifications.

In case you have not saved the SAP HANA logon credentials in hdbuserstore, you need to change the ConnectionContext to the hardcoded logon approach as shown earlier in this blog.

All notebooks that are offered for download here are saved with the cell output that was produced when the notebooks were executed. Before running the notebooks yourself, you can remove that output so that you know for sure that all output was produced by yourself. To clear the previous output, right-click on the notebook and select “Clear All Outputs”.

Within these notebooks you find additional comments and explanations. Therefore in the blog here only a high-level summary of the different notebooks is given.

Data upload

The notebook “00 Preparation” loads the data from autos.csv first into a local pandas data frame, does some data preparation before using the hana_ml wrapper to save the data to SAP HANA. This data will be used to train ML models. The notebook also creates a second table in SAP HANA which contains additional cars, on which we will apply the trained model to predict the price.

 

Introduction to Python wrapper

Before going into a longer and more realistic project, run the notebook “05 Introduction” to train a very simple model through the hana_ml wrapper. If you are comfortable with the steps in this notebooks, you already got the hang of it!

 

Exploratory Data Analysis

With notebook “10 Exploratory Data Analysis” you start a more comprehensive project. You will explore and filter the data. The transformed data is saved as a view to SAP HANA, which will be used in the following notebook. As the transformation is saved as a view, no data got duplicated!

 

Imputation and Model Training

With “20 Imputation and model training” the data gets transformed further, missing data is imputed. The data is split into train and test sets. These data sets are used to train different decision trees and to test the model’s quality.

The best model is chosen and an error analysis is carried out to see how the model performed in different areas. The model is then saved to SAP HANA.

 

Apply the trained model / predict

In the first notebook (“00 Preparation”) we created a table in SAP HANA with cars whose price we want to predict. The moment has come!

Run “30 Apply saved model” to load the model that was created in the previous notebook (“20 Imputation and model training”). Then apply the model on these cars and see how the difference in mileage affects the price of the cars.

 

Tidy up

Optionally, if you want to delete the tables and views that have been created you can run the Notebook “40 Tidy up”.

 

SAP Data Intelligence

This chapter is only relevant for those who have access to SAP Data Intelligence. The above notebooks can also be executed within the interface of SAP Data Intelligence. Ideally you should already have some familiarity with SAP Data Intelligence, ie by having read through or even implemented your first ML Scenario.

You need to follow these steps to run the Notebooks in SAP Data Intelligence:

  • Prepare a SAP HANA system as described in the chapter “SAP HANA access and configuration”.
  • In SAP Data Intelligence create a new connection of type HANA_DB to that SAP HANA instance. Name the connection “di_hana_hxe”. The “Host” is the SAP HANA’s IP address or server name. As “Port” enter the SQL port (see above for SQL_PORT). Specify the ML user and the corresponding password.
  • In the “ML Scenario Manager” create a new “ML Scenario” named “Car price prediction blog”
  • Create a Notebook called “dummy”, just to open the JupyterLab interface. If you get prompted for a kernel, select “Python 3”.
  • Import the five Notebooks and the data file autos.csv (in the “File Browser” on the left).
  • In each Notebook change the command that logons to SAP HANA since now the connection “di_hana_hxe” needs to be used. Replace this line:
    conn = dataframe.ConnectionContext(key = 'hana_hxe')​

    with

    from notebook_hana_connector.notebook_hana_connector import NotebookConnectionContext
    conn = NotebookConnectionContext(connectionId = 'di_hana_hxe')

    Only for Notebook “40 Tidy up” replace the existing command with

    from notebook_hana_connector.notebook_hana_connector import NotebookConnectionContext
    conn = NotebookConnectionContext(connectionId = 'di_hana_hxe').connection
  • And finally, you may need to update the hana_ml package. Currently (October 2019) the version that comes with SAP Data Intelligence is not at the version that is required for this blog’s notebooks. Uninstall the current version with:
    !pip uninstall hana_ml --yes

    Upload the file hana_ml-1.0.7.tar.gz (see above) into the File Browser together with the Notebooks. Install the package with:

    !pip install hana_ml-1.0.7.tar.gz

    Restart the Python kernel through the JupyterLab menu (In the “Kernel” menu select “Restart Kernel…”)

Now you should be good to run all Notebooks as explained above.

 

Deployment into production

Now we have a model that we can manually work with. To bring the predictions into an ongoing business process you can leverage SAP Data Intelligence to retrain and apply the model as needed.

With SAP Data Intelligence you can script in Jupyter Notebooks without having to install any components on your own computer. The code you want to bring into the productive process can be deployed through graphical pipelines, which help IT to run the code in a governed framework.

This sounds like a topic for another day and another blog.

Conclusion

Well done! If you have read the blog this far, you have an understanding of how Machine Learning can be carried out within SAP HANA. If you have followed this guide hands-on, you now even have experience with Machine Learning in SAP HANA and are ready to experiment with your own data!

8 Comments
You must be Logged on to comment or reply to a post.