Skip to Content
Technical Articles
Author's profile photo Andreas Forster

SAP Data Intelligence: Deploy your first HANA ML pipelines

Find out how to create and deploy SAP HANA Machine Learning pipelines in SAP Data Intelligence. By estimating the price of a used vehicle, we will see how a Data Scientist can work with Python in their familiar Jupyter notebooks with a high degree of agility and how the logic can be deployed into an operational process under the governance of IT.

Begin with exploring and preparing SAP HANA data before deciding on a Machine Learning model. Then move from the creation phase into deployment by leveraging graphical HANA ML operators and templates. You go from raw data to real-time predictions, all without having to extract any data from SAP HANA.

Should you have access to a SAP Data Intelligence system, you can implement this project in your own system. Or you can just read through the article, to get an overview.

This blog does not assume that you have a lot of experience with SAP Data Intelligence. However, this example is building on two related blogs, which introduce two topics that are being combined in this scenario: SAP Data Intelligence’s ML Scenario Manager and SAP HANA Machine Learning triggered through Python.

Also, the official documentation will be useful:

This blog was updated on 14 January 2021 for SAP Data Intelligence Cloud version 2010.29.22 and hana_ml version 2.6.21011300.

Table of contents


In this exercise we will estimate the price of a used vehicle. We will train a regression model on the “Ebay Used Car Sales Data“, which contains vehicles that were offered for sale at the end of 2016. The true prices, for which the cars were sold, are not known.

Therefore, to be more precise: we will estimate the price in Euros for vehicles that were offered for sale at the end of 2016. Today’s prices should be lower, with just a few exceptions. Oldtimers tend to get more valuable over time.

System Preparation

For implementing this scenario, you require:

  • Access to SAP Data Intelligence
  • A SAP HANA system to which SAP Data Intelligence can connect
  • A user for that SAP HANA system, which has the rights to call the Predictive Algorithm Library (PAL)

With the above components in place, upload the historic car sales data to SAP HANA. There are many ways for uploading a CSV-file. My favourite option is to load the data from Python directly to SAP HANA. To follow this blog, please use the same approach as this ensures that your data types are in sync with this description.

Logon to SAP Data Intelligence and create a connection, which will be used by the notebooks for creating content and by the pipelines for deployment. Select the “Connection Management” tile.


Create a new connection of connection type “HANA_DB”. Name the connection by setting it’s Id to “MYHANA”. Enter the details for your connection. If you don’t know the SQL port of your system, see the hints in this blog.


Back on the Launchpad, click into the “ML Scenario Manager”, from where we will now upload the historic data into SAP HANA. Later on, we will use the same ML Scenario to implement the car price prediction.

In your own, future projects, you may not need to upload any data into SAP HANA. You might be able to work data that is already stored in your existing system.


Create a new ML Scenario named “Car price prediction (HANA ML)”.


The ML Scenario opens up. Create a new Jupyter notebook.


Name the notebook “10 Data upload”. The Jupyter Notebook environment opens up. When prompted for the version of the kernel, select “Python 3”.

Keep the Jupyter window open. Separately, download the file that contains the historic sales data (autos.csv). Unzip the file and upload it into the Notebook environment using the “Upload Files” icon on the left.


Now we can start scripting. We will be using the hana_ml library to upload the data. Later on, we will use the same library for the Machine Learning tasks. The library’s official name is “SAP HANA Python Client API for Machine Learning Algorithms“. There is also a similar hana_ml library for R.

The Jupyter Notebooks in SAP Data Intelligence might not always have the very latest version of this library installed. Enter this Python code in the cell on the right-hand side and execute that cell, to see which version is available.

import hana_ml



When the above screenshot was taken, version 1.0.5 was installed. Upgrade the library to the latest version.

!pip install hana_ml --upgrade


Restart the Python kernel, so that your Python session uses the newly installed version.


After the restart, verify that the hana_ml library has been upgraded.

import hana_ml


Currently, when writing this blog, the latest version is 2.6.21011300.


Now load the data from the autos.csv file into a Pandas dataframe. Once loaded, it will display the first 5 rows of the data set.

import pandas as pd
df_data = pd.read_csv('autos.csv', encoding = 'Windows-1252')


Continue by adjusting the data slightly before uploading it. Turn the column names into upper case.

df_data.columns = map(str.upper, df_data.columns)


Filter the data on Mercedes Benz cars that are offered for sale by private individuals. I just picked a vendor by chance. This exercise is not affiliated with or endorsed by Mercedes Benz in any way.

df_data = df_data[df_data['BRAND'] == 'mercedes_benz']    # Keep only Mercedes Benz
df_data = df_data[df_data['OFFERTYPE'] == 'Angebot']      # Keep only cars for sale (excluding adverts for purchasing a car)
df_data = df_data[df_data['SELLER'] == 'privat']          # Keep only sales by private people (excluding commercial offers)
df_data = df_data[df_data['NOTREPAIREDDAMAGE'] == 'nein'] # Keep only cars that have no unrepaired damage


Remove a few columns for simplicity.

df_data = df_data.drop(['NOTREPAIREDDAMAGE',
                         axis = 1)


Rename some columns for readability.

df_data = df_data.rename(index = str, columns = {'YEAROFREGISTRATION': 'YEAR',
                                                 'POWERPS': 'HP'})


Translate column content from German into English.

df_data['MODEL'] = df_data['MODEL'].replace({'a_klasse': 'A-Class',
                                             'b_klasse': 'B-Class',
                                             'c_klasse': 'C-Class',
                                             'e_klasse': 'E-Class',
                                             'g_klasse': 'G-Class',
                                             'm_klasse': 'M-Class',
                                             's_klasse': 'S-Class',                                     
                                             'v_klasse': 'V-Class',                                       
                                             'cl': 'CL',  
                                             'sl': 'SL', 
                                             'gl': 'GL', 
                                             'clk': 'CLK',   
                                             'slk': 'SLK',
                                             'glk': 'GLK',  
                                             'sprinter': 'Sprinter',  
                                             'viano': 'Viano',  
                                             'vito': 'Vito',                                        
                                             'andere': 'Other'                                        
df_data['GEARBOX']  = df_data['GEARBOX'].replace({'manuell': 'manual',
                                                  'automatik': 'automatic'})
df_data['FUELTYPE'] = df_data['FUELTYPE'].replace({'benzin': 'petrol'})


Add an ID column, which will be needed for training a Machine Learning model.

df_data.insert(0, 'CAR_ID', df_data.reset_index().index)


And move our target column, the price, towards the end. Some algorithms require the target to be the last column of the data. Moving it early on makes it a little easier later on.

df_data = df_data[pd.Index.append(df_data.columns.drop("PRICE"), pd.Index(['PRICE']))]


The data is prepared. Now establish a connection to SAP HANA, using the credentials from the MYHANA connection that was created earlier.

import hana_ml.dataframe as dataframe
from notebook_hana_connector.notebook_hana_connector import NotebookConnectionContext
conn = NotebookConnectionContext(connectionId = 'MYHANA')


Upload the data to a table called USEDCARPRICES.

df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, 
                                                   pandas_df = df_data, 
                                                   table_name = 'USEDCARPRICES',
                                                   force = True,
                                                   replace = False)


Verify that the data has indeed been uploaded. You could have a look through your usual SAP HANA frontends. Or you can have a look from within the Jupyter Notebook.

Select the “Data Browser” icon on the very left, then select the “Connections” icon on top and double-click on the “MYHANA” connection.


Double-click into your database schema and find the “USEDCARPRICES” table. Now right-click and select “View in Metadata Explorer”.


Meta Data Explorer opens up for our new table. Notice that we have 4 columns of type character and 5 numeric columns. Select the *Data Preview” tab and the data should show up!


Back in the notebook close the connection to SAP HANA. Then hit the “Save” icon.



Explore and prepare data

The system is ready, the data is loaded. We can start working with the data! On the main page of the ML Scenario create a new notebook called “20 EDA and data preparation”. EDA is a common abbreviation for Exploratory Data Analysis. Once we have explored and understood the data, we prepare it further for Machine Learning.


When the notebook opens, select the “Python 3” kernel as before. Begin by establishing a connection to SAP HANA, leveraging the connection that was centrally defined beforehand.

import hana_ml.dataframe as dataframe
from notebook_hana_connector.notebook_hana_connector import NotebookConnectionContext
conn = NotebookConnectionContext(connectionId = 'MYHANA')


Create a SAP HANA dataframe which points to the USECARPRICES table.

df_remote = conn.table(table = 'USEDCARPRICES')


No data gets extracted here from SAP HANA. The object df_remote is a pointer to the table, where the data remains. Try calling the object on its own.



No data is returned, which is good. After all, we want to avoid the movement of data. No data has been extracted yet from SAP HANA. You just see the object type.


However, we can extract data of course if we want to. Python provides an intellisense to see the object’s function. Add a dot after the object name and hit “TAB” on your keyboard.


Knowing which functions are available and how to use them will come in handy during any project. Select the head() function to specify a small number of rows. Place the cursor into the function’s round brackets, then hit SHIFT + TAB and the intellisense shows the help for this function, i.e. what it does, and which parameters can be passed into it.


Retrieve 5 rows from the table. The collect function is needed to transfer the data into Python. It is important to restrict the data first (with the head function) before extracting it with the collect function.

df_remote.head(n = 5).collect()


Continue by exploring the data. Count the number of rows for instance.



Find the price of the most expensive car.

df_remote.agg([('max', 'PRICE', 'MOSTEXPENSIVE')]).collect()


Use the describe function, to get an overview of the content of all columns.



Each cell was calculated by SAP HANA. Only the results were then moved to Python. In that table you will also find the values that you had requested earlier, i.e. the price of the most expensive car.

To prove the point that SAP HANA did the work here, ask for the underlying SELECT statement, which was created by the hana_ml library.



The SQL syntax that is displayed, is much too long to be displayed here. However, seeing that SQL code might also help understand how the hana_ml library works, how it translates the requests into SQL code for SAP HANA.

Continue the exploration by finding the most frequent car models in the data set.

top_n = 5
df_remote_col_frequency = df_remote.agg([('count', 'MODEL', 'COUNT')],  group_by = 'MODEL')
df_col_frequency = df_remote_col_frequency.sort('COUNT', desc = True).head(top_n).collect()


Plot the data of the above Pandas dataframe.

%matplotlib inline = 'MODEL',
                          y = 'COUNT',
                          title = 'Top ' + str(top_n));


Inspect the price of a car against the year in which it was built. First install the seaborn library for additional charting capabilities.

!pip install seaborn


For a small sample of all cars retrieve only the year and the price.

col_name_1 = 'PRICE'
col_name_2 = 'YEAR'

from hana_ml.algorithms.pal import partition
df_remote_sample, df_remote_ignore1, df_remote_ignore2 = partition.train_test_val_split(
                               data = df_remote, 
                               random_seed = 1972,
                               training_percentage = 0.01, 
                               testing_percentage = 0.99, 
                               validation_percentage = 0)
df_sample =, col_name_2).collect()


Visualise the data in a scatter plot.

%matplotlib inline
import seaborn as sns
sns.jointplot(x=col_name_1, y=col_name_2, data=df_sample);


For newer cars there is a trend that they are more expensive. However, the most expensive cars are the oldest ones. That’s the oldtimers. Therefore, filter the dataset to focus on the more common cars. Only consider cars that were built in the year 2000 or later and that cost under 50.000 Euros. And to keep things a little simple for now, all rows with missing values are also dropped. 18066 cars remain.

df_remote = df_remote.filter('YEAR >= 2000')
df_remote = df_remote.filter('PRICE < 50000')
df_remote = df_remote.dropna()


Continue by analysing the correlations between the numerical variables.

import matplotlib.pyplot as plt
from hana_ml.visualizers.eda import EDAVisualizer
f = plt.figure()
ax1 = f.add_subplot(111) # 111 refers to 1x1 grid, 1st subplot
eda = EDAVisualizer(ax1)
ax, corr_data = eda.correlation_plot(data = df_remote.drop(['CAR_ID']),
                                     cmap = 'coolwarm')


The strongest positive correlation exists between the year in which the car was built and the price with a coefficient of 0.75. That means, there is a fairly strong trend that newer cars come with a higher price. Or vice versa, older cars tend to be cheaper. Similarly, the negative correlation of -0.55 between the price and kilometer means, that a higher mileage tends to come with a lower price. Or the other way round. Both make perfect sense of course.


We have a reasonable understanding of the data now. Before making the prepared data available for Machine Learning, just convert the target column of our prediction, the price, into a DOUBLE type. This will be needed later or by the algorithm that trains the model. Converting it early, makes things a little easier later on.

df_remote = df_remote.cast('PRICE', 'DOUBLE')


Now that the data is fully prepared, save it as a view to SAP HANA. This function returns a SAP HANA dataframe, which points to the newly created view. Hence Python will display the object type hana_ml.dataframe.DataFrame. This is not an error; you have already seen this behaviour at the beginning of this notebook. = ('USEDCARPRICES_PREPVIEW'), table_type = 'VIEW', force = True)


Close the connection and save the notebook.



The original data in the USEDCARPRICES table remains unchanged. No data has been duplicated, but the view USEDCARPRICES_PREPVIEW now provides suitable data for the purpose of Machine Learning.

Train ML model

For the Machine Learning part, I prefer to use a separate notebook. Create a new notebook named “30 Train ML”. Select the “Python 3” kernel when prompted.


Begin by establishing a connection to SAP HANA again.

import hana_ml.dataframe as dataframe
from notebook_hana_connector.notebook_hana_connector import NotebookConnectionContext
conn = NotebookConnectionContext(connectionId = 'MYHANA')


Create a SAP HANA dataframe, that now points to the view that represents the prepared and filtered data.

df_remote = conn.table(table = 'USEDCARPRICES_PREPVIEW')


Peek at the data, retrieve 5 rows.



Before training any Machine Learning model, split the data into two separate parts. One part (70%) will be used to train different ML models. The other part (30%) will be used to test the model’s quality.

from hana_ml.algorithms.pal import partition
df_remote_train, df_remote_test, df_remote_ignore = partition.train_test_val_split(random_seed = 1972,
                                                                                   data = df_remote, 
                                                                                   training_percentage = 0.7, 
                                                                                   testing_percentage = 0.3,
                                                                                   validation_percentage = 0)


See how many rows are in each subset.

print('Size of training subset: ' + str(df_remote_train.count()))
print('Size of test subset: ' + str(df_remote_test.count()))


Use the training data to train a very first Hybrid Gradient Boosting regression. The code specifies the regression’s target (PRICE), predictor columns (i.e. YEAR and HP) as well which hyperparameters to use for training the algorithm. Those hyperparameters like n_estimators or max_depth are described in the documentation. For now, we are not too concerned about the model quality. This step is primarily to show the concept how such a model can be trained. Later on we will try to improve the model.

from hana_ml.algorithms.pal import trees
hgb_reg = trees.HybridGradientBoostingRegressor(random_state = 42,
                                                n_estimators = 10,
                                                max_depth = 5,
                                                learning_rate = 0.3,
                                                min_sample_weight_leaf = 1,
                                                min_samples_leaf = 1,
                                                lamb = 1,
                                                alpha = 1)

# Specify the model's predictors

# Train the model = df_remote_train, 
            features = features, 
            key = "CAR_ID", 
            label = 'PRICE')


With the trained model, we can test its performance on the test data. First compare the actual / true price of the car from the test data with the prediction of the regression model.

df_remote_act_pred = df_remote_test.alias('L').join(
    hgb_reg.predict(data = df_remote_test, features = features, key = 'CAR_ID').alias('R'),
    'L.CAR_ID = R.CAR_ID', 
    select=[('L.CAR_ID', 'CAR_ID'), 'MODEL', ('PRICE','ACTUALS'), ('SCORE', 'PREDICTIONS')])


Some predictions are closer to the actual value than others. Now calculate the Root Mean Squared Error (RMSE), which gives an indication of the model’s overall performance for all cars in the test dataset.

import numpy as np
df_remote_se ='CAR_ID',
                                        ('(ACTUALS - PREDICTIONS) * (ACTUALS - PREDICTIONS) ', 'ERRORSQ'))
df_mse = df_remote_se.agg([('avg', 'ERRORSQ', 'MSE')]).collect()
rmse = np.sqrt(float(df_mse.iloc[0:,0]))
print('RMSE: ' + str(round(rmse, 2)))


The RMSE will probably be about 3100. This means a little simplified, that by this definition, the model was on average 3100 Euros away from the true value. That can be good or bad, depending on the circumstances.

Now however, we want to improve the quality of the model. We stick with the same algorithm, but we try out different hyperparameter values (i.e. for n_estimator or max_depth) to achieve a lower RMSE on the test data.

These hyperparameter values will be tested.

n_estimators = [10, 20, 30, 40, 50]
max_depth = [2, 3, 4, 5, 6, 7, 8, 9, 10]


In this blog, all hyperparameter combinations are tested through an explicit loop. A model is trained on the training data, the RMSE is calculated on the test data. The results are stored both in Python as well as in the ML Scenario. There are alternatives to such looping, see for example this blog by Raymond Yao, which uses a random search.

Running this cell will take a few minutes…

import pandas as pd
from math import sqrt
from sapdi import tracking

# Dataframe to store hyperparameters with model quality
df_hyper_quality = pd.DataFrame(columns=['N_ESTIMATORS', 'MAX_DEPTH', 'RMSE'])

# Iterate through all parameter combinations
from hana_ml.algorithms.pal import trees
for aa in n_estimators:
    for bb in max_depth:
        hgb_reg = trees.HybridGradientBoostingRegressor(random_state = 42,
                                                        min_samples_leaf = 1,
                                                        n_estimators = aa,
                                                        max_depth = bb)

        # Train the regression with the current parameters = df_remote_train, 
                       features = features, 
                       key = "CAR_ID", 
                       label = 'PRICE')
        # Evaluate the model on the test data
        df_remote_act_pred = df_remote_test.alias('L').join(
                                hgb_reg.predict(data = df_remote_test, features = features, key = 'CAR_ID').alias('R'),
                                'L.CAR_ID = R.CAR_ID', 
                                select=[('L.CAR_ID', 'CAR_ID'), ('PRICE','ACTUALS'), ('"SCORE"', 'PREDICTIONS')])
        df_remote_se ='CAR_ID',
                                                     ('(ACTUALS - PREDICTIONS) * (ACTUALS - PREDICTIONS) ', 'ERRORSQ'))
        df_mse = df_remote_se.agg([('avg', 'ERRORSQ', 'MSE')]).collect()
        rmse = np.sqrt(float(df_mse.iloc[0:,0]))
        # Print a status update
        print('n_estimators: ', aa, ' | max_depth: ', bb)
        print('RMSE on test data: ' + str(round(rmse, 2)))
        # Add the parameters and the RMSE to the collection in Python
        df_hyper_quality = df_hyper_quality.append({'N_ESTIMATORS': aa,
                                                    'MAX_DEPTH': bb,
                                                    'RMSE': rmse},
                                                    ignore_index = True)
        # Add the parameters and RMSE to the ML Scenario
        run = tracking.start_run(run_collection_name="Car price experiments")
        tracking.log_parameters({"n_estimators": aa,
                                 "max_depth": bb})
        tracking.log_metrics({"RMSE": rmse})


Once the cell has completed, see which model performed best.

df_hyper_quality.sort_values(by = 'RMSE', ascending = True).head(5)


The model that was trained with n_estimators = 30 and max_depth = 6 resulted in the lowest RMSE. We could already deploy this model now in a graphical pipeline, but let’s have a closer look at the model beforehand. Train the model once more, now with those identified hyperparameter values. Please note, that the process has some randomness and that your best hyperparameter values might be different.

hgb_reg = trees.HybridGradientBoostingRegressor(random_state = 42,
                                                n_estimators = 30,
                                                max_depth = 6,
                                                min_samples_leaf = 1) = df_remote_train, 
            features = features, 
            key = "CAR_ID", 
            label = 'PRICE')


Compare the actuals with the predicted values.

df_remote_act_pred = df_remote_test.alias('L').join(
    hgb_reg.predict(data = df_remote_test, features = features, key = 'CAR_ID').alias('R'),
    'L.CAR_ID = R.CAR_ID', 
    select=[('L.CAR_ID', 'CAR_ID'), 'MODEL', ('PRICE','ACTUALS'), ('SCORE', 'PREDICTIONS')])


As a test, calculate the RMSE. This should result in the same value as before when this model was identified as the strongest.

import numpy as np
df_remote_se ='CAR_ID',
                                             ('(ACTUALS - PREDICTIONS) * (ACTUALS - PREDICTIONS) ', 'ERRORSQ'))
df_mse = df_remote_se.agg([('avg', 'ERRORSQ', 'MSE')]).collect()
rmse = np.sqrt(float(df_mse.iloc[0:,0]))
print('RMSE: ' + str(round(rmse, 2)))


Finally, investigate how well the model performs by car model.

df_mse = df_remote_se.agg([('avg', 'ERRORSQ', 'MSE')], group_by = ['MODEL']).collect()
df_mse['RMSE'] = np.sqrt(df_mse['MSE'])
df_mse.sort_values(by = ['RMSE'], ascending = True)

The model performed best on the A-Class. This is not very surprising as this is a lower priced model. Since we measure the quality of the prediction in absolute Euros (and not in percentage for instance), the prediction errors in Euros should not be very high for the A-Class. The G-Class however, for which the model as a much large RMSE, is a more expensive model. One could consider training a separate model for the G-Class, but we will go with the model now as it is.


Close the connection and save the notebook.



Earlier on, when the different parameter combinations for the model were tried out, for each model the parameters and the resulting RMSE was saved into the ML Scenario (see the “tracking” commands). Those values can be helpful for transparency. In case you need to recall at some point in the future, which models were considered, they are persisted here.

Back in the ML scenario you click “Open in Metrics Explorer”.


You will first see a tabular display of the information, but the data can also be charted. It’s fairly easy to find your way around, just give it a try if you are interested. (In the Metrics Explorer select the collection of metrics on the left → Select all runs → click “Open Visual Board” → “Canvas” → “Add Chart” → “Plot Metric against Parameter” → select “RMSE” in combination with one of the two hyperparameters)


Training pipeline

Through the above work in the notebooks we know which algorithm and which hyperparameters we want to use for the ML model in production. The first step for the deployment is to train and save the model through a graphical training pipeline. A second graphical pipeline will then use this saved model for inference.

On the ML Scenario’s main page create a new pipeline.


Name the pipeline “10 Train” and select the “HANA ML Training” template. Click “Create”.


The pipeline opens in the Modeler interface. All operators that we need are already there. The model will be trained through the “HANA ML Training” operator. The operator has two outputs, as it saves the model into the ML Scenario through the “Artifact Producer” and it can also save model metrics (i.e. quality indicators) into the same scenario. Once both branches have completed, the pipeline terminates.


Only the “HANA ML Training” operator must be configured with our requirements. The other operators can remain unchanged. Open the operator’s configuration.


First specify the connection. Set the “Configuration Type” to “Configuration Manager”. Then select the “MYHANA” connection from the “Connection ID” drop-down and save.


Still in the configurations of the “HANA ML Training” operator, select as “Dataset” the “USEDCARPRICES_PREPVIEW” in your schema, which provides the data in the format needed to train the model. In the Features section, remove the “CAR_ID” and “PRICE” columns. The ID and target should always be explicitly excluded.


Specify the algorithm by setting the “Task” drop-down to “Regression”, then select “Hybrid Gradient Boosting (PAL)” as algorithm. Set “Split Dataset for Testing” to “False”. Manually enter the “Key Column” as “CAR_ID” and the “Target Column” is “PRICE”.


Now specify the algorithm’s hyperparameters, which we found earlier on, that produced the strongest model. Edit the “Hyper Parameters” and enter the JSON syntax below, which specifies our preferred hyperparameters.

"random_state": 42,
"n_estimators": 30,
"max_depth" : 9,
"min_samples_leaf" : 1


The pipeline is configured. Hit the “Save” icon on top.


Optionally, in the ML Scenario’s main page you can click “Create Version” at the top of the screen. This creates a new version, which includes all the changes made to the project since the scenario was created. This step used to be compulsory in earlier versions of SAP Data Intelligence, now it is optional.

Whether you have created a new version or not, select the “10 Train” pipeline and press the “Execute” button to run the pipeline.


Skip through the settings that appear until the “Pipeline Parameters” are prompted. Set “newArtifactName” to “xgb_model”. The trained model will be saved with this name in the ML Scenario. Then click outside the parameter box, so that the “Save” option appears, which you click.


After a short while the execution’s summary screen shows that the pipeline completed successfully. No metric is shown, as we hadn’t specified a test set in the “HANA ML Training” operator.


The model is saved in the ML Scenario and can be used for predictions through an inference pipeline.

Inference pipeline

Create this second and final pipeline in the ML Scenario’s main page.


Name the pipeline “20 Real-time prediction” and choose the “HANA ML Inference on API Requests”. In case you are working with a slightly older version of SAP Data Intelligence and that template is not available, you can continue with the “HANA ML Inference” template.


This inference pipeline looks rather different to the training pipeline, as it serves a very different purpose. This pipeline is supposed to run permanently. The “OpenAPI Servlow” operator provides a REST-API, which can be called to obtain a prediction. That prediction is carried out through the model that was saved with the “10 Train” pipeline.

Quickly configure this pipeline. You only need to set the “Connection” to “MYHANA” as before. And set the “Key column” to “CAR_ID”. Then hit the “Save” icon at the top.


Select the new pipeline on the ML Scenario’s main page and click the “Deploy” icon.


Click through the options that come up until you can select the trained model from a drop-down. This model appears here, because it had been saved into the ML Scenario by our training pipeline. Continue with “Save”.


After a short while the pipeline should be “Running”. The “Deployment URL” shows the base path under which the pipeline can be called. Click the icon to the right of the URL to copy the address.

The pipeline is running and ready to provide predictions in real-time.


Real-time prediction

Let’s predict the price of the car!

Keep the Deployment URL from the previous step, you will need it soon. First however install a program to test the REST-API. You can use any tool you like that can call a REST-API. I will be using Postman. Just follow the link and download and install. This should all be possible without having to register anywhere.

Open Postman and create a new “Request” with the little “+” icon.


Copy the “Deployment URL” into the address bar that opened up. Extend this address by adding “v1/inference/” at the end. Also change the drop-down of the request type from “GET to “POST”.


Go into the “Authorization” tab and set the “TYPE” to “Basic Auth”. Enter the username, which is the name of the tenant of SAP Data Intelligence, followed by “\” and your username. If you are working in the default tenant with user DS, you need to enter default\DS. Don’t forget to enter the password.


On the “Headers” tab, add the following entry:

  • Key: X-Requested-With
  • Value: XMLHttpRequest


And finally, in the “Body” tab specify the car, for which you want a price prediction. Select “Raw” and enter this JSON syntax.

"CAR_ID" : [1],
"VEHICLETYPE" : ["coupe"],
"YEAR" : [2010],
"GEARBOX" : ["manual"],
"HP" : [120],
"MODEL" : ["C-Class"],
"KILOMETER" : [40000],
"FUELTYPE" : ["petrol"]


All that’s left, is to click the “Send” button.


Lo and behold, we have a prediction! The score is the price the model estimates for our imaginary car. Try out different specifications, i.e. different values for the kilometers, and see how the predictions change.


With the REST-API now providing the predictions, business applications can make use of them in real-time. This video shows an example of a chatbot, which uses a similar REST-API from SAP Data Intelligence to provide car price estimates. The model behind this chatbot is a little simpler, it only requires the car’s model, the year in which it was built, and the car’s mileage.


Well done, if you have read this far but particularly if you have implemented this scenario!

Machine Learning requires data and algorithms coming together. In SAP HANA you already have both in the same place. With the hana_ml wrapper it’s easy (and I find it’s real fun) to work with the data and algorithms. The pipelines and templates of SAP Data Intelligence help bring the logic into production.

If you are keen to learn more, here is some further material:

Assigned Tags

      You must be Logged on to comment or reply to a post.
      Author's profile photo Alexander Römpler
      Alexander Römpler


      I get this error for the upload to hana:

      DataError                                 Traceback (most recent call last)
      <ipython-input-14-2a3305273d5f> in <module>
            4                                                    table_name = 'USEDCARPRICES',
            5                                                    force = True,
      ----> 6                                                    replace = False)
      /opt/conda/lib/python3.7/site-packages/hana_ml/ in create_dataframe_from_pandas(connection_context, pandas_df, table_name, schema, force, replace, object_type_as_bin, table_structure, drop_exist_tab, allow_bigint)
         3346         parms = ("?," * len(rows[0]))[:-1]
         3347         sql = 'INSERT INTO {} VALUES ({})'.format(table_reference, parms)
      -> 3348         cursor.executemany(sql, rows)
         3349     cursor.close()
         3350     connection_context.connection.commit()
      DataError: (307, 'numeric value out of range: nan: type_code=7, index=8')


      Does anyone have an idea what’s wrong here?


      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Alex, what version is your hana_ml package? If this is below 1.0.8 please upgrade it before uploading the data to HANA

      Show the version:

      import hana_ml

      Upgrading the package

      !pip install hana_ml --upgrade


      In case upgrading the version does not help, try uploading the data into a new table, just in case there maybe a problem dropping the existing/old table and its column types, ie

      df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, 
                                                         pandas_df = df_data, 
                                                         table_name = 'USEDCARPRICES2',
                                                         force = True,
                                                         replace = False)



      Author's profile photo Maxime Simon
      Maxime Simon

      I got the same error on a CAL instance.

      Not a Number(NaN) values cannot get inserted in numeric columns.


      To solve it, just set replace = True.


      As explained here, HANA will handle null values.


      Author's profile photo Alexander Römpler
      Alexander Römpler

      Hi together,

      Andreas Forster:

      I ran all commands in the given order and noticed that the upgrade via “pip install” installs the new library somewhere, but the current session does not use it afterwards.
      Only after a kernel restart the new version was used.

      This has to be done after each restart of Data Intelligence.

      Maxime Simon:

      Thanks, that solved this problem.

      @all: Unfortunately I have another problem now. In chapter “Explore and prepare data” at “For a small sample of all cars retrieve only the year and the price.” I get this:

      TypeError                                 Traceback (most recent call last)
      <ipython-input-19-81108ebf347e> in <module>
           10                                training_percentage = 0.01,
           11                                testing_percentage = 0.99,
      ---> 12                                validation_percentage = 0)
           14 df_sample =, col_name_2).collect()
      TypeError: train_test_val_split() got an unexpected keyword argument 'conn_context'


      By the way the function help nothing says about a conn_context parameter.

      The hana_ml lib-loading etc. was previously added to the second notebook and the version is:

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Alexander, A new hana_ml version was released recently, which is dealing differently with the connection context and possibly with the data upload.

      Please uninstall your current hana_ml version in Jupyter and install version 1.0.8.post11

      !pip uninstall hana_ml -y
      !pip install hana-ml==1.0.8.post11

      I hope to update the blog soon


      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      I reproduced the "numeric value out of range" with hana_ml version 2.5. With 1.0.8.post11 the code still works for me as expected. The blog is now updated, that this version has to be installed. I will let the product group know of this behaviour.

      Author's profile photo Alexander Römpler
      Alexander Römpler

      Now everything works as desired.

      Thank you very much for this great article and the quick help!

      Author's profile photo Shishupalreddy Ramreddy
      Shishupalreddy Ramreddy

      Good blog to know SAP DI - ML flow. Thank you for sharing.

      Can you please let me know if the below programming paradigm will be supported by SAP DI

      Importing userdefined .py or jupyter notebook files into other notebooks similar to import of standard Python libraries or HANA wrappers.


      Thank you


      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Shishupal,
      Thank you for the feedback. The following hands-on explains how to leverage additional libraries (ie sklearn) in the Notebooks as well as in the pipelining. Custom python files could be added to the Docker environment to make them available to a pipeline operator.

      Author's profile photo Shishupalreddy Ramreddy
      Shishupalreddy Ramreddy

      Thank you Andreas for the helpful information. I will try it out.

      Author's profile photo Marco Zangrilli
      Marco Zangrilli

      A very useful article

      Author's profile photo Tobias Gorhan
      Tobias Gorhan

      Very detailed and informative article. Thank you very much, Andreas.


      I tried a similar approach with the scenario manager with a Training and Inference for classification for data stored in HANA.

      Executing Training works fine, Deployment of Inference is Running fine aswell until I send the POST request from Postman.

      Error in Pstman:

          “error”: “Error loading model. Check logs from the graph for details.”

      Error in Modeler:

      Graph failure: Error while executing callback registered on port(s) [‘inferenceData’]: (-10807, “System call ‘recv’ failed, rc=104:Connection reset by peer {XX.XXX.XXX.XXX:30253}”) [file ‘/vrep/vflow/subengines/com/sap/python36/operators/com/sap/ml/hana/inference/’, line 47]

      (X inserted to hide IP)


      We also tried other OpenAPI applications with the same result, seems like an error from HANA-side. (Currently using an MultiDB BW/HANA 2.0 SP02 Version –

      Several Notes linking to this error have been tested, without success.


      Do you have any idea how to solve this issue?


      Kind Regards




      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Tobias,
      The issue could be related to the fact that hana_ml is only supported from HANA 2.0 SPS04 +
      If you have a support contract, can you log a ticket with technical support?

      Author's profile photo Raevanth Kumar
      Raevanth Kumar

      Hi Andreas,

      My team is using SAP Data Intelligence (trial edition), we are working on house price prediction using HANA ML Training operator. We followed this blog.

      We are getting error in HANA_ML Training operator stating “Graph failure:
      Error while executing callback registered on port(s) [‘trigger’]: PAL is not installed” . We have installed “hana-ml==1.0.8” but it states PAL is not installed.

      Please let me know how to fix this error.


      FYI : We have deployed DI Trial edition in Google Cloud Platform.

      I have attached screenshots of error message, HANA_ML Configuration and HANA_ML Script.


      Error Message : (Output ports and respective input ports are properly connected)



      HANA_ML operator configuration:


      HANA_ML - Script Editor :
      Kind regards,
      Raevanth Kumar


      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Raevanth Kumar,
      The Predictive Algorithm Library (PAL) has to be installed on SAP HANA. The hana_ml library on its own is not sufficent, it is an interface to use the PAL.

      I am not sure which house price data you are using. Were you able to execute the following code in a Jupyter Notebook for car price prediction from an earlier step the blog? = df_remote_train,
      features = features,
      key = "CAR_ID",
      label = 'PRICE')

      That code is triggering the PAL also. Only once this succeeds, there is a chance that the pipeline will work.

      Author's profile photo Raevanth Kumar
      Raevanth Kumar

      Thanks for the clarification. We followed this blog for using hana_ml  pipeline but we tried to work on different scenario (house price prediction) and different datasets.

      Author's profile photo Rodrigo Silveira
      Rodrigo Silveira

      Hello Guys,

      I´m facing this problem trying to reproduce this blog, then function partition.train_test_val_split:

      from hana_ml.algorithms.pal import partition
      df_remote_sample, df_remote_ignore1, df_remote_ignore2 = partition.train_test_val_split(
                                     data = df_VBAK_remote, 
                                     random_seed = 1972,
                                     training_percentage = 0.8, 
                                     testing_percentage = 0.2, 
                                     validation_percentage = 0)


      (423, 'AFL error: AFL DESCRIBE for nested call failed - invalid table(s) for ANY-procedure call (Input table 0: column 16 (starting with 0) has invalid SQL type.): line 29 col 1 (at pos 819)')
      My environment:
      SAP DI 3.1
      • HANA (fa/hana2sp04)
      • HANA FULL Version -
      • AFL release 2.0
      • AFL rev-number 048
      • AFL rev-patchlevel 0100


      Does anyone have any idea what may be happening?



      Rodrigo Silveira

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Rodrigo Silveira, That error typically means that an unsupported column type is used.
      You are working with your own df_BAK_remote, did the code succeed with the blog's df_remote pointing to USEDCARPRICES_PREPVIEW?

      To make the code work on your own data, please verify that the column types are supported by the underlying PAL algorithm

      If a column is not supported, you can

      • either cast the DataFrame column to a supported format (this does not change the physical column)
      • or drop the unsupported column from the DataFrame (this is also purely semantic, the physical table is not changed)
      • or create the DataFrame with just supported columns, this would be an easy way to test. Instead of pointing to the full table, you can base the DataFrame on a SELECT.
        df_remote = conn.sql('SELECT COL1, COL2 FROM "ML"."BANKMARKETING" ORDER BY "CUSTOMER_ID" ASC')​


      Author's profile photo Rodrigo Silveira
      Rodrigo Silveira

      Thanks Andreas. You are right, in my case netwr field (Net Value of the Sales Order in Document) of VBAK is not supported. Unfortunately, that was the field that I supposed to predict. Do you have any workaround or tip for that?


      Best regards,

      Rodrigo Silveira.

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Rodrigo Silveira, You coud cast the column type in the HANA DataFrame, just before the partitioning. Something like

      df_VBAK_remote = df_VBAK_remote.cast('NETWR', 'DOUBLE')
      Author's profile photo Rodrigo Silveira
      Rodrigo Silveira

      Cool! Thanks Andreas!

      Author's profile photo Indrit Asllani
      Indrit Asllani

      Hi Andreas,

      I'm facing this error

      I'm using SAP Data Intelligence 3.0 trial version, which uses a preconfigured local HANA database connection for training purposes. I have tried to uninstall the library hana_ml and install the latest version.

      Can you help me for solving this problem?



      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Indrit Asllani, This error means that the PAL is not installed on HANA itself.
      The hana_ml wrapper is a separate installation within Python, but it does not install PAL on HANA.
      I am not sure which HANA environment is shipping with the trial. Could you reach out to the DI trial team, on guidance for installing PAL (Predictive Analysis Library)?



      Author's profile photo Kevin Geiger
      Kevin Geiger

      Hi Andreas, at this point I'm not using SAP Data Intelligence.  Can I still upload the pandas dataframe to a hana dataframe?  I have version 2.5.20062600 of the hana ml library installed which sounds like that is a problem.

      Also can I connect to an on-prem rev.52 instance of Hana for this script?  I do connect for some other python testing and have no issues with basic reads from hana tables and calv view models and perform basic dataframe stuff.

      Best Regards

      Kevin Geiger

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Kevin Geiger, Yes absolutely, you can use the hana_ml wrapper also without SAP Data Intelligence, and also with an on-prem HANA. We just updated this sample code, which uses local Jupyter Notebooks. Other Python environment should also work

      Just check with your SAP Account Executive please whether this access is covered by your HANA license.

      The latest version of hana_ml supports SPS05 and HANA Cloud.

      I understand your revision 52 is HANA 2 SPS05. So this would also be supported.


      Author's profile photo Kevin Geiger
      Kevin Geiger

      Thanks Andreas, sorry to have to ask this.  Where do I pull this latest version of the hana_ml library from?  I believe the one I run currently was part of my hana studio install.

      I installed a stand alone version of python and used the pip install hana_ml to get to the version I'm currently running python 3.8.7 and as mentioned hana ml 2.5.20062600


      Best regards

      Kevin Geiger

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Kevin Geiger ,
      no worries at all. hana_ml can be upgraded with pip as well.
      The different versions are documented on

      Upgrade to the latest hana_ml version with

      !pip install --upgrade hana_ml

      Then restart the kernel in Jupyter

      You can also upgrade to a specific hana_ml version, for instance

      !pip uninstall hana_ml --yes
      !pip install hana-ml==2.6.21011300
      Author's profile photo Kevin Geiger
      Kevin Geiger

      Thanks Andreas, I was able to upgrade to 2.6... and that did get me past the issue I ran into.  Very nice blog to help learn how this works.

      Author's profile photo Kevin Geiger
      Kevin Geiger

      Hi Andreas, i ran into some issues today with sapdi .  Can I use that with my basic python / on premise hana system?  It looks like I may need to do some research on SAP BTP and Data Attribute Recommendation before I can process any further with this.

      Best Regards

      Kevin G

      Author's profile photo David Alzate
      David Alzate

      Hi Kevin,


      I am in your same situation. I had to perform feature engineering before training the predictive model and due to the complexity of the calculation I used Python. Thus I used the hana-ml library and leave it in the server and from an ABAP program I am automating the process.

      I would recommend you this github:

      hana-ml-samples/Python-API at main · SAP-samples/hana-ml-samples · GitHub

      It contains good examples to use the sap hana_ml python library. Last but not least, Andreas has created several blogs that explains the process of using the library, and the authorization process to use this library.


      PS: I found a survey from SAP this week and I point out the difficulty for us, The developers in order to use this new technologies in internal PoC. Hope, they can hear us.

      Here is the survey :



      David Alzate