Skip to Content
Technical Articles

SAP Data Intelligence: Create your first ML Scenario

You may know that SAP Data Intelligence is SAP’s new platform to create and deploy Machine Learning (ML) projects. Whether your data is structured or unstructured, in SAP HANA or outside, SAP Data Intelligence is your new friend.

This blog shows how to create a very basic ML project in SAP Data Intelligence, We will train a linear regression model in Python with just a single predictor column. Based on how many minutes it takes a person run a half-marathon, the model will estimate the person’s marathon time. That trained model will then be exposed as REST-API for inference. Your business applications could now leverage these predictions in your day-to-day processes. After this introductory tutorial you will be familiar with the core components of an ML scenario in SAP Data Intelligence, and the implementation of more advanced requirements is likely to follow a similar pattern.

Should you have access to a system you can follow hands-on yourself. You can currently request a trial or talk to your Account Executive anytime.

Table of contents

 

Training data

Our dataset is a small CSV file, which contains the running times (in minutes) of 117 people, who ran both the Zurich Marathon as well as a local Half-Marathon in the same region, the Greifenseelauf. The first rows show the fastest runners:

ID HALFMARATHON_MINUTES MARATHON_MINUTES
1 73 149
2 74 154
3 78 158
4 73 165
5 74 172

 

SAP Data Intelligence can connect to range of sources. However, to follow this tutorial hands-on, please place this this file into your own Amazon S3 bucket.

Alternatively, you can also store the file in SAP Data Intelligence’s internal data lake. Just follow the advice of Philipp Zaltenbach in this comment on which adjustments are needed should you prefer that option.

In case you have placed the file into an S3 bucket, you need to know:

  • the endpoint URL of the S3 server, without your bucket name, ie s3.amazonaws.com
  • the authentication region, ie us-east-1
  • the access key and secret key

Data connection

Logon to SAP Data Intelligence and you will see the different application tiles. Click into “Connection Management”, where you can centrally define a connection to the S3 bucket that holds the data.

 

You may see a number of connections that have already been set up. Click “Create”, set the id to “s3_files” and set the “Connection Type” to “S3”. Enter the bucket’s endpoint, the region and the access key as well as the secret key.

Verify the settings with “Test Connection” and click “Create”.

 

Data exploration and free-style Data Science

Now you are ready to work with the data in SAP Data Intelligence. Go back to the main page of SAP Data Intelligence and click into the “ML Scenario Manager”, where you will carry out all Machine Learning related activities.

Click the little “+”-sign on the top right to create a new scenario. Name it “Marathon time prediction” and you enter further details into the “Business Question” section. Click “Create”.

 

You see the empty scenario. You will use the Notebooks to explore the data and to script the regression model in Python. Pipelines bring the code into production. Executions of these pipelines will create Machine Learning models, which are then deployed as REST-API for inference.

 

But one step at a time. Next, you load and explore the data in a Notebook. Select the “Notebooks” tab, then click the “+”-sign.  Name the Notebook “10 Data exploration and model training”. Click “Create” and the Notebook opens up. You will be prompted for the kernel, keep the default of “Python 3”.

 

Now you are free to script in Python, to explore the data and train the regression model. The central connection to your S3 bucket can be leveraged by accessing the keys.

import notebook_hana_connector.notebook_hana_connector
di_connection = notebook_hana_connector.notebook_hana_connector.get_datahub_connection(id_="s3_files")
access_key = di_connection["contentData"]["accessKey"]
secret_key = di_connection["contentData"]["secretKey"]

 

Install the library that is required for Python to connect to the bucket.

!pip install boto3

 

The CSV file can then be loaded into a pandas DataFrame. Just make sure that the value assigned to the bucket corresponds with your own bucket name.

import boto3
import pandas as pd
import io
client = boto3.client('s3', aws_access_key_id=access_key, 
                            aws_secret_access_key=secret_key)
bucket = 'i056450-datasets'
object_key = 'RunningTimes.csv'
csv_obj = client.get_object(Bucket=bucket, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')
df_data = pd.read_csv(io.StringIO(csv_string), sep=";")

 

Use Python to explore the data. Display the first few rows for example.

df_data.head(5)

 

Or plot the runners’ half-marathon time against the full marathon time.

x = df_data[["HALFMARATHON_MINUTES"]]
y_true = df_data["MARATHON_MINUTES"]

%matplotlib inline
import matplotlib.pyplot as plot
plot.scatter(x, y_true, color = 'darkblue');
plot.xlabel("Minutes Half-Marathon");
plot.ylabel("Minutes Marathon");

 

There is clearly a linear relationship. No surprise, If you can run a marathon fast, you are also likely to be one of the faster half-marathon runners.

 

Train regression model

Begin by installing the scikit-learn library, which is very popular for Machine Learning in Python on tabular data such as ours.

!pip install sklearn​

 

Then continue by training the linear regression model with the newly installed scikit-learn library. In case you are not familiar with a linear regression, you might enjoy the openSAP course “Introduction to Statistics for Data Science“.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(x, y_true)

 

Plot the linear relationship that was estimated.

plot.scatter(x, y_true, color = 'darkblue');
plot.plot(x, lm.predict(x), color = 'red');
plot.xlabel("Actual Minutes Half-Marathon");
plot.ylabel("Actual Minutes Marathon");

 

Calculate the Root Mean Squared Error (RMSE) as quality indicator of the model’s performance on the training data. This indicator will be shown later on in the ML Scenario Manager.

In this introductory example the RMSE is calculated rather manually. This way, I hope, it easier to understand its meaning, in case you have not yet been familiar with it. Most Data Scientists would probably leverage a Python package to shorten this bit.

import numpy as np
y_pred = lm.predict(x)
mse = np.mean((y_pred - y_true)**2)
rmse = np.sqrt(mse)
rmse = round(rmse, 2)
print("RMSE: " , str(rmse))
print("n: ", str(len(x)))

 

A statistician may want to run further tests on the regression, ie on the distribution of the errors. We skip these steps in this example.

 

Save regression model

You could apply the model immediately on a new observation. However, to deploy the model later on, we will be using one graphical pipeline to train and save the model and a second pipeline for inference.

Hence at this point we also spread the Python code across two separate Notebooks for clarity.  Therefore we finish this Notebook by saving the model. The production pipeline will save the model as pickled object. Hence the model is also saved here in the Notebook as pickle object as well.

import pickle
pickle.dump(lm, open("marathon_lm.pickle.dat", "wb"))

 

Don’t forget to save the Notebook itself as well.

 

Load and apply regression model

To create the second Notebook, go back to the overview page of your ML Scenario and click the “+”-sign in the Notebooks section.

 

Name the new Notebook “20 Apply model”, confirm “Python 3” as kernel and the empty Notebook opens up. Use the following code to load the model that has just been saved.

import pickle
lm_loaded = pickle.load(open("marathon_lm.pickle.dat", "rb"))

 

It’s time to apply the model. Predict a runner’s marathon time if the person runs a half-marathon in 2 hours.

x_new = 120
predictions = lm_loaded.predict([[x_new]])
round(predictions[0], 2)

 

The model estimates a time of just under 4 hours and 24 minutes.

 

Deployment

Now everything is in place to start deploying the model in two graphical pipelines.

  • One pipeline to train the model and save it into the ML Scenario.
  • And Another pipeline to surface the model as REST-API for inference

 

Training pipeline

To create the graphical pipeline to retrain the model, go to your ML Scenario’s main page, select the “Pipelines” tab and click the “+”-sign.

 

Name the pipeline “10 Train” and select the “Python Producer”-template.

 

You should see the following pipeline, which we just need to adjust. In principle, the pipeline loads data with the “Read File”-operator. That data is passed to a Python-operator, in which the ML model is trained. The same Python-operator stores the model in the ML Scenario through the “Artifact Producer”. The Python-operator’s second output can pass a quality metric of the model to the same ML Scenario. Once both model and metric are saved, the pipeline’s execution is ended with the “Graph Terminator”.

 

Now adjust the template to our scenario. Begin with the data connection. Select the “Read File”-operator and click the “Open Configuration” option.

 

In the Configuration panel on the right-hand side set “Service” to “S3”. Then open the “Connection”-configuration, set “Configuration Type” to “Configuration Manager” and you can set the “Connection Id” to the connection you had created earlier in the Connection Management. In this example the connection is called “S3_files”.

 

Save these settings. Now enter the name of your bucket into the “bucket” property. The path property will store the name of the file in the S3-bucket. The template however uses a parameter for this value. Hence when running the pipeline you will be prompted to enter the file name. Your settings should now be similar to:

 

Next we adjust the Python code that trains the regression model. Select the “Python 3”-operator and click the Script-option.

 

The template code opens up. It shows how to pass the model and metrics into the ML Scenario. Replace the whole code with the following. That code receives the data from the “Read File”-operator, uses the code from the Notebook to train the model and passes the trained model as well as its quality indicator (RMSE) to the ML Scenario.

# Example Python script to perform training on input data & generate Metrics & Model Blob
def on_input(data):
    
    # Obtain data
    import pandas as pd
    import io
    df_data = pd.read_csv(io.StringIO(data), sep=";")
    
    # Get predictor and target
    x = df_data[["HALFMARATHON_MINUTES"]]
    y_true = df_data["MARATHON_MINUTES"]
    
    # Train regression
    from sklearn.linear_model import LinearRegression
    lm = LinearRegression()
    lm.fit(x, y_true)
    
    # Model quality
    import numpy as np
    y_pred = lm.predict(x)
    mse = np.mean((y_pred - y_true)**2)
    rmse = np.sqrt(mse)
    rmse = round(rmse, 2)
    
    # to send metrics to the Submit Metrics operator, create a Python dictionary of key-value pairs
    metrics_dict = {"RMSE": str(rmse), "n": str(len(df_data))}
    
    # send the metrics to the output port - Submit Metrics operator will use this to persist the metrics 
    api.send("metrics", api.Message(metrics_dict))

    # create & send the model blob to the output port - Artifact Producer operator will use this to persist the model and create an artifact ID
    import pickle
    model_blob = pickle.dumps(lm)
    api.send("modelBlob", model_blob)
    
api.set_port_callback("input", on_input)

 

Close the Script-window, then hit “Save” in the menu bar.

 

Before running the pipeline, we just need to create a Docker image for the Python operator. This gives the flexibility to leverage virtually any Python library, you just need to provide the Docker file, which installs the necessary libraries. You find the Dockerfiles by clicking into the “Repository”-tab on the left, then right-click the “Dockerfiles” folder and select “Create Docker File”.

 

Name the file python36marathon.

 

Enter this code into the Docker File window. This code leverages a base image that comes with SAP Data Intelligence and installs the necessary libraries on it. It’s advised to specify the versions of the libraries to ensure that new versions of these libraries do not impact your environment.

FROM $com.sap.opensuse.python36
RUN python3.6 -m pip install numpy==1.16.4
RUN python3.6 -m pip install pandas==0.24.0
RUN python3.6 -m pip install sklearn

 

SAP Data Intelligence uses tags to indicate the content of the Docker File. These tags are used in a graphical pipeline to specify in which Docker image an Operator should run. Open the Configuration panel for the Docker File with the icon on the top-right hand corner.

 

You have to add the existing tags “opensuse”, “python36” and “tornado”. For “tornado” also enter the version “5.0.2”. Now add a custom tag to be able to specify that this Docker File is for the Marathon case. You can add a tag named “marathontimes”. The “Configuration” should look like:

 

Now save the Docker file and click the “Build”-icon to start building the Docker image.

 

Wait a few minutes and you should receive a confirmation that the build completed successfully.

 

Now you just need to configure the Python operator, which trains the model, to use this Docker image. Go back to the graphical pipeline “10 Train”. Right-click the “Python 3”-operator and select “Group”.

 

Such a group can contain one or more graphical operators. And on this group level you can specify which Docker image should be used. Select the group, which surrounds the “Python 3” Operator. Now in the group’s Configuration select the tag “marathontimes”. Save the graph.

 

The pipeline is now complete and we can run it. Go back to the ML Scenario. On the top left you notice a little orange icon, which indicates that the scenario was changed after the current version was created.

 

Hence, create a new version, so that the latest version contains the latest changes. Click “Create Version”. Enter a description, if you wish.

 

Version 2 was successfully created. It contains the latest changes, the orange symbol disappeared.

 

You can now execute this graph, to train and save the model! Select the pipeline in the ML Scenario and click the “Execute” button on the right.

 

Skip the optional steps until you get to the “Pipeline Parameters”. Set “newArtifactName” to lm_model. The trained regression model will be saved under this name. And set the inputFilePath to RunningTimes.csv. This is the file in your S3 bucket, which the pipeline will load.

 

Click “Save”. Wait a few seconds until the pipeline executes and completes. Just refresh the page once in a while and you should see the following screen. The metrics section shows the trained model’s quality indicator (RMSE = 16.96) as well as the number of records that were used to train the model (n = 116). The model itself was saved successfully under the name “lm_model”.

 

If you scroll down on the page, you see how the model’s metrics as well as the model itself have become part of the ML scenario.

 

We have our model, now we want to use it for real-time inference.

Prediction / Inference with REST-API

Go back to the main page of your ML Scenario and create a second pipeline. This pipeline will provide the REST-API to obtain predictions in real-time. Hence call the pipeline “20 Apply REST-API”. Now select the template “Python Consumer”. This template contains a pipeline that provides a REST-API. We just need to configure it for our model.

 

The “OpenAPI Servlow” operator provides the REST-API. The “Artifact Consumer” loads the trained model from our ML scenario and the “Python36 – Inference” operator ties the two operators together. It receives the input from the REST-API call (here the half-marathon time) and uses the loaded model to create the prediction, which is then returned by the “OpenAPI Servlow” to the client, which had called the REST-API.

 

We only need to change the “Submit Artifact Name” and “Python36 – Inference”-operators. Select the “Submit Artifact Name” icon and in the configuration set the value for “Content” to ${modelTechnicalIdentifier} This change will enable us to pass the model’s technical identifier to the pipeline.

All other changes are for the “Python36 – Inference”-operator. At the top of the code, in the on_model() function, replace the single line “model = model_blob” with

import pickle
model = pickle.loads(model_blob)

 

Add a variable to store the prediction. Look for the on_input() function and add the line “prediction = None” as shown:

def on_input(msg):
    error_message = ""
    success = False
    prediction = None # Added by I056450

 

Below the existing comment “# obtain your results” add the syntax to extract the input data (half-marathon time) and to carry out the prediction:

# obtain your results
hm_minutes = json.loads(user_data)['half_marathon_minutes']
prediction = model.predict([[hm_minutes]])

 

Further at the bottom of the page, change the “if success:” section to:

if success:
    # apply carried out successfully, send a response to the user
    msg.body = json.dumps({'marathon_minutes_prediction': round(prediction[0], 1)})

 

Close the editor window. Finally, you just need assign the Docker image to the “Python36 – Inference” operator. As before, right-click the operator and select “Group”. Add the tag “marathontimes”.

 

Save the changes and go back to the ML Scenario. Here create a new version as before.

 

To deploy the pipeline, copy the technical identifier of the trained model.

 

Now select the pipeline “20 Apply REST-API” and click the “Deploy” icon.

 

Click through the screens until you can enter the technical identifier, that you had just copied to the clipboard. Click “Save”.

 

Wait a few seconds and the pipeline is running!

 

As long as that pipeline is running, you have the REST-API for inference. So let’s use it! There are plenty of applications you could use to test the REST_API. My personal preference is Postman, hence the following steps are using Postman. You are free to use any other tool of course.

Copy the deployment URL from the above screen. Do not worry, should you receive a message along the lines of “No service at path XXX does not exist”. This URL is not yet complete, hence the error might come up should you try to call the URL.

Now open Postman and enter the Deployment URL as request URL. Extend the URL with v1/uploadjson/. Change the request type from “GET” to “POST”.

 

Go to the “Authorization”-tab, select “Basic Auth” and enter your user name and password for SAP Data Intelligence. The user name starts with your tenant’s name, followed by a backslash and your actual user name.

 

Go to the “Headers”-tab and enter the key “X-Requested-With” with value “XMLHttpRequest”.

 

Finally, pass the input data to the REST-API. Select the “Body”-tab, choose “raw” and enter this JSON syntax:

{
    "half_marathon_minutes": 120
}

 

Press “Send” and you should see the prediction that comes from SAP Data Intelligence! Should you run a half-marathon in 2 hours, the model estimates a marathon time of under 2 hours 24 minutes. Try the REST-API with different values to see how the predictions change. Just don’t extrapolate, stay within the half-marathon times of the training data.

 

Summary

You have used Python in Jupyter Notebooks to train a Machine Learning model and the code has been deployed in a graphical pipeline for productive use. Your business processes and application can now leverage these predictions. The business users who benefit from these predictions, are typically not exposed to the underlying technologies. They should just receive the information they need, where and when they need it.

Let me give an example that is more business related than running times. Here is a screenshot of a Conversational AI chatbot, which uses such a REST-API from SAP Data Intelligence to estimate the price of a vehicle. The end users is having a chat with a bot and just needs to provide the relevant information about a car to receive a price estimate.

Happy predicting!

 

 

 


            
23 Comments
You must be Logged on to comment or reply to a post.
  • Thank you so much Andrea, a very nice end-to-end tutorial!

    I’m sharing a couple of notes for issues that I’ve noticed. They’re all small things, to keep the tutorial smooth to follow!

    1. Regarding the data exploration notebook
      1. get_datahub_connection(id_="s3_files")

        should be capitalized, to become:

        get_datahub_connection(id_="S3_files")

        For consistency with the Connection Manager configurations.

      2. To allow inline visualization (eg: the plots) we should add:
        %matplotlib inline ​

        Before running any plot function.

      3. Similar to the boto libraries, also sklearn might needs to be installed
        !pip install sklearn​
    2. Regarding the Train pipeline
      1. The group configurations for the python operator should have the tag “marathontimes” insead of “python3marathon”

     

    Thank you so much for your time and work!

  • SAP Data Intelligence also comes with a built-in Data Lake. If you would like to use that embedded Data Lake instead of AWS S3 you need to do the following changes:

    Section „Data Connection“: Instead of defining the connection to S3, use the pre-defined connection for the DI Data Lake. In the Metadata Explorer go via “Browse Connections” to “DI_DATA_LAKE” and directly upload the CSV file (e.g. in folder “shared”).

    Section “Data exploration and free-style Data Science”: In the Juypter Notebook omit the code for connecting to S3, but instead use the following code to connect to the DI Data Lake and read the CSV file.

    !pip install hdfs
    from hdfs import InsecureClient
    client = InsecureClient('http://datalake:50070')
    with client.read('/shared/RunningTimes.csv', encoding = 'utf-8') as reader:
        df = pd.read_csv(reader, sep=";")

    The access to the built-in data lake from within Jupyter is also described in the documentation.

    Section “Training Pipeline”: Instead of configuring the “Read File” operator for S3, configure the parameter “Service” and set it to “SDL”. Furthermore, adjust the input file path to the execution step in the “Training Pipeline” section, i.e. inputFilePath = shared/RunningTimes.csv

    That’s all you need to do.

     

  • Thnx Andreas for this great tutorial.

    I got error 502 “bad gateway” when sending the POST request.

    It could be fixed by extending the Python Consumer template:
    when returning the response a new instance of api.message was created.

    request_id = msg.attributes[‘message.request.id’]
    response = api.Message(attributes={‘message.request.id’: request_id}, body=msg.body)
    api.send(‘output’, response)

  • What all needs to be changed to train a different type of model using the pipeline interface? I trained a regression tree model (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html) successfully in the Jupyter Notebook interface, then transferred to code to the Python3 operator in the pipeline interface. That operator seems to run fine, but the pipeline gets stuck on the “Artifact Producer” operator and the model is not produced (however, the metrics are produced). The pipeline stays in “running” status and never reaches “completed”.

    • Working with SAP support we found that there is a 10MB size limit on objects to fit through the pipeline between the Python3 operator and the Artifact Producer operator. In my case the model was larger and was getting stuck. We solved this by grouping those two operators together, tagged with the dockerfile below. Note: some of these settings are specific to my instance of DI

       

      FROM §/com.sap.datahub.linuxx86_64/vsolution-golang:2.7.44
      
      RUN python3.6 -m pip --no-cache-dir install \
      requests==2.18.4 \
      tornado==5.0.2 \
      minio==3.0.3 \
      hdfs==2.5.0 \
      Pillow==6.0.0 \
      tensorflow==1.13.1 \
      numpy==1.16.4 \
      pypng==0.0.19 \
      pandas==0.24.0 \
      sklearn
      • You don’t need to have all operators running the same image/dockerfile to be able to run them in the same pod. A pod can have several containers based on different images. In DH, the way you tell it to run all operators in the same pod is by adding them all to the same group.

  • Very interesting and complete blog Andreas Forster, thank you!

    I am trying to do the same having loaded the marathon times file on a HANA table.

    In the Jupyter notebook all good, I have used the new python APIs, but then? How should I change the training pipeline?

    Thanks

    Elisa

     

    • Hello Elisa, thank you for the feedback! To follow this tutorial with data that is held in SAP HAN you need to replace the “Read File” operator in the training pipeline with a “HANA Client”. The select statement that retrieves the history can be passed to the HANA Client with a “Constant Operator”. And a “Format Converter” after the HANA Client might be needed to get the CSV format this tutorial is based on.
      However, with data in HANA, there is an alternative, that I would prefer. Since in HANA we have predictive algorithms, you can train the model directly in SAP HANA without data extraction. We have a new Python and R wrappers that make this pretty easy. Here are some links on that wrapper
      https://blogs.sap.com/2019/08/21/conquer-your-data-with-the-hana-dataframe-exploratory-data-analysis/
      https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/latest/en-US/index.html
      There are also hands-on sessions on that wrapper at this year’s Teched season that is just starting, ie DAT260, SAP Data Intelligence: Machine Learning Push-Down to SAP HANA with Python

  • Hello,

    When I call the rest-API via Postman I get

    Is there something wrong or do I just have to wait – actually I’ve been waiting already a rather long time? The inference pipeline in the modeller is running.

    Thx, Ingo

    • Hi,

      I think the reason might be that the Semantic Data Lake (SDL) does not really work in the CAL environment where I tried to deploy the solution. In particular it is possible to save the model in the SDL which can be checked in the Metadata Explorer. However trying to access the model when using it in the inference operator seems to fail. You can also find the error message in the else-branch of the interference operator.

      Instead I used the local file system of DH to store the model which worked fine.

      Regards, Ingo

  • Hello Andreas,

     

    Great Blog. I have one further question.

     

    Is it possible to access Vora from JupyterLab Notebooks to access multiple Parquet files and leverage SQL on files.

     

    Thanks, Prem

     

    • Hi Prem, I haven’t tried accessing Vora from Python. But you can read Parquet files in Jupyter Notebooks and in the graphical pipelines, the latter of which can also write to Vora if that’s any helpful. Greetings, Andreas