Skip to Content
Technical Articles
Author's profile photo Alessandro Parolin

Dancing with SAP Data Intelligence

In the blog post written by Ingo Peter, we can see how SAP Data Intelligence can leverage SAP HANA PAL to create predictive models by using custom code in Python Operators. In this article, we will show a new way (available as of DI 1911) of using PAL models by using out-of-the-box operators.

Introduction

As the end-of-the-year party approaches, your manager gives you, an important data scientist in the company, a critical job: to make sure the music playlist is as fun as possible. This is of extreme importance so that all employees enjoy the party. You think to yourself: how on earth you will do that? After all, you are no DJ. So, you decide to put your machine learning skills to work and create a model to figure that out for you. But that brings up another question: where will you do that? It would be good if this model could be integrated with your in-house software that manages events. You then remember reading about a platform called SAP Data Intelligence in an SAP blog post and decide to use it to tackle this problem as well.

The dataset

As usual, you need to gather some data to train a model. Luckily, you find on the Internet a dataset containing a list of the top 50 Spotify songs of 2019 and some features associated to each of them. You upload the csv file to your company datalake on AWS S3.

Loading the data into SAP HANA DB

You open up the Modeler app to create a pipeline that will load the data from the csv into HANA DB. Also, you take the opportunity to clean it up a bit. As you would like to keep things simple (at least for now), you decide to use only some of the features available on the dataset.

The pipeline for that is also simple. Just an operator to read the file, a Python script to pre-process the data, and ultimately a HANA client operator to insert the data into the database. It looks like this:

For the Read File operator, the configuration was quite simple:

 

For the Python operator, you wrap it with a group by right clicking it and selecting “Add group”. On the settings of that group, you add tag “pandas”. That will allow you to import pandas package, which makes it so much easier to deal with csv files. Also, you create an input port on the operator called “file” of type “message.file” and an output port called “data” of type “message. The Python code then looks like this:

from io import StringIO
import pandas as pd

def on_file(message):
    content = StringIO(message.body.decode('iso-8859-1'))
    df = pd.read_csv(content, sep=",")
    
    dataset = df[['Genre','Beats.Per.Minute','Popularity','Length.','Danceability']]
    dataset['Genre'] = dataset['Genre'].astype('category').cat.codes
    
    danceability_col = dataset['Danceability']
    danceability_col_cat = danceability_col.copy()
    
    fun_threshold = 70
    danceability_col_cat[danceability_col < fun_threshold] = 'No Fun'
    danceability_col_cat[danceability_col >= fun_threshold] = 'Fun'
    dataset['Danceability'] = danceability_col_cat

    api.send('data', api.Message(body=dataset.to_csv(header=None)))

api.set_port_callback('file', on_file)

You then configure the SAP HANA client operator, which is also simple:

Last, you connect the operators and run the pipeline. Success! The data is in HANA now. You are off to a great start in this endeavor!

Training the model

Well, SAP Data Intelligence has an app for that. Without second guessing, you open up ML Scenario Manager (MLSM), and create a new scenario:

In the scenario, you click on the + sign on the Pipelines section, choose the template HANA ML Training and give it a name.

When created, you are taken to the Modeler UI.

Here you do some configurations on the pipeline. In this case, you configure the HANA ML Training operator by selecting a connection, specifying the training and test datasets, select the task and algorithm, and inform the key and target columns. Additionally, hyperparameters could be provided to fine-tune the algorithm in JSON format. They can be found in the HANA ML documentation page and are algorithm specific. For example, these are the hyper parameters for a neural network.

When ready, the configurations look like the following:

That’s it! Going back to MLSM, you select that pipeline and click on execute. Since there were changes to the pipeline, MLSM asks you to create a new version:

After that, you execute the pipeline once again. This time, MLSM takes you through some steps in which you can provide a description of the current execution and provide global pipeline configurations. In this case, the name of the artifact (model) that is going to be created. You give it a name and click Save.

Then, the training begins:

… and in a few moments, you have yourself a shiny new model. Excellent!

Deploying an inference pipeline

Once again in MLSM, you create a new pipeline, but this time you select the HANA ML Inference template.

Just like before, you are then taken to the Modeler:

This time you configure the HANA ML Inference operator to connect to a HANA system. That’s right, any HANA system! Not necessarily the one in which the model was trained. The configuration looks like the following:

Back to MLSM, you select the pipeline and deploy it:

Once again, you create a new version, since you modified the pipeline.

The wizard takes you to the point where you need to specify a model that will be used. Here, you select the recently trained model and continue.

At the end, you get a URL that you can use as a REST endpoint exposing your model.

Consuming the REST endpoint

Time to see what this thing can do! To simplify things, you do a regular curl by providing the features of a song you want to determine whether people will enjoy:

curl --location --request POST 'https://<host>/app/pipeline-modeler/openapi/service/<deploy id>/v1/inference' \
--header 'Content-Type: application/json' \
--header 'If-Match: *' \
--header 'X-Requested-With: Fetch' \
--header 'Authorization: Basic 00000000000000000' \
--data-raw '{
        "ID": [1],
        "GENRE": ["6"],
        "BEATSPERMINUTE": [117],
        "POPULARITY": [79],
        "LENGTH":[121]
}'

And the response for that one is:

{"ID":{"0":1},"SCORE":{"0":"Fun"},"CONFIDENCE":{"0":0.5990922485}}

Great! That is it! You did it! Now, you just have to let your boss know the job is done, enjoy the party and get the well deserved promotion! 🙂

Assigned tags

      5 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Marco Furlanetto
      Marco Furlanetto

      Well done!

      Author's profile photo Joseph Yeruva
      Joseph Yeruva

      Good One. Thank you. What algorithm is used here? how can we choose the algorithm in the HANA ML Training operator?

      Author's profile photo Alessandro Parolin
      Alessandro Parolin
      Blog Post Author

      Hi Joseph,

      In this case, we used a Hybrid Gradient Boosting Classifier.

      The algorithm can be selected in the training operator configuration (6th screenshot from the “Training the model” section above). Depending on the task selected, different algorithms are available.

      Best Regards!

      Author's profile photo Rolf Hoven
      Rolf Hoven

      We have several schema on the HANA-database.

      When using the “SAP HANA Client” operator, where can I decide what schema to create my Spotify table in ?

      Author's profile photo Indu Khurana
      Indu Khurana

      Hello Alessandro,

       

      What a helpful blog!

      I tried to implement the same graph in DI, but while inserting into HANA, it throws an error: "failed to create schema '\"DATAHUB\"': SQL Error 258 - insufficient privilege: Detailed info for this error can be found with guid"

       

      Where can I check my authroization or how should I fix it?

      Could you please suggest!

       

      Thanks,

      Indu Khurana.