Skip to Content
Technical Articles
Author's profile photo Andreas Forster

SAP Data Intelligence: Create your first ML Scenario

SAP Data Intelligence is SAP’s platform for data management, data orchestration and Machine Learning (ML) deployment. Whether your data is structured or unstructured, in SAP HANA or outside, SAP Data Intelligence can be very useful.

This blog shows how to create a very basic ML project in SAP Data Intelligence, We will train a linear regression model in Python with just a single predictor column. Based on how many minutes it takes a person run a half-marathon, the model will estimate the person’s marathon time. That trained model will then be exposed as REST-API for inference. Your business applications could now leverage these predictions in your day-to-day processes. After this introductory tutorial you will be familiar with the core components of an ML scenario in SAP Data Intelligence, and the implementation of more advanced requirements is likely to follow a similar pattern.

For those who prefer R instead of Python, Ingo Peter has published a blog which implements this  first ML Scenario with R.

Should you have access to a system you can follow hands-on yourself. You can currently request a trial or talk to your Account Executive anytime.

Update on 3 April 2023:

  • Please note that SAP Data Intelligence is not intended as one-stop-shop Data Science Platform.
  • Recently an issue caused the “Python Producer” and “Python Consumer” template pipelines to disappear. See SAP Note 3316646 to rectify.
  • This blog was created with Python 3.6, which isn’t supported anymore. See the comment below by Tarek Abou-Warda on using Python 3.9 instead.
  • Examples on using the embedded Machine Learning in SAP HANA, SAP HANA Cloud and SAP Datasphere from SAP Data Intelligence are given in the book “Data Science mit SAP HANA” (that book exists only in German, unfortunately there is no English edition)

Table of contents

 

Training data

Our dataset is a small CSV file, which contains the running times (in minutes) of 117 people, who ran both the Zurich Marathon as well as a local Half-Marathon in the same region, the Greifenseelauf. The first rows show the fastest runners:

ID HALFMARATHON_MINUTES MARATHON_MINUTES
1 73 149
2 74 154
3 78 158
4 73 165
5 74 172

 

SAP Data Intelligence can connect to range of sources. However, to follow this tutorial hands-on, please place this this file into SAP Data Intelligence’s internal DI_DATA_LAKE.

  • Open the “Metadata Explorer”
  • Select “Browse Connections”
  • Click “DI_DATA_LAKE”
  • Select “shared” and upload the file

 

 

 

Data connection

Since we are using the built-in DI_DATA_LAKE you do not need any additional connection. If the file was located outside SAP Data Intelligence, then you would create an additional connection in the Connection Management.

 

If the file was located in an S3 bucket for instance, you would need to create a new connection and and set the “Connection Type” to “S3”. Enter the bucket’s endpoint, the region and the access key as well as the secret key. Also, set the “Root Path” to the name of your bucket.

 

Data exploration and free-style Data Science

Now start work on the Machine Learning project. On the main page of SAP Data Intelligence click into the “ML Scenario Manager”, where you will carry out all Machine Learning related activities.

Click the little “+”-sign on the top right to create a new scenario. Name it “Marathon time predictions” and you enter further details into the “Business Question” section. Click “Create”.

 

You see the empty scenario. You will use the Notebooks to explore the data and to script the regression model in Python. Pipelines bring the code into production. Executions of these pipelines will create Machine Learning models, which are then deployed as REST-API for inference.

 

But one step at a time. Next, you load and explore the data in a Notebook. Select the “Notebooks” tab, then click the “+”-sign.  Name the Notebook “10 Data exploration and model training”. Click “Create” and the Notebook opens up. You will be prompted for the kernel, keep the default of “Python 3”.

 

Now you are free to script in Python, to explore the data and train the regression model. The central connection to your DI_DATA_LAKE can be leveraged through Python.

from hdfs import InsecureClient
import pandas as pd
client = InsecureClient('http://datalake:50070')

 

The CSV file can then be loaded into a pandas DataFrame. Just make sure that the value assigned to the bucket corresponds with your own bucket name.

with client.read('/shared/i056450/RunningTimes.csv') as reader:
    df_data = pd.read_csv(reader, delimiter=';')

 

Use Python to explore the data. Display the first few rows for example.

df_data.head(5)

 

Or plot the runners’ half-marathon time against the full marathon time.

x = df_data[["HALFMARATHON_MINUTES"]]
y_true = df_data["MARATHON_MINUTES"]

%matplotlib inline
import matplotlib.pyplot as plot
plot.scatter(x, y_true, color = 'darkblue');
plot.xlabel("Minutes Half-Marathon");
plot.ylabel("Minutes Marathon");

 

There is clearly a linear relationship. No surprise, If you can run a marathon fast, you are also likely to be one of the faster half-marathon runners.

 

Train regression model

Begin by installing the scikit-learn library, which is very popular for Machine Learning in Python on tabular data such as ours.

!pip install sklearn​

 

Then continue by training the linear regression model with the newly installed scikit-learn library. In case you are not familiar with a linear regression, you might enjoy the openSAP course “Introduction to Statistics for Data Science“.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(x, y_true)

 

Plot the linear relationship that was estimated.

plot.scatter(x, y_true, color = 'darkblue');
plot.plot(x, lm.predict(x), color = 'red');
plot.xlabel("Actual Minutes Half-Marathon");
plot.ylabel("Actual Minutes Marathon");

 

Calculate the Root Mean Squared Error (RMSE) as quality indicator of the model’s performance on the training data. This indicator will be shown later on in the ML Scenario Manager.

In this introductory example the RMSE is calculated rather manually. This way, I hope, it easier to understand its meaning, in case you have not yet been familiar with it. Most Data Scientists would probably leverage a Python package to shorten this bit.

import numpy as np
y_pred = lm.predict(x)
mse = np.mean((y_pred - y_true)**2)
rmse = np.sqrt(mse)
rmse = round(rmse, 2)
print("RMSE: " , str(rmse))
print("n: ", str(len(x)))

 

A statistician may want to run further tests on the regression, ie on the distribution of the errors. We skip these steps in this example.

 

Save regression model

You could apply the model immediately on a new observation. However, to deploy the model later on, we will be using one graphical pipeline to train and save the model and a second pipeline for inference.

Hence at this point we also spread the Python code across two separate Notebooks for clarity.  Therefore we finish this Notebook by saving the model. The production pipeline will save the model as pickled object. Hence the model is also saved here in the Notebook as pickle object as well.

import pickle
pickle.dump(lm, open("marathon_lm.pickle.dat", "wb"))

 

Don’t forget to save the Notebook itself as well.

 

Load and apply regression model

To create the second Notebook, go back to the overview page of your ML Scenario and click the “+”-sign in the Notebooks section.

 

Name the new Notebook “20 Apply model”, confirm “Python 3” as kernel and the empty Notebook opens up. Use the following code to load the model that has just been saved.

import pickle
lm_loaded = pickle.load(open("marathon_lm.pickle.dat", "rb"))

 

It’s time to apply the model. Predict a runner’s marathon time if the person runs a half-marathon in 2 hours.

x_new = 120
predictions = lm_loaded.predict([[x_new]])
round(predictions[0], 2)

 

The model estimates a time of just under 4 hours and 24 minutes.

 

Deployment

Now everything is in place to start deploying the model in two graphical pipelines.

  • One pipeline to train the model and save it into the ML Scenario.
  • And Another pipeline to surface the model as REST-API for inference

 

Training pipeline

To create the graphical pipeline to retrain the model, go to your ML Scenario’s main page, select the “Pipelines” tab and click the “+”-sign.

 

Name the pipeline “10 Train” and select the “Python Producer”-template.

 

You should see the following pipeline, which we just need to adjust. In principle, the pipeline loads data with the “Read File”-operator. That data is passed to a Python-operator, in which the ML model is trained. The same Python-operator stores the model in the ML Scenario through the “Artifact Producer”. The Python-operator’s second output can pass a quality metric of the model to the same ML Scenario. Once both model and metric are saved, the pipeline’s execution is ended with the “Graph Terminator”.

 

Now adjust the template to our scenario. Begin with the data connection. Select the “Read File”-operator and click the “Open Configuration” option.

 

In the Configuration panel on the right-hand side open the “Connection”-configuration, set “Configuration type” to “Connection Management” and you can set the “Connection ID” to the “DI_DATA_LAKE.

 

Save this configuration. Now open the “Path” setting and select the RunningTimes.csv file you had uploaded.

 

Next we adjust the Python code that trains the regression model. Select the “Python 3”-operator and click the Script-option.

 

The template code opens up. It shows how to pass the model and metrics into the ML Scenario. Replace the whole code with the following. That code receives the data from the “Read File”-operator, uses the code from the Notebook to train the model and passes the trained model as well as its quality indicator (RMSE) to the ML Scenario.

# Example Python script to perform training on input data & generate Metrics & Model Blob
def on_input(data):
    
    # Obtain data
    import pandas as pd
    import io
    df_data = pd.read_csv(io.StringIO(data), sep=";")
    
    # Get predictor and target
    x = df_data[["HALFMARATHON_MINUTES"]]
    y_true = df_data["MARATHON_MINUTES"]
    
    # Train regression
    from sklearn.linear_model import LinearRegression
    lm = LinearRegression()
    lm.fit(x, y_true)
    
    # Model quality
    import numpy as np
    y_pred = lm.predict(x)
    mse = np.mean((y_pred - y_true)**2)
    rmse = np.sqrt(mse)
    rmse = round(rmse, 2)
    
    # to send metrics to the Submit Metrics operator, create a Python dictionary of key-value pairs
    metrics_dict = {"RMSE": str(rmse), "n": str(len(df_data))}
    
    # send the metrics to the output port - Submit Metrics operator will use this to persist the metrics 
    api.send("metrics", api.Message(metrics_dict))

    # create & send the model blob to the output port - Artifact Producer operator will use this to persist the model and create an artifact ID
    import pickle
    model_blob = pickle.dumps(lm)
    api.send("modelBlob", model_blob)
    
api.set_port_callback("input", on_input)

 

Close the Script-window, then hit “Save” in the menu bar.

 

Before running the pipeline, we just need to create a Docker image for the Python operator. This gives the flexibility to leverage virtually any Python library, you just need to provide the Docker file, which installs the necessary libraries. You find the Dockerfiles by clicking into the “Repository”-tab on the left, then right-click the “Dockerfiles” folder and select “Create Docker File”.

 

Name the file python36marathon.

 

Enter this code into the Docker File window. This code leverages a base image that comes with SAP Data Intelligence and installs the necessary libraries on it. It’s advised to specify the versions of the libraries to ensure that new versions of these libraries do not impact your environment.

FROM $com.sap.sles.base
RUN pip3.6 install --user numpy==1.16.4
RUN pip3.6 install --user pandas==0.24.0
RUN pip3.6 install --user sklearn

 

SAP Data Intelligence uses tags to indicate the content of the Docker File. These tags are used in a graphical pipeline to specify in which Docker image an Operator should run. Open the Configuration panel for the Docker File with the icon on the top-right hand corner.

 

Add a custom tag to be able to specify that this Docker File is for the Marathon case. Name it  “marathontimes”. No further tags need to be added, as the base image com.sap.sles.base already contains all other tags that are required (“sles”, “python36”, “tornado”).

The “Configuration” should look like:

 

Now save the Docker file and click the “Build”-icon to start building the Docker image.

 

Wait a few minutes and you should receive a confirmation that the build completed successfully.

 

Now you just need to configure the Python operator, which trains the model, to use this Docker image. Go back to the graphical pipeline “10 Train”. Right-click the “Python 3”-operator and select “Group”.

 

Such a group can contain one or more graphical operators. And on this group level you can specify which Docker image should be used. Select the group, which surrounds the “Python 3” Operator. Now in the group’s Configuration select the tag “marathontimes”. Save the graph.

 

The pipeline is now complete and we can run it. Go back to the ML Scenario. On the top left you notice a little orange icon, which indicates that the scenario was changed after the current version was created.

 

Earlier versions of SAP Data Intelligence were only able to execute pipelines that had no changes made to them after the last version was taken. This requirement has been removed, hence there is no need to create a new version at the moment. However, when implementing a productive solution, I strongly suggest to deploy only versioned content.

You can immediately execute this graph, to train and save the model! Select the pipeline in the ML Scenario and click the “Execute” button on the right.

 

Skip the optional steps until you get to the “Pipeline Parameters”. Set “newArtifactName” to lm_model. The trained regression model will be saved under this name.

 

Click “Save”. Wait a few seconds until the pipeline executes and completes. Just refresh the page once in a while and you should see the following screen. The metrics section shows the trained model’s quality indicator (RMSE = 16.96) as well as the number of records that were used to train the model (n = 116). The model itself was saved successfully under the name “lm_model”.

 

If you scroll down on the page, you see how the model’s metrics as well as the model itself have become part of the ML scenario.

 

We have our model, now we want to use it for real-time inference.

Prediction / Inference with REST-API

Go back to the main page of your ML Scenario and create a second pipeline. This pipeline will provide the REST-API to obtain predictions in real-time. Hence call the pipeline “20 Apply REST-API”. Now select the template “Python Consumer”. This template contains a pipeline that provides a REST-API. We just need to configure it for our model.

 

The “OpenAPI Servlow” operator provides the REST-API. The “Artifact Consumer” loads the trained model from our ML scenario and the “Python36 – Inference” operator ties the two operators together. It receives the input from the REST-API call (here the half-marathon time) and uses the loaded model to create the prediction, which is then returned by the “OpenAPI Servlow” to the client, which had called the REST-API.

 

We only need to change the “Python36 – Inference”-operator. Open its “Script” window. At the top of the code, in the on_model() function, replace the single line “model = model_blob” with

import pickle
model = pickle.loads(model_blob)

 

Add a variable to store the prediction. Look for the on_input() function and add the line “prediction = None” as shown:

def on_input(msg):
    error_message = ""
    success = False
    prediction = None # This line needs to be added

 

Below the existing comment “# obtain your results” add the syntax to extract the input data (half-marathon time) and to carry out the prediction:

# obtain your results
hm_minutes = json.loads(user_data)['half_marathon_minutes']
prediction = model.predict([[hm_minutes]])

 

Further at the bottom of the page, change the “if success:” section to:

if success:
    # apply carried out successfully, send a response to the user
    msg.body = json.dumps({'marathon_minutes_prediction': round(prediction[0], 1)})

 

Close the editor window. Finally, you just need assign the Docker image to the “Python36 – Inference” operator. As before, right-click the operator and select “Group”. Add the tag “marathontimes”.

 

Save the changes and go back to the ML Scenario. Now deploy the new pipeline. Select the pipeline “20 Apply REST-API” and click the “Deploy” icon.

 

Click through the screens until you can select the trained model from a drop-down. Click “Save”.

 

Wait a few seconds and the pipeline is running!

 

As long as that pipeline is running, you have the REST-API for inference. So let’s use it! There are plenty of applications you could use to test the REST_API. My personal preference is Postman, hence the following steps are using Postman. You are free to use any other tool of course.

Copy the deployment URL from the above screen. Do not worry, should you receive a message along the lines of “No service at path XYZ does not exist”. This URL is not yet complete, hence the error might come up should you try to call the URL.

Now open Postman and enter the Deployment URL as request URL. Extend the URL with v1/uploadjson/. Change the request type from “GET” to “POST”.

 

Go to the “Authorization”-tab, select “Basic Auth” and enter your user name and password for SAP Data Intelligence. The user name starts with your tenant’s name, followed by a backslash and your actual user name.

 

Go to the “Headers”-tab and enter the key “X-Requested-With” with value “XMLHttpRequest”.

 

Finally, pass the input data to the REST-API. Select the “Body”-tab, choose “raw” and enter this JSON syntax:

{
    "half_marathon_minutes": 120
}

 

Press “Send” and you should see the prediction that comes from SAP Data Intelligence! Should you run a half-marathon in 2 hours, the model estimates a marathon time of under 2 hours 24 minutes. Try the REST-API with different values to see how the predictions change. Just don’t extrapolate, stay within the half-marathon times of the training data.

 

Should you prefer to authenticate against the REST-API with a certificate, instead of a password, then see Certificate Authentication to call an OpenAPI Servlow pipeline on SAP Data Intelligence.

 

Summary

You have used Python in Jupyter Notebooks to train a Machine Learning model and the code has been deployed in a graphical pipeline for productive use. Your business processes and application can now leverage these predictions. The business users who benefit from these predictions, are typically not exposed to the underlying technologies. They should just receive the information they need, where and when they need it.

Let me give an example that is more business related than running times. Here is a screenshot of a Conversational AI chatbot, which uses such a REST-API from SAP Data Intelligence to estimate the price of a vehicle. The end users is having a chat with a bot and just needs to provide the relevant information about a car to receive a price estimate.

Happy predicting!

 

 

Looking for more?

If you liked this exercise and you want to keep going, you could follow the next blog, which uses SAP Data Intelligence to train a Machine Learning model inside SAP HANA, using the Machine Learning that is embedded inside SAP HANA:
SAP Data Intelligence: Deploy your first HANA ML pipelines

In case you want to stick with native Python, you may wonder how to apply the concept from this blog with your own, more realistic, dataset, that contains more than a single predictor. Below are the most import differences for using three predictors instead of just one. In this new example, we predict the price of a used vehicle, using the car’s mileage, horse power, and the year in which the car was build.

Training the model in Jupyter:

from sklearn.linear_model import LinearRegression
x = df_data[['YEAR', 'HP', 'KILOMETER']]
y_true = df_data[['PRICE']]
lm = LinearRegression()
lm.fit(x, y_true)

Applying the model in Jupyter:

x_new = [[2005, 150, 50000]]
predictions = lm_loaded.predict(x_new)
round(predictions[0][0], 2)

Training the model in the pipeline. Just as in Jupyter:

    x = df_data[['YEAR', 'HP', 'KILOMETER']]
    y_true = df_data[['PRICE']]

Inference pipeline, retrieving the new predictor values and creating the new prediction:

                input_year = json.loads(user_data)['YEAR']
                input_hp = json.loads(user_data)['HP']
                input_kilometer = json.loads(user_data)['KILOMETER']
                prediction = model.predict([[input_year, input_hp, input_kilometer]])

Inference pipeline, returning the prediction:

        msg.body = json.dumps({'car_price': round(prediction[0][0], 2)})

Passing the values from Postman:

{
    "YEAR": 2005,
    "HP": 200,
    "KILOMETER": 50000
}

Assigned Tags

      116 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Ginger Gatling
      Ginger Gatling

      Excellent - thank you, Andreas!!

      Author's profile photo Philipp Zaltenbach
      Philipp Zaltenbach

      SAP Data Intelligence also comes with a built-in Data Lake. If you would like to use that embedded Data Lake instead of AWS S3 you need to do the following changes:

      Section „Data Connection“: Instead of defining the connection to S3, use the pre-defined connection for the DI Data Lake. In the Metadata Explorer go via “Browse Connections” to “DI_DATA_LAKE” and directly upload the CSV file (e.g. in folder “shared”).

      Section “Data exploration and free-style Data Science”: In the Juypter Notebook omit the code for connecting to S3, but instead use the following code to connect to the DI Data Lake and read the CSV file.

      !pip install hdfs
      from hdfs import InsecureClient
      client = InsecureClient('http://datalake:50070')
      with client.read('/shared/RunningTimes.csv', encoding = 'utf-8') as reader:
          df = pd.read_csv(reader, sep=";")

      The access to the built-in data lake from within Jupyter is also described in the documentation.

      Section “Training Pipeline”: Instead of configuring the “Read File” operator for S3, configure the parameter “Service” and set it to “SDL”. Furthermore, adjust the input file path to the execution step in the “Training Pipeline” section, i.e. inputFilePath = shared/RunningTimes.csv

      That’s all you need to do.

       

      Author's profile photo David Bertsche
      David Bertsche

      For the Semantic Data Lake soluton I also had to add the complete input file path to the Execution step in the "Training Pipeline" section.

      i.e. inputFilePath = shared/RunningTimes.csv

       

      Thanks for the great tutorial and comments!

      Author's profile photo Philipp Zaltenbach
      Philipp Zaltenbach

      Correct and thanks for spotting this. I updated my comment accordingly.

      Author's profile photo Snehal Wankhade
      Snehal Wankhade

      Hi Philip,

      Do we need to do any setting if we wish to use DI_DATA_LAKE for the first time.Its a newly created instance.

      I am getting following error while accessing folders in Data lake for uploading the file. Can you please advise.

      Folder cannot be displayed Error displaying Connection: DI_DATA_LAKE, Path: /. Select a connection or go back to previous location

      Module
      dh-app-metadata
      Code
      91322
      Message
      Could not browse the connection or folder
      Detail
      Could not browse the connection. Invalid input.
      GUID
      5f68badf-83d0-4e25-857e-ee00d936606a
      Hostname
      data-application-59d7l-bf46cd9d8-xgk27
      Causes
      Module
      dh-app-connection
      Code
      10102
      Message
      Error at Type Extension ‘onBeforeSendResponse’: Error at exchange with Storage Gateway: 404 – “<html>\n<head>\n<meta http-equiv=\”Content-Type\” content=\”text/html;charset=ISO-8859-1\”/>\n<title>Error 404 </title>\n</head>\n<body>\n<h2>HTTP ERROR: 404</h2>\n<p>Problem accessing /webhdfs/v1/. Reason:\n<pre> Not Found</pre></p>\n<hr />\n</body>\n</html>\n”
      GUID
      7d3f7ff4-7dc1-4080-89ef-a1ffc0e448dd
      Hostname
      flowagent-whcwh-784854d86c-pk9cn
      Author's profile photo Erick David Santillan Perez
      Erick David Santillan Perez

      I would add that if you can't upload the file using the Metadata Explorer (the upload button is grey) it could be that is because you don't have the sufficient permissions. Your user must have the policy sap.dh.metadata assigned to them.

      In order to assign it you can go to System Management > Users then click on the plus sign and add that policy to the user.

       

      Thanks for the tutorial!

       

      Author's profile photo Prashant Jayaraman
      Prashant Jayaraman

      It’s worth noting that the Data Intelligence virtual machines (DI VMs) do not have the DI_DATA_LAKE connection as of this writing; therefore I had to use an S3 bucket.

      Within the “10 Train” pipeline I also had to make the following changes to the Artifact Producer:

      • Right-click the Artifact Producer operator, and click “Open Operator Editor”.  Click the “Configuration” tab, and change the “service” parameter’s value from “SDL” to “File”.
      • Open the script for Artifact Producer, and change the value of the “flagUsingFileStorage” variable to “true”.

      On a related note, I found “kubectl get pods -w” and “kubectl log {pod_id}” to be invaluable for diagnosing why my pipeline execution kept failing.  These logs can also be downloaded from the Status tab in the pipeline modeler.

      EDIT:  Actually, if you just click the name of the item in Status, you get something more readable than the log, and you can see exactly where in the pipeline the execution failed. 

      Author's profile photo Phillip Parkinson
      Phillip Parkinson

      Thanks Prashant!

      You saved me a lot of time trying to figure out if the Artifact Producer can be used without the SDL.

      Author's profile photo Raevanth Kumar
      Raevanth Kumar

      Hi Prashant,

      I also had the problem in DI_DATA_LAKE, so I followed the steps mentioned above. I

      am getting error in artifact producer stating “runtimeStorage via $VSYSTEM_RUNTIME_STORE_

      MOUNT_PATH not found”.

       

      PS : While debugging I found the error caused due to the code (if condition) in artifact producer  :

      var ( //deprecated
      	flagCheckRuntimeStorage = true
      	flagUsingFileStorage    = true
      	gRuntimeStoragePath     = "/tenant/runtime"
      )
      
      
      if flagCheckRuntimeStorage && flagUsingFileStorage {
      		runtimeStoragePath, ok := os.LookupEnv("VSYSTEM_RUNTIME_STORE_MOUNT_PATH")
      		if !ok {
      			ProcessErrorSetup("runtimeStorage", errors.New("runtimeStorage via $VSYSTEM_RUNTIME_STORE_MOUNT_PATH not found"))
      			return
      		}
      		if _, err := os.Stat(runtimeStoragePath); os.IsNotExist(err) {
      			ProcessErrorSetup("runtimeStoragePath", err)
      			return
      		}
      		Logf("ArtifactProducer: using the following runtimeStorage: %q", runtimeStoragePath)
      	}

       

      Please help me to move forward!

       

      Author's profile photo Prashant Jayaraman
      Prashant Jayaraman

      Hi Raevanth,

      I actually haven't done anything further with Data Intelligence since March, so it's quite likely you're on a newer version.  Maybe my solution no longer works.  Are you using an S3 bucket?

      I'm not actually on the DI team, so take my suggestion with a grain of salt, but I suppose one "hacky" fix would be to just change flagCheckRuntimeStorage to false.  Maybe this will break something else though.

      If you'd like I can inquire further.  I have a colleague who has done more with DI than I have.  However you may want to try a more "official" DI support channel first.  I'm sure when you started your DI trial there was some support information provided.

      By the way, was your DI_DATA_LAKE problem this by any chance?  Just happened to stumble onto it.

       

      Author's profile photo Raevanth Kumar
      Raevanth Kumar

      Yes, I used S3 bucket. My DI_DATA_LAKE connection was not stable so I was getting error something like this “DataLakeEndpoint  “502 Bad Gateway while building the artifact producer”.  So I thought of using S3 connection.

       

      Thanks for the clarification, will look some alternative.

      Author's profile photo Prashant Jayaraman
      Prashant Jayaraman

      Good luck!

      Author's profile photo Kavish Rastogi
      Kavish Rastogi

      Well, there is a new change in Artifact Producer V2. If you want to use any storage other than DI_DATA_LAKE, the configurations have changed.

      I had to make the following changes to Artifact Producer to save my model in S3 Bucket

      Right-click the Artifact Producer operator, and click “Open Operator Editor”.  Click the “script” tab, and Ctrl + F  "GetStorageConnectionID" change the return value to your Connection Id. And Ctrl + F "createArtifactURI" look for function and change fixedPath variable to your desired path in the connection.

      Author's profile photo Nina Sun
      Nina Sun

      Hi Philipp,

      Could you please also help me with how to export a CSV file to the data lake? My code is like below:

      result = pd.merge(dataset1, dataset, left_on = 'Income', right_on = [0])
      
      !pip install hdfs
      from hdfs import InsecureClient
      client = InsecureClient('http://datalake:50070')
      with client.write('/shared/DEMO_15761/CustomerSegmentation_Results.csv', encoding = 'utf-8') as writer:
          result.to_csv(writer)

      I got an error saying that “Invalid file path or buffer object type: <class ‘hdfs.util.AsyncWriter’>”

      Thanks for any hint.

      Nina

      Author's profile photo Dubravko Bulic
      Dubravko Bulic

      Thanks Andreas for sharing  - its perfect tutorial !!! 🙂

      Author's profile photo Phillip Parkinson
      Phillip Parkinson

      Great blog post Andreas, I was able to follow along in my own tenant.

      Author's profile photo Karim Mohraz
      Karim Mohraz

      Thnx Andreas for this great tutorial.

      I got error 502 “bad gateway” when sending the POST request.

      It could be fixed by extending the Python Consumer template:
      when returning the response a new instance of api.message was created.

      request_id = msg.attributes['message.request.id']
      response = api.Message(attributes={'message.request.id': request_id}, body=msg.body)
      api.send('output', response)

      Author's profile photo David Bertsche
      David Bertsche

      What all needs to be changed to train a different type of model using the pipeline interface? I trained a regression tree model (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html) successfully in the Jupyter Notebook interface, then transferred to code to the Python3 operator in the pipeline interface. That operator seems to run fine, but the pipeline gets stuck on the "Artifact Producer" operator and the model is not produced (however, the metrics are produced). The pipeline stays in "running" status and never reaches "completed".

      Author's profile photo David Bertsche
      David Bertsche

      Working with SAP support we found that there is a 10MB size limit on objects to fit through the pipeline between the Python3 operator and the Artifact Producer operator. In my case the model was larger and was getting stuck. We solved this by grouping those two operators together, tagged with the dockerfile below. Note: some of these settings are specific to my instance of DI

       

      FROM §/com.sap.datahub.linuxx86_64/vsolution-golang:2.7.44
      
      RUN python3.6 -m pip --no-cache-dir install \
      requests==2.18.4 \
      tornado==5.0.2 \
      minio==3.0.3 \
      hdfs==2.5.0 \
      Pillow==6.0.0 \
      tensorflow==1.13.1 \
      numpy==1.16.4 \
      pypng==0.0.19 \
      pandas==0.24.0 \
      sklearn
      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Thank you for sharing David, great to hear that you got it working and how you made it!

      Author's profile photo Henrique Pinto
      Henrique Pinto

      You don't need to have all operators running the same image/dockerfile to be able to run them in the same pod. A pod can have several containers based on different images. In DH, the way you tell it to run all operators in the same pod is by adding them all to the same group.

      Author's profile photo Thorsten Schneider
      Thorsten Schneider

      Very cool blog, Andreas Forster!

      Thanks

      Thorsten

      Author's profile photo Manikandan Rajasekar
      Manikandan Rajasekar

      Andreas Forster , Super Blog ! Informative and step by step - explained well ! Able to get hands on with SAP DI and created my first ML scenario with in build Data Lake. :).

      Karim Mohraz i ran into the same 502 error , your suggestion helped :).

      Philipp Zaltenbach Your steps to read from SDL connection helped a lot :).
      Author's profile photo ELISA MASETTI
      ELISA MASETTI

      Very interesting and complete blog Andreas Forster, thank you!

      I am trying to do the same having loaded the marathon times file on a HANA table.

      In the Jupyter notebook all good, I have used the new python APIs, but then? How should I change the training pipeline?

      Thanks

      Elisa

       

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Elisa, thank you for the feedback! To follow this tutorial with data that is held in SAP HAN you need to replace the "Read File" operator in the training pipeline with a "HANA Client". The select statement that retrieves the history can be passed to the HANA Client with a "Constant Operator". And a "Format Converter" after the HANA Client might be needed to get the CSV format this tutorial is based on.
      However, with data in HANA, there is an alternative, that I would prefer. Since in HANA we have predictive algorithms, you can train the model directly in SAP HANA without data extraction. We have a new Python and R wrappers that make this pretty easy. Here are some links on that wrapper
      https://blogs.sap.com/2019/08/21/conquer-your-data-with-the-hana-dataframe-exploratory-data-analysis/
      https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/latest/en-US/index.html
      There are also hands-on sessions on that wrapper at this year's Teched season that is just starting, ie DAT260, SAP Data Intelligence: Machine Learning Push-Down to SAP HANA with Python

      Author's profile photo ELISA MASETTI
      ELISA MASETTI

      Thank you Andreas!
      Is there a documentation page about the library hana_notebook_connector?

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      The only documentation I am aware of is the popup within Jupyer. Have the cursor within the round brackets of NotebookConnectionContext(connectionId='YOURCONNECTION') and hit SHIFT+TAB.

      Author's profile photo Rudolf Wenzler
      Rudolf Wenzler

      Hi Andreas,

       

      also thanks for your great blog - I'm also actually trying to access HANA from a canary service instance but cannot connect (ESOCKETTIMEDOUT). Is there something specific to configure?

       

      Thanks,

      Rudolf

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Rudolf, I have seen the ESOCKETTIMEDOUT error when
      - the HANA system is either down
      - or if the HANA system cannot be found (ie if it has no external IP address)
      - or if the wrong port was specified

      In case this doesn't help, please ping me directly so we keep the comments on this blog close to the Marathon case 🙂

      Many Greetings, Andreas

      Author's profile photo Achin Kimtee
      Achin Kimtee

      Hello Andreas,

      It's not very clear how to use Constant Operator in combination with Format Converter, HANA Client operator and Artifact Producer. Will you be able to share how the pipeline would look if we use HANA Client Operator to read the HANA table instead of the Read File operator?

      Regards

      Achin Kimtee

       

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Achin, To change the data source from CSV to SAP HANA in the training pipeline, the python code just needs to be adjusted slightly as SAP HANA is providing the data in JSON format (not CSV).

      import json
      df_data =  pd.read_json(io.StringIO(data))

      The full code is then:

      import json
      import io
      import pandas as pd
      
      def on_input(data):
          
          # Obtain data
          df_data =  pd.read_json(io.StringIO(data))
          
          # Get predictor and target
          x = df_data[["HALFMARATHON_MINUTES"]]
          y_true = df_data["MARATHON_MINUTES"]
          
          # Train regression
          from sklearn.linear_model import LinearRegression
          lm = LinearRegression()
          lm.fit(x, y_true)
          
          # Model quality
          import numpy as np
          y_pred = lm.predict(x)
          mse = np.mean((y_pred - y_true)**2)
          rmse = np.sqrt(mse)
          rmse = round(rmse, 2)
          
          # to send metrics to the Submit Metrics operator, create a Python dictionary of key-value pairs
          metrics_dict = {"RMSE": str(rmse), "n": str(len(df_data))}
          
          # send the metrics to the output port - Submit Metrics operator will use this to persist the metrics 
          api.send("metrics", api.Message(metrics_dict))
      
          # create & send the model blob to the output port - Artifact Producer operator will use this to persist the model and create an artifact ID
          import pickle
          model_blob = pickle.dumps(lm)
          api.send("modelBlob", model_blob)
          
      api.set_port_callback("input", on_input)

       

      That's the pipeline.

       

      The Constant Generator is just sending the SELECT to the HANA Client, ie
      SELECT * FROM ML.RUNNINGTIMES

      Just to mention, that if the data is in SAP HANA, I suggest to try using the ML algorithms in HANA to avoid the data extraction (Predictive Algorithm Library, Automated Predictive Algorithm Library). Here is an example

      https://blogs.sap.com/2020/04/21/sap-data-intelligence-deploy-your-first-hana-ml-pipelines/

      Andreas

       

       

      Author's profile photo Achin Kimtee
      Achin Kimtee

      Hi Andreas,

      Thanks for your quick and detailed response. It helped me to move in the right direction. There are 2 other issues which I am facing when I design the pipeline as you suggested.

      1. I used constant generator and put select query in the content but in the output of first SAP HANA Client operator, ordering of columns is randomly changing. It’s not the same sequence as present in the HANA table. As ordering changes randomly, python operator can’t accept the same and consequently, pipeline fails.
      2. Float values are coming as 52456/1000 instead of 52.456.

      Did you come across any such issues in your pipeline? Any inputs are welcome.

      Training%20Pipeline%20using%20HANA%20Table%20as%20Source

      Training Pipeline using HANA Table as Source

       

      Regards

      Achin Kimtee

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Achin, You can control the column sequence by replacing the * with the column names.
      SELECT * FROM ML.RUNNINGTIMES

      I am surprised though that Python is expecting the columns in a certain order, since the code is retrieving the columns by name, not by index?
      y_true = df_data["MARATHON_MINUTES"]

      Author's profile photo Achin Kimtee
      Achin Kimtee

      Hi Andreas Forster ,

      Even though I give column names instead of select *, the columns does not come in order.

      But you are right. Even though the columns are not in order,the code accepted it and ran successfully. I could successfully generate model using HANA table instead of file.

      Regards

      Achin Kimtee

      Author's profile photo Achin Kimtee
      Achin Kimtee

      Hi Elisa,

      I am trying to design a similar training pipeline where input data is coming from HANA table. I get an error if I try to use HANA Client Operator with Artifact Producer. If I use Read File Operator, it works fine. Did it work for you? Can you share how did you achieve this scenario?

       

      Thanks

      Achin Kimtee

      Author's profile photo Ingo Peter
      Ingo Peter

      Hello,

      When I call the rest-API via Postman I get

      Is there something wrong or do I just have to wait - actually I've been waiting already a rather long time? The inference pipeline in the modeller is running.

      Thx, Ingo

      Author's profile photo Ingo Peter
      Ingo Peter

      Hi,

      I think the reason might be that the Semantic Data Lake (SDL) does not really work in the CAL environment where I tried to deploy the solution. In particular it is possible to save the model in the SDL which can be checked in the Metadata Explorer. However trying to access the model when using it in the inference operator seems to fail. You can also find the error message in the else-branch of the interference operator.

      Instead I used the local file system of DH to store the model which worked fine.

      Regards, Ingo

      Author's profile photo Premchand Nutakki
      Premchand Nutakki

      Hello Andreas,

       

      Great Blog. I have one further question.

       

      Is it possible to access Vora from JupyterLab Notebooks to access multiple Parquet files and leverage SQL on files.

       

      Thanks, Prem

       

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Prem, I haven't tried accessing Vora from Python. But you can read Parquet files in Jupyter Notebooks and in the graphical pipelines, the latter of which can also write to Vora if that's any helpful. Greetings, Andreas

      Author's profile photo Joseph Yeruva
      Joseph Yeruva

      Thank you so much Andrea for the end-to-end tutorial. Waiting to build this.

      Author's profile photo DaPeng Wang
      DaPeng Wang

      Hi, Andreas,

       

      thanks for the great hands-on tutorial.

       

      But somehow I am stuck with the docker build step. Using your sample code, I get the error message:

      Error reading Dockerfile open /vrep/vflow/dockerfiles/com/sap/opensuse/python36/Dockerfile: no such file or directory

       

      Trying to switch to other base images such as §/com.sap.datahub.linuxx86_64/vsolution-golang:2.7.44 or  python-3.6.3-slim-strech, I am able to build the docker image. Nevertheless, the execution of the pipeline always failed with no further message than "Dead".

       

      -- Is there any documentation about  base images available for data hub? Is there any way to take a look at the repository?

      -- How to get more detailed error messages?

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi DaPeng,
      The newly release DI 3.0 introduced some changes, there is a new base image for instance.
      https://help.sap.com/viewer/1c1341f6911f4da5a35b191b40b426c8/Cloud/en-US/d49a07c5d66c413ab14731adcfc4f6dd.html
      I hope to test this out soon, in the meantime please try this syntax

      FROM $com.sap.sles.base
      RUN python3.6 -m pip install numpy==1.16.4 --user
      RUN python3.6 -m pip install pandas==0.24.0 --user
      RUN python3.6 -m pip install sklearn --user

      The Monitoring tile on the DI launchpad shows a more detailed error for a failed pipeline.

      Author's profile photo James Yao
      James Yao

      Hi Andreas,

      I have the same issue with Dapeng, after trying your new script. it gives me an error :

      "Unable to start docker file build: /service/v1/dockerenv/deploy/python36marathon"

      or

      "build failed for image: 589274667656.dkr.ecr.ap-southeast-1.amazonaws.com/di_30/vora/vflow-node-691ab04f0613a196010aa19647e61e943a7ffafc:3.0.21-python36marathon-20200425-210636"

      I checked also the guide in help.sap, couldn't figure out how to correct the script.

      Can you please help?

      Thank you!

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi DaPeng, hi James,
      I have just updated the whole blog for DI 3.0.
      There were plenty of changes, the product is moving fast 🙂
      For instance the connection, the docker file, the read file operator the versioning.
      The docker syntax is updated in the blog. And the tags that need to be assigned have also changed.
      You may want to skim through the blog from the beginning, for instance the bucket name is now specified in the connection.
      Please let me know if you get stuck somewhere or if you make it to the end.
      Andreas

      Author's profile photo DaPeng Wang
      DaPeng Wang

      Hi,

      I am able to build the docker image with following DOCKERFILE.

      FROM $com.sap.sles.base
      RUN pip3.6 install –user numpy==1.16.4
      RUN pip3.6 install –user pandas==0.24.0
      RUN pip3.6 install –user sklearn

       

      Nevertheless, when running the pipeline, I got the following error:

      container has runAsNonRoot and image will run as root

       

      Any idea, how to configure the security or runAs context for the container?

       

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi DaPeng, hi James,
      I have just updated the whole blog for DI 3.0.
      There were plenty of changes, the product is moving fast 🙂
      For instance the connection, the docker file, the read file operator the versioning.
      The docker syntax is updated in the blog. And the tags that need to be assigned have also changed.
      You may want to skim through the blog from the beginning, for instance the bucket name is now specified in the connection.
      Please let me know if you get stuck somewhere or if you make it to the end.
      Andreas

      Author's profile photo Mohamed Abdelrasoul
      Mohamed Abdelrasoul

      Hi Andreas,

      many thanks for this interesting post,

      I did install SAP DI trial version and I'm using AWS for the as a host,

      I used this post to simulate my ML model of predicting diabetes (classification problem), however, I stuck while creating the environment of python through the docker file.

      can you please have a look into the below error?

      many thanks for your help

       

      Unable to start docker file build: /service/v1/dockerenv/deploy/dockerized

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Mohamed,
      I wonder if your environment might not be configured correctly, were you able to build any Docker File?
      Does this single line build successfully for example?

      FROM $com.sap.sles.base

      Andreas

      Author's profile photo Mohamed Abdelrasoul
      Mohamed Abdelrasoul

      Hi Andreas,

      I tried this line as well and it runs successfully,

      FYI, I used another dataset, however, I'm using only Pandas, Sklearn, and NumPy to import

      thanks

      Mohamed

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Mohamed, If the single FROM line builds successfully it seems to find the base image. Can you also build the following code?

      In case this fails, I assume there is a problem with the Docker configuration, maybe it cannot connect to the web to download the packages.

      FROM $com.sap.sles.base
      RUN pip3.6 install --user numpy==1.16.4
      RUN pip3.6 install --user pandas==0.24.0
      RUN pip3.6 install --user sklearn
      Author's profile photo Mohamed Abdelrasoul
      Mohamed Abdelrasoul

      Hi Andreas,

      I tried the above but it gives me the below error

       

      error building docker image. Docker daemon error: The command '/bin/sh -c pip3.6 install --user numpy==1.16.4' returned a non-zero code: 1
      a few seconds ago
      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Mohamed, I assume your Docker environment might not be configured correctly.
      Maybe it doesn't have Internet access to download packages from PyPI.
      Can you log a ticket with support for this? Or the wider SAP community might be able to help http://answers.sap.com/
      Andreas

      Author's profile photo Sergio Peña
      Sergio Peña

      Hi Andreas/Mohamed,

       

      I am the same error:

       

      The command ‘/bin/sh -c pip3.6 install –user numpy==1.16.4’ returned a non-zero code: 1

       

      With this script

      FROM $com.sap.sles.base
      RUN pip3.6 install --user numpy==1.16.4
      RUN pip3.6 install --user pandas==0.24.0
      RUN pip3.6 install --user sklearn
      RUN pip3.6 install --user tensorflow

      I tried to do it like this but the docker is never created

      FROM §/com.sap.datahub.linuxx86_64/vsolution-golang:2.7.44RUN python3.6 -m pip --no-cache-dir install \
      tornado==5.0.2 \
      hdfs==2.5.0 \
      tensorflow==1.13.1 \
      numpy==1.16.4 \
      pandas==0.24.0

       

      I think that is not posible create docker file into Sap Cal DI 3.0. I tried with Azure and i am triying it with Aws.

       

      Therofere when i assign tags to groups i get the error:

      failed to prepare graph description: failed to select image: no matching dockerfile found for group 'group1' with runtime tags: {"pandas": "", "sklearn": "", "numpy": "", "Pillow": "", "test2": "", "hdfs": "", "tornado": "", "minio": "", "tensorflow": "", "pypng": "", "requests": "", "python36": ""}; Alternatives are: {com.sap.dh.scheduler {"dh-app-base": "2002.1.10", "node": "vflow-sub-node"}} {com.sap.opensuse.flowagent-codegen {"flowagent-codegen": "2002.1.12", "opensuse": "", "spark": "2.4.0", "hadoop": "2.9.0", "python36": "", "tornado": "5.0.2", "deprecated": ""}} {com.sap.opensuse.golang.zypper {"opensuse": "", "python36": "", "tornado": "5.0.2", "sapgolang": "1.13.5-bin", "zypper": "", "deprecated": ""}} {com.sap.sles.ml.python {"ml-python": "", "numpy36": "1.13.1", "scipy": "1.1.0", "tornado": "5.0.2", "sapgolang": "1.13.5-bin", "python36": "", "opencv36": "3.4.2", "pykalman36": "", "textblob36": "0.12.0", "tweepy36": "3.7.0", "automated-analytics": "3.2.0.9", "sles": ""}} {test2 {"test2": "", "sles": "", "python36": "", "tornado": "5.0.2", "node": "10.16.0"}} {com.sap.opensuse.flowagent-operator {"opensuse": "", "deprecated": "", "flowagent": "2002.1.12"}} {com.sap.sles.sapjvm {"sles": "", "sapjvm": "", "python36": "", "tornado": "5.0.2"}} {com.sap.sles.textanalysis {"vflow_textanalysis": "", "python36": "", "tornado": "5.0.2", "sles": ""}} {org.opensuse {"tornado": "5.0.2", "sapgolang": "1.13.5-bin", "zypper": "", "opensuse": "", "python36": ""}} {phyt {"requests": "", "tornado": "", "tensorflow": "", "sklearn": "", "pypng": "", "pandas": "", "minio": "", "hdfs": "", "Pillow": "", "numpy": ""}} {com.sap.scenariotemplates.customdataprocessing.pandas {"python36": "", "pandas": "0.20.1", "tornado": "5.0.2", "node": "10.16.0", "sles": ""}} {com.sap.sles.dq {"sles": "", "vflow_dh_dq": "2002.1.2"}} {com.sap.sles.base {"default": "", "sles": "", "python36": "", "tornado": "5.0.2", "node": "10.16.0"}} {com.sap.dh.workflow {"dh-app-data": "2002.1.9", "node": "vflow-sub-node"}} {com.sap.sles.ml.functional-services {"sapgolang": "1.13.5-bin", "deprecated": "", "node": "10.16.0", "tornado": "5.0.2", "python36": "", "requests": "2.22.0", "sles": ""}} {com.sap.sles.streaming {"sles": "", "sapjvm": "", "streaming_lite": "", "python36": "", "tornado": "5.0.2"}} {com.sap.sles.flowagent-operator {"sles": "", "flowagent": "2002.1.12"}} {python36marathon {"marathontimes12": "", "sles": "", "python36": "", "tornado": "5.0.2", "node": "10.16.0"}} {trey {"trt": ""}} {Test7 {"sles": "", "python36": "", "tornado": "5.0.2", "node": "10.16.0", "tes7": ""}} {com.sap.dsp.dsp-core-operators {"hdfs": "2.5.0", "aiohttp": "3.5.4", "sapdi": "0.3.23", "hana_ml": "1.0.8.post5", "sapgolang": "1.13.5-bin", "hdbcli": "2.4.167", "pandas": "0.24.2", "tornado": "5.0.2", "python36": "", "sles": "", "requests": "2.22.0", "backoff": "1.8.0", "uvloop": "0.12.2"}} {com.sap.scenariotemplates.customdataprocessing.papaparse {"node": "", "papaparse": "4.1.2", "sles": "", "python36": "", "tornado": "5.0.2"}} {com.sap.dsp.sapautoml {"sapautoml": "2.0.0", "tornado": "5.0.2", "python36": "", "sles": ""}} {com.sap.sles.flowagent-codegen {"hadoop": "2.9.0", "python36": "", "tornado": "5.0.2", "flowagent-codegen": "2002.1.12", "sles": "", "spark": "2.4.0"}} {com.sap.sles.golang {"sles": "", "sapgolang": "1.13.5-bin", "python36": "", "tornado": "5.0.2"}} {com.sap.sles.hana_replication {"sles": "", "sapjvm": "", "python36": "", "tornado": "5.0.2", "hanareplication": "0.0.101"}} {test1 {"sles": "", "python36": "", "tornado": "5.0.2", "node": "10.16.0"}} {Tes6 {"Test6": ""}} {com.sap.opensuse.dq {"opensuse": "", "deprecated": "", "vflow_dh_dq": "2002.1.2"}} {com.sap.opensuse.ml.rbase {"rjsonlite": "", "tree": "", "sles": "", "python36": "", "opensuse": "", "rserve": "", "tornado": "5.0.2", "r": "3.5.0", "rmsgpack": ""}} {com.sap.sles.node {"tornado": "5.0.2", "sles": "", "node": "", "python36": ""}}
      But i assign the tags. I assign the tags to the operator and the group.

      Great blog Andreas.

      Author's profile photo Andres Levano
      Andres Levano

      I'm having the same issue Sergio, and I don't know where to search, seems like there is no documentation to solve it.

      Author's profile photo Andres Levano
      Andres Levano

      Hi Andreas,

      I'm having exactly the same problem as Sergio. I was checking other blogs but I think there is no workaround to solve this issue. Could you please help us?

      Truly thanks in advance.

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Andres Levano , hi Sergio Peña , hi Mohamed ,
      I am bit puzzled that the Docker code doesn't build for you.
      Maybe your systems have something in common, that prevents this.
      I wouldnt be surprised for example, if your Docker cannot connect to the web, and therefore cannot download packages.

      Can you please test whether you can build the following code from the documentation.

      DI Cloud
      https://help.sap.com/viewer/1c1341f6911f4da5a35b191b40b426c8/Cloud/en-US/781938a8d99944d099c94ac813962c34.html
      FROM $com.sap.sles.base
      RUN pip3.6 install --user numpy=="1.16.1"

      DI on premise
      https://help.sap.com/viewer/aff95eebc2e04c44816e6ff0d21c3c88/3.0.latest/en-US/d49a07c5d66c413ab14731adcfc4f6dd.html?q=sles
      FROM $com.sap.sles.base
      RUN pip3.6 install --user numpy

      Andreas

      Author's profile photo Mohamed Abdelrasoul
      Mohamed Abdelrasoul

      Hi Andreas Forster ,

      tried again the above, but still facing the below error.

      can you please advise if there is a way to reinitialize the DI to have a fresh copy or do I need to reinstall anything again? as I think there might be an issue with the installation.

       

      ------

      error building docker image. Docker daemon error: The command '/bin/sh -c pip3.6 install –user numpy==”1.16.1″' returned a non-zero code: 1

      ------
      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Mohamed , Andres Levano , Sergio Peña , Klaus-Dieter Steffen
      To test this out I have created my on DI trial instance and can now reproduce the error. The Docker code from the documentation, which builds fine in the full DI version,

      FROM $com.sap.sles.base
      RUN pip3.6 install --user numpy

      is giving this error in the trial:
      error building docker image. Docker daemon error: The command ‘/bin/sh -c pip3.6 install –user numpy’ returned a non-zero code: 1

      The following Docker build succeeds in the DI Trial. However, I am not sure if that image is intended to be used. So far I have not managed at least to also install pandas and sklearn

      FROM $com.sap.sles.ml.python
      RUN python3.6 -m pip install numpy

      Let me check with the colleagues that look after the DI trial.
      Andreas

      Author's profile photo Sergio Peña
      Sergio Peña

      Hi Andreas,

       

      I test it and i get other error. I check my conexion but it is fine.

       

      [object Object],DEBUG,"[stream] [91mNo matching distribution found for tensorflow
      [0m",vflow,container,35605,func1
      [object Object],DEBUG,"[stream] Removing intermediate container 35cdef30f81e",vflow,container,35605,func1
      [object Object],DEBUG,"[error] The command '/bin/sh -c python3.6 -m pip install tensorflow' returned a non-zero code: 1",vflow,container,35605,func1
      [object Object],ERROR,"Error building docker image: The command '/bin/sh -c python3.6 -m pip install tensorflow' returned a non-zero code: 1",vflow,container,35269,buildImageCoreDocker
      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Mohamed  , Andres Levano  Sergio Peña , Klaus-Dieter Steffen ,
      Dimitri Vorobiev was able to analyse this behaviour further and identified that the issue is Docker on AWS not being able to connect to the web. Hence Docker cannot download / install the required Python packages. He is also proposing a workaround, please see
      https://answers.sap.com/questions/13080892/error-building-docker-image.html?childToView=13084906#comment-13084906

      Andreas

      Author's profile photo Martin Donadio
      Martin Donadio

      Hi Andreas !

       

      This is a great post !! Thanks for sharing all details and steps.

       

      I created a ML Scenario, in the Jupyter Notebook i installed opencv-python

       

      !pip install opencv-python

       

      And everything looks fine

       

      Collecting opencv-python
        Downloading https://files.pythonhosted.org/packages/d0/f0/cfe88d262c67825b20d396c778beca21829da061717c7aaa8b421ae5132e/opencv_python-4.2.0.34-cp37-cp37m-manylinux1_x86_64.whl (28.2MB)
           |████████████████████████████████| 28.2MB 12.6MB/s eta 0:00:01    |████████████                    | 10.5MB 12.6MB/s eta 0:00:02
      Requirement already satisfied: numpy>=1.14.5 in /opt/conda/lib/python3.7/site-packages (from opencv-python) (1.18.1)
      Installing collected packages: opencv-python
      Successfully installed opencv-python-4.2.0.34
      
      When I try to import cv2 I get the error below
      
      
      import cv2
      
      ---------------------------------------------------------------------------
      ImportError                               Traceback (most recent call last)
      <ipython-input-3-c8ec22b3e787> in <module>
      ----> 1 import cv2
      
      /opt/conda/lib/python3.7/site-packages/cv2/__init__.py in <module>
            3 import sys
            4 
      ----> 5 from .cv2 import *
            6 from .data import *
            7 
      
      ImportError: libSM.so.6: cannot open shared object file: No such file or directory

      This looks to me that an os library is missign in the container.

      If I try to run zypper I get the below error

      ERROR: zypper was removed due to licensing reasons.
      It depends on 'rpm' module, which is licensed under sleepycat license.
      
      
      Do you have any hints to run opencv in ML Scenarios?
      
      
      Best regards,
      
      Martin

       

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Martin, Glad to hear you find the blog helpful!
      But I haven't tried opencv myself I am afraid.

      Author's profile photo Sumin Lee
      Sumin Lee

      Hi, @Martin Donadio

      Can you share how did you solve the opencv package error with the message in the DI?

      Thank you,

      Author's profile photo Mychajlo Chodorev
      Mychajlo Chodorev

      Hi Andreas,

      thank you for the comprehensive article. I've met however an issue on the last step. I set a POST request using curl and get an error:

      ralf@home ~> curl -X POST https://vsystem.ingress.dh-y23a0vea.dhaas-live.shoot.live.k8s-hana.ondemand.com/app/pipeline-modeler/openapi/service/f29ab031-5823-479a-b69f-8076c46921fe/v1/uploadjson --data "{\"half_marathon_minutes\"
      : 120}" --user 'default\XXXXXX:XXXXXXXX' -H "X-Requested-With: XMLHTTPRequest"

      Illegal URL request path '/openapi/service/f29ab031-5823-479a-b69f-8076c46921fe/v1/uploadjson'. No executing host found

      ralf@home ~>

      The URL is copied from deployment page. Google didn't help. What could it be?

       

      Mychajlo

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Mychajlo,
      Does the inference pipeline still show as "Running" in DI? Maybe it terminated with an error.
      If it is still running, I wonder if your corproate network or security settings might be blocking some communication.
      Can you please try from your laptop using a Hotspot from your phone, or just from your personal laptop with your home Wifi.
      Since I haven't called the inference URL from curl, do you get the same result in Postman? Just in case curl might need some additional config / parameter.
      Andreas

      Author's profile photo Jude Regy
      Jude Regy

      I am following this tutorial. I am executing the code below in the jupyter notebook with my own csv file located as in the code:

      import boto3
      import pandas as pd
      import io
      client = boto3.client('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key)
      bucket = 'bucket1'
      object_key = 'bucket1/shared/JR/RunningTimes.csv'
      csv_obj = client.get_object(Bucket=bucket, Key=object_key)
      body = csv_obj['Body']
      csv_string = body.read().decode('utf-8')
      df_data = pd.read_csv(io.StringIO(csv_string), sep=";")

       

      I am getting the following error: Can someone let me know what I am doing wrong?

      ---------------------------------------------------------------------------
      ClientError                               Traceback (most recent call last)
      <ipython-input-30-e77a0f1f08d7> in <module>
            5 bucket = 'bucket1'
            6 object_key = 'RunningTimes.csv'
      ----> 7 csv_obj = client.get_object(Bucket=bucket, Key=object_key)
            8 body = csv_obj['Body']
            9 csv_string = body.read().decode('utf-8')
      
      /opt/conda/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
          314                     "%s() only accepts keyword arguments." % py_operation_name)
          315             # The "self" in this scope is referring to the BaseClient.
      --> 316             return self._make_api_call(operation_name, kwargs)
          317 
          318         _api_call.__name__ = str(py_operation_name)
      
      /opt/conda/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
          633             error_code = parsed_response.get("Error", {}).get("Code")
          634             error_class = self.exceptions.from_code(error_code)
      --> 635             raise error_class(parsed_response, operation_name)
          636         else:
          637             return parsed_response
      
      ClientError: An error occurred (InvalidAccessKeyId) when calling the GetObject operation: The AWS Access Key Id you provided does not exist in our records.
      
      
      Thanks you for your help.
      Jude
      
      
      
      
      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Jude, You are getting this error as you are working with a SAP internal system that does not have a true S3 bucket connected.
      Your boto3 library connects to Amazon, which does not know about the S3 simulator in your system. Hence the message from Amazon: "The AWS Access Key Id you provided does not exist in our records."
      To proceed with the Python code, you can create an S3 bucket on Amazon, and use its credentials in DI.
      Andreas

      Author's profile photo Jude Regy
      Jude Regy

      Hi Andreas, that's a great information. I had no idea about that. Thank you very much for letting me know of this.

      -Jude

      Author's profile photo Vinieth Kuruppath
      Vinieth Kuruppath

      Hi Andreas

      First of all, thanks for this blog, very little information is available on DI and this helps a lot

      In my DI, I have setup an s3 connection and it gives me an "ok" value during connection check. In fact, I am able to connect to aws s3, get the data and run the ML scenario in Jupyter notebook in ML Scenario manager. However, when I try to train the model, the read file gives me a blank screen while trying to get a file, manually keying in the file name does not help and as a result the training job abends.

       

       

      Any pointers would be really helpful

      Regards

      Vinieth

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Vinieth, You can get a more detailed error message in the Monitoring tile on the DI launchpad.
      There might be a tidy up process. So in case your dead pipleline does not show anymore, just produce the error once more and it should show up.
      Andreas

      Author's profile photo Alexander Croll
      Alexander Croll

      Hi Andreas,

      very interesting and well-written post!

      I was able to create the Python Rest API and tested with Postman successfully.

      However, if I try to access the same API from inside a SAPUI5 application (with authorization included), I get access error/failure (403). My research points to CORS restricting access to the remote server (Data Intelligence).

      Did you run into issues like this or have any thought on how to overcome it?

      SAPUI5 app and DI tenant are not on the same server/origin.

      thanks!

      Alex

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Alex, I love that you want to extend this and put a UI5 front-end on top! That would make for a great hands-on tutorial. I haven't worked much with UI5 though, but because of CORS Postman requires the X-Requested-With header with value XMLHttpRequest.
      Maybe this can also be added to the UI5 app?

      Author's profile photo Alexander Croll
      Alexander Croll

      Hi Andreas,

      I have tried this, still getting 403 (Forbidden) error in console when accessing the API via a button.

      onPress: function(oEvent) {
      			var oView = this.getView();
      			var oResourceBundle = oView.getModel("i18n").getResourceBundle();
      			var url = oResourceBundle.getText("url").toString().trim();
      
      			$.ajax({
      				type: "POST",
      				url: url,
      				dataType: "json",
      				// beforeSend: function(xhr) {
      				// 	xhr.setRequestHeader("Authorization", "Basic Z******=");
      				// },
      				headers: {
      					"X-Requested-With": "XMLHttpRequest",
      					"Authorization": "Basic Z******="
      				},
      				success: function(response) {
      
      				},
      				error: function(response) {
      
      				}
      			});
      		}

      This is the ajax call I am making from my SAPUI5 app. As far as I could find, it's how it should be done, but I am open to suggestions of course 🙂

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Alexander, Maybe a question on the SAP Community could reach the UI5 experts, that have done something similar before?

      Author's profile photo EMANUELE FUMEO
      EMANUELE FUMEO

      Hi Alexander, Hi Andreas,

      have you further investigated this topic?

      I was trying to create a very simple web UI with some Javascript to consume the REST API of the OpenAPI block directly, but I stumped on the same error.

      The only solution that worked for me was to completely bypass the CORS problem with a proxy in between... but this is not a "clean" solution.

      I would really appreciate if you could share any thought or even breakthrough 🙂

      Thanks and Kind Regards,

      Emanuele

      Author's profile photo Alexander Croll
      Alexander Croll

      Hi Emanuele,

      I have not investigated much further, but have received a response from SAP DI Team that it might have been a bug that prevented the API call up to SAPUI5 version 1.6 or so...

      Could not test this because our gateway currently has a SAPUI5 version below this...

      Can you share your solution with the proxy? It may not be clean, but maybe worth a try for prototypes/showcases 🙂

      Thanks and best,

      Alex

      Author's profile photo Viren Pravinchandra Devi
      Viren Pravinchandra Devi

      Brilliant blog..thank you..very helpful

      Author's profile photo Miguel Angel Meza Martinez
      Miguel Angel Meza Martinez

      Hi Andreas.

       

      I've been following your tutorial and got to a point where i get an error and can not continue.

       

      I was creating the dockerfile with the content and the tag that you specified. However, When I click "save" the save and build icons are disabled and when I hover above them I get a "prohibited" red icon. In the logs I can see a "warning" log but the message only says "1". Are you familiar with this error?

       

      I appreciate your help

       

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Miguel, It seems the latest DI Cloud release introduced an issue.
      Our colleagues are working on a fix. Felix Bartler posted a workaround
      https://answers.sap.com/questions/13127511/sap-di-cloud-cant-add-tags-to-dockerfile.html
      Greetings, Andreas

      Jens Rannacher

      Author's profile photo Rahul Pant
      Rahul Pant

      Can the SAP DI ML Model APIs be authentication using Token as well?

      If Yes, where can we find these token on the SAP DI Application.

      User/Password is not the best way for API AUthentication.

      I use the below to call the API but throws error that invalid authentication.

      In that case I want to understand if the user needs to have any bear min privilege ?

      response = requests.request("POST", url, data=payload, headers=headers,auth=HTTPBasicAuth('user', 'password'))

      Thank you!

       

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Rahul,
      The Python Consumer template is designed for basic auth with user name and password
      https://help.sap.com/viewer/97fce0b6d93e490fadec7e7021e9016e/Cloud/en-US/2a351907e4764e3b8224d3cd0cbaddf3.html
      You could ask on the wider forum whether there are alternatives?
      https://answers.sap.com/index.html

      If you get an invalid authentication, please ensure your user is prefixed by the tenant, ie

      default\yourusername

      Andreas

      Author's profile photo Rachit Agarwal
      Rachit Agarwal

      Hi Andreas,

      Thanks for blog. As I am following the steps in the blog but still I am not able to build the docker file.

       

      https://answers.sap.com/questions/13177961/couldnt-create-docker-image-on-di-30.html?childToView=13187131#comment-13187131

      Posted this query quite a while ago & proposed solution still have similar error. I am working on Sap DI 3.0

       

      Looking forward for the solution.

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Rachit,
      Frank Schuler posted some answers to your question on the community.
      Could you please try those steps and post the outcome on that forum?
      I will subscribe to the question and will try to help if I have any further idea.
      Andreas

      Author's profile photo Rachit Agarwal
      Rachit Agarwal

      Hi Andreas,

       

      Thanks for the response. I have followed the Frank Schuler suggestion but still I am facing same issue. Here is the error I am facing

      failed to prepare graph description: failed to prepare image: error building docker image. Docker daemon error: The command ‘/bin/sh -c python3.6 -m pip –no-cache-dir install –user pandas’ returned a non-zero code: 1

       

      I am using SAP DI trail 3.0.

      Will be really helpful, if you can help me to resolve this query.

       

      Regards,

      Rachit Agarwal

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Rachit, Let's try to narrow it down. It was working successfully with this code and just one tag of your choosing

      FROM $com.sap.sles.base
      RUN pip3.6 install --user numpy==1.16.4
      RUN pip3.6 install --user pandas==0.24.0
      RUN pip3.6 install --user sklearn

      If the build gives the same error, please try building this single line
      FROM $com.sap.sles.base

      If this succeeeds we know that Docker is working in principle at least
      Then build
      FROM $com.sap.sles.base
      RUN pip3.6 install --user numpy==1.16.4

      Should this fail there is a problem adding additional libraries.
      This could indicate that DI cannot download libraries from the web. Please see this post from  Dimitri Vorobiev
      https://answers.sap.com/questions/13080892/error-building-docker-image.html?childToView=13084906#comment-13084906

      Author's profile photo Rachit Agarwal
      Rachit Agarwal

      Thanks Andreas, for giving a helping end. I think Docker is working fine – as I am able to build by running

      FROM $com.sap.sles.base

      Adding libraries is causing similar error.

      Even I tried to follow Dimitri suggested steps but it’s not helping me out at the moment. After deleting the pipeline I am atleast able to run build the docker but as soon as I tag it while executing the pipeline it fails with same error.

      Can you please let me know the way around to it.

       

      Regards,

      Rachit Agarwal

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Rachit,
      The test shows that there is some problem with your Docker environment downloading packages.
      I don't know any steps to troubleshoot this beyond Dimitri's suggestion.
      Since that trial issue is not specific to this blog, I suggest to continue the discussion with the whole community through the separate question you had posted.
      Andreas

      Author's profile photo Achin Kimtee
      Achin Kimtee

      Hi Andreas Forster ,

      As suggested in this post, I did the API call from Postman and it worked for 2-3 times. But after that it started throwing below error.

      Illegal URL request path '/openapi/service/1381f941-8c02-477b-8679-57a07dcd9752/v1/uploadjson/'. No executing host found

      I have stopped the pipeline, re-deployed it but still throwing the same error. Can you pls help with this error?

      Postman%20API%20POST%20Error

      Postman API POST Error

       

      Regards

      Achin Kimtee

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Achin Kimtee, Every time the pipleline is started, the url of the REST-API changes.
      Do you get the predictions when using the url of the latest deployment?
      Andreas

      Author's profile photo Achin Kimtee
      Achin Kimtee

      Hi Andreas Forster ,

      After re-deployment of pipeline, I am using the updated URL but the issue is still there. What's weird is, it works sometimes and then it stops suddenly with the error I shared in the previous comment.

       

      Regards

      Achin Kimtee

       

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Achin Kimtee, If the pipeline is still shown as "Running", I am not sure what could be causing this message. Maybe network connectivity, but I am guessing. Can you please ask to the wider community on https://answers.sap.com/questions/ask.html
      In case the pipeline stopped with an error, the "Monitoring" tile could show further details, on why it stopped.

      Author's profile photo Andrea Scalvini
      Andrea Scalvini

      Hi Andreas Forster

      thanks for the comprehensive and interesting tutorial.

      I am running the Trial Version 3.0 on GCP Hyperscaler and I have an issue in the training pipeline: i'm able to see the final model score, but the Artifact Producer runs in the following error.

      Graph failure: ArtifactProducer: InArtifact: ProduceArtifact: response status from DataLakeEndpoint "502 Bad Gateway": "{\n \"message\": \"Error while executing type extension: Error during exchange with Storage Gateway: 400 - undefined\",\n \"code\": \"10102\",\n \"module\": \"dh-app-connection\",\n \"guid\": \"e9c0e17d-c96a-4da2-8dfe-788a56a6bc89\",\n \"statusCode\": 500,\n \"causes\": [],\n \"hostname\": \"datahub-app-core-f1fc5a8fb5566f4e7e2284-6577c59bf7-txhh5\"\n}"

      Could you please indicate me what might be the problem?

      Thanks in Advance,

      Andrea

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello  Andrea Scalvini , Last month Dimitri Vorobiev made a comment that there "is a bug in our automated deployment of DI Trial Edition when deploying on Google Cloud".

      He is giving some suggestions in his post, which might be relevant for you as well.

      https://answers.sap.com/questions/13204754/runtime-storage-error-in-artifact-producer-sap-di.html

      In case that's not resolving it, can you please post this behaviour as question to the wider community

      https://answers.sap.com/index.html

      Author's profile photo Andrea Scalvini
      Andrea Scalvini

      It worked, thank you very much (and to Dimitri Vorobiev for the support.

      I add a further information that could be useful for someone else, in the Pipeline 20 Apply REST-API, inside the Python36 - Inference there is this line of code:

      model_ready = False

      To have response from the API, it needs to be changed to:

      model_ready = True

      otherwise it would respond with "Model has not yet reached the input port - try again"

      Author's profile photo Petar Aleksandrov
      Petar Aleksandrov

      Hello everyone!

       

      First of all great tutorial! I am having problems with the execution of the graph. I am using s3 bucket but i've got this message when execute the graph.

      I would apreceate some help here to complete this example.

      Thanks in Advance,

      Petar

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Petar, The s3_files connection is just needed to load the CSV file for training the model.
      The Artifact Producer should only use the built-in DI_DATA_LAKE connection.
      Maybe you changed some configuration of the Artifact Producer, or (if you are using DI on-prem) the DI_DATA_LAKE might not be configured incorrectly?
      Andreas

      Author's profile photo Petar Aleksandrov
      Petar Aleksandrov

      Thank you for the fast forward, if i dont have the DI_DATA_LAKE dont work? Its there anyway to work without the DI_DATA_LAKE, because in our onprem we dont have installed the DI_DATA_LAKE.

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Yes, you need the DI_DATA_LAKE connection to make this work.
      This can still be setup fully on-prem though. Just passing on a comment from a colleauge that helped another on-prem customer:
      You can use an S3 interface such as rook-Ceph. It is essentially an S3 API wrapper around a block storage that is hosted on-premise. You can either deploy rook directly in Kubernetes, which would then leverage the storage provisioner for Kubernetes, or set up a dedicated Ceph filesystem server. From the SAP Data Intelligence perspective it will be the same as connecting to an Amazon S3 storage bucket, except the hostname will be different.

      I believe that such a setup with rook-Ceph is supported. But please contact support in case you need an official statement for this.
      Andreas

      Author's profile photo Michał Gromiec
      Michał Gromiec

      First of all, thank you, Andreas Forster, for this great blog post, it helps a lot!

      I am facing issues with using `get_datahub_connection` method inside JupyterLab. It looks like this method doesn't exist in my notebook. The only one thing I've found inside is a few wrappers (JSON, logging, requests) and two connection contexts, but for HANA connections.

      Is it possible to get connected with a predefined S3 connection? What about other connections? Is it possible to use them directly inside JupyterLab?

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Michał Gromiec, Great to hear you find the blog useful!
      get_datahub_connection was deprecated in a recent release. I just haven't updated the blog yet, as I understand that the DI trial is still providing it. Maybe you are in a productive DI cloud system?
      If so, the following code should obtain credentials from the central connection.

      import requests
      connectionId = 's3_files'
      connection_content = requests.get("http://vsystem-internal:8796/app/datahub-app-connection/connectionsFull/%s" % connectionId).content
      Author's profile photo Michał Gromiec
      Michał Gromiec

      Andreas Forster In the meantime I've found this in NotebookConnectionContext 🙂 I have only one doubt, correct me if I'm wrong, but does SAP DI responses with credentials in plain text? Is this endpoint available for all DI users?

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Michał Gromiec , Yes it is clear text at the moment. I understand that this is currently being changed though. Christian Tietz

      Author's profile photo Snehit S Shanbhag
      Snehit S Shanbhag

      Hello Andreas,

       

      Firstly thanks a for such detailed blog.

      We see the URL for the deployed ML scenario is consumed in postman in this blog, i was curious

      1. How can i use the URL in other SAP tools, such that user can pass the data and see the output, which tool can we use ?
      2. Can we use this URL in SAC story?
      3. Can we use this URL in custom widget in SAC AD (for end user to communicate with ML scenario)?
      4. Can we build a UI5 app to consume this URL and avail end user to communicate with ML scenario?

       

      will be interested to know your perspective and also any other approach to consume this URL (apart from postman)

      Thank in advance.

       

      Best Regards,

      Snehit Shanbhag

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hello Snehit S Shanbhag, The REST-API created in this blog is a generic interface, that can be called from many other applications or languages. They just need to be able to call a REST-API. For example at the end of the following blog is a video of a chatbot that retrieves predictions from the REST-API. https://blogs.sap.com/2020/04/21/sap-data-intelligence-deploy-your-first-hana-ml-pipelines/

      But there are also other options, on how Data Intelligence can provide predictions, depending on the requirements. Not every use case might require individual real-time predictions. Depending on the scenario, you can also schedule the scoring of a large number of predictions, and pass these to SAP Analytics Cloud. (for instance the "HANA ML Inference on Dataset" template)

      Or alternatively, if you need predictions at high speed at scale, Data Intelligence can listen to Kafka or MQTT to provide individual predictions on the fly.
      https://blogs.sap.com/2020/12/08/mlops-in-practice-applying-and-updating-machine-learning-models-in-real-time-and-at-scale/

      All the best,
      Andreas

      Author's profile photo Alecsandra Dimofte
      Alecsandra Dimofte

      Hi Andreas Forster

      Does DI offer code generation based on the model/process?

      Thank you

      Alecs

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Alecsandra Dimofte, Do you want to turn the trained model into a piece of code to create the inference / prediction in a separate program? That's possible with the Automated Predictive Library of HANA. https://blogs.sap.com/2020/12/07/hands-on-tutorial-score-your-apl-model-in-stand-alone-javascript/

      Data Intelligence could trigger the model training in HANA and obtain the latest scoring equation, or even deploy the code directly in Data Intelligence https://blogs.sap.com/2020/12/08/mlops-in-practice-applying-and-updating-machine-learning-models-in-real-time-and-at-scale/

      Author's profile photo Erick Rangel
      Erick Rangel

      Hello everyone,

      I want to know three things:

      1. How can I read the parquet format using hdfs ?

      2. I have other HANNA Data Base servers, how to know the port or localhost?

      3. I can't use the sapdi command, I think it's because I'm only connected to the Datalake port, could you give me more information about it?

      Thanks!

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Erick Rangel

      1) Reading a parquet file from the data lake
      from hdfs import InsecureClient
      import pandas as pd
      client = InsecureClient('http://datalake:50070')
      with client.read('/shared/i056450/RunningTimes.parquet') as reader:
      content = reader.read()
      df_data = pd.read_parquet(io.BytesIO(content))

      2) This blog explains how to get the SQL port for HANA On-Premise. For HANA Cloud and DWC that's always 443
      https://blogs.sap.com/2017/12/04/hey-sap-hana-express-edition-any-idea-whats-your-sql-port-number/

      3) Which sapdi command are you referring to?

      Greetings,

      Andreas

      Author's profile photo Erick Rangel
      Erick Rangel

      Thanks for replying Andreas Forster

      I've been trying to read parquets with various functions but unfortunately I couldn't and I'm asking for help again.

      Sorry for the inconvenience,

       

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      The error shows that it cannot find the file. You are using *external" storage, which adds some complexity (ie user rights). To narrow this down, test loading the file from the "shared" folder.

      Author's profile photo Tarek Abou-Warda
      Tarek Abou-Warda

      If anyone is also getting the below error message when trying to build the docker file

      build failed for image: 539423227167.dkr.ecr.eu-central-1.amazonaws.com/dh-koq2mtyi/vora/vflow-node-51154c7fd66e6c6de7a9f93e2e1bd41096ca68c2:2209.4.7-python36marathon_TAW-20220707-150502

      It seems like the python subengine got updated and therefore it helped to change the python version to 3.9 instead of 3.6:

      FROM $com.sap.sles.base
      RUN pip3.9 install --user numpy
      RUN pip3.9 install --user pandas
      RUN pip3.9 install --user sklearn
      RUN pip3.9 install --user scikit-learn-intelex

      See also: https://help.sap.com/docs/SAP_DATA_INTELLIGENCE/1c1341f6911f4da5a35b191b40b426c8/781938a8d99944d099c94ac813962c34.html

      Best regards

      Tarek

      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Thanks for the hint Tarek Abou-Warda !

      Author's profile photo Anandhu Sudheer
      Anandhu Sudheer

      Hi Andreas Froster,

      First off, the blog is fantastic and it was incredibly helpful when I tried to create a Ml scenario in DIC almost a month ago. However, when I attempt to repeat the process, I am unable to create the docker file.The status never changes to completed when I try to run the docker.

      Additionally, I've observed that DIC no longer offers the Python Producer or Consumer Pipelines.

      What prevents me from building the docker and why?

      I've attached the screenshots for your reference.

       

       

       

      This is the error that I've encountered while trying to deploy the pipeline

      Best Regards,

      Anandhu

      Author's profile photo Tom Hu
      Tom Hu

      Hi Anandhu Sudheer

      It's possible that the image building workload is waiting for allocation while there's no available resource in your worker node. You can try release some resources by stopping running graphs and applications. The image building will start after there's available resource for allocation.

      For Python Consumer and Producer pipelines, they were removed along with some obsolete contents form DI Cloud. For workaround please refer to this SAP Note: https://i7p.wdf.sap.corp/sap/support/notes/3316646

      Besides I noticed pip3.6 is used in your dockerfile. Please change that to pip3.9 as DI pipelines are running on Python 3.9 now.

      Document: https://help.sap.com/docs/SAP_DATA_INTELLIGENCE/1c1341f6911f4da5a35b191b40b426c8/781938a8d99944d099c94ac813962c34.html

       

       

      Author's profile photo Rafael Bortolon Paulovic
      Rafael Bortolon Paulovic
      In addition to Tom's comment, one can also set the resources for their specific image builds.
      For that, a resourceDefinition.json can be created under the folder where the Dockerfile is located.
      Example:
      {
        "requests" : {
          "cpu": "500m",
          "memory": "1Gi"
        },
        "limits" : {
          "memory": "2Gi"
        }
      }
      The default value if not defined is 1 CPU and 2 Gi of requests, and it seems that your cluster setup does not have enough resources for executing it.
      Author's profile photo Andreas Forster
      Andreas Forster
      Blog Post Author

      Hi Anandhu Sudheer , Thanks for the feedback!

      Very recently an issue slipped into Data Intelligence, which causes the Python templates to disappear. To implement this blog's example, please follow the steps described in SAP Note 3316646 - Python Consumer and Python Producer operators are no longer available

      I will add a reference to that Note at the top of the blog