Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
nidhi_sawhney
Advisor
Advisor
In this blog post I will discuss some of the basics needed to get things done in  SAP Data Intelligence. This is a cheat-sheet if you like for some of the frequent operations, please refer to the Product Manuals for complete information on the released versions.

With SAP Data Intelligence we often need to similar tasks in the Jupyter Lab environment and the Modeler Pipeline for executing these tasks. For most part, I will describe how to do this in both these environments. Here is a list of what we will go through in this blog post :

  1. How to access data from the native Semantic Data Lake(SDL) in Data Intelligence

  2. How to use different version of hana_ml libraries

  3. How to track model metrics

  4. How to "peek" into SAP Data Intelligence Pipelines


How to access data from the native Semantic Data Lake(SDL) in Data Intelligence


Lets say we have the following files in the DI DATA LAKE which can be uploaded or accessed via the Data Intelligence Metadata Explorer. To have access to the Metadata Explorer functionality the DI user must have the policy sap.dh.metadata assigned to them.



These csv files are all the same format and say we want to create a data frame with all the data in it from the separate files. The code snippet below can be used to access the files and create a combined data frame from them.
In Jupyter Notebook

!pip install hdfs

from hdfs import InsecureClient
client = InsecureClient('http://datalake:50070')

client.status("/")
fnames=client.list('/shared/MY_CSV_FILES')

import pandas as pd
data = pd.DataFrame()
for f in fnames:
with client.read('/shared/MY_CSV_FILES/' + f, encoding='utf-8') as reader:
data_file = pd.read_csv(reader)
data = pd.concat([data_file,data])

 
In Modeler Pipeline

To access SDL files in Pipeline you can use the Read File operator which allows access to SDL directly using the Service SDL and specify the path to your file



 

How to use different version of hana_ml libraries


SAP Data Intelligence has hana_ml libraries already installed in the Jupyter Lab environment. However incase this version is not the same as you would like to use you can install the required version just as you would normally install in a JupyterLab.

For example the current release of SAP Data Intelligence has version 1.0.5 installed in the Jupyter Lab environment. With hana_ml 1.0.7 you can use the functionality of creating a hana_ml DataFrame from a pandas dataframe, which is what I want to do with the data loaded from the SDL in the step before. In order to do this follow the following steps
In Jupyter Notebook


  1. Upload the required hana_ml library in this example hana_ml 1.0.7 in the JupyterLab

  2. Run !pip install hana_ml-1.0.7.tar.gz in a Jupyter Notebook

  3. Create a HANA dataframe from pandas dataframe
    from notebook_hana_connector.notebook_hana_connector import NotebookConnectionContext
    import hana_ml.dataframe as dataframe
    conn = NotebookConnectionContext(connectionId='CONNECTION')

    dataframe.create_dataframe_from_pandas(conn, data, "TABLE_NAME", force=True, replace=False)​



In Modeler Pipeline

The steps for creating a docker for running hana_ml are already described in this excellent blog post from Stojan Maleschlijski.

For python version 3.6 and hana_ml 1.0.7 the docker file looks like this:
FROM $com.sap.opensuse.python36

# Install the HANA_ML package
RUN mkdir /tmp/SAP_HANA_ML
#Copy the local tar file to the docker container
COPY hana_ml-1.0.7.tar.gz /tmp/SAP_HANA_ML/
#Install from the docker container
RUN python3 -m pip install /tmp/SAP_HANA_ML/hana_ml-1.0.7.tar.gz

 

How to track model metrics in SAP Data Intelligence


SAP Data Intelligence provides you the functionality to train and apply your models. As part of the training and model management you can easily track the model metrics of choice  when executing training model runs in both the Jupyter Lab and Pipeline environments.
In Jupypter Notebook via SAP Data Intelligence SDK

from hana_ml.algorithms.pal import metrics
### Test set: Quality metric (Root Mean Squared Error)
r2 = metrics.r2_score(conn, df_act_pred, label_true='ACTUALS', label_pred='PREDICTIONS')

accuracy=metrics.accuracy_score(conn, df_act_pred, label_true='ACTUALS', label_pred='PREDICTIONS')

from sapdi import tracking
run_id = tracking.start_run()
metric1 = {
'name': 'r_2',
'type': 'float',
'value': r2,
}
metric2 = {
'name': 'accuracy',
'type': 'float',
'value': accuracy,
}

metrics = [metric1,metric2]
tracking.log_metrics(metrics)

params = {
'input_size': df_test.count(),
'model': 'cart'
}

tracking.log_parameters(params)

# Mark the end of tracking
tracking.end_run() # persists the metrics at the end of the run

When the Notebook is run these metrics and parameters are available with the Notebook via the ML Scenario Manager


In Modeler Pipeline

Similar functionality can be embedded in the pipeline. This is described very nicely in great detail in the blog post from Andreas Forster.

The key step is to send the desired metrics the Metrics API in the Python Producer Pipeline :
df_act_pred=df_act_pred.cast('PREDICTIONS', 'DOUBLE')
rmse=metrics.r2_score(conn, df_act_pred, label_true='ACTUALS', label_pred='PREDICTIONS')

# to send metrics to the Submit Metrics operator, create a Python dictionary of key-value pairs
metrics_dict = {"RMSE": str(rmse), "Validation Data Set Size": str(df_test.count())}

# send the metrics to the output port - Submit Metrics operator will use this to persist the metrics
api.send("metrics", api.Message(metrics_dict))

The model metrics and model artifacts are created when the pipeline is Executed via the ML Scenario Manager. These can then be accessed via the ML Scenario Manager Executions section.



 

Thats all for now folks. As I find more things which are commonly used I will attempt to add these here and hope they come in handy for your work.

 

 

How to "peek" into the SAP Data Intelligence Pipelines


When developing pipelines there is often a need to find whats going on with the code and the container environment where the pipeline is executing. Below I describe how to do it for Python operators.

Code Tracing (credit to andreas.forster )


For the first problem of tracking and tracing code running in the pipelines we can use the logging mechanism provided for the python operator.

In the code of the python operator add the following code snippets to log at different log levels: INFO, DEBUG, ERROR etc
import logging

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')

api.logger.info("Your message")

Then once the pipeline runs these messages can be traced in the TRACE section in the Modeler:



 

Container Access (credit to stojanm )


When we are developing we often times need to check the environment and setup of the container where the pipeline is running. To enable this one can attach a Terminal operator to the pipeline which feeds back the user input so we can access and emulate accessing the docker container in our local environment.

Create a new graph for example PythonOperatorWithLogs and add a Python operator and add 2 ports to it one for input and one for output. Attach the Python operator to a Terminal and Terminal back to the Python operator as follows:



The following code snippet is added to a python operator to get access to the container.
import os
import subprocess
import logging
import time
import sys

#This is enable logging in addition to accessing the container
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')

def on_input(data):

process = subprocess.Popen(data, shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE, universal_newlines=True)
out, err = process.communicate()
#Log the message you want
api.logger.info("User input " + str(data))

api.send("log", out)
api.send("log", err)

#Call the above function on user input
api.set_port_callback("input", on_input)

 

Now when the pipeline running, right click on Terminal and click on Open UI



This will open the terminal access in the browser



We can then type commands in the Input window below, for example "ls" here and see the output in the window above.
6 Comments