SAP Data Intelligence Tips & Tricks

nidhi_sawhney · ‎10-21-2019

In this blog post I will discuss some of the basics needed to get things done in SAP Data Intelligence. This is a cheat-sheet if you like for some of the frequent operations, please refer to the Product Manuals for complete information on the released versions.

With SAP Data Intelligence we often need to similar tasks in the Jupyter Lab environment and the Modeler Pipeline for executing these tasks. For most part, I will describe how to do this in both these environments. Here is a list of what we will go through in this blog post :

How to access data from the native Semantic Data Lake(SDL) in Data Intelligence

How to use different version of hana_ml libraries

How to track model metrics

How to "peek" into SAP Data Intelligence Pipelines

How to access data from the native Semantic Data Lake(SDL) in Data Intelligence

Lets say we have the following files in the DI DATA LAKE which can be uploaded or accessed via the Data Intelligence Metadata Explorer. To have access to the Metadata Explorer functionality the DI user must have the policy sap.dh.metadata assigned to them.

These csv files are all the same format and say we want to create a data frame with all the data in it from the separate files. The code snippet below can be used to access the files and create a combined data frame from them.

In Jupyter Notebook

!pip install hdfs



from hdfs import InsecureClient

client = InsecureClient('http://datalake:50070')



client.status("/")

fnames=client.list('/shared/MY_CSV_FILES')



import pandas as pd

data = pd.DataFrame()

for f in fnames:

   with client.read('/shared/MY_CSV_FILES/' + f, encoding='utf-8') as reader:

    data_file = pd.read_csv(reader)

    data = pd.concat([data_file,data])

In Modeler Pipeline

To access SDL files in Pipeline you can use the Read File operator which allows access to SDL directly using the Service SDL and specify the path to your file

How to use different version of hana_ml libraries

SAP Data Intelligence has hana_ml libraries already installed in the Jupyter Lab environment. However incase this version is not the same as you would like to use you can install the required version just as you would normally install in a JupyterLab.

For example the current release of SAP Data Intelligence has version 1.0.5 installed in the Jupyter Lab environment. With hana_ml 1.0.7 you can use the functionality of creating a hana_ml DataFrame from a pandas dataframe, which is what I want to do with the data loaded from the SDL in the step before. In order to do this follow the following steps

In Jupyter Notebook

Upload the required hana_ml library in this example hana_ml 1.0.7 in the JupyterLab

Run !pip install hana_ml-1.0.7.tar.gz in a Jupyter Notebook

Create a HANA dataframe from pandas dataframe

from notebook_hana_connector.notebook_hana_connector import NotebookConnectionContext

import hana_ml.dataframe as dataframe

conn = NotebookConnectionContext(connectionId='CONNECTION')



dataframe.create_dataframe_from_pandas(conn, data, "TABLE_NAME", force=True, replace=False)

In Modeler Pipeline

The steps for creating a docker for running hana_ml are already described in this excellent blog post from Stojan Maleschlijski.

For python version 3.6 and hana_ml 1.0.7 the docker file looks like this:

FROM $com.sap.opensuse.python36



# Install the HANA_ML package

RUN mkdir /tmp/SAP_HANA_ML

#Copy the local tar file to the docker container

COPY hana_ml-1.0.7.tar.gz /tmp/SAP_HANA_ML/

#Install from the docker container

RUN python3 -m pip install /tmp/SAP_HANA_ML/hana_ml-1.0.7.tar.gz

How to track model metrics in SAP Data Intelligence

SAP Data Intelligence provides you the functionality to train and apply your models. As part of the training and model management you can easily track the model metrics of choice when executing training model runs in both the Jupyter Lab and Pipeline environments.

In Jupypter Notebook via SAP Data Intelligence SDK

from hana_ml.algorithms.pal import metrics

### Test set: Quality metric (Root Mean Squared Error) 

r2 = metrics.r2_score(conn, df_act_pred, label_true='ACTUALS', label_pred='PREDICTIONS')



accuracy=metrics.accuracy_score(conn, df_act_pred, label_true='ACTUALS', label_pred='PREDICTIONS')



from sapdi import tracking

run_id = tracking.start_run()

metric1 = {

    'name': 'r_2',

    'type': 'float',

    'value': r2,

}

metric2 = {

    'name': 'accuracy',

    'type': 'float',

    'value': accuracy,

}



metrics = [metric1,metric2]

tracking.log_metrics(metrics)



params = {

    'input_size': df_test.count(),

    'model': 'cart'

}



tracking.log_parameters(params)



# Mark the end of tracking

tracking.end_run() # persists the metrics at the end of the run

When the Notebook is run these metrics and parameters are available with the Notebook via the ML Scenario Manager

In Modeler Pipeline

Similar functionality can be embedded in the pipeline. This is described very nicely in great detail in the blog post from Andreas Forster.

The key step is to send the desired metrics the Metrics API in the Python Producer Pipeline :

df_act_pred=df_act_pred.cast('PREDICTIONS', 'DOUBLE')

rmse=metrics.r2_score(conn, df_act_pred, label_true='ACTUALS', label_pred='PREDICTIONS')



# to send metrics to the Submit Metrics operator, create a Python dictionary of key-value pairs

metrics_dict = {"RMSE": str(rmse), "Validation Data Set Size": str(df_test.count())}



# send the metrics to the output port - Submit Metrics operator will use this to persist the metrics 

api.send("metrics", api.Message(metrics_dict))

The model metrics and model artifacts are created when the pipeline is Executed via the ML Scenario Manager. These can then be accessed via the ML Scenario Manager Executions section.

Thats all for now folks. As I find more things which are commonly used I will attempt to add these here and hope they come in handy for your work.

How to "peek" into the SAP Data Intelligence Pipelines

When developing pipelines there is often a need to find whats going on with the code and the container environment where the pipeline is executing. Below I describe how to do it for Python operators.

Code Tracing (credit to andreas.forster )

For the first problem of tracking and tracing code running in the pipelines we can use the logging mechanism provided for the python operator.

In the code of the python operator add the following code snippets to log at different log levels: INFO, DEBUG, ERROR etc

import logging



logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')



api.logger.info("Your message")

Then once the pipeline runs these messages can be traced in the TRACE section in the Modeler:

Container Access (credit to stojanm )

When we are developing we often times need to check the environment and setup of the container where the pipeline is running. To enable this one can attach a Terminal operator to the pipeline which feeds back the user input so we can access and emulate accessing the docker container in our local environment.

Create a new graph for example PythonOperatorWithLogs and add a Python operator and add 2 ports to it one for input and one for output. Attach the Python operator to a Terminal and Terminal back to the Python operator as follows:

The following code snippet is added to a python operator to get access to the container.

import os

import subprocess

import logging

import time

import sys



#This is enable logging in addition to accessing the container

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')



def on_input(data):



    process = subprocess.Popen(data, shell=True, 

                                stdout=subprocess.PIPE,

                                stderr=subprocess.PIPE, universal_newlines=True)

    out, err = process.communicate()

    #Log the message you want

    api.logger.info("User input " + str(data))



    api.send("log", out)

    api.send("log", err)



#Call the above function on user input

api.set_port_callback("input", on_input)

Now when the pipeline running, right click on Terminal and click on Open UI

This will open the terminal access in the browser

We can then type commands in the Input window below, for example "ls" here and see the output in the window above.