Technical Articles
SAP Data Intelligence Tips & Tricks
In this blog post I will discuss some of the basics needed to get things done in SAP Data Intelligence. This is a cheat-sheet if you like for some of the frequent operations, please refer to the Product Manuals for complete information on the released versions.
With SAP Data Intelligence we often need to similar tasks in the Jupyter Lab environment and the Modeler Pipeline for executing these tasks. For most part, I will describe how to do this in both these environments. Here is a list of what we will go through in this blog post :
- How to access data from the native Semantic Data Lake(SDL) in Data Intelligence
- How to use different version of hana_ml libraries
- How to track model metrics
- How to “peek” into SAP Data Intelligence Pipelines
How to access data from the native Semantic Data Lake(SDL) in Data Intelligence
Lets say we have the following files in the DI DATA LAKE which can be uploaded or accessed via the Data Intelligence Metadata Explorer. To have access to the Metadata Explorer functionality the DI user must have the policy sap.dh.metadata assigned to them.
These csv files are all the same format and say we want to create a data frame with all the data in it from the separate files. The code snippet below can be used to access the files and create a combined data frame from them.
In Jupyter Notebook
!pip install hdfs
from hdfs import InsecureClient
client = InsecureClient('http://datalake:50070')
client.status("/")
fnames=client.list('/shared/MY_CSV_FILES')
import pandas as pd
data = pd.DataFrame()
for f in fnames:
with client.read('/shared/MY_CSV_FILES/' + f, encoding='utf-8') as reader:
data_file = pd.read_csv(reader)
data = pd.concat([data_file,data])
In Modeler Pipeline
To access SDL files in Pipeline you can use the Read File operator which allows access to SDL directly using the Service SDL and specify the path to your file
How to use different version of hana_ml libraries
SAP Data Intelligence has hana_ml libraries already installed in the Jupyter Lab environment. However incase this version is not the same as you would like to use you can install the required version just as you would normally install in a JupyterLab.
For example the current release of SAP Data Intelligence has version 1.0.5 installed in the Jupyter Lab environment. With hana_ml 1.0.7 you can use the functionality of creating a hana_ml DataFrame from a pandas dataframe, which is what I want to do with the data loaded from the SDL in the step before. In order to do this follow the following steps
In Jupyter Notebook
- Upload the required hana_ml library in this example hana_ml 1.0.7 in the JupyterLab
- Run !pip install hana_ml-1.0.7.tar.gz in a Jupyter Notebook
- Create a HANA dataframe from pandas dataframe
from notebook_hana_connector.notebook_hana_connector import NotebookConnectionContext import hana_ml.dataframe as dataframe conn = NotebookConnectionContext(connectionId='CONNECTION') dataframe.create_dataframe_from_pandas(conn, data, "TABLE_NAME", force=True, replace=False)
In Modeler Pipeline
The steps for creating a docker for running hana_ml are already described in this excellent blog post from Stojan Maleschlijski.
For python version 3.6 and hana_ml 1.0.7 the docker file looks like this:
FROM $com.sap.opensuse.python36
# Install the HANA_ML package
RUN mkdir /tmp/SAP_HANA_ML
#Copy the local tar file to the docker container
COPY hana_ml-1.0.7.tar.gz /tmp/SAP_HANA_ML/
#Install from the docker container
RUN python3 -m pip install /tmp/SAP_HANA_ML/hana_ml-1.0.7.tar.gz
How to track model metrics in SAP Data Intelligence
SAP Data Intelligence provides you the functionality to train and apply your models. As part of the training and model management you can easily track the model metrics of choice when executing training model runs in both the Jupyter Lab and Pipeline environments.
In Jupypter Notebook via SAP Data Intelligence SDK
from hana_ml.algorithms.pal import metrics
### Test set: Quality metric (Root Mean Squared Error)
r2 = metrics.r2_score(conn, df_act_pred, label_true='ACTUALS', label_pred='PREDICTIONS')
accuracy=metrics.accuracy_score(conn, df_act_pred, label_true='ACTUALS', label_pred='PREDICTIONS')
from sapdi import tracking
run_id = tracking.start_run()
metric1 = {
'name': 'r_2',
'type': 'float',
'value': r2,
}
metric2 = {
'name': 'accuracy',
'type': 'float',
'value': accuracy,
}
metrics = [metric1,metric2]
tracking.log_metrics(metrics)
params = {
'input_size': df_test.count(),
'model': 'cart'
}
tracking.log_parameters(params)
# Mark the end of tracking
tracking.end_run() # persists the metrics at the end of the run
When the Notebook is run these metrics and parameters are available with the Notebook via the ML Scenario Manager
In Modeler Pipeline
Similar functionality can be embedded in the pipeline. This is described very nicely in great detail in the blog post from Andreas Forster.
The key step is to send the desired metrics the Metrics API in the Python Producer Pipeline :
df_act_pred=df_act_pred.cast('PREDICTIONS', 'DOUBLE')
rmse=metrics.r2_score(conn, df_act_pred, label_true='ACTUALS', label_pred='PREDICTIONS')
# to send metrics to the Submit Metrics operator, create a Python dictionary of key-value pairs
metrics_dict = {"RMSE": str(rmse), "Validation Data Set Size": str(df_test.count())}
# send the metrics to the output port - Submit Metrics operator will use this to persist the metrics
api.send("metrics", api.Message(metrics_dict))
The model metrics and model artifacts are created when the pipeline is Executed via the ML Scenario Manager. These can then be accessed via the ML Scenario Manager Executions section.
Thats all for now folks. As I find more things which are commonly used I will attempt to add these here and hope they come in handy for your work.
How to “peek” into the SAP Data Intelligence Pipelines
When developing pipelines there is often a need to find whats going on with the code and the container environment where the pipeline is executing. Below I describe how to do it for Python operators.
Code Tracing (credit to Andreas Forster )
For the first problem of tracking and tracing code running in the pipelines we can use the logging mechanism provided for the python operator.
In the code of the python operator add the following code snippets to log at different log levels: INFO, DEBUG, ERROR etc
import logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
api.logger.info("Your message")
Then once the pipeline runs these messages can be traced in the TRACE section in the Modeler:
Container Access (credit to Stojan Maleschlijski )
When we are developing we often times need to check the environment and setup of the container where the pipeline is running. To enable this one can attach a Terminal operator to the pipeline which feeds back the user input so we can access and emulate accessing the docker container in our local environment.
Create a new graph for example PythonOperatorWithLogs and add a Python operator and add 2 ports to it one for input and one for output. Attach the Python operator to a Terminal and Terminal back to the Python operator as follows:
The following code snippet is added to a python operator to get access to the container.
import os
import subprocess
import logging
import time
import sys
#This is enable logging in addition to accessing the container
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
def on_input(data):
process = subprocess.Popen(data, shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE, universal_newlines=True)
out, err = process.communicate()
#Log the message you want
api.logger.info("User input " + str(data))
api.send("log", out)
api.send("log", err)
#Call the above function on user input
api.set_port_callback("input", on_input)
Now when the pipeline running, right click on Terminal and click on Open UI
This will open the terminal access in the browser
We can then type commands in the Input window below, for example “ls” here and see the output in the window above.
Thank you Nidhi for starting this treasure trove!!
Thanks Nidhi, already looking forward to part 2! 😉
Thank you very much Nidhi.
Can you please let me know if there exists a test platform may be in cloud to get hands on with SAP Data intelligence and HANA ML libraries to work on PoC s
Thank you
Shishupal
Hi Shishupalreddy Ramreddy
SAP Data Intelligence is provisioned via SCP (SAP Cloud Platform) and you would need to enable it from and SCP account.
For hana_ml you can pip install the tar gz which comes with the HANA Client :
All the best,
Nidhi
Hi Nidih! I have one question.
Nidhi is possible to read .parquet partitioned files using the DI DATA LAKE connection using jupyter lab?
Hi Barbara Souza,
The DI DATA LAKE Connection should give you access to the parquet files if they are in the DI DL. To read them you would need the right library to read, in my case I show how to read csv with pandas for example.
Hope this helps,
Nidhi