Skip to Content
Technical Articles

Machine Learning Tracking SDK in SAP Data Intelligence

In this blog, we will see how to use Machine Learning Tracking Service in SAP Data Intelligence.

To know more about SAP Data Intelligence, refer here.

Why do we need tracking service? As we know building machine learning models is an iterative process involving multiple steps. Most often Data Scientists runs multiple experiments in Jupyter notebook, by tweaking underlying models, optimizing model parameters, tuning hyper-parameters, changing sampling methods, trying different libraries or versions etc., and then compare these trained models by using print statements (most of the times). This becomes confusing and difficult to manage when the number of models and the parameters increases. As a result, Data Scientists are unable to reproduce an experiment, track the performance of a model, and unable to figure out what worked in the past and what did not.

Tracking service in SAP Data Intelligence is trying to provide a solution to the above problem. It provides a framework to Data Scientists to configure, capture, organize, reproduce and visualize machine learning experiment.

With tracking, data scientists can define what needs to be captured like Model Parameters, Model Metrics, Model hyper-parameters, other configurations like random seeds etc. And then also view the captured information from Metric Explorer (see preview below) within ML Scenario Manager.

You can also see Metric Explorer in action in the video here and Tracking SDK in the video here.

So, lets get started.

Launch SAP Data Intelligence and from Launchpad select ML Scenario Manager.

 

Inside Scenario Manager, click + to create a new scenario

In the Create Scenario dialogue box provide some name and business question details and click on create.

 

Scenario gets created and user is re-directed to the details page.

Under the Notebooks section, click on + to create a new notebook.

In the create Notebook dialog box provide the name and description of your notebook and click on Create.

 

Jupyter Notebook gets created and user is re-directed to the notebook page to select Kernel.

Select Python 3

 

Once in the notebook, import tracking from DI SDK using below command:

 

The main entity of Tracking is a Run.

Runs are instances of a machine learning experiment consisting of metrics and parameters from one iteration along with the associated tags.

 

The recommended way is to surround the ML experiment code with start_run() and end_run() at its beginning and end respectively.

When the run is initialized via start_run(), it automatically records the metadata information such as the Scenario Details, Source Details, Start Time etc.

Logical grouping of runs is allowed using Run Collection. When the user initializes the run, they can (optionally) specify the run collection in which they want to associate this run. If not specified, the system creates a default run collection for each source.

Metrics are logged using log_metric function. It requires the parameters name and value, whereby value must be numeric.

you can use log_metrics function to pass a list of metric items that are to be captured. Each item in the list is a dictionary that has the same parameters as the log_metric function.

 

Parameters are logged using log_parameters function. It requires the parameters name and value. Type of each parameter must be string (if another type is passed, it will be automatically converted to string).

 

Set_tags function can be used to set tags under the current run. Tags are user-defined properties that are associated with a run. Each tag is a key(string)-value(string) pair.

 

get_runs function to fetch the run object that contains the metrics, parameters, tags, and so on. It requires one or more of the parameters scenario, scenario_version, pipeline, execution, notebook and run.

The SDK creates an ID for the run and returns it as a run object when start_run() is called. You can use this object to retrieve the metrics and parameters recorded in this run.

Tags can also be used to filter runs.

In case the user wants to log some metric after the end_run(), log_metric function can be called with associated run object as an additional parameter.

 

To conclude, Tracking SDK can be used to capture and organize experiments. The below diagram reflects the concepts of Runs, Run Collections and the various objects of Tracking.

Note: It is not representative of the technical infrastructure.

 

The sample Code is below:

# Import Tracking from DI SDK
from sapdi import tracking

# Initialize Tracking Run, under the provided Run Collection democollection
run = tracking.start_run(run_collection_name='democollection')

params = {
    'input_size': 784,
    'hidden_size': 500,
    'num_classes': 10,
    'num_epochs': 2,
    'batch_size': 100,
    'learning_rate': 0.001,
}

# Log Parameters for the run
tracking.log_parameters(params)

# Log Metrics for the run
tracking.log_metric('test', 10)

metrics = {
    'accuracy': 0.89,
    'cross_entropy': 3.2,
    'avg_loss': 0.59
    }

# Log List of Metrics for the run
tracking.log_metrics(metrics)

# Log Tags under the current run
tracking.set_tags({
    "tag_key": "tag_value",
    "tag_key_n": "tag_value_n"
  })

# Mark the end of tracking
tracking.end_run()

# To Fetch payload from Tracking for the current run
runs = tracking.get_runs(run=run)

# To Fetch metrics from Tracking by tags (in addition with other parameters)
runs = tracking.get_runs(run=run, tags = {'tag_key': 'tag_value'})

# Persisting metrics after the end
tracking.log_metric('test_metric', 0.4, run=run)

After the experiment has ended, you can view the runs and visualize the metrics via Metric Explorer.

Alternatively, you can also query the runs and metrics from Tracking SDK and plot them in the notebook using python library.

For the metrics captured as part of above sample tracking code, you can plot them using below sample script

 

import sapdi
import json
import pandas as pd
sc = sapdi.get_current_scenario()
run_data =  tracking.get_runs(scenario = sc,notebook = sapdi.scenario.Notebook.get(notebook_id="Demo Notebook.ipynb"))
run_json = json.loads(str(run_data))
df = pd.DataFrame.from_dict(run_json[0]['metrics'])
import matplotlib.pyplot as plt
plt.plot(df['name'],df['value'],'ro')

Plot as follows:

 

Be the first to leave a comment
You must be Logged on to comment or reply to a post.