Using FedML library with SAP Datasphere and Databr...

AkashAmarendra · ‎05-03-2023

Background:

Large-scale distributed data has become the foundation for analytics and informed decision-making processes in most businesses. A large amount of this data is also utilized for predictive modeling and building machine learning models.

There has been a rise in the number and variety of ML platforms providing machine learning and modeling capabilities, along with data storage and processing. Businesses that use these platforms can now seamlessly utilize them for efficient training and deployment of machine learning models.

Data Scientists training ML models using Databricks, have a challenge of accessing and working with SAP data. A data scientist has to rely on a data engineer to build a pipeline to extract data from SAP source systems and prepare the data for use in ML experimentation. Extraction and migration of data from the source systems is both expensive and time-consuming. Moreover, the data scientist may need additional non-SAP data modeled together with SAP data for use in ML experimentation.

Proposed Solution:

FedML Databricks is a library built to address these issues. The library applies the data federation architecture of SAP Datasphere and provides functions that enable businesses and data scientists to build, train and deploy machine learning models on ML platforms, thereby eliminating the need for replicating or migrating data out from its original source.

By abstracting the data connection, data load, model deployment and model inference on these ML platforms, the FedML Databricks library provides end-to-end integration with just a few lines of code.

Solution Diagram

In this blog, we use the FedML Databricks library to train a ML model with the data from SAP Datasphere and deploy the model to Databricks and SAP BTP, Kyma runtime. We also inference the deployed model and store the inference data back to SAP Datasphere for further analysis.

The data can be federated to SAP Datasphere from numerous data sources including SAP and non-SAP data sources. The data from various data sources can also be merged to create a view, which can be used for the FedML experiment. Please ensure that the view used for the FedML experiment has consumption turned on.

Train and deploy the model using the FedML Databricks library: 

Pre-requisites:

1. Create a Databricks workspace in any of the three supported hyperscalers (AWS, Azure, GCP).

2. Create a cluster in the Databricks Workspace by referring to the guide.

3. Create a notebook in the Databricks Workspace by referring to the guide.

4. Whitelist the Databricks cluster IP in SAP Datasphere as follows:

Note: You will need a non-community version of Databricks account to perform the below steps. For a trial Databricks account, you can whitelist "0.0.0.0/0" in SAP Datasphere by referring this guide and skip the below steps.

a. For Azure Databricks:

Ensure that you have created an Azure Databricks Workspace with secure cluster connectivity as listed in the pre-requisites section. If not already created, create it by referring to the article.

In the overview page of the created Azure Databricks Workspace, navigate to "Managed Resource Group". Search for "NAT gateway" in the overview page of the Managed Resource Group and navigate to the NAT gateway.

In the overview page of the NAT Gateway, click on "Outbound IP" under “Settings” and take a note of the IP address under “Public IP addresses”. Whitelist this IP address in SAP Datasphere by referring to the guide.

b. For AWS Databricks:

Ensure that you have created a Databricks workspace in AWS as listed in the pre-requisites section. If not already created, create it by referring to the article.

Navigate to VPC Dashboard in the same region as the Databricks Workspace.

On the VPC Dashboard, navigate to "NAT Gateways". Select the NAT Gateway associated with the Databricks VPC and copy the IP address listed under "Primary public IPv4 address". Whitelist this IP address in SAP Datasphere by referring to the guide.

Using the FedML Databricks Library:

1. Install the FedML Databricks library.

%pip install fedml-databricks --no-cache-dir --upgrade --force-reinstall

Import the necessary libraries:

from fedml_databricks import DbConnection,predict

It may also be useful to import the following libraries if you are using them in your notebook

import numpy as np

import pandas as pd

import json

2. Create a secure connection to SAP Datasphere and retrieve the data.

Create a Databricks secret scope by referring to the article Create a Databricks-backed secret scope on Databricks website. Then, create the Databricks secret containing SAP Datasphere connection details in the form of json, as described in the article. The SAP Datasphere json connection credentials can be obtained using the method described in this Github documentation - DbConnection class .

config_str=dbutils.secrets.get('<secret-scope>','<secret-key>')

config=json.loads(config_str)

Now, create a DbConnection instance to connect to SAP Datasphere:

dsp = DbConnection(dict_obj=config)

We can now retrieve the data. There are multiple ways of retrieving the data from SAP Datasphere. The following code gets the data from SAP Datasphere in the form of a Pandas DataFrame. The appropriate schema and view name must be entered below:

df=dsp.execute_query('SELECT * FROM \"<schema>\".\"<view>\"') 

df

3. Train the ML model using MLflow.

You can train a ML model using the Mlflow library managed by Databricks. Follow this MLflow guide to get started.

Import the MLflow library

import mlflow

Here is a sample linear regression model being trained using MLflow:

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split



def train_model(x_train,x_test, y_train, y_test,experiment_name,model_name):

    mlflow.set_experiment(experiment_name)



    with mlflow.start_run() as run:

        model = LinearRegression().fit(x_train, y_train)

        score = model.score(x_test, y_test)

        mlflow.log_param("score",score)

        mlflow.sklearn.log_model(model,model_name,

                         registered_model_name = model_name)

        

    run_id = run.info.run_id

    return run_id



x_train, x_test, y_train, y_test = train_test_split(dataframe , y, test_size=0.3)

experiment_name,model_name='/Users/<user>/<experiment-name>','<model_name>'

run_id=train_model(x_train,x_test, y_train, y_test,experiment_name,model_name)

model_uri=f"runs:/{run_id}/{model_name}"

4. Deploy the ML model as a webservice endpoint and inference the deployed model.

Option 1: Deploy the trained MLflow model to Databricks:

You can log, register and deploy MLFlow models using the Databricks managed Mlflow library. More information on Databricks Machine Learning capabilities can be found in this guide.

Executing the notebook inside Databricks workspace will register the model in the managed MLflow, if you trained the model outside of Databricks you can register the model in the MLflow model registry:

import time

model_version = mlflow.register_model(model_uri=model_uri,name=model_name)

 

# Registering the model takes a few seconds, so add a small delay

time.sleep(15)

Transition the model to Production:

You can do that in the Managed MLflow on Databricks, or inside the notebook.

from mlflow.tracking import MlflowClient

 

client = MlflowClient()

client.transition_model_version_stage(

  name=model_name,

  version=model_version.version,

  stage="Production",

)

You can use MLflow to deploy models for batch or streaming inference or to set up a REST endpoint to serve the model. Batch inference the MLflow model deployed in Databricks:

model = mlflow.pyfunc.load_model(f"models:/{model_name}/production")

infererence_result=model.predict(<test_data>)

Option 2: Deploy the MLflow model to SAP BTP Kyma Kubernetes:

The MLflow model trained in Databricks can be deployed to SAP BTP, Kubernetes environment using the hyperscaler container registry. Currently, deployment of MLflow model to SAP BTP, Kubernetes environment is supported in AWS and Azure, with support for GCP in the pipeline.

You can deploy the MLflow model using the same hyperscaler infrastructure used by Databricks. For example, if you use Azure Databricks, you can use Azure to deploy the MLflow model trained in Azure Databricks to SAP BTP, Kyma runtime.

4.2.1. Complete the pre-requisite steps for SAP BTP, Kyma runtime by referring to the guide.

4.2.2. Take note of the ‘DATABRICKS_URL’ and ‘MODEL_URI’ by running the below cell in the databricks notebook:

print("The DATABRICKS_URL is 'https://{}'".format(spark.conf.get("spark.databricks.workspaceUrl"))) 

print("The MODEL_URI is '{}'".format(model_uri))

For ease of use, you can perform steps 4.2.3 & 4.2.4 in the hyperscaler jupyter notebook (AzureML notebook or Sagemaker notebook):

4.2.3. Create a configuration file with the necessary details for SAP BTP, Kyma runtime deployment for AWS or Azure using the AWS template or Azure template. The values for the configuration file can be obtained by completing the above two steps.

4.2.4. Deploy the Databricks MLflow model to SAP BTP, Kubernetes environment using the below method. The ‘databricks_config_path’ refers to the path of the configuration file created in the previous step:

from fedml_databricks import deploy_to_kyma 

endpoint_url=deploy_to_kyma(databricks_config_path='<databricks-config-json-file-path>') 

print("The kyma endpoint url is '{}'".format(endpoint_url))

Take note of the SAP BTP, Kubernetes environment endpoint.

Inference the MLflow model deployed in SAP BTP, Kubernetes environment within the Databricks notebook as follows:

inference_dataframe=predict(endpoint_url=<kyma-endpoint>,content_type=<content-type>,data=<test-data>)

5. FedML Databricks library allows for bi-directional data access. You can store the inference result in SAP Datasphere for further use and analysis.

5.1 Create a table in SAP Datasphere:

dsp.create_table("CREATE TABLE <table_name> (ID INTEGER PRIMARY KEY, <column_name> <data_type>,..)")

5.2 You can now restructure the data to write back to SAP Datasphere in your desired format and insert the data in the table:

dsp.insert_into_table('<table_name>',<pandas_dataframe_containing_datasphere_data>)

Now, that the data is inserted into the local table in SAP Datasphere, you can create a view and deploy it in SAP Datasphere. You can then use the view to perform further analysis using SAP Analytics Cloud. 

More information on the use of the library and end-to-end sample notebooks can be found in our Github repo here. 

In summary, the FedML Databricks library provides an effective and convenient way to federate the data from multiple SAP and non-SAP source systems, without the overhead of any data migration or replication. It enables the Data scientists to effectively model SAP and non-SAP data in real-time, for use in ML experimentation. It also provides the capabilities to deploy models to SAP BTP, Kyma runtime, perform inferencing on the deployed webservice and store the inference data back to SAP Datasphere for further use and analysis.

Please read our blog here to learn how external data from Databricks delta tables can be federated live and combined with data from SAP Applications via SAP Datasphere unified models, for doing real-time Analytics using SAP Analytics Cloud.

Credits

Many thanks to Databricks team for their support and collaboration in validating this architecture – Itai Weiss, Awez Syed, Qi Su,Felix Mutzl and Catherine Fan. Thanks to SAP team members, for their contribution towards this architecture – Sangeetha Krishnamoorthy, Karishma Kapur, Ran Bian, Sandesh Shinde, and to Sivakumar N and Anirban Majumdar for support and guidance.

If you have any questions, please leave a comment below or contact us at paa@sap.com.