Skip to Content
Technical Articles
Author's profile photo Pankti Jayesh Kansara

Federated Machine Learning using SAP Data Warehouse Cloud and Amazon SageMaker

Background

Large-scale distributed data has become the foundation for analytics and informed decision-making processes in most businesses. A large amount of this data is also utilized for predictive modeling and building machine learning models.

There has been a rise in number and variety of hyperscaler platforms providing machine learning and modeling capabilities, along with data storage and processing. Businesses that use these platforms for data storage can now seamlessly utilize them for efficient training and deployment of machine learning models.

Training machine learning models on most of these platforms is relatively smoother if the training data resides in their respective platform-native data stores. This brings up a new challenge because of the tight coupling of these features with the native data storage. Extraction and migration of data from one data source to another is both expensive and time-consuming.

Proposed Solution

SAP Federated-ML or FedML is a library built to address this issue. The library applies the Data Federation architecture of SAP Data Warehouse Cloud and provides functions that enable businesses and data scientists to build, train and deploy machine learning models on hyperscalers, thereby eliminating the need for replicating or migrating data out from its original source.

By abstracting the data connection, data load and model training on these hyperscalers, the FedML library provides end to end integration with just a few lines of code.

It%20is%20image

This blog post will focus on training a machine learning model on Amazon SageMaker with data from Google BigQuery:

Note: This post assumes that training data is already present in BigQuery and accessible through SAP Data Warehouse Cloud. Refer this blog post for steps on how to integrate BigQuery with SAP DWC.

 

1. Create an Amazon SageMaker notebook instance

Follow Step 1 of this guide for creating a notebook instance on SageMaker, creating an IAM role and adding required permissions.

 

2. Download Federated-ML for AWS

Download the library using the link below. It will be downloaded in a .whl file format on your local system.

Download library

3. Install the library on your SageMaker notebook instance using the following command

pip install fedml_aws-1.0.0-py3-none-any.whl --force-reinstall​

 

4. Use the following imports to utilize library functionalities

from fedml_aws import DbConnection
from fedml_aws import DwcSagemaker

 

5. Read BigQuery data from SAP DWC and load it into SageMaker notebook

db = DbConnection()
train_data = db.execute_query('<your_query_to_fetch_train_data>')
#The query should ideally fetch only the data that would be needed to train the model
#Extracting and loading entire view is not required
train_data = pd.DataFrame(train_data[0], columns=train_data[1])
train_data.head()

Only the rows and features needed to train the model need to be fetched and loaded into SageMaker notebook.

 

6. Train a Scikit-learn model on Sagemaker using the extracted data

dwcs = DwcSagemaker(prefix='fedml-sample', bucket_name='temp')
clf = dwcs.train_sklearn_model(train_data=train_data,
                               content_type='text/csv',
                               train_script='fedml-sample-train.py',
                               instance_count=1,
                               instance_type='ml.c4.xlarge',
                               wait=True
                              )

Details about train_script and some example notebooks with their corresponding training scripts can be found here.

 

FedML makes it extremely convenient for data scientists and developers to perform cross-platform ETL and train machine learning models on hyperscalers without focussing on the hassle of data replication and migration.

Assigned Tags

      3 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Peter Baumann
      Peter Baumann

      Hello Pankti,

      thank you for this very interesting blog. I really would like to learn more about SAP Federated-ML or FedML.

      For me the GitHub link seem not to be accessible. Can you provide further information about or give hints what is necessary to access the GitHub information?

      Kind regards,

      Peter

      Author's profile photo Pankti Jayesh Kansara
      Pankti Jayesh Kansara
      Blog Post Author

      Hi Peter,

      Thanks for reaching out!

      GitHub links are now fixed in the article and you should be able to access them now.

      Best,

      Pankti

      Author's profile photo Peter Baumann
      Peter Baumann

      Perfect, thank you!