Federated Machine Learning using SAP Datasphere and Amazon SageMaker 2.0
Large-scale distributed data has become the foundation for analytics and informed decision-making processes in most businesses. A large amount of this data is also utilized for predictive modeling and building machine learning models.
There has been a rise in number and variety of hyperscaler platforms providing machine learning and modeling capabilities, along with data storage and processing. Businesses that use these platforms for data storage can now seamlessly utilize them for efficient training and deployment of machine learning models.
Training machine learning models on most of these platforms is relatively smoother if the training data resides in their respective platform-native data stores. This brings up a new challenge because of the tight coupling of these features with the native data storage. Extraction and migration of data from one data source to another is both expensive and time-consuming.
SAP Federated-ML or FedML is a library built to address this issue. The library applies the Data Federation architecture of SAP Datasphere and provides functions that enable businesses and data scientists to build, train and deploy machine learning models on hyperscalers, thereby eliminating the need for replicating or migrating data out from its original source.
By abstracting the data connection, data load and model training on these hyperscalers, the FedML library provides end to end integration with just a few lines of code.
Training a Model on AWS Sagemaker with FedML AWS
In this blog post, we will build and deploy a machine learning model using FedML AWS with data in local tables from SAP Datasphere. Once deployed, we will run predictions on our model. Please note that the local tables can be swapped out for data stored in SAP and non-SAP sources.
If you choose to run this experiment using data from SAP and non-SAP sources, you will have to federate these data sources. Using these federated data sources, you can either join these tables to create a view in SAP Datasphere, or you can join these tables in your notebook creating a joined DataFrame to run your FedML experiment with.
Please ensure any views created in SAP Datasphere for this experiment have consumption turned on.
You must have your environment set up. Please refer here for the steps to complete this requirement.
- You must have an AWS account.
- You must have an Amazon SageMaker Notebook Instance with the proper IAM roles and permissions set.
Using FedML AWS
To learn about all the possible functions and parameters, please refer to the FedML AWS READ.ME here.
1. Installing fedml_aws
pip install fedml-aws --force-reinstall
2. Import the libraries needed
from fedml_aws import DwcSagemaker from fedml_aws import DbConnection
It may also be useful to import the following if you are using them in your notebook
import numpy as np import pandas as pd
3. Create a DwcSagemaker instance to access the classes functions
dwcs = DwcSagemaker(prefix='<insert your bucket prefix here>', bucket_name='<insert your bucket name here>')
4. Create a DbConnection instance and get data from SAP Datasphere.
This step requires you to have a config.json inside your AWS notebook instance. The config.json provides the credentials that DbConnection needs to access your views in SAP Datasphere. Please refer to the DbConnection documentation to learn how to set this up.
Once you have the config.json set up, you can then run the following snippet to get data from SAP Datasphere.
db = DbConnection() train_data = db.get_data_with_headers(table_name='<VIEW_NAME>', size=1) data = pd.DataFrame(train_data, columns=train_data) data
You can also use the following snippet for more flexibility on your query to SAP Datasphere.
db = DbConnection() train_data = db.execute_query('SELECT * FROM <SCHEMA_NAME>.<VIEW_NAME>') data = pd.DataFrame(train_data, columns=train_data) data
5. If you want to do any manipulation to the data outside of your training script, you can do so now. This also includes splitting your data into train and test data sets if you would like.
6. Next, we will send our training data to our training script, which will start a training job in AWS SageMaker.
If you do not have test data to pass, please omit the test_data parameter.
clf = dwcs.train_sklearn_model(train_data=train_data, test_data=test_data, content_type='text/csv', train_script='<training script .py file>', instance_count=1, instance_type='ml.c4.xlarge', wait=True, base_job_name='<optional name of job>' )
7. Now that we have fit a model using FedML AWS, we can now deploy this model.
FedML AWS provides 2 options for deploying. You can either deploy to the AWS SageMaker Environment, or you can deploy to your SAP BTP Kyma Environment. We will walk through both in this blog.
Option 1: Deploy to AWS SageMaker Environment
This option is straight forward and only requires the following snippet to be run:
predictor = dwcs.deploy(clf, initial_instance_count=1, instance_type="ml.c4.xlarge", endpoint_name='<endpoint name>')
Option 2: Deploy to SAP BTP Kyma Environment
This option requires some information from you regarding your Kyma Environment and some IAM permissions. This ensures the library has access to your Amazon ECR and can deploy the model to your Kyma Environment.
First, you must have an IAM user that does not have MFA enabled and has EC2 container registry access permissions for pushing and pulling to Amazon ECR. Refer here for information on how to create an IAM user.
Next, you must create a profile using aws cli in AWS SageMaker Notebook that connects to the IAM user with EC2 container registry access permissions. Please ensure the region of this profile is the same region as the AWS SageMaker Notebook Jupyter instance.
!aws configure set aws_access_key_id '<aws_access_key_id>' –profile '<name of profile>' !aws configure set aws_secret_access_key '<aws_secret_access_key>' --profile '<name of profile>' !aws configure set region '<region>' --profile '<name of profile>'
Now, you also need a kubeconfig.yml that specifies the credentials for your Kyma account.
- Follow the steps in this tutorial, however for steps 2.1 and 4.1, please also follow the below steps:
- In step 2.1 of the tutorial, add the following fields:
- Add ‘namespaces’ under ‘rules -> resources’ section of yaml file
- Add ‘watch’ under ‘rules -> verbs’ section of yaml file.
- In step 4.1 of the tutorial, replace the following:
- Replace the value of ‘name’ under ‘clusters’ section with the cluster name of the Kyma Kubernetes cluster.
- Replace the value of ‘name’ under ‘users’ section and ‘user’ under ‘contexts’-> ‘context’ with ‘OIDCUser’ user.
- Replace the value of ‘name’, ‘context -> cluster’ under ‘contexts’ section with the cluster name of the Kyma Kubernetes cluster.
- Replace the value of ‘current-context’ with the cluster name of the Kyma Kubernetes cluster.
- In step 2.1 of the tutorial, add the following fields:
- Please note that if you are using Windows and step 2 in the tutorial doesn’t create the proper kubeconfig, you may have to run each command manually to get the values and then replace them in the kubeconfig yourself.
Finally, we can now run the deploy_to_kyma() function. Please note the name of the aws cli profile we created earlier is passed to the function as profile_name. The kubeconfig.yml must be location at the root of your notebook (not in any subfolders). The full path according to terminal would be ‘/home/ec2-user/SageMaker’.
dwcs.deploy_to_kyma(clf, initial_instance_count=1, profile_name='<name of profile>')
To provide greater flexibility in case you didn’t recently run the train function, we also allow you the ability to pass the training job name to function parameter clf.
Please note, as we are using the training job from AWS SageMaker, that also means any of the required functions (model_fn) and functions allowed for overwriting (input_fn, predict_fn, output_fn) remain with the same rules. If model_fn is not provided, the model will fail to deploy. If input_fn, predict_fn, and/or output_fn are provided in the training script which was passed to train_sklearn_model(), the model will use the functions provided by you.
This cell uses the prebuilt SageMaker SKLearn image to build a docker image of your model and pushes the image to ECR. Kyma then pulls this image and uses it for deployment. Once deployed, an endpoint is provided with /invocations for predictions and /ping for availability checks.
8. Since we have now deployed our model on either AWS SageMaker or SAP BTP Kyma, we can now run predictions on our endpoint.
The function used to run predictions depends on where you deployed the model. If you deployed to AWS SageMaker, please follow these steps:
result = dwcs.predict(endpoint_name=predictor, body=df.to_csv(header=False, index=False).encode('utf-8'), content_type='text/csv')
If you deployed the model to SAP BTP Kyma, please follow these steps:
result = dwcs.invoke_kyma_endpoint(api='<endpoint with /invocations as printed from deploy_to_kyma() console logs>', payload=X.to_json(), content_type='application/json')
result = result.content.decode()
9. Finally, now since we have a working model and can run predictions, we can write our predictions results back to SAP Datasphere for further use and analysis.
First, you’ll need to create a table
db.create_table("CREATE TABLE <table_name> (ID INTEGER PRIMARY KEY, <column_name> <datatype>, …)")
You’ll then want to prepare your prediction results to follow the format of your create table statement above, ensuring the proper data types and column names.
Once your data set is ready, you can start inserting it into your table. Based on the size of your dataset, insertion might take some time.
You can also drop the table as so:
Once the table is created and the insertions are done, you will have a local table in SAP Datasphere with the data you inserted. You can deploy this table in SAP Datasphere, create a view, and run further analytics on it using SAC if you would like.
For more information on the use of the library and some sample notebooks with the corresponding training scripts, please refer here.
In summary, FedML makes it extremely convenient for data scientists and developers to perform cross-platform ETL and train machine learning models on hyperscalers without focusing on the hassle of data replication and migration. The new features of V2 include deploying an AWS SageMaker model on AWS SageMaker, deploying an AWS SageMaker model on SAP BTP Kyma, running predictions on these deployed models, and writing prediction results to a table in SAP Datasphere.
If you have any questions, please leave a comment below or contact us at email@example.com.
Hi Karishma Kapur !
Thank you for the update here with this innovative approach extending the ML capabilities of SAP Data & Analytics.
Some questions for a better understanding:
Furthermore it would be very interesting to hear/read more about possibilities and advantages to deploy such a scenario on Kyma.
Thank you for your feedback on the blog! In regards to your questions, you'll find the answers below:
I hope this answers your questions!
Thank you! Looking forward to see how it goes on!
thank you for your write-up. I am trying to set up a connection to DWC from an AWS SageMaker notebook, however when calling DbConnection() the package throws an error because it cannot connect to the remote server:
Could it be that SageMaker needs to be set up to allow connecting from a workbook to remote machines?
Please make sure the Elastic IP that provides connectivity to your SageMaker notebook instance from internet is added to the Allow list in DWC as Trusted IP. You may have to configure VPC for your SageMaker notebook and provide connectivity through a NAT gateway. In addition, you'll also have to add the IP of DWC to your notebook's security group's inbound rules.
This should resolve this error.