Federated Machine Learning using SAP Datasphere & ...

jackseeburger · ‎06-14-2022

SAP Federated-ML or FedML is a library that enables businesses and data scientists to build, train and deploy machine learning models on hyperscalers, thereby eliminating the need for replicating or migrating data out from its original source.

If you would like to know more about the FedML library and the data federation architecture of SAP Datasphere upon which it is built, please reference our overview blog here.

Training a Model on VertexAI with FedML GCP

In this article, we focus on building a machine learning model on Google Cloud Platform VertexAI by federating the training data via SAP Datasphere without the need for replicating or moving the data from the original data storage.

To learn how to set up a connection between SAP S/4HANA and SAP Datasphere and SAP HANA (on-premise and cloud) and Datasphere, please refer here. Please also create a view in DWC with consumption turned on.

If you would like to use local tables in Datasphere instead of connecting SAP S/4HANA or SAP HANA on-premise or SAP HANA Cloud to Datasphere, please refer here. Please create a view in Datasphere with consumption turned on.

Once you have the data, you can merge these tables for a merged view and run your FedML experiment with the merged dataset.

Find more detailed examples of specific data science use cases on the FedML Github page in the sample notebooks folder.

1.Set up your environment

1. Follow this guide to create a new Vertex AI notebook instance
2. Create a Cloud Storage bucket to store your training artifacts

2. Install the FedML GCP Library in your notebook instance with the following command

pip install fedml-gcp

3. Load the libraries with the following imports

from fedml_gcp import dwcgcp

4. Instantiate some constant variables to use throughout your notebook.

PROJECT_ID = '<project_id>' 

REGION = '<region>'

BUCKET_NAME = '<bucket_name>'

BUCKET_URI = "gs://"+BUCKET_NAME

BUCKET_FOLDER = 'folderName'

MODEL_OUTPUT_DIR = BUCKET_URI+'/'+BUCKET_FOLDER

GCS_PATH_TO_MODEL_ARTIFACTS= MODEL_OUTPUT_DIR+'/model/'

TRAINING_PACKAGE_PATH = 'local-notebook-path-trainingpackage'

JOB_NAME = "job-name"

MODEL_DISPLAY_NAME = "model-name"

DEPLOYED_MODEL_DISPLAY_NAME = 'deployed-model'

TAR_BUNDLE_NAME = 'yourbundlename.tar.gz'

5. Create a new DwcGCP class instance with the following (replace the project name and bucket URI)

params = {'project':PROJECT_ID,

         'location':REGION, 

         'staging_bucket':BUCKET_URI}



dwc = dwcgcp.DwcGCP(params)

6. Determine which training image and deploying image you want to use.

1. Please refer here for the training pre-built containers: https://cloud.google.com/vertex-ai/docs/training/create-python-pre-built-container
2. Please refer here for the deployment pre-built containers: https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers

TRAIN_VERSION = "scikit-learn-cpu.0-23"

DEPLOY_VERSION = "sklearn-cpu.1-0"

TRAIN_IMAGE = "us-docker.pkg.dev/vertex-ai/training/{}:latest".format(TRAIN_VERSION)

DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(DEPLOY_VERSION)

7. Make a tar bundle of your training script files. This is your training package where your model scripts, utility functions and library requirements will be maintained. Vertex AI will then use this package for your custom training job.

1. You can find example training script files here
  1. Open a folder and drill down to the trainer folder (which contains the scripts)
2. And more information about GCP training application structure here

dwc.make_tar_bundle(TAR_BUNDLE_NAME, 

                    TRAINING_PACKAGE_PATH, 

                    BUCKET_FOLDER+'/train/'+TAR_BUNDLE_NAME)

8. Create your training inputs

1. More info about training inputs can be found here
2. Replace ‘DATA_VIEW’ with the name of the table you exposed in Datasphere Cloud
3. ‘cmd_args’ are the arguments you want access to in your training script. In this example case, we want access to the table name we want to get data from, the size of the table, the job-dir for where we should store job artifacts, the bucket name in gcs that we want to use for artifact storing, the bucket folder since we are using a specific folder in the bucket for this model, and the package name of the training package we created. However, if you have different arguments you want to use in your training script, you would edit these arguments here to correspond to the arguments you need.

table_name = 'DATA_VIEW'

job_dir = 'gs://'+BUCKET_NAME

cmd_args = [

    "--table_name=" + str(table_name),

    "--job-dir=" + str(job_dir),

    "--table_size=" + '1',

    "--bucket_name=" + str(BUCKET_NAME),

    "--bucket_folder=" + str(BUCKET_FOLDER),

    "--package_name=" + 'trainer'

    

]



inputs ={

    'display_name':JOB_NAME,

    'python_package_gcs_uri':BUCKET_URI + '/' + BUCKET_FOLDER+'/train/'+TAR_BUNDLE_NAME,

    'python_module_name':'trainer.task',

    'container_uri':TRAIN_IMAGE,

    'model_serving_container_image_uri':DEPLOY_IMAGE,

}

run_job_params = {'model_display_name':MODEL_DISPLAY_NAME,

                  'args':cmd_args,

                  'replica_count':1,

                  'base_output_dir':MODEL_OUTPUT_DIR,

                  'sync':True}

9. Submit your training job

job = dwc.train_model(training_inputs=inputs, training_type='customPythonPackage', params=run_job_params)

10. Deploy your model to GCP Vertex AI.

If you would like to use a custom prediction routine you would need to import the predictor class to your notebook and use the upload_custom_predictor function to deploy. Find more information in the README.

model_config = { 'deployed_model_display_name': DEPLOYED_MODEL_DISPLAY_NAME,

'traffic_split':{"0": 100},

'machine_type':'n1-standard-2',

'min_replica_count':1,

'max_replica_count':1,

'sync':True }



deployed_endpoint = dwc.deploy(model=job, model_config=model_config)

11. Now that the model is deployed you can invoke your GCP Endpoint (the variable data is a pandas dataframe).

params = {

    'instances': data.astype('float64').values.tolist()

}



response = dwc.predict(predict_params=params, endpoint=deployed_endpoint)

12. Finally, since we now have a working model and can run predictions, we can write our predictions results back to Datasphere for further use and analysis.

First, you’ll need to create a table

db.create_table("CREATE TABLE <table_name> (ID INTEGER PRIMARY KEY, <column_name> <datatype>, …)”)

You’ll then want to prepare your prediction results to follow the format of your create table statement, ensuring the proper data types and column names.

Once your data set is ready, you can start inserting it into your table. Based on the size of your dataset, insertion might take some time.

db.insert_into_table(‘<table_name>’, dwc_data)

You can also drop the table as so:

db.drop_table('<table_name>')

Once the table is created and the insertions are done, you will have a local table in Datasphere with the data you inserted. You can deploy this table in Datasphere, create a view, and run further analytics on it using SAP Analytics Cloud if you would like.

More information about the FedML GCP library and more examples can be found here.

In summary, FedML makes it extremely convenient for data scientists and developers to perform cross-platform ETL and train machine learning models on hyperscalers without focusing on the hassle of data replication and migration.

For more information about this topic or to ask a question, please leave a comment below or contact us at paa@sap.com.

Federated Machine Learning using SAP Datasphere & Google Cloud Vertex AI 2.0

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win