Transform Kubeflow Pipelines into Argo Workflows f...

former_member15717 · ‎10-02-2022

Recently, we introduced the Metaflow library for SAP AI Core which extends Metaflow’s capabilities to run ML training pipelines as Argo Workflows which can be executed on SAP AI Core. If you have missed it, check out this blog post.

In this blog post, we want to cover another popular open-source tool: Kubeflow. Kubeflow enables users to create and orchestrate machine learning workflows on Kubernetes clusters, among many other functionalities. Like SAP AI Core, Kubeflow uses Argo Workflows as its underlying, powerful workflow orchestration engine.

You will learn how to create a simple pipeline with Kubeflow and how to transform it so that you can execute it on SAP AI Core. As both Kubeflow and SAP AI Core are built on top of Argo Workflows, only few changes must be made to run the Kubeflow pipeline on SAP AI Core. As a practical example, we will use the California housing dataset to create a simple linear regression model to predict housing prices.

To follow along, you should have a basic understanding of SAP AI Core and Kubeflow.
The code shown in this post can be found here.

Create a simple Kubeflow pipeline

Our pipeline will have two steps: 

Download the California housing dataset from a URL and store it as an artifact

Use the stored dataset to train a linear regression model and make predictions which are also stored as an artifact

Using the Kubeflow Python SDK kfp, we create Python functions for the two steps in our pipeline. The first function is called make_step_download. It uses pandas to read a csv file from a url and save it as an artifact. We define, that this function will output a csv file with the parameters output_csv: comp.OutputPath('CSV').

The second function will receive a csv file as input input_csv: comp.InputPath('CSV') and output again a csv file with predictions output_csv: comp.OutputPath('CSV'). Again, all necessary libraries are imported within the defined function. We then train a linear regression model on the dataset and save a csv file with predictions. 

After the definition of these functions, we use the kfp.components.create_component_from_func function to make these functions available as steps for our Kubeflow pipeline. We can also define the base docker images and which Python libraries should be installed. 

Afterwards we create a pipeline function where we use the before defined steps for our pipeline. We also define that the output from our first step should be the input to the second step. This pipeline function can now be compiled to an Argo kfp.compiler.Compiler(). 

This YAML file can be uploaded to a Kubeflow cluster to define a pipeline which can then be started.

import kfp 

import kfp.components as comp 

  

kfp.__version__ # '1.8.9' 

  

  

def make_step_download(output_csv: comp.OutputPath('CSV')): 

    import pandas as pd 

    url = "https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv" 

    df = pd.read_csv(url) 

    df.fillna(0,inplace=True) 

    df.to_csv(output_csv, index=False) 

    return None 

      

def make_step_train(input_csv: comp.InputPath('CSV'), output_csv: comp.OutputPath('CSV')): 

    from sklearn.model_selection import train_test_split 

    from sklearn.linear_model import LinearRegression 

    from sklearn.metrics import mean_squared_error 

    import pandas as pd 

    from joblib import load, dump 

      

    # load data as dataframe 

    df = pd.read_csv(input_csv) 

    X = df[['total_rooms','households', 'latitude','longitude','total_bedrooms','population','median_income']] 

    y_target = df['median_house_value']  

  

    # train Linear Regression model 

    X_train, X_test, y_train, y_test = train_test_split(X, y_target, test_size=0.3) 

    model = LinearRegression() 

    model.fit(X_train, y_train) 

      

    # make predictions with trained model 

    y_predict = model.predict(X_test)  

    rmse = mean_squared_error(y_predict, y_test, squared=False) 

    print("RMSE = ",rmse) 

      

    # save predictions 

    df = pd.DataFrame(y_predict) 

    df.to_csv(output_csv) 

    return None 

  

  

step_download = kfp.components.create_component_from_func( 

    func=make_step_download, 

    base_image='python:3.7', 

    packages_to_install=['pandas==1.1.4']) 

  

step_train = kfp.components.create_component_from_func( 

    func=make_step_train, 

    base_image='python:3.7', 

    packages_to_install=['pandas==1.1.4','sklearn']) 

  

def my_pipeline(): 

    download_step = step_download() 

    train_step = step_train(download_step.outputs['output_csv']) 

      

kfp.compiler.Compiler().compile( 

    pipeline_func=my_pipeline, 

    package_path='pipeline.yaml')

Transform the Kubeflow pipeline

We will now have a look at what is needed to run a pipeline created with the Kubeflow SDK on SAP AI Core. As said before, both Kubeflow and SAP AI Core are built on top of Argo Workflows so only few changes are required.

The major difference is that we define our Docker images before and upload them to a Docker image hub, such as Docker Hub. The reason is that a Kubeflow pipeline starts with a base Docker image, such as python3:7, and installs the necessary Python libraries once the container is started. However, giving the privileges to install packages could be a stability and security risk in production. Thus, SAP AI Core requires that we start with a Docker image that already includes the necessary packages, to ensure security requirements.

# Start with an already created docker image as base image 

step_download = kfp.components.create_component_from_func( 

    func=make_step_download, 

    base_image='docker.io/flxschneider/text-train:0.0.1') 

  

step_train = kfp.components.create_component_from_func( 

    func=make_step_train, 

    base_image='docker.io/flxschneider/text-train:0.0.1') 

  

def my_pipeline(): 

    download_step = step_download() 

    train_step = step_train(download_step.outputs['output_csv']) 

      

kfp.compiler.Compiler().compile( 

    pipeline_func=my_pipeline, 

    package_path='pipeline.yaml')

In the following, we will explain the other smaller changes we have to make to the YAML file to make it work with SAP AI Core. The changes which must be made are shown in the following YAML files.

Define as WorkflowTemplate and add name

The beginning of the before created workflow.yaml file should now look like this:

{'apiVersion': 'argoproj.io/v1alpha1', 

 'kind': 'Workflow', 

 'metadata': {'generateName': 'my-pipeline-', 

  'annotations':  

        {'pipelines.kubeflow.org/kfp_sdk_version': '1.8.9', 

         'pipelines.kubeflow.org/pipeline_compilation_time': '2022-01-18T12:57:16.488742', 

         'pipelines.kubeflow.org/pipeline_spec': '{"name": "My pipeline"}'}, 

  'labels':  

       {'pipelines.kubeflow.org/kfp_sdk_version': '1.8.9'}},

There are a few changes we must make here manually. First, we must change the kind of workflow from Workflow to WorkflowTemplate and, second, add a name for the workflow.

{'apiVersion': 'argoproj.io/v1alpha1', 

 'kind': 'WorkflowTemplate', 

 'name': 'ca-housing-linearregression' 

 ... 

}

Add annotations and labels

Furthermore, we must add SAP AI Core specific annotations and labels. These annotations are important as SAP AI Core uses them to map the workflows to the corresponding scenarios. Further, we need to define what kind of output our pipeline produces, such as a dataset or a model. The Kubeflow-specific labels can be left as-is as SAP AI Core will ignore them.

{'apiVersion': 'argoproj.io/v1alpha1', 

 'kind':'WorkflowTemplate', 

 'metadata': {'generateName': 'my-pipeline-', 

  'annotations':  

       {'pipelines.kubeflow.org/kfp_sdk_version': '1.8.9', 

        'pipelines.kubeflow.org/pipeline_compilation_time': '2022-01-24T11:25:46.545733', 

        'pipelines.kubeflow.org/pipeline_spec': '{"name": "My pipeline"}', 

        'scenarios.ai.sap.com/description': 'CA Housing linear regression', 

        'scenarios.ai.sap.com/name': 'ca-housing-train-scenario', 

        'executables.ai.sap.com/description': 'CA Housing linear regression', 

        'executables.ai.sap.com/name': 'ca-housing-linreg', 

        'artifacts.ai.sap.com/make-step-download-output_csv.kind': 'dataset', 

        'artifacts.ai.sap.com/make-step-train-output_csv.kind': 'dataset'}, 

  'labels':  

       {'pipelines.kubeflow.org/kfp_sdk_version': '1.8.9', 

        'scenarios.ai.sap.com/id': 'ca-housing', 

        'executables.ai.sap.com/id': 'ca-housing-linreg', 

        'ai.sap.com/version': '1.0.2'}, 

'name': 'ca-housing-linreg10'}

Add docker registry secret

Because SAP AI Core will load the docker images from the Docker hub that was registered during the setup of SAP AI Core, we must add which Docker secret should be used in case the Docker image is not publicly accessible:

'spec': {'imagePullSecrets': [{'name': 'docker-registry-secret'}]}}

Furthermore, we must add a globalName to the output artifacts that were created by our pipeline:

{'name': 'make-step-download-output_csv', 

 'path': '/tmp/outputs/output_csv/data', 

 'archive': {'none': {}}, 

 'globalName': 'make-step-download-output_csv'

Define template for pipeline steps

Finally, we must add the order of our pipeline in case our pipeline contains multiple steps. We add a template with the name my-pipeline which is also our entry point and define the steps that should be performed by our pipeline:

{'name': 'my-pipeline', 

  'steps': [{'name': 'make-step-download',  

            'template': 'make-step-download'}, 

           {'name': 'make-step-train',  

            'template': 'make-step-train'}]}

Note that if there is another template with the same name defined as entry point, we must remove it from the workflow file.

This YAML file can now be uploaded to a Git repository which is connected to your SAP AI Core instance. After three minutes, the workflow should be synced with SAP AI Core and be visible within the SAP AI Launchpad. From here on, you can start an execution in SAP AI Core.

In summary, when we create a Docker image beforehand and perform the small, required changes in the YAML, we can leverage the Kubeflow SDK and its advantages to create workflows for SAP AI Core. Especially handy with the Kubeflow SDK is the handling of inputs and outputs between different steps of our pipeline.

Automate the transformation

A lot of the shown steps are easy to make, for example, adding the SAP AI Core specific annotations, changing the kind to WorkflowTemplate, or adding the Docker secret. Most of the described steps can easily be performed with a script that transforms the Kubeflow workflow into a workflow that can be executed on SAP AI Core. For the workflow from our example above, a Python script can be found in this repository.

In just a few lines, the necessary changes can be made to the YAML file, showing that you can automate the move from development and experimentation with Kubeflow to production with SAP AI Core.

Conclusion

We have seen that with only a few changes we can transform a Kubeflow workflow so that it can be executed on SAP AI Core, and we have demonstrated how to automate the transformation. Consequently, you can leverage the benefits of Kubeflow during development and bridge the gap to production with SAP AI Core easily.

Let us know your feedback and thoughts in the comments below.

_______________________________________________________________________________

For more information on SAP AI Core:

Follow us in the SAP Community

Start innovating with our tutorials

Find guidance in the SAP Help Portal

Transform Kubeflow Pipelines into Argo Workflows for SAP AI Core

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win