Tracking HANA Machine Learning experiments with ML...

stojanm

Introduction

MLflow is an open-source platform, which is the de facto standard, when it comes to managing and streamlining machine learning lifecycles, including experimentation, reproducibility, and deployment. It offers a centralized repository to track experiments, share projects, and collaborate effectively, making it a common choice among data scientists and can be used with most open-source machine learning frameworks (e.g. scikit-learn, Tensorflow, etc.).

Some providers offer MLflow as a managed service (e.g. Databricks) or integrate it (e.g. Azure ML). In addition, it is possible for the user to deploy the service manually on their platform of choice (e.g. on SAP Business Technology Platform).

Starting with version 2.13 HANA Machine Learning added support for tracking of experiments with the mlflow package. This makes models, which were developed using hanaml, easily incorporated into an extensive MLOps pipeline.

This blog post is part of a series describing the usage of MLflow with HANA Machine Learning co-authored by @martinboeckling and @stojanm. In the first part we present an conceptual guide on how to use MLflow with SAP Datasphere and HANA Machine Learning (through the hanaml package). The objective is to provide to the reader a high level template for machine learning operations (MLOps) for HANA ML specifically with MLflow. In the second part of the series, called Tracking HANA Machine Learning experiments with MLflow: A technical Deep Dive, we provide a more technical deep dive on how to setup an MLflow instance and a general introduction on how Machine Learning models trained with HANA ML can be logged with MLflow.

It is important to mention that SAP offers an extensive MLOps platform for managing ML experiments, AI Core / AI Launchpad, which is out of the scope of this post. For more information on AI Core please refer to the blog post here.

Ok, let's start reviewing our example. We will work our way along a simplified Machine Learning pipeline as shown below and will comment on the architectural patterns for each task.

To simplify the use case, we will assume that the gravity of the data required for our model lies within SAP. This means that majority of the data used for model training is in an SAP application either on-premise or in the cloud.

Data Modeling

Typically data landscapes in enterprises are quite complex and data is distributed across numerous systems. So, even though the majority of the data for our example comes from an SAP source, it is realistic to assume that for the modeling a portion of that data could come from another system. It is the task of a Data Engineer to connect to the data and to prepare it for the algorithm training (e.g. feature engineering). As shown in the picture below SAP Datasphere can help unify data sources in a central repository either via federation or, where not supported, via replication.

In addition to the data modeling features, SAP Datasphere also offers a runtime for Machine Learning tasks thanks to the embedded HANA Cloud instance. This runtime can be utilized by Data Scientists and since it is embedded it allows to perform ML without the need for data movement and replication. This brings several benefits related to security, execution speed, business context preservation and compliance aspects. For more information about those benefits check out this blog post.

Ok, let's move to the data science tasks and model training.

Model Training

During this phase Data Scientists experiment and iteratively develop the ML model. Most Data Science experts have their preferred platform for ML prototyping. The HANA Machine Learning Python package, called hana-ml, can be used with any Python IDE available. The development environment can be either deployed manually by the Data Scientist or hosted centrally on a dedicated platform. The following blog posts show examples how HANA Machine Learning code can be developed using different platforms: Azure ML and Databricks.

Already during training and experimentation, MLflow plays an important role. It helps evaluate the progress and log the details of each experiment run for later reference. Several algorithms from the hana-ml package (e.g. Automated* or Unified* methods) support the automatic logging of model key performance indicators during training. This is seamless for the user and uses the same interface as open source frameworks. It allows to track hyper parameters, model performance KPIs and also log training activities with usernames and timestamps for auditibility. For more technical details about these features please refer to the second part of our blog post.

Model Deployment

Once a suitable model has been selected and trained, it needs to be deployed. For our example with HANA Machine Learning, we will do this in two steps. In the first step, hana-ml is used to store the artifacts into the built-in model repository of SAP HANA Cloud. In the second step the model is exposed via an API to be consumed by other applications. This can be achieved in several ways, but a lean approach is to use a deployed Flask application (e.g. on SAP Business Technology Platform). To see the details on this process please refer to this blog post.

Model Performance Tracking

In addition to being able to track experiments while training the model, also tracking of the model performance after deployment is important - e.g. in order to track prediction quality and also detect effects like data drift, etc. Some information about those concepts can be found in this blog post.

In our example we will achive the monitoring of the model during operations as follows: Since our model is deployed and exposed via a Flask application, as proposed above, in the application code we use the mlflow package to log incoming data as well as predictions. This allows us to run validation tests (compare actual vs. predicted) once validation data becomes available.

Let's now review how to detect deterioration in model performance and how to perform retraining.

Model Re-Training

Model retraining could be either scheduled or triggered based on a condition or an event. There are several ways how this can be achieved, including automation flows, like SAP Build Process Automation, AirFlow (https://airflow.apache.org/) or Kubeflow (https://www.kubeflow.org/), or by simple helper applications deployed by the user.

For the sake of simplicity in our case we use a simple application deployed on SAP Business Technology Platform (e.g. on Cloud Foundry), which can schedule model retraining runs (e.g. if new training data becomes available regularly). This same application can periodically run a check on the model performance via the APIs from mlflow as described in the previous section. And if there is any performance deterioration (e.g. there is a high deviation in predicted vs actuals) a new retraining run can be performed. And as discussed already, during the new training run the model parameters will be logged via hana-ml and mlflow. In addition, the model will be updated in the model repository in SAP HANA Cloud.

This closes the cycle for the example ML pipeline in the first section. Let's put all pieces together to see the big picture.

Key takeaways

In this blog post we showcased an example conceptual and architectural blueprint on how to realize MLOps pipelines using SAP HANA Machine Learning and the open-source framework MLflow. We discussed the end-to-end process and the advantages of integrating these tools to streamline the machine learning lifecycle, especially in the part of model lifecycle management. To see the technical details and example code to achieve the described steps please refer to the second part of the blog series here. Happy reading!