Technical Articles
SAP Data Intelligence: Create your first AutoML Scenario
SAP Data Intelligence 1911 provides the newly released AutoML feature. AutoML provides Citizen Data Scientists, business users and LOB users with a full graphical environment and tools to perform experiments using their datasets and gain answers to their business questions through machine learning models. AutoML automates a number of steps in the machine learning process, data preparation, feature selection, model choice and model parameter optimisation. Users provide the training and test date, select the outcome they want the prediction for and define the amount of time the training should execute for.
AutoML then performs the steps required to process the data for training against a number of machine learning models, which result in a pipeline that can be chosen for deployment. When the user decides which of the trained models to deploy, it can then be called via a REST API offered by AutoML cockpit.
This post shows how to get started with get started with AutoML and perform an end to end scenario using the tools provided. For this tutorial we’ll use the well known Iris dataset and using AutoML we will predict the species based on other input data. We’ll show how to create the necessary data collections using the newly provided ML Data Manager, upload datasets to a collection and how to use the ML Data Manager to extract features from a dataset. Once we have the required data in place for training and testing we are ready to create the machine learning scenario in AutoML, create an execution and then monitor the training of the model using the AutoML cockpit. Finally we deploy the trained model and test it via a REST call.
Table of Contents
- Authorization/Policies
- ML Data Manager
- Create Data Workspace
- Create Data Collections
- Upload Data
- Extract features from Dataset
- AutoML
- Create ML Scenario
- Create Execution
- Select Pipeline/Model
- Deploy Pipeline
- Testing the Inference Pipeline
Authorization/Policies
To proceed with the scenario your user should have the following Policies in the Data Intelligence system:
- sap.dh.metadata – to work with and upload files to the Metadata Explorer
- sap.dh.developer – to work with the Pipeline Modeler
You or you system administrator can assign these policies to your user by using the System Management app from the Launchpad. Within System Management, select the User tab on the top ribbon, select your user and then the + icon to assign the above policies to your user.
ML Data Manager
The SAP Data Intelligence Launchpad has two new apps added for AutoML and the ML Data Manager. We’ll start by using the ML Data Manager.
Ensure that you have saved the Iris dataset on your local machine. Split the dataset into two files on a 80/20 split. The larger dataset for training and the smaller dataset for testing. Let’s start by creating the Data Workspace in the ML Data Manager. The ML Data Manager allows datasets to be arranged into workspaces and collections for easier access of frequently used datasets whenever required.
In the ML Data Manager app, select Create + to create a new Data Workspace, fill in a name IrisDataset and description Iris Dataset for AutoML Scenario.
Next, let’s create the Collections that will be used to upload the dataset into. Select Create + to create the training collection, name it iris_train and provide a description like Training Data for Iris Dataset.
Once created, in the Files and Folders tab in the collection, select the Metadata Explorer icon to navigate to the location where the dataset will be uploaded.
A new tab will open the Metadata Explorer, in this location of the data lake you can upload the training dataset. Select the upload icon, find your training dataset in csv format and upload it.
Navigate back to the browser tab where you created the training dataset and select the Features List tab.
In the File input select your dataset, ensure that the Feature and Type columns are correctly categorised. Finally select Save in the Features List tab.
Still in the tab where you created the training collection, repeat the steps to create the test collection and upload the test dataset to the date lake. Name your test collection something like iris_test and a description of Test Data for Iris Dataset. Also repeat the steps to extract features from the dataset in the Features List tab as you did with the training dataset.
You have now created the objects your AutoML scenario will require to train and test with. Your completed Data Collections with extracted features would look similar to this.
AutoML
Using the navigation menu in the top left, open the AutoML app so that we can create the ML Scenario.
In AutoML select the Create label to create a new scenario. Name the scenario Iris_Species and a Business Question of Identify Species from Iris Dataset.
Let’s create a new Execution by selecting the Create label, the Create Execution wizard opens and you start by selecting your training and test datasets you uploaded earlier in the ML Data Manager Workspace–>Collection. Select the Step 2 button when your datasets are correctly selected.
Next we select an Experiment Target, this is what you want your model to do the prediction for. In the dropdown, select species as we want to predict the species based on measurement criteria. For Metric select the Log Loss option, the classification metric based on probabilities. You’ll see the Task label has now switched to Classification. Select Step 3 to continue to budget selection.
Here we define how long we want to train and test the model for. For this simple scenario we can leave the default of 30 minutes. Select Review to view your inputs for the Execution.
Finally select Create to start the Execution.
The Execution will then run for the allocated amount of time. You can keep track of the progress and models being trained by viewing the Execution in the AutoML app within your specific Execution.
The Summary shows you key details about your Execution such as Status, Budget and Metric.
The Cockpit shows you Performance Progression, time vs metric in real time. To the right of the Performance Progression, you start to see pipelines created, based on trained models. You can navigate into these pipelines, view further details and ultimately deploy them for inference.
Still in AutoML app, within your Execution the Leaderboard shows you the pipelines with more details. For this example let’s select the Naïve Bayes Classifier Pipeline.
Here you see the pipeline steps, details as well as the Hyperparameters in the Feature Section. In the top right, Select the Deploy button to deploy this pipeline in order for us to be able to call the trained model. After some time in the top section, you’ll see a Deployment URL. This is the URL we can use to call the model.
At this point you can also navigate to the ML Scenario Manager and you’ll see that the scenario we created in AutoML exists here too, select the scenario.
In the pipeline section you see we have two pipelines, one for training and the other for inference. We have a currently running execution, if you are viewing this before your allocated budget of 30 minutes has completed.
Once the 30 minutes has completed the execution still shows here and in AutoML but it’s no longer running. In the Deployment section we should have a deployment visible, select Deployment to navigate to the details. Here you’ll see the Status, Flow, the full Deployment URL and additional details.
Let’s test the URL by copying it and then opening a test client such as Postman.
In Postman add /predict to the end of the Deployment URL and select POST method.
In the authorization tab, select Basic Auth and add your username and password for Data Intelligence. Use the following format for username tenant\username.
In the Headers tab, enter the key “X-Requested-With” with value “XMLHttpRequest”.
Pass the following input data to the REST-API. Select the Body tab, choose raw and enter the following JSON syntax:
{
"sepal_length": 7,
"sepal_width": 3.2,
"petal_length": 4.7,
"petal_width": 1.4
}
Select Send and the prediction result is returned from SAP Data Intelligence.
Summary
You created your first AutoML scenario. Starting by creating the required Workspace and Collections in the ML Data Manager app, you uploaded the required datasets to the data lake for use in model training and testing. Creating the AutoML scenario, an Execution, choosing a model and pipeline to deploy, you then deployed this pipeline for inference. Using a REST API client you were able to call the deployed model from outside of SAP Data Intelligence.
AutoML in SAP Data Intelligence, brings the power of Machine Learning into the hands of citizen Data Scientists. For further learning, you could now follow up with more complex datasets as well as deploying and calling some of the other classification models created for your scenario.
Excellent- thank you, Phillip!
Excellent blog, Philip! Thanks for helping us understand AutoML better!
Great instruction and screenshots Phillip! Here is a video demo using SAP Data Intelligence Cloud which follows this process and includes some basic data preparation activities. https://sapvideoa35699dc5.hana.ondemand.com/?entry_id=1_sfv66cpl
Hi Philipp. Well done. Simple, easy to follow. If you want you can add some more slides, why the relation sepal length and sepal width, petal length and petal width respectively help to identify which type of iris you have. I have a nice slide deck (2 slides) which explains it.
Hey Matthias, hope you are keeping well. Of course, please send it over and I'll update the post.
Hi Phillip,
I’ve seen that perhaps some additional policies (authorizations) for the user running your AutoML scenario might be required.
Go to SAP DI “System Management” (from the SAP DI launchpad), then to “User”, click on the “arrow on the right” to “view the details” of your user. Here you can “Add” new policies to the user.
For the upload of files (for example from your local machine the Iris dataset) to your “Data Workspace” you need the policy “sap.dh.metadata”. Later you need for the step to "Select" dataset in the "Feature List" also the policy “sap.dh.developer”. So, add this policies as well.
I send you the screenshots, so you can add it to the blog as well.
Hi Phillip,
Thank you for this great blog!
I have a small question. When I test the inference result in Postman, if I use a record that is in my test data set, it could work well and give me a correct result. But if I use some other input data, I will get a respond: "502 Bad Gateway - The server was acting as a gateway or proxy and received an invalid response from the upstream server."
Does this mean the algorithm "don't know" what kind of Iris this is, therefore returns with an error ?
it feels like it can only take data within a certain range, if that is the case, how can we make system respond "I don't know, or This maybe not an Iris" rather than 502 error code?