Getting started with SAP Data Intelligence and Data Science – but how?
SAP Data Intelligence (“DI”, formerly known als “SAP Data Hub” ?) is a very powerful but also complex solution for information processing. Some people say SAP Data intelligence core capability is also machine learning, some say the focus is more on information processing. Right now it seems that SAP´s strategy is mainly directed to the scope of information processing. But that doesn´t mean you cannot provide powerful machine learning solutions based on SAP Data Intelligence!
On major outcome of a lot of data science projects based on SAP Data Intelligence is that a solid and quick start is an important baseline for the overall project success.
For the first time user DI it is not easy to get first and even quick results by processing data science tasks. This is on the one hand caused by the huge amount of functions and operators of DI which often needs the developers full attention. The second problem is often based on a poor quality of the source data being used. The third problem comes up with the combination of problem one and two: what is best way to start processing data science tasks with SAP Data Intelligence?
With this blog post I want to introduce our Ready-2-Run solution for machine learning with SAP Data Intelligence to you. It is a predefined content for SAP Data intelligence and can quickly be used for your first steps and enables the project team to quickly show some first results.
It is part of the official DI3 Partner content package – so please also have a look at the blog of Matthias Kretschmer to get more information.
Introduction to this solution
The goal of this machine learning scenario is the development of a model predicting the optimal car price based on collected data. The solution involves all processing steps: starting with importing data from cloud data lake, developing a machine learning model in Python or R, creating a data pipeline to process the results dataset and ending with running the car price simulation in the SAP Analytics Cloud. All components are ready to run so that you can apply this machine learning scenario in a short time to your own landscape.
I will shortly describe all components of this content, the data pipeline with extended operators, the dataset, example of machine learning model visualization in Python and the car price simulation in SAP Analytics cloud within this blog post.
The Data Pipeline
The SAP DI Data pipeline covers the data-driven workflows and all tasks are dependent on the completion of previous tasks. The automated data movement and the data transformation within the operators reflect the underlying business logic. An overview of the deployment processes can be read out of the data pipeline.
- The Python3 Operator produces results dataset, machine learning model and defined metrics. This operator has a Python runtime environment and it contains the implemented Python code that was previously developed in Notebook (JupyterLab).
- The metrics operator submits defined values to track model and data quality.
- The Artifact Producer persists the model and it can be deploy as webservices for cloud-based real-time predictions.
- This operator is writing a result dataset for documentation and for additional data quality check between development and pipeline environments.
- This operator is pushing the results dataset to SAP Analytics Cloud to use for price simulation and in operational reporting.
The dataset contains different brands and car types with their associated prices. In addition to car specific features such as fueltyp, horsepower and engine size, it also includes the rating from insurance.
A simple statement in Python creates a table that summarizes numerical features of the dataset. The values in the table give a first overview of features distribution and indicate the order of magnitude.
The Python script
The package contains Python and R Scripts. In both scripts are the same modeling steps includes. In this blog you will find examples for Python only.
The Python script contains the following essential data science steps to create a machine learning model:
- Importing dataset from cloud data lake
- Exploring features with described statistics and visualization
- Preparing features for model training
- Calling statistical indicators and plotting the model to evaluate the model accuracy
- Saving the machine learning model
See below two visualization examples creating with Python.
The visualization of pairwise distribution supports the finding of patterns, relationships and anomalies. All relationships of features in this visualization is displaying an upward trend. A few outliers can be observed in higher regions and are highlighted as darkpurple points.
This linear regression plot shows that the trend for car types of fuel Gas and Diesel diverge with increasing predicted car price.
Car Price Simulation in SAP Analytics Cloud
For SAP Analytics cloud, an interactive simulation of delta car price has been set up. It supports the interactive search for an optimal price with a complied configuration.
Conclusion and “run”
SAP Data Intelligence is a very powerful tool and predefined content makes life and first steps much more easier. Users and developers can leverage predefined content to assure a quick onboarding of all necessary resources.