Technical Articles
SAP HANA Machine Learning Resources
Last update: 12.01.2022
This blog post aims at providing a comprehensive overview on the most recent content. It’s meant to be a living document, so we’ll try to keep it as up to date as possible.
The links are structured as follows:
- Getting started – Your first steps with SAP HANA Machine Learning
- Machine Learning Operationalization – How to bring Machine Learning to life in real-world scenarios
- Deep-Dive sections into:
- Classification
- Outlier and Anomaly Detection
- Time Series Analysis
- Text Mining
- External Machine Learning – How to integrate with your external R or TensorFlow servers
- Miscellaneous – All other relevant materials.
Getting started
Denys van Kempen provides a broad overview on materials to get started with SAP HANA Machine Learning in this blog. You will find links to the documentation, recent demo videos as well as code samples in this collection:
With our simple SQL interface, developers and data scientists can easily work with all the features of the SAP HANA database and integrate them with any other SQL-based solution. While SQL is considered the third most popular language for Machine Learning, we are nevertheless aware of the fact that there are other scripting languages even more popular with data scientists being specifically Python and R.
To get started with the Python and R APIs, you might want to take look at our initial release blog post, from Arun Godwin Patel:
Arun also wrote two great articles on the power of the SAP HANA dataframe object that is introduced with the APIs. It builds the foundation of all the Machine Learning features in the library, so understanding it is important for all the following articles:
- Diving into the HANA DataFrame: Python Integration – Part 1
- Diving into the HANA DataFrame: Python Integration – Part 2
Last but not least, he introduces us to the Explanatory Data Analysis feature. This toolset specifically allows us to visually explore data using different graphics and charts, while benefiting from the remote execution of aggregation statements.
Jeremy Yu build on this, explaining in details how to handle Descriptive Analytics and Statistics for Data Science needs with SAP HANA Cloud in these two articles.
- Easy Descriptive Statistics with Python and SAP HANA Cloud
- Automated Descriptive Statistics Report with SAP HANA Cloud and Python
Are you looking for the ultimate step-by-step guide on how to get started with Python and Machine Learning in SAP HANA Cloud? Well, Andreas Forster has you covered with this article:
Too much Python around here so far from your perspective? Perfect timing to spotlight our R API!
Yannick Shaper prepared two articles on the basics of working with R and SAP HANA, and the integration of R-based SAP HANA Machine Learning within a Shiny App:
- Hands-On Tutorial: Becoming the Chief Data Cook with RStudio and SAP HANA
- Hands-On Tutorial: Leverage SAP HANA embedded Machine Learning through an R Shiny App
After taking the first steps into Machine Learning models, you might want to bring them to life in a production scenario. But before doing that, make sure the model meets your quality criteria!
Raymond Yao has prepared a great example of how to use the new Model Report for that. It assists in understanding and debriefing a trained model by displaying model statistics, variable importance and standard metrics.
Machine Learning Operationalization
Maintenance of the Machine Learning model lifecycle and especially versioning of different states of a model is an important part of making Machine Learning enterprise ready. This article from Xin Chen explains the details of how to set up a model storage in SAP HANA using the Python client for Machine Learning:
For managing and orchestration of large Data Science and Machine Learning architectures, SAP Data Intelligence comes into play. In this article Andreas Forster explains how to leverage the Python client for SAP HANA Machine Learning with the Jupyter Notebook operator in Data Intelligence:
To make Machine Learning part of a complex data processing workflow, you can include SAP HANA Machine Learning in your SAP Data Intelligence pipelines. The following article from Andreas gives you all the details to get started:
Your data resides in SAP Data Warehouse Cloud together with your business reports, but you don’t want to miss the power of SAP HANA Machine Learning? We’ve got you covered! Learn how to leverage SAP HANA Machine Learning from DWC:
Andreas Forster applies our integration with SAP DWC to bring SAP HANA Machine Learning features to the SAP Marketing Cloud:
A more detailed tutorial of how to leverage the embedded Machine Learning in SAP DATA Warehouse Cloud is provided by Christiano Hage in this article:
Your scenario requires SAP HANA Machine Learning on data stored in SAP HANA, but the predictions need to be executed in an independent environment? Deploying the JavaScript export of your APL model might be an option. Learn from Andreas how to do this in this article:
One of the most common scenarios for SAP HANA Machine Learning is the implementation in the context of ABAP-based SAP applications, like SAP S/4HANA or SAP BW/4HANA.
Jerome Zhao showcases, how to call SAP HANA Machine Learning functions from ABAP in this article:
To provide a more sophisticated integration, especially with SAP S/4HANA, the Intelligent Scenario Lifecycle Management (ISLM) was introduced, to orchestrate all Machine Learning activities like the creation of scenarios and models as well as training, deployment and activation of those Machine Learning models.
Venkata Raghu Banda has prepared a comprehensive collection of materials on ISLM in this blog post:
Some of the highlights of his collection are:
- Embedding Machine Learning into SAP S/4HANA
- Leveraging Machine Learning with the ISLM Framework
- Migration of PAi to ISLM with the embedded ML
You may also check our overview pages on Intelligent ERP and ISLM to get the latest updates:
In case you are running an SAP BW/4HANA system and are interested in leveraging Machine Learning with the data stored in your Data Warehouse, we have you covered in this overview on how that could be done:
Deep-dive sections
Let’s take a closer look into some of the most important scenarios for SAP HANA Machine Learning.
Regression
Andreas Forster created a nice demo on the use of regression techniques to predict used car prices, using the Python API.
Classification
Kurt Holst contributed a series of three blog posts focused on a classification scenario. He demonstrates the end-to-end implementation making use of the R API, as well as how to evaluate the business value of a model created that way:
- Machine Learning with SAP HANA – from R
- Machine Learning with SAP HANA – with R API Part 2
- Machine Learning with SAP HANA & R – Evaluate the Business Value
Dmitry has shared some ideas on how to derive advanced feature importance values from Hybrid Gradient Boosting Trees for classification scenarios in this post:
Outlier and Anomaly Detection
Likun Hou has prepared four blog entries on different techniques for outlier and anomaly detection.
The first article demonstrates the usage of Statistical Tests (Variance Test and IQR Test) for Outlier Detection. Likun shows that IQR test is a more robust outlier detection method with the presence of extremely deviated (from mean/median) values in the targeted numerical feature. However, both methods only work on 1-dimensional numerical data, so they are mainly applicable to outliers with at least one outstanding numerical features value.
The second option is to use the DBSCAN clustering algorithm to perform Outlier Detection. Different from the Statistical test approach above, all feature values can get involved if appropriate distance metrics are adopted.
Typically, these two methods (Statistical Test and Clustering) can only detect outliers in the input dataset, and the detection result cannot be generalized to new data points, because they do not come up with any model. The third article demonstrates how Classification methods can be adopted to overcome this difficulty.
However, all the aforementioned techniques become less applicable, when the dataset of interest is of high dimensionality (i.e. contains many features), or the boundary between normal points and anomalous ones is complicated. In his fourth article, Likun demonstrates a better approach by manually labeling the point of anomalies in the dataset, and then training a supervised Machine Learning model for the classification of normal points and anomalies.
Another approach to Anomaly detection is based on sensor data over time, that requires the usage of time series analysis techniques. We have the basics of that covered in the section below. Nidhi Sawhney and Rafael Pacheco showcase two scenarios in these three articles:
- Quality Identification of rail-road tracks – An application of Dynamic Time Warping using SAP HANA PAL
- Anomaly Forecast of Sensor Data in Energy Intensive Industries – Part I: The Machine Learning and Beer Production
- Anomaly Forecast of Sensor Data in Energy Intensive Industries – Part II: The Machine Learning Execution
Andreas Forster and Nico Wang have published another article on that approach, describing how to detect contextual anomalies:
Finally, Raymond Yao shows us how to apply Weibull Analysis – one of the most used algorithms for Predictive Maintenance use cases.
Time-Series Analysis
Another series of great articles from Likun Hou covers the most relevant aspects of Time-Series Analysis.
He starts off with explaining the basic principles of Time-Series Analysis, specifically the ideas of “Trends” and “Seasonality” and how to perform decomposition on these to prepare for an Anomaly Detection.
The second article explains how to apply the most commonly used techniques for Time-Series Analysis: Exponential Smoothing and ARIMA.
Lastly, Likun introduces one of the most recent enhancement of the Predictive Analysis Library (PAL): The Additive Model Time-series Analysis, that is an advanced approach that proves to be superior in dealing time-series with complicated trend, multiple seasonality as well as cyclic patterns.
Xin Chen dedicated another article to Seasonal Decomposition, showcasing examples of how this can be done with SAP HANA PAL.
While many Time-Series scenarios are based on just one time-dependent variable, there are also many cases where Time-Series consist of more than one time-dependent variable and each variable depends not only on its past values but also has some dependency on other variables. These Multivariate Time-Series are covered in this article:
Marc Daniau introduces us to a recently added feature in the Automated Predictive Library, called Piecewise Linear Trend, that can be specifically helpful in detecting and handling change points in time series.
Text Mining
The most recent enhancement to the SAP HANA Machine Learning features is the Text Mining feature. The initial version allows for analysis and classification of texts, like service tickets or text messages and enable users to explore relations among the texts. Learn how to make use of this feature in Alex Dalentzas blog post:
Maxime Simon has prepared a more in-depth article on how to use this feature for automated classification of messages with customer complaints.
Multi-Model Data Science
The true power of SAP HANA for Data Science becomes visible when working with different data types. Since SAP HANA is a multi-model database, it supports various data types like JSON documents, graph networks or spatial data. And of course, we can use these for Machine Learning as well.
Mathias Kemeter has two great examples on how Machine Learning can be used on spatial data in these two articles.
External Machine Learning
The third flavor of the SAP HANA Machine Learning is the integration of external Machine Learning servers. It mainly allows us to remotely execute Machine Learning models in TensorFlow or R on separate servers using data from a SAP HANA database (on-premises) and consuming back the results in SAP HANA as well.
These two articles provide an overview of the R server integration.
- SAP HANA and R hands-on: From freestyle to deployment
- Parallelization options with the SAP HANA and R-Integration
More information on the TensorFlow integration can be found in these posts from Philip Mugglestone and Nandi Kishore:
- Introducing SAP HANA External Machine Learning (aka TensorFlow Integration)
- Tensorflow Machine Learning Model Integration with SAP HANA
Miscellaneous
Sometimes, you have the right data, in the right place, at the right time, but your scenario requires them to be turned by 90 degrees. That can be cumbersome, but Nidhi Sawhney shows us, how pivoting can be easily done, using SAP HANA and the Python API.
This article describes, how to import multiple excel files into a single SAP HANA table using the Python Machine Learning client for SAP HANA.
If you are looking for more code examples on the use of SAP HANA Machine Learning, please take a look at our sample repository on Github, to find dozens of examples for the various use cases of Machine Learning.
Thank you!
I would like to thank all the above contributors for their tremendous effort and time to create these valuable materials!
As said in the introduction, this article will receive updates any time new relevant content gets created. If there is anything you miss in this collection (either because we missed it or because there is no resources on your specific topic), do not hesitate to reach to Christoph Morgen (SAP HANA Product Management) or me (SAP HANA Solution Management).
Also, we are happy to take your feedback, thoughts or questions in the comment section below!
Without doubt this is the best blog covering all of the excellent material for using ML with SAP. Great Blog.
Very comprehensive.
Will explore few areas.
I read quite a few articles that have a few lines between PAL and APL and differences but all I as newbie see are algorithms accessed from hana_ml python library
https://blogs.sap.com/2021/02/25/hands-on-tutorial-leverage-sap-hana-machine-learning-in-the-cloud-through-the-predictive-analysis-library/
uses algorithm random forest
https://blogs.sap.com/2020/07/27/hands-on-tutorial-automated-predictive-apl-in-sap-hana-cloud/
uses algorithm Gradient Boosting Classification
You mention in your opening lines
"automated Machine Learning of the APL targets especially developers and business analysts, the expert Machine Learning of the PAL is designed for data scientists."
I have 2 novice questions
1. How do I know if algorithm is in PAL or in APL
other than look up the hana_ml reference?
hana_ml has 2 packages
https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.07/en-US/hana_ml.algorithms.apl.html APL
https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.07/en-US/hana_ml.algorithms.pal.html PAL
2. What is the "automated" aspect of APL
3. ISLM looks very attractive but only in recent versions of S4HANA
Is this conclusion correct? Not in 1909 but in 2021?
Regards
Jayanta
Hi Jayanta,
let me try to answer your questions.
Upfront the actual algorithm documentation can be found
The Python ML client documentation more focuses on the usage of the exposed algorithm functions and procedures.
Re 1) You can take a look at the documentation above for the algorithms. APL basically focuses on providing a automated function for classification, regression and time series. The internals of the algorithms used are not exposed, however APL evaluates different approaches internally.
Whereas PAL provides documentation for each algorithm.
Re 2) APL automated this analysis covering all steps from variable selection, data preparation, variable encoding, missing value handling, outlier handling, binning and banding, model testing and best model selection. The user therefore doesn't require to think about e.g. if the type of input features is supported by the algorithm. The constraints are managed my APL, therefore it is simple usage experience and fast forward to very good results. There is a gradient boosting based classification/regression function as choice, supporting non-binary classification scenario for example.
Re 3) Yes, the ISLM capabilities have been enhanced since 2021, regarding PAL use with unified classification see AMDP Class for PAL | SAP Help Portal.
Best regards,
Christoph
Thanks Christoph
Very comprehensive answer and very convincing.
Regards
Jayanta