New R and enhanced Python API for SAP HANA Machine Learning – Released!
On April 5th 2019, HANA 2.0 SPS 04 has been released! Amongst a whole bunch of great features released (see this blog by Joerg Latza for more details), I am going to focus on two exciting capabilities – the new R and the enhanced Python API for SAP HANA Machine Learning.
- The API’s are now generally available from April 5th with the release of HANA 2.0 SPS 04. You can download the packages multiple ways, for example with the HANA Express Download Manager (see this blog) and can get started straight away, for free!
- Alongside the Python API, we now have a comparable API for R! In my previous blogs, I have given a walk-through on how to use the Python API and the value it can bring for building Machine Learning models on massive datasets, but below you’ll find a preview of one of the enhanced features – Exploratory Data Analysis. With the addition of the R API, you can train and deploy models in a similar fashion. Below I have provided some code samples for the R API, but for a detailed overview see this blog by Kurt Holst.
- The manual stages of the Machine Learning process (such as feature engineering, data encoding, sampling, feature selection and cross validation) can now be taken care of by the Automated Predictive Library (APL) algorithms. The user only needs to focus on the business problem being solved. See the documentation for more details and for a worked example use this link.
Python Example – Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an essential tool for Data Science. It is the process of understanding your dataset using statistical techniques and visualizations. The insight that you gain from EDA can help you to uncover issues and errors, give guidance on important variables, draw assumptions from the dataset and build powerful predictive models. The Python API now includes 3 EDA techniques:
- Distribution plot
- Pie plot
- Correlation plot
Note: The EDA capabilities will be expanded with further release cycles.
The benefit of leveraging these EDA plots with the HANA DataFrame is best illustrated with some performance benchmarks. I tested these plots on the same 10 million row data set and compared the time it took to return to plots in Jupyter.
- Using a Pandas DataFrame = on average 3 hours
- Using the HANA DataFrame = less than 5 seconds, for each of the 3 plots
The below example is using the Titanic data set. Credit for the data goes to https://www.kaggle.com/c/titanic/data.
# Import DataFrame and EDA from hana_ml import dataframe from hana_ml.visualizers.eda import EDAVisualizer # Connect to HANA conn = dataframe.ConnectionContext('ADDRESS', 'PORT', 'USER', 'PASSWORD') # Create the HANA Dataframe and point to the training table data = conn.table("TABLE", schema="SCHEMA") # Create side-by-side distribution plot for AGE of non-survivors and survivors f = plt.figure(figsize=(18, 6)) ax1 = f.add_subplot(121) eda = EDAVisualizer(ax1) ax1, dist_data = eda.distribution_plot(data=data.filter("SURVIVED = 0"), column="AGE", bins=20, title="Distribution of AGE for non-survivors") ax1 = f.add_subplot(122) eda = EDAVisualizer(ax1) ax1, dist_data = eda.distribution_plot(data=data.filter("SURVIVED = 1"), column="AGE", bins=20, title="Distribution of AGE for survivors") plt.show()
This is just a preview of the EDA capabilities, an in-depth overview of all the plots and parameters will be detailed in my next blog… stay tuned.
R Example – K Means Clustering
K-means clustering in SAP HANA is an unsupervised machine learning algorithm for data partitioning into a set of k clusters or groups. It classifies observation into groups such that object within the same group are similar as possible.
For this example, I will be using the Iris data set, from University of California, Irvine (https://archive.ics.uci.edu/ml/datasets/iris). This data set contains attributes of a plant iris. There are three species of Iris plants.
- Iris Setosa
- Iris Versicolor
- Iris Virginica
Connecting to HANA
# Load HANA ML package library(hana.ml.r) # Use ConnectionContext to connect to HANA conn.context <- hanaml.ConnectionContext('ADDRESS','USER','PASSWORD') # Load data data <- conn.context$table("IRIS")
# Look at the columns as.character(data$columns) >>  "ID" "SEPALLENGTHCM" "SEPALWIDTHCM" "PETALLENGTHCM"  "PETALWIDTHCM" "SPECIES" # Look at the data types sapply(data$dtypes(), paste, collapse = ",") >>  "ID,INTEGER,10" "SEPALLENGTHCM,DOUBLE,15"  "SEPALWIDTHCM,DOUBLE,15" "PETALLENGTHCM,DOUBLE,15"  "PETALWIDTHCM,DOUBLE,15" "SPECIES,VARCHAR,15" # Number of rows sprintf('Number of rows in Iris dataset: %s', data$nrows) >>  "Number of rows in Iris dataset: 150"
Training K-Means Clustering model
library(sets) library(cluster) library(dplyr) # Train K Means model with 3 clusters km <- hanaml.Kmeans(conn.context, data, n.clusters = 3) # Plot clusters kplot <- clusplot(data$Collect(), km$labels$Collect()$CLUSTER_ID, color = TRUE, shade = TRUE, labels = 2, lines = 0)
# Print cluster numbers Cluster_number<- select(km$labels$Collect(), 2) %>% distinct() print(Cluster_number) >> CLUSTER_ID 1 2 2 1 3 0
These snippets are not meant to be an exhaustive analysis, simply to showcase some of the capabilities within the API. To learn more about the benefits of using the HANA ML API see this blog, and to get a deeper understanding of the R API, see Kurt’s blog once again.
- R and Python are undoubtedly 2 of the first tools within a Data Scientist’s toolbox. With the HANA ML package now supporting both programming languages this can help to boost productivity of your Data Science teams significantly.
- No more cumbersome data transfer, no more waiting for days for models to train, leveraging the HANA DataFrame is a game changer for EDA and Machine Learning.
- As we look to boost productivity we naturally fall into the world of automation. The APL enables easy access to automated algorithms to quickly identify contributing factors, validate hypotheses and build powerful predictive models all within the same API.
- The PAL and APL collectively house over 100 algorithms within HANA. The contents of the API’s will be updated with the release cycles. For information on what’s available today, follow the links for the R API and the Python API documentation.
- What’s new in SAP HANA SPS04 – https://blogs.sap.com/2019/04/05/whats-new-in-sap-hana-2.0-sps-04-2/
- Machine Learning from SAP HANA from R – https://blogs.sap.com/2019/04/09/machine-learning-with-sap-hana-from-r/
- Python Client API for Machine Learning in SAP HANA 2.0, Express Edition SPS 03, Revision 33 – https://blogs.sap.com/2018/10/29/python-client-api-for-machine-learning-in-sap-hana-2.0-express-edition-sps-03-revision-33/
- What is SAP HANA Automated Predictive Library – https://help.sap.com/viewer/cb31bd99d09747089754a0ba75067ed2/184.108.40.206/en-US
- End-to-End APL example – https://help.sap.com/doc/0172e3957b5946da85d3fde85ee8f33d/2.0.03/en-US/html/hana_ml.html#end-to-end-example-using-the-automated-predictive-library-apl-module
- Blog explaining the benefits of the HANA ML package – https://blogs.sap.com/2018/12/17/diving-into-the-hana-dataframe-python-integration-part-1/
- R API Documentation – https://help.sap.com/doc/c48739beb06a4304a98e44b4d5b60a50/2.0.04/en-US/hana.ml.r/html/00Index.html
- Python API Documentation – https://help.sap.com/doc/0172e3957b5946da85d3fde85ee8f33d/2.0.03/en-US/html/index.html
- Learn about the SAP HANA DataFrame – https://blogs.sap.com/2018/12/17/diving-into-the-hana-dataframe-python-integration-part-1/
- Learn about the ML capabilities within the Python API – https://blogs.sap.com/2019/01/28/diving-into-the-hana-dataframe-python-integration-part-2/
EDA very soon demands integrated visualisation, so this is really the way forward. My question : is the usage only for notebook purposes, or can this also be used in other deployment contexts. If so : how ? Where to find documentation on 'beyond notebooks' use-cases ?
Also the usage of HANA-native processing via PAL/APL looks promising (although inherets the sometimes awkward approach from PAL).
I did install HANA_ML recently (from HANA Express), but this didn't deliver the hana_ml.visualizers.eda. So I got an error "No module named 'hana_ml.visualizers.eda'"
when executing the above in my notebook.
On top : I cannot find any documentation on what is available in this hana_ml context, besides https://help.sap.com/doc/0172e3957b5946da85d3fde85ee8f33d/2.0.03/en-US/html/index.html. Where to find the documentation concerning the visualisation part of hana_ml ?
Thanks for reaching out. Can you give some examples of other deployment contexts that you'd like to use?
Also, can you please confirm the version of the HANA ML package that you have installed?
As for documentation on EDA, this is in progress. I will update you when this is added to the documentation site.
I'd like to use this also in an XSA (python buildpack) context. Use-case scenario : one of our data-scientists create a workbook with the right interactivity (widgets) to enable usage for a broader community. I would just like to copy&paste the code inside an XSA MTA-setup, adapt some things like user-authentication, and let the newborn (kinda self-service) application run on the HANA-platform. Minimal support necessary from an ICT-department, maximal reusage of user-community, and still decent status of quality.
I've used the hana_ml which was coming with the latest version of HANA Express (I think rev33).
For now the visualisations are only compatible in a notebook setting.
I'm unsure what is bringing the error with the EDA classes, did you manage to get this working?
implemented my first ML application with Python API. So I will integrate your described EDA features and look forward for your updates on that
We are going to upgrade to SAP HANA 2.0 with run time license so pleas suggest can we have this features used in our system as well.Thanks in advance.