On April 5th 2019, HANA 2.0 SPS 04 has been released! Amongst a whole bunch of great features released (see this blog by Joerg Latza for more details), I am going to focus on two exciting capabilities – the new R and the enhanced Python API for SAP HANA Machine Learning.
- The API’s are now generally available from April 5th with the release of HANA 2.0 SPS 04. You can download the packages multiple ways, for example with the HANA Express Download Manager (see this blog) and can get started straight away, for free!
- Alongside the Python API, we now have a comparable API for R! In my previous blogs, I have given a walk-through on how to use the Python API and the value it can bring for building Machine Learning models on massive datasets, but below you’ll find a preview of one of the enhanced features – Exploratory Data Analysis. With the addition of the R API, you can train and deploy models in a similar fashion. Below I have provided some code samples for the R API, but for a detailed overview see this blog by Kurt Holst.
- The manual stages of the Machine Learning process (such as feature engineering, data encoding, sampling, feature selection and cross validation) can now be taken care of by the Automated Predictive Library (APL) algorithms. The user only needs to focus on the business problem being solved. See the documentation for more details and for a worked example use this link.
Python Example – Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an essential tool for Data Science. It is the process of understanding your dataset using statistical techniques and visualizations. The insight that you gain from EDA can help you to uncover issues and errors, give guidance on important variables, draw assumptions from the dataset and build powerful predictive models. The Python API now includes 3 EDA techniques:
- Distribution plot
- Pie plot
- Correlation plot
Note: The EDA capabilities will be expanded with further release cycles.
The benefit of leveraging these EDA plots with the HANA DataFrame is best illustrated with some performance benchmarks. I tested these plots on the same 10 million row data set and compared the time it took to return to plots in Jupyter.
- Using a Pandas DataFrame = on average 3 hours
- Using the HANA DataFrame = less than 5 seconds, for each of the 3 plots
The below example is using the Titanic data set. Credit for the data goes to https://www.kaggle.com/c/titanic/data.
# Import DataFrame and EDA from hana_ml import dataframe from hana_ml.visualizers.eda import EDAVisualizer # Connect to HANA conn = dataframe.ConnectionContext('ADDRESS', 'PORT', 'USER', 'PASSWORD') # Create the HANA Dataframe and point to the training table data = conn.table("TABLE", schema="SCHEMA") # Create side-by-side distribution plot for AGE of non-survivors and survivors f = plt.figure(figsize=(18, 6)) ax1 = f.add_subplot(121) eda = EDAVisualizer(ax1) ax1, dist_data = eda.distribution_plot(data=data.filter("SURVIVED = 0"), column="AGE", bins=20, title="Distribution of AGE for non-survivors") ax1 = f.add_subplot(122) eda = EDAVisualizer(ax1) ax1, dist_data = eda.distribution_plot(data=data.filter("SURVIVED = 1"), column="AGE", bins=20, title="Distribution of AGE for survivors") plt.show()
This is just a preview of the EDA capabilities, an in-depth overview of all the plots and parameters will be detailed in my next blog… stay tuned.
R Example – K Means Clustering
K-means clustering in SAP HANA is an unsupervised machine learning algorithm for data partitioning into a set of k clusters or groups. It classifies observation into groups such that object within the same group are similar as possible.
For this example, I will be using the Iris data set, from University of California, Irvine (https://archive.ics.uci.edu/ml/datasets/iris). This data set contains attributes of a plant iris. There are three species of Iris plants.
- Iris Setosa
- Iris Versicolor
- Iris Virginica
Connecting to HANA
# Load HANA ML package library(hana.ml.r) # Use ConnectionContext to connect to HANA conn.context <- hanaml.ConnectionContext('ADDRESS','USER','PASSWORD') # Load data data <- conn.context$table("IRIS")
# Look at the columns as.character(data$columns) >>  "ID" "SEPALLENGTHCM" "SEPALWIDTHCM" "PETALLENGTHCM"  "PETALWIDTHCM" "SPECIES" # Look at the data types sapply(data$dtypes(), paste, collapse = ",") >>  "ID,INTEGER,10" "SEPALLENGTHCM,DOUBLE,15"  "SEPALWIDTHCM,DOUBLE,15" "PETALLENGTHCM,DOUBLE,15"  "PETALWIDTHCM,DOUBLE,15" "SPECIES,VARCHAR,15" # Number of rows sprintf('Number of rows in Iris dataset: %s', data$nrows) >>  "Number of rows in Iris dataset: 150"
Training K-Means Clustering model
library(sets) library(cluster) library(dplyr) # Train K Means model with 3 clusters km <- hanaml.Kmeans(conn.context, data, n.clusters = 3) # Plot clusters kplot <- clusplot(data$Collect(), km$labels$Collect()$CLUSTER_ID, color = TRUE, shade = TRUE, labels = 2, lines = 0)
# Print cluster numbers Cluster_number<- select(km$labels$Collect(), 2) %>% distinct() print(Cluster_number) >> CLUSTER_ID 1 2 2 1 3 0
These snippets are not meant to be an exhaustive analysis, simply to showcase some of the capabilities within the API. To learn more about the benefits of using the HANA ML API see this blog, and to get a deeper understanding of the R API, see Kurt’s blog once again.
- R and Python are undoubtedly 2 of the first tools within a Data Scientist’s toolbox. With the HANA ML package now supporting both programming languages this can help to boost productivity of your Data Science teams significantly.
- No more cumbersome data transfer, no more waiting for days for models to train, leveraging the HANA DataFrame is a game changer for EDA and Machine Learning.
- As we look to boost productivity we naturally fall into the world of automation. The APL enables easy access to automated algorithms to quickly identify contributing factors, validate hypotheses and build powerful predictive models all within the same API.
- The PAL and APL collectively house over 100 algorithms within HANA. The contents of the API’s will be updated with the release cycles. For information on what’s available today, follow the links for the R API and the Python API documentation.
- What’s new in SAP HANA SPS04 – https://blogs.sap.com/2019/04/05/whats-new-in-sap-hana-2.0-sps-04-2/
- Machine Learning from SAP HANA from R – https://blogs.sap.com/2019/04/09/machine-learning-with-sap-hana-from-r/
- Python Client API for Machine Learning in SAP HANA 2.0, Express Edition SPS 03, Revision 33 – https://blogs.sap.com/2018/10/29/python-client-api-for-machine-learning-in-sap-hana-2.0-express-edition-sps-03-revision-33/
- What is SAP HANA Automated Predictive Library – https://help.sap.com/viewer/cb31bd99d09747089754a0ba75067ed2/22.214.171.124/en-US
- End-to-End APL example – https://help.sap.com/doc/0172e3957b5946da85d3fde85ee8f33d/2.0.03/en-US/html/hana_ml.html#end-to-end-example-using-the-automated-predictive-library-apl-module
- Blog explaining the benefits of the HANA ML package – https://blogs.sap.com/2018/12/17/diving-into-the-hana-dataframe-python-integration-part-1/
- R API Documentation – https://help.sap.com/doc/c48739beb06a4304a98e44b4d5b60a50/2.0.04/en-US/hana.ml.r/html/00Index.html
- Python API Documentation – https://help.sap.com/doc/0172e3957b5946da85d3fde85ee8f33d/2.0.03/en-US/html/index.html
- Learn about the SAP HANA DataFrame – https://blogs.sap.com/2018/12/17/diving-into-the-hana-dataframe-python-integration-part-1/
- Learn about the ML capabilities within the Python API – https://blogs.sap.com/2019/01/28/diving-into-the-hana-dataframe-python-integration-part-2/