Hands-On Tutorial: Leverage SAP HANA embedded Machine Learning through an R Shiny App
As a student I was focused on the statistics and the modeling of the machine learning algorithms but in practice we are not done with just our R or Python script. In reality, important aspects like the data quality or the deployment of the models must be addressed. Therefore, I started setting up the R Machine Learning client for SAP HANA, which is not only an in-memory database but also comes with various machine learning capabilities. In this blog I will show you how you can stay in your used environment and beloved RStudio and leverage the power of the SAP HANA especially through the combination of the RJDBC, hana.ml.r and shiny packages. In addition, we want to create a dynamic web app directly from R to let users interact with our data and analysis. Clearly, our goal in this Hands-on Tutorial is to give data purpose.
What will you learn?
- Stop working with csv or excel files and directly connect to your SAP HANA from RStudio
- Leverage the power of SAP HANA and Machine Learning through the native Predictive Analysis Library (PAL)
- Document and historize your results and models in SAP HANA
- Create an interactive Shiny App for different users and visualize your results
Let us start from the beginning. In our use case we want to cluster different observations using the native Predictive Analysis Library on SAP HANA. Clustering refers to a broad range of techniques for finding subgroups in a dataset. Our goal is to find partitions such that the observations within these groups are very similar. For example, imagen our goal is to perform a market segmentation by identifying subgroups of people who are more receptive to a particular form of advertising or are more likely to purchase a particular product. As often in machine learning there are many ways to accomplish this goal and we will focus particularly on the example of a K-means clustering in this Hands-on tutorial. With this method we want to partition the observations into a pre-determined number of clusters. Of course, choosing the number of clusters K is quite a difficult and an important choice. On the one hand, if K is too large, we will divide the customers into too many groups and separate customers which should actually be together. On the other hand, if the number K is too small the behavior of the customers in the groups might be very different to each other. Good clustering means that the within-cluster variation is as small as possible.
For simplification we simulate our own data and hence know the true clusters to which the observations belong. Of course, we will not tell this to our K-means algorithm ?. The following plot illustrates the true clusters of our observations, which are separated into four groups.
Now, we move into my favorite environment RStudio. Before we can get started with our modeling we need to install a number of packages in R. The hana.ml.r package is available in the hana client under the following link and must be installed locally. The documentary can be found under this link. The complete R script used in this Hands-on tutorial can be found here. Now please execute the following part of the script to install the needed packages.
Further, we need to move the ngdbc.jar file, which is installed as part of the SAP HANA client installation, into our working environment. The ndgdbc.ja file is available under the following link. Make sure that your Java Home is set correctly for example:
Now, we will use the hanaml.ConnectionContext function to connect to our SAP HANA:
Rstudio will notify you in case your current HANA user does not have the needed rights to use the R Integration with SAP HANA. Please ask your admin to give you the additional rights.
Furthermore, we will use another possibility to connect to our SAP HANA using the CRAN package RJDBC. Both packages hana.ml.r and RJDBC have their individual advantages. We will use the PAL algorithms to create our K-means model from the hana.ml.r package and the functions available in the RJDBC package to document and historize our results in SAP HANA. Therefore, please execute the following R script:
Next, we simulate our data in RStudio and push it into our SAP HANA. Therefore, execute the following R script:
Please control in your SAP HANA that the data was loaded correctly.
Now, back in RStudio let us load the data using the connection created through the hana.ml.r package. There are many different possibilities to acquire the data. Two possibilities are:
In the next step we want to apply our K-means algorithm on our data set. First, let us look at the helpfile for our algorithm. Hence, call the following script, open the Help in RStudio and make yourself familiar with the function.
Now, we need to provide the key, drop the variable “WHICH” since we don’t want to tell the K-means algorithm the true clusters and set the number of clusters. In practice, one could also set a range of possible values for the number of clusters. But we will focus on 4 clusters in this Hands-on tutorial and try out different possibilities later in the shiny app.
Let us collect the results from the K-means algorithm and see how many observations we identified correctly. Therefore, please execute the following R script:
In the following plot we can see which observations belong to the same cluster identified by our K-Means algorithm.
Of course, we want to document and historize our results. Therefore, we will push the resulting table as well as our K-means model into SAP HANA through the following R script.
Now, please control your results in SAP HANA. After refreshing you should see the following two tables in your schema.
At last, we want to create a shiny app in which we can change the number of clusters in the PAL K-Means algorithm and visualize directly the results of our model. Hence, execute the following R script which defines our UI.
In the next step we need to define the server logic in which we define our PAL K-means algorithm and the visualization. Please node, you have to provide your login credentials at line 176. Then, execute the following R script:
With the UI and the server logic we can create our shiny app.
In our app we can now easily change the number of clusters in the K-means algorithm and directly visualize our results. Here is an example when you the set the slider to five clusters.
Through the R Integration of SAP HANA, the native PAL and shiny app we can provide different users with the power and insights of machine learning algorithms. Of course, this was just the tip of the iceberg. I encourage you to try it yourself and create your own awesome shiny apps in combination with the strength of SAP HANA.