Machine Learning in a Box (week 6) : SAP HANA R Integration
In case you are catching the train running, here is the link to the introduction blog of the Machine Learning in a Box series which allow you to get the series from the start. At the end of this introduction blog you will find the links for each elements of the series.
Before we get started, a quick recap from last week
Last week, we looked at how to import data in SAP HANA, express and we used the dataset provided by the SAP Predictive Analytics tools (and available online).
But the main idea was to show you how you can import more or less any kind of text/CSV files in your HXE instances.
I hope you all managed to try this out, and probably some of you already started playing with some classification algorithms and the Census dataset or some Forecasting algorithms with the Time Series data available.
If you didn’t start playing with algorithms, don’t worry the second part will deal with this. So, let’s complete our setup with the SAP HANA R integration this week.
Next, we will look at the External Machine Learning (EML) library.
Welcome to week 6 of Machine Learning in a Box!
SAP HANA R Integration
The SAP HANA R integration is a bit different from the Machine Learning capabilities already available with the AFL libraries (APL & PAL).
With the R integration, you literally can execute R code in SQLScript. Ok, not directly like a simple SQL SELECT, but using RLANG.
Now you may ask what is RLANG, but before answering that let’s first put some context explaining what is R (for those who never heard about it), then we will have a look at how the integration works, and finally we will look at to use it or how it could be used.
What is R?
R is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing.
The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
R is a GNU package. The source code for the R software environment is written primarily in C, Fortran, and R.
R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems. While R has a command line interface, there are several graphical front-ends available.
The capabilities of R are extended through user-created packages, which allow specialized statistical techniques, graphical devices, import/export capabilities, reporting tools (knitr, Sweave), etc.
These packages are developed primarily in R, and sometimes in Java, C, C++, and Fortran.
A core set of packages is included with the installation of R, but more than 11,000 additional packages (as of July 2017) are available at the Comprehensive R Archive Network (CRAN), GitHub, and other repositories.
The SAP HANA R Integration scenarios
The goal of the integration of the SAP HANA database service with R is to enable the embedding of R code in the SAP HANA database context.
That is, the SAP HANA database allows R code to be processed in-line as part of the overall query execution plan using RLANG.
This scenario is suitable when an SAP HANA-based modeling and consumption application wants to use the R environment for specific statistical functions for example not provided by the built-in libraries.
An efficient data exchange mechanism supports the transfer of intermediate database tables directly into the vector-oriented data structures of R.
This offers a performance advantage compared to standard SQL interfaces, which are tuple based and, therefore, require an additional data copy on the R side.
The SAP HANA R Integration explained
To process R code in the context of the SAP HANA database, the R code is embedded in SAP HANA SQL code in the form of a RLANG procedure.
The SAP HANA database uses the external R environment to execute this R code, similarly to native database operations like joins or aggregations.
This allows the application developer to elegantly embed R function definitions and calls within SQLScript and submit the entire code as part of a query to the database.
The diagram below depicts the overall integration:
When the calculation model plan execution reaches an R-operator, the calculation engine’s R-client issues a request through the Rserve mechanism to create a dedicated R process on the R host.
Then, the R-Client efficiently transfers the R function code and its input tables to this R process, and triggers R execution.
Once the R process completes the function execution, the resulting R data frame is returned to the calculation engine, which converts it.
Since the internal column-oriented data structure used within the SAP HANA database for intermediate results is very similar to the vector-oriented R data frame, this conversion is very efficient.
A key benefit of having the overall control flow situated on the database side is that the database execution plans are inherently parallel and, therefore, multiple R processes can be triggered to run in parallel without having to worry about parallel execution within a single R process.
Configure the SAP HANA R Integration with SAP HANA, express edition
The pre-built versions of R are not compiled with dynamic/shared libraries enable which is required for the SAP HANA integration.
Therefore, you must compile the R package from its source code with the dynamic/shared libraries.
You can find all the details about that in the following tutorial:
At the end, you will also test the configuration by uploading one of the R built-in dataset (Iris).
Further details can also be found in the SAP HANA R Integration Guide.
As you may have noticed with the last step of the tutorial, you can access the R dataset and load them inside of SAP HANA.
CREATE COLUMN TABLE IRIS ( "Sepal.Length" DOUBLE, "Sepal.Width" DOUBLE, "Petal.Length" DOUBLE, "Petal.Width" DOUBLE, "Species" VARCHAR(5000) ); CREATE PROCEDURE LOAD_IRIS(OUT iris "IRIS") LANGUAGE RLANG AS BEGIN library(datasets) data(iris) iris <- cbind(iris) END; CREATE PROCEDURE DISPLAY_IRIS() AS BEGIN CALL LOAD_IRIS(iris); INSERT INTO IRIS SELECT * FROM :iris; END; CALL DISPLAY_IRIS(); SELECT * FROM IRIS;
This means that you can now import any of the sample datasets available in R. And guess what, R provide a “datasets” packages with over a hundred dataset as listed in in the package documentation.
You can find all the details about that in the following tutorial:
These datasets are really handy in term of education as they are all associated with R code example for you to try and compare with SAP HANA APL and PAL for example.
Now that you have the R integration setup, you can compare one of the PAL algorithm with R using the same dataset, like Census.
With the SAP HANA R integration, we are almost done with the environment setup as we are just missing the EML library which we will dive into next week.
This means that we will install a TensorFlow serving server and connect our SAP HANA, express edition to it and consume a simple model (which I need to find now).
(Remember sharing && giving feedback is caring!)
UPDATE: Here are the links to all the Machine Learning in a Box weekly blogs:
- Introducing “Project: Machine Learning in a Box”
- Machine Learning in a Box (part 2) : Project Methodologies
- Recap Machine Learning in a Box (part 2) : Project Methodologies
- Machine Learning in a Box (part 3) : Algorithms Learning Styles
- Machine Learning in a Box (part 4) : Get your environment up and running
- Machine Learning in a Box (part 5) : Upload Machine Learning Datasets
- Machine Learning in a Box (part 6) : SAP HANA R Integration
- Machine Learning in a Box (part 7) : Jupyter Notebook
- Machine Learning in a Box (part 8) : SAP HANA EML and TensorFlow Integration
- Machine Learning in a Box (part 9) : Build your first Machine Learning application
- Machine Learning in a Box (part 10) : JupyterLab