Technical Articles
SAP Data Intelligence Cloud : How to use R Kernel in Notebook?
You might already know that you can build complex Machine Learning scenarios as part of your Data Value Journey with SAP Data Intelligence . If not then look at the blog post by Andreas Forster and Ingo Peter on how to create your first ML scenario in SAP Data Intelligence using Python & R :
SAP Data Intelligence: Create your first ML Scenario
SAP Data Intelligence: Create your first ML Scenario with R
In this blog post you will learn how to start R Kernel inside Notebook in SAP Data Intelligence Cloud to perform Data exploration and Free-Style Data Science and also to test R code before using it in the R Operator available in pipeline modeler.
Although R Kernel is not yet officially available in SAP Data Intelligence Cloud notebook (as of version 2010.29.9) , this blog post will guide you to install one using “conda” on python and will also present a way to load data sets stored in DI DATA LAKE using R library – “reticulate”.
Launch Notebook inside your ML SCENARIO
Launch a Notebook session inside your ML Scenario:
Select Python 3 Kernel :
Install R Kernel
Use below “Conda” command to install the R-Kernel :
Note -y option here is for the silent install. (else we get prompted to enter “y”/”yes” to begin the install)
!conda install -y -c r r-irkernel
All required/dependent packages will be installed and the final summary is as below :
Install Required R Packages
Before you begin using the newly installed R kernel, remember that the base R install won’t have all packages that you need ,for example look at below R code snippet requiring dplyr, sqldf,readxl etc
library("sqldf")
library("dplyr")
library("cluster")
library("tidyverse")
library("reshape2")
library("excel.link")
library("RODBC")
library("randomForest")
library("car")
Let’s first learn how to install such R packages.
Execute below command in the python 3 kernel notebook session (same as the one on which we installed R-kernel) to install “dplyr” and this is same as doing install.packages(“dplyr”) on the R kernel :
conda install r-dplyr
Follow the same for other required packages:
conda install r-reshape2
conda install r-RODBC
Launch a New Notebook with R Kernel
Select the Newly installed R-Kernel:
Or use the Launcher to create a new notebook with R kernel :
Let’s follow the IRIS R Example to use the R Kernel :
Load the Data:
library(datasets)
data(iris)
summary(iris)
names(iris) <- toupper(names(iris))
library(dplyr)
A basic Filter:
setosa <- filter(iris, SPECIES == "setosa")
head(setosa)
sepalLength5 <- filter(iris, SPECIES == "setosa", SEPAL.LENGTH > 5)
tail(sepalLength5)
Loading Data from a local CSV ( CSV uploaded in the Jupyter lab session) :
SFO landings Data Reference: https://data.sfgov.org/Transportation/Air-Traffic-Landings-Statistics/fpux-q53t
Loading Data from SAP DI Data Lake
We will make use of R library “reticulate” (library(reticulate)) to access artifacts stored in DI DATA LAKE using a python script , again this is a workaround which lets us access the shared artifacts.
First install the R reticulate library using the python 3 kernel with conda as below :
conda install r-reticulate
Create a new Text file to create a python script and insert below code cell which imports python packages ( sapdi , pandas) , and further defines a function to read a file from DI data lake :
import pandas as pd
import sapdi
def read_FILE_DL():
ws=sapdi.get_workspace(name='Air_Traffic_SFO')
dc=ws.get_datacollection(name='LANDINGS')
with dc.open('Air_Traffic_Landings_Statistics.csv').get_reader() as reader:
landings = pd.read_csv(reader)
return landings
Now Open the Notebook with R-kernel ,Restart the R Kernel before running below commands.
use below commands to load the Data :
library(reticulate)
source_python("load_data.py")
landings <- read_FILE_DL()
landings
Using the “MARATHON” dataset example
Refer Ingo’s Blog post and for Dataset use the blog post : SAP Data Intelligence: Create your first ML Scenario
df_train <- read.csv(file="RunningTimes.txt", header=TRUE, sep=";")
head(df_train)
ID HALFMARATHON_MINUTES MARATHON_MINUTES
1 73 149
2 74 154
3 78 158
4 73 165
5 74 172
6 84 173
str(df_train)
'data.frame': 117 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ HALFMARATHON_MINUTES: int 73 74 78 73 74 84 85 86 89 88 ...
$ MARATHON_MINUTES : int 149 154 158 165 172 173 176 177 177 177 ...
lm.fit<-lm(MARATHON_MINUTES~HALFMARATHON_MINUTES,data=df_train)
str(lm.fit)
List of 12
$ coefficients : Named num [1:2] -6.01 2.25
..- attr(*, "names")= chr [1:2] "(Intercept)" "HALFMARATHON_MINUTES"
$ residuals : Named num [1:117] -9.18 -6.43 -11.43 6.82 11.57 ...
..- attr(*, "names")= chr [1:117] "1" "2" "3" "4" ...
$ effects : Named num [1:117] -2361.82 294.44 -9.94 8.49 13.21 ...
..- attr(*, "names")= chr [1:117] "(Intercept)" "HALFMARATHON_MINUTES" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:117] 158 160 169 158 160 ...
..- attr(*, "names")= chr [1:117] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:117, 1:2] -10.8167 0.0925 0.0925 0.0925 0.0925 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:117] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "HALFMARATHON_MINUTES"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.09 1.18
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2.
///
SUMMARY
Its often a need in projects to quickly test R Code snippets before we can use them in Data Pipelines, so I hope that above workaround to have a working R kernel in Notebook within SAP Data Intelligence will be helpful.
I would also reiterate that the solution above serves as a workaround to test some of the R code snippets in notebook in SAP DI, and that the R SDK for SAP DI is not yet available ( unlike Python SDK for SAP DI) hence the artifacts stored in SAP DI data lake cannot be directly referenced using R kernel among other limitations( e.g. as is done using “sapdi” python library in SAP Data Intelligence)
Please follow the tag SAP Data Intelligence to get notified on latest updates, or visit the SAP Community page for SAP Data Intelligence to keep learning more: https://community.sap.com/topics/data-intelligence
That's excellent Vinay Bhatt, great to know that this is possible!
I tried it out and got the Marathon time forecast working with R in the DI Notebook
It's useful & nice blog post. Thank you, Vinay.
Hello Vinay, that's really good to know but after executing the steps i am always getting the following error, even after i have restarted the kernel or created a new notebook from scratch. Any idea?
The kernel for 057596fe-f23e-44bc-aac1-185a427834bd/notebooks/10_Loading_Datasets.ipynb appears to have died. It will restart automatically.
Thanks
Hi,
Thanks for your comment, you may try restarting the Jupyter lab application from the "System Management" Application :
Thanks!
Hi, Thanks for the answer, but I have tried several times to do it but after restarting the Jupyter Lab the kernel R dissapear and I have to execute again the steps to install the R kernel. the output of the conda info is as follows: