Skip to Content
Technical Articles

SAP Data Intelligence: Create your first ML Scenario with R

Andreas Forster published a meanwhile well-known blog SAP Data Intelligence: Create your first ML Scenario on a quite easy to understand ML example of Marathon times where he uses Python as programming language. I took the opportunity to re-write exactly the same example using R as language instead of Python.

The infrastructure where I deployed the scenario and also managed to get it running is a Data Intelligence release 1908.

My main intention and driver to develop this example was to provide some code snippets in R on some challenges which took quite some time for me to figure out when writing the first R-operators like e.g.

  • What is the pickle equivalent in the R environment to produce and reload the model blob? The blob data type is necessary to pass over the model to the artifact producer.
  • How are the metrics of the model turned into the json object which is requested by the metrics operator?
  • How has the result of the prediction to be passed back to the REST-API?

Training Data

We use exactly the same training data as in Andreas’ blog and save them to the Amazon S3 bucket which is inherent to the CAL image.

Data exploration and free-style Data Science

In the R environment we use R studio instead of the Jupyter notebook. R studio has not that close integration into data intelligence as Jupyter notebook. Actually there is some integration with SAP HANA if your data reside there, see e.g. the blog post by Kurt Holst. In our case the easiest thing is to load the csv file locally in a local R deployment and to begin analyzing the data there. We are going to use Generalized Linear Models from the glm family in R.

> df_train <- read.csv(file="C:/Downloads/RunningTimes.csv", header=TRUE, sep=";")
> df_train
     ID HALFMARATHON_MINUTES MARATHON_MINUTES
1     1                   73              149
2     2                   74              154
3     3                   78              158
4     4                   73              165
5     5                   74              172
6     6                   84              173
7     7                   85              176
8     8                   86              177
9     9                   89              177

You may check the type of the input format which is a dataframe and the type of the variables as well.

> str(df_train)
'data.frame':   117 obs. of  3 variables:
 $ ID                  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ HALFMARATHON_MINUTES: int  73 74 78 73 74 84 85 86 89 88 ...
 $ MARATHON_MINUTES    : int  149 154 158 165 172 173 176 177 177 177 ...

We train the model with all the data we have at hand, first remove the first identity column. After having trained the model check the $residuals as attribute of the model which will subsequently be used to computed the root mean square error as quality indicator of the model.

> df_train<-df_train[,-1]
> lm.fit<-lm(MARATHON_MINUTES~HALFMARATHON_MINUTES,data=df_train)
> str(lm.fit)
List of 12
 $ coefficients : Named num [1:2] -6.01 2.25
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "HALFMARATHON_MINUTES"
 $ residuals    : Named num [1:117] -9.18 -6.43 -11.43 6.82 11.57 ...
  ..- attr(*, "names")= chr [1:117] "1" "2" "3" "4" ...
 $ effects      : Named num [1:117] -2361.82 294.44 -9.94 8.49 13.21 ...
  ..- attr(*, "names")= chr [1:117] "(Intercept)" "HALFMARATHON_MINUTES" "" "" ...
 $ rank         : int 2

It is here where one can also check the right format of what has e.g. to be passed to the metrics port of the R client in the training pipeline (see below).

> library(jsonlite)
> rmse <- toString(sqrt(mean(lm.fit$residuals^2)))
> json<-toJSON(data.frame(rmse))
> json
[{"rmse":"16.959368690539"}] 
> json<-gsub('^.|.$', '', json)
> json
{"rmse":"16.959368690539"} 

After training the model a blob type is required for the model. The model is converted into a blob type with the rawConnection function in R.

> conn <- rawConnection(raw(0), "w")
> saveRDS(lm.fit, conn)
> modelBlob <- rawConnectionValue(conn)
> str(modelBlob)
 raw [1:12621] 58 0a 00 00 ...

Then let’s try if the reload works and predict for a sample value.

> glm_model_reload = readRDS(rawConnection(modelBlob, "r"))
> df_inference<-as.data.frame(fromJSON('{"HALFMARATHON_MINUTES":122}'))
> df_inference
  HALFMARATHON_MINUTES
1                  122
> marathon_minutes_prediction <- predict(glm_model_reload,newdata=df_inference,type="response")
> marathon_minutes_prediction <-toJSON(data.frame(marathon_minutes_prediction))
> result <-gsub('^.|.$', '', marathon_minutes_prediction)
> result
{"marathon_minutes_prediction":268.3888} 

The contents of result has finally to be passed back via the Rest-API (see below the script for the inference pipeline). This means that up to now we have our R script which has now to be used in the deployment pipelines.

Deployment

Now everything is in place to start deploying the model in two graphical pipelines.

  • One pipeline to train the model and save it into the ML Scenario.
  • And another pipeline to surface the model as REST-API for inference

Training Pipeline

Create in Data Intelligence a new Machine Learning Scenario “Marathon Times w/ R” and create a producer pipeline of type R producer:

 

This will generate a template which has slightly to be adapted:

  • If necessary replace both of the old deprecated R clients com.sap.system.rClient2 in the template by the new one com.sap.system.rClient3. For the new operators the ports have to be created exactly with the same names and types as they are given for the old R clients. The script for the second R client may be copied.
  • Configure the “Read File” operator to read the train data from the AWS S3 bucket where we placed them before.
  • Then insert the following R script into the first R client:
# Example R Client script to perform training on input data & generate Metrics & Model Blob
library(jsonlite)

onInput <- function(in_csv) {
    df_train=read.table(text=in_csv,header=TRUE,sep=";",quote="\"'",dec=".")
    df_train<-df_train[,-1]
    
    lm.fit<-lm(MARATHON_MINUTES~HALFMARATHON_MINUTES,data=df_train)
# produce a model blob from your training to be used with the Artifact Producer operator
    conn <- rawConnection(raw(0), "w")
    saveRDS(lm.fit, conn)
    modelBlob <- rawConnectionValue(conn)

# produce some metrics JSON which will be used by the Submit Metrics operator. Send as a message
    rmse <- toString(sqrt(mean(lm.fit$residuals^2)))
    json<-toJSON(data.frame(rmse))
    json<-gsub('^.|.$', '', json)
    metrics <- list(Body=json, Attributes=list(), Encoding="UTF-8")
# send the model blob and metrics to the output ports
    list(modelBlob=modelBlob, metrics=metrics)
}

api$setPortCallback(c("input"), c("metrics", "modelBlob"), "onInput")

The default R Client operator already comes with three tags “rserve”, “rjsonlite”, and “rmsgpack”. So there is no need to create its own dockerfile and a corresponding group in order to provide the necessary runtime for this pipeline.

Save the pipeline and finally execute the pipeline from the scenario manager. When execution is started a name for the model to be trained has to be given.

The pipeline will run, complete and save the model to the SAP data lake (SDL) as well as provide a metric RSME (Root Mean Square Error) as defined in the code above.

In the sequel we want to use this model for a real-time inference.

Prediction / Inference with REST-API

In order to start we create from the Machine Learning scenario “Marathon Times w/ R” a consumer pipeline of type R consumer.

This will generate a template for the R consumer pipeline.

Again the pipeline has to be adapted. Keep the variable ${ARTIFACT:MODEL} of the operator “Submit Artifact Name” as it is. The R script in the R client has to be adapted at some places:

  • I changed the API object method for receiving the model from api$setPortCallback to api$setSwitchCallback in order to ensure that the handling of input by the Rest-API is only started when we have a model at hand. After the model has reached the input port the switch variable (switchPosition) has to be set to TRUE.
  • Instead of passing back the response as resp$Body and resp$Attributes as supposed by the template I changed this also into one line.

The script given below should work when cut & paste into the script of the R client in the consumer pipeline.

library(jsonlite)

# global variables for checking the model status
globalModelBlob <<- NA
modelReady <<- FALSE

# function for validating the json sent in POST request from client
isJson <- function(msg_body) {
#    print('Validating JSON')
    json <- try(fromJSON(msg_body), silent = TRUE)
    if(class(json) == 'try-error') {
        return(FALSE)
    } else {
        return(TRUE)
    }
}

# when the modelBlob reaches the input port
onModel <- function(modelBlob) {
# I033659 - XIJ
    globalModelBlob <<- modelBlob
    modelReady <<- TRUE
    list(switchPosition=TRUE)
}

# when user sends a POST request with JSON data
onInput <- function(msg) {
    success <- FALSE
    errorMessage <- ''
    if(modelReady){
        data <- rawToChar(msg$Body)
        if(isJson(data)){
            df_inference<-as.data.frame(fromJSON(data))
            names(df_inference)<-paste("HALFMARATHON_MINUTES")
            glm_model = readRDS(rawConnection(globalModelBlob, "r"))
            marathon_minutes_prediction <- predict(glm_model,newdata=df_inference,type="response")
            marathon_minutes_prediction <-toJSON(data.frame(marathon_minutes_prediction))
            result <-gsub('^.|.$', '', marathon_minutes_prediction)
            success <- TRUE
        } else {
            errorMessage <- 'Invalid JSON provided in request.'
            success <- FALSE
        }

    } else {
        print('Model has not yet reached the input port - try again.')
        errorMessage <- 'Model has not yet reached the input port - try again.'
        success <- FALSE
    }
    if(success){
        resp =list(Body=result, Attributes=list('message.request.id'=msg$Attributes[['message.request.id']]), Encoding="UTF-8")
    } else {
        resp =list(Body=paste('{"Error": "', errorMessage, '"}', sep=""), Attributes=list('message.request.id'=msg$Attributes[['message.request.id']]), Encoding="UTF-8")    
    }
    list(response=resp)
}

api$setSwitchCallback(c("model"), c(), "onModel")
api$setPortCallback(c("input"), c("response"), "onInput")

Save the pipeline and then deploy it from the Machine Learning Scenario Manager. At some place at step 4 you will be asked for the model to be used. There should be a value help, choose the model which has been created during the training step.

As soon as the pipeline is running a deployment URL will appear which can e.g. be used in postman for real-time prediction of marathon times.

Take that deployment URL, extend it at the end by /v1/uploadjson/ and enter it e.g. as request URL into Postman. Change the request in Postman from GET to POST.

Add in the authorization tab the authorization type to Basic Auth and enter as username <tenant>\<user> as well as the password you need to log on to Data Intelligence. Go to the “Headers”-tab and enter the key “X-Requested-With” with value “XMLHttpRequest”.

Finally, pass the input data to the REST-API. Select the “Body”-tab, choose “raw” and enter this JSON syntax:

Press “Send” and you should see the prediction that comes from SAP Data Intelligence. Compare with Andreas blog and you see definitely the same results: One model was computed by LinearRegression provided by sklearn, this model was computed by fitting a Generalized Linear Model provided by R.

Summary

We have remodeled the Python machine learning example given in the blog SAP Data Intelligence: Create your first ML Scenario by Andreas Forster with a Generalized Linear Models from R. The main intention is to show the world from the R perspective and give some hints for some crucial code snippets in R when using the standard templates for R producer and R consumer in SAP Data Intelligence.

Be the first to leave a comment
You must be Logged on to comment or reply to a post.