Skip to Content

Since I wrote my blog When SAP HANA met R – First kiss I had received a lot of nice feedback…and one those feedbacks was…”What’s new?”…

Well…as you might now SAP HANA works with R by using Rserve, a package that allows communication to an R Server, so really…there can’t be too many new features…but…the cool thing is that SAP HANA has been tested with R 2.15 and Rserve 0.6-8 so any new features added on R and Rserve and instantly available on SAP HANA ๐Ÿ˜‰

But hey! I wouldn’t write a blog is there wasn’t at least one new cool feature, right? You can read more about it here SAP HANA R Integration Guide.

Of course…for this you need SAP HANA rev. 48 (SPS5)

So what’s the cool feature? Well…you can store a train model like lm() or ksvm() directly on a table for later use. This is really cool, because if you have a big calculation to be made, you only need to store the model and use it later without having to reprocess everything again.

Let’s make an example…and hope all the R fans doesn’t kill me for this…because when it comes to statistics…I’m really lost in the woods ๐Ÿ™

Let’s say we have two tables from the SFLIGHT package…SPFLI and STICKET, so we want to predict how many times a customer is going to flight to different destinations (CITYFROM-CITYTO) depending on how many times all the customers has flights to those very same locations.

We’re going to create one SQLScript file to get the information, transform it, create the model and store it in the database…

Build_Flight_Model.sql

–Create a TYPE T_FLIGHTS to grab the information from the SPFLI and STICKET tables.

DROP TYPE T_FLIGHTS;

CREATE TYPE T_FLIGHTS AS TABLE (

CARRID NVARCHAR(3),

CUSTOMID NVARCHAR(8),

CITYFROM NVARCHAR(20),

CITYTO NVARCHAR(20)

);

–Create a TYPE FLIGHT_MODEL_T and a table FLIGHT_MODEL to get and store the model in the database.

DROP TYPE FLIGHT_MODEL_T;

CREATE TYPE FLIGHT_MODEL_T AS TABLE (

ID INTEGER,

DESCRIPTION VARCHAR(255),

MODEL BLOB

);

DROP TABLE FLIGHT_MODEL;

CREATE COLUMN TABLE FLIGHT_MODEL (

ID INTEGER,

DESCRIPTION VARCHAR(255),

MODEL BLOB

);

–This R procedure will receive the T_FLIGHTS information, create table containing a field call FLIGHT_NAME that will contain the concatenation

–of the CARRID, CITYFROM and CITYTO. ie: AA-NEW YORK-SAN FRANCISCO.

–We’re going to convert the table into a data.frame, so all the similar values in FLIGHT_NAME are going to be summarized.

–Using the subset() function, we’re going to get rid of all the FLIGHT_NAME’s that has a frequency lower or equal than 0.

–We’re going to use the nrow() function to count all the FLIGHT_NAME occurrences and multiply that by 10 (Stored in f_rows)

–We’re going to use the sum() function to sum all the frequencies and the divide it by f_rows (Stored in f_sum)

–We’re going to use the mapply() function to divide each of the frequencies by f_sum

–We’re going to use the order() function to sort by FLIGHT_NAME

–We’re going to use the colnames() function to assign names to our data.frame

–We’re going to use the lm() function to generate a Linear Regression based on the FLIGHT_NAME and it’s frequency

–Finally, we’re going to use the generateRobjColumn() custom created function to store the result of the model in the buffer.


DROP PROCEDURE FLIGHT_TRAIN_PROC;

CREATE PROCEDURE FLIGHT_TRAIN_PROC (IN traininput “T_FLIGHTS”, OUT modelresult FLIGHT_MODEL_T)

LANGUAGE RLANG AS

BEGIN

generateRobjColumn <- function(…){

          result <- as.data.frame(cbind(

                    lapply(

                              list(…),

                              function(x) if (is.null(x)) NULL else serialize(x, NULL)

                    )

          ))

          names(result) <- NULL

          names(result[[1]]) <- NULL

          result

}

tab<-table(FLIGHT_NAME=paste(traininput$CARRID,traininput$CITYFROM,traininput$CITYTO,sep=”-“))

df<-data.frame(tab)

ss<-subset(df,(df$Freq>0))

freq<-ss$Freq

f_rows<-(nrow(ss)) * 10

fsum<-sum(freq) / f_rows

ss$Freq<-mapply(“/”,ss$Freq, fsum)

flights<-ss[order(ss$FLIGHT_NAME),]

colnames(flights)<-c(“FLIGHT_NAME”,”FREQUENCY”)

lmModel<-lm(FREQUENCY ~ FLIGHT_NAME,data=flights)

modelresult<-data.frame(

ID=c(1),

DESCRIPTION=c(“Flight Model”),

MODEL=generateRobjColumn(lmModel)

)

END;

–This SQLSCRIPT procedure will grab all the needed information from the tables SPFLI and STICKET and will assign it to flights

–We’re going to call the R procedure FLIGHT_TRAIN_PROC

–We’re going to do an INSERT to finally store the model from the buffer into the database

DROP PROCEDURE POPULATE_FLIGHTS;

CREATE PROCEDURE POPULATE_FLIGHTS ()

LANGUAGE SQLSCRIPT AS

BEGIN

flights = SELECT SPFLI.CARRID, CUSTOMID, CITYFROM, CITYTO

             FROM SFLIGHT.SPFLI INNER JOIN SFLIGHT.STICKET

             ON SPFLI.CARRID = STICKET.CARRID

             AND SPFLI.CONNID = STICKET.CONNID;

CALL FLIGHT_TRAIN_PROC(:flights, FLIGHT_MODEL_T);

INSERT INTO “FLIGHT_MODEL” SELECT * FROM :FLIGHT_MODEL_T;

END;

CALL POPULATE_FLIGHTS();

When we call POPULATE_FLIGHTS(), our FLIGHT_MODEL table should look like this…

FLIGHT_MODEL.png

If you are wondering why we have an “X”…it’s because the content is serialized and stored in a BLOB field…if you inspect the content, you will receive a bunch of weird hexadecimal numbers…

Anyway…it took 6.165 seconds to SAP HANA to process 1,842,160 records.

Now the we have our model safely stored in the database, we can move to our next SQLScript file…

GET_AND_USE_FLIGHT_MODEL.sql

–We’re going to create a TYPE T_PREDICTED_FLIGHTS and a table PREDICTED_FLIGHTS to store the information of the current number of flights and

–the estimated (according to our prediction) number of flights

DROP TYPE T_PREDICTED_FLIGHTS;

CREATE TYPE T_PREDICTED_FLIGHTS AS TABLE (

CUSTOMID NVARCHAR(8),

FLIGHT_NAME NVARCHAR(60),

FREQUENCY INTEGER,

PREDICTED INTEGER

);

DROP TABLE PREDICTED_FLIGHTS;

CREATE TABLE PREDICTED_FLIGHTS (

CUSTOMID NVARCHAR(8),

FLIGHT_NAME NVARCHAR(60),

FREQUENCY INTEGER,

PREDICTED INTEGER

);

–In this R procedure, we’re going to receive the flight for a given customer, the model stored in the database and we’re going to return the result so it can be

–stored in our PREDICTED_FLIGHTS table.

–We’re going to use the unserialize() function to extract the model.

–We’re going to create a table containing a field call FLIGHT_NAME that will contain the concatenation of the CARRID, CITYFROM and CITYTO.

–ie: AA-NEW YORK-SAN FRANCISCO. and also the CUSTOMID

–We’re going to convert the table into a data.frame, so all the similar values in FLIGHT_NAME are going to be summarized.

–We’re going to use the colnames() function to assign names to our data.frame

–We’re going to use the nrow() function to get the number of records in the data.frame (Stored in dfrows)

–We’re going to use the rep() function to repeat the CUSTOMID value of the first record dfrows times

–We’re going to use the predict() function to predict the amount of flights based on our model (retrieved from the database) and the new data that we recovered

–Finally, we’re going to create a data.frame containing all the information that should be stored in our table PREDICTED_FLIGHTS


DROP PROCEDURE USE_FLIGHT;

CREATE PROCEDURE USE_FLIGHT(IN flights T_FLIGHTS, IN modeltbl FLIGHT_MODEL_T, OUT out_flights T_PREDICTED_FLIGHTS)

LANGUAGE RLANG AS

BEGIN

lmModel<-unserialize(modeltbl$MODEL[[1]])

tab<-table(FLIGHT_NAME=paste(flights$CARRID,flights$CITYFROM,flights$CITYTO,sep=”-“),CUSTOMID=flights$CUSTOMID)

df<-data.frame(tab)

colnames(df)<-c(“FLIGHT_NAME”,”CUSTOMID”,”FREQUENCY”)

dfrows<-nrow(df)

customid<-rep(df$CUSTOMID[1],dfrows)

prediction=predict(lmModel,df,interval=”none”)

out_flights<-data.frame(CUSTOMID=customid,FLIGHT_NAME=df$FLIGHT_NAME,FREQUENCY=df$FREQUENCY,PREDICTED=prediction)

END;

–This SQLSCRIPT procedure will select the information from the FLIGHT_MODEL table and store in the flight_model variable

–We’re going to select all the needed information from the table SPFLI and STICKET based on the customer ID number

–We’re going to call the R procedure USE_FLIGHT and it will return us the PREDICTED_FLIGHTS that we’re going to store in the database

DROP PROCEDURE GET_FLIGHTS;

CREATE PROCEDURE GET_FLIGHTS(IN customId NVARCHAR(8))

LANGUAGE SQLSCRIPT AS

BEGIN

flight_model = SELECT * FROM FLIGHT_MODEL;

out_flights = SELECT SPFLI.CARRID, CUSTOMID, CITYFROM, CITYTO FROM SFLIGHT.SPFLI INNER JOIN SFLIGHT.STICKET

                   ON SPFLI.CARRID = STICKET.CARRID AND SPFLI.CONNID = STICKET.CONNID

                   WHERE CUSTOMID = :customId;

CALL USE_FLIGHT(:out_flights, :flight_model, PREDICTED_FLIGHTS);

INSERT INTO “PREDICTED_FLIGHTS” SELECT * FROM :PREDICTED_FLIGHTS;

END;

Now that we have all our Stored Procedures ready…we can create the last SQLScript file to actually fill our PREDICTED_FLIGHTS with some data…

PREDICT_FLIGHTS_FOR_CUSTOMERS.sql

CALL GET_FLIGHTS(‘00000156’);

CALL GET_FLIGHTS(‘00002078’);

CALL GET_FLIGHTS(‘00002463’);

As you can see…we only need to call the GET_FLIGHTS procedure, passing the Customer ID’s…

This process took only 970ms and generate 122 records for the 3 customers…

Now I’m sure you realize how cool is this…if we haven’t stored our model in the database…then we would have to calculate the model for each customer…and it would have took us around 7 seconds for each (get the lm(), plus the predict)…that would have been around 21 seconds for the 3 customers…while we can say that the whole process took us only 7 seconds…if you needed to calculate the prediction for all the more than 1 million customers…you will be in sort of trouble ๐Ÿ™‚

Let’s see the content of our PREDICTED_FLIGHTS table…of course, I’m going to only show a part of it…

PREDICTED_FLIGHTS.png

We can dump this information to a CSV file from SAP HANA Studio and let’s say…use it on SAP Visual Intelligence to generate a nice graphic…

VI_PREDICTED.png

Hope you like it ๐Ÿ™‚

To report this post you need to login first.

11 Comments

You must be Logged on to comment or reply to a post.

  1. Former Member

    Thanks for the information. I recently installed the PAL library and integrated the R server with my SAP HANA server. Everything is working great but I was curious to know if SAP offers an Rserve sizing guide? We are currently limiting the connection between SAP HANA and the RServe to 2 GB of data, but I would like to know the hardware recommendations  for larger data-sets.

    (0) 
    1. Alvaro Tejada Galindo Post author

      Jonathan:

      Well…if you read the SAP HANA R Integration Guide (link on the blog) you will find:

      We recommend that you set the value of maxinbuf to (physical memory size, in bytes) / 2048. For example if you installed R on a host with 256 GB of physical memory you should set maxinbuf to 134217728.

      Greetings,

      Blag.

      (0) 
      1. Former Member

        Thanks. Memory is covered by that statement. From a CPU standpoint, it appears that R is single threaded (per instance of Rserve). Would it be safe to say that you need 1-2 CPU per instance of Rserve? I know that you can run multiple instance of RServe on the same box or other boxes to use more CPU power. However, I am unsure if SAP HANA uses the host:port combo in the cer_rserve_addresses parameter in an active / passive or round robin fashion. Do you know anything about the CPU requirements and / or the best practices for running multiple RServe instances.

        Conversation of R single threading

        http://en.usenet.digipedia.org/thread/14657/226/

        (0) 
        1. Alvaro Tejada Galindo Post author

          Jonathan:

          As promised ๐Ÿ™‚ Here goes the response…

          R itself is normally single threaded. There might be some internal functions which are parallelized and there might be packages which use parallel C code and there are packages with which you can parallelize your R code.

          On the other hand Rserve does not really anything, it only forks an R runtime if there is an incoming request. So multiple requests can be handled in parallel, but each in general single threaded. This means one call to an R procedure on HANA side will consume only one CPU on the R side.

          When you want to exploit multiple cores on the R side you can parallelize on HANA side. This means you have to split your R algorithm into parts. So you build a wrapper SQLScript procedure where you split your data in parts and then you can call the R procedure in parallel for each part from within the procedure.

          Configuring multiple Rserve doesn’t work as we use the Rserve addresses out of cer_rserve_addresses for high availability and do not use them in a round robin manner.

          One note for windows: on windows Rserve do not support in general parallel processing of requests, so there it is true that only one CPU is also with the approach mentioned above. But in general it is not recommended do use Rserve on windows.

          Greetings,

          Blag.

          (0) 
          1. Former Member

            Wow.. Thank you for such a wonderful response. This was exactly what I needed to know.

            In summary, SAP HANA send requests to a configured RServe in active passive mode.However, through my coding in the Stored Proc, I can control the number of simultaneous requests that are sent to the active instance of Rserve.  RServe acts as a broker for the R binaries on the RServer. R can manage X number of requests in parallel. Therefore I should install X number of CPU on the host based on the number of parallel operations I develop in my code.

            Thanks Again.

            (0) 
  2. Former Member

    Hi,

    I have a very fundamental question concerning R and HANA: I have been working with an AWS HANA instance; I was not allowed to install RServe (or R) at this instance. In a productive HANA environment, is it possible to intall R on the same machine as HANA? Or is there some restriction from SAP’s side concerning what may run on that appliance machine?

    As with my AWS instance, HANA and R had to communicate over some TCP connection. From what I seen in my tests is that both HANA and R are increadibly efficient, but the internet connection, especially transfer of “big data” slows computations down enormously (to my understanding, even if it would have been a super-fast ethernet connection, this would not be much different).

    Did I get something wrong here?

    Thank you in advance,

    regards,

    Georg

    (0) 
    1. Alvaro Tejada Galindo Post author

      Georg:

      While R can be installed in the same SAP HANA Machine, we don’t recommend that at all…sharing processes between the two applications is not a good thing…that’s why I have my SAP HANA on one AWS and my R Server in other AWS…

      Rserve which is the package used to communicate SAP HANA and R relies on a TPC connection…but so far, in all my testing, that haven’t been a problem…the processing time is really fast…just every programming language, the performance will depend on how good you define your application..

      Also…if you have an SAP HANA Server (not AWS) then the connection to the R Server will be on the local network…

      Greetings,

      Blag.

      (0) 
      1. Former Member

        Hi Blag,

        thanks a lot for the quick reply!

        I was just wondering how HANA/R scales and performs on truly big data sets. I have done my tests mainly on several dozen MB of data on AWS. However, would that also work for, say, 10-100 GB? My question is, would you recommend HANA over any other DB in this case, when doing analytics with R? My concern here is that the performance gain using an in-memory, row-based DB (rather than a disk-based one) seems small when we have to forward all the data over a TCP connection to R for analysis, which seems like a true bottleneck here. Also, running HANA on a seperate instance than the one where R is located does not seem to fit into the general HANA philosophy of moving application logic to the DB in order to get most out of the in-memory performance advantages and real-time capabilities.

        The second thing I noticed is that PAL seems to be an alternative for common data mining tasks, which I have tried for the same algorithm (i.e. k-means). It worked very nicely. To my understanding, PAL is implemented in C/C++. Is it possible to extend this by adding functions of my own in C/C++, running on the HANA instance, i.e. machine learning algorithms etc.?

        Thank you very much in advance, regards,

        Georg

        PS. my background is rather data mining specific other than SAP / BI/BW, so I may have missed some facts obvious to SAP veterans.

        (0) 
        1. Alvaro Tejada Galindo Post author

          Georg:

          I haven’t test SAP HANA and R with that amount of big data yet…but of course I would recommend SAP HANA over other databases…the speed of SAP HANA and the compression algorithms it uses can balance the fact that you’re passing the information to R via TCP…

          Also…when that huge amount of data kicks in…you can always use additional tools like Hadoop…this blog written by Piers Harding shows how to use R, SAP and Hadoop http://scn.sap.com/community/scripting-languages/blog/2013/03/12/data-conduction–r-sap-and-hadoop I’m planning to write something similar using R, SAP HANA and Hadoop but right now I have some many things in my plate…

          I haven’t used PAL yet…so I don’t know if you can extend it or not… ๐Ÿ™ Sorry about that…

          Greetings,

          Blag.

          (0) 
  3. Anubha Sharma

    Hi Alvaro,

    This is a very informative post.

    I am trying to use an R package(zoo package) from HANA Studio, i have installed the package in R Console(RGUI).My code using the package is working fine in the console.But,the library is not being recognized in HANA Studio.

    Both HANA Studio and RGUI are using the same working directory.

    Regards,

    Anubha

    (0) 

Leave a Reply