Skip to Content

In this blog series I will share my experience working with HANA and R. In this entry I will describe two ways of integration HANA and R: “Inside-out” and “Outside-In”. When you connect the R studio with HANA, SAP calls it “Outside-In”. I will tell you use cases of the latter scenario and give you a quick impression how people are working with R on the client – the typical work of a data analyst. The limitations of R on the client (f.e. limited memory) can be weakened by performing calculations in the HANA. I close this blog entry by suggesting promising use cases for the “Inside-Out” scenario.

Working with R on the Client

For me as mathematician R is just one my tools. Usually I extract data, do much transformation between different file formats mostly in Python and then I use different Open Source libraries to perform calculations. When I am working with R on my local machine I am usually using directories and workspaces. The workspace can be saved to the file system and represents the project I am working with and the directories contain different data sets I am working with. Also it contains my scripts I use when I don’t want to type in commands again and again or if I want to automate certain things. My workspace also contains the variables – f.e. the results of my calculations. When I reload my workspace again the variables are still there and I can continue my work effortlessly.

Summarized the R Studio contains a set of tools of a console for typing in command, a graphic window, a data editor, a data frame editor (looks like a spreadsheet if you see it for the first time). The visualization is really cool – there are many resources on the web that show the possibilities like box plots, heat maps, 3D plots, correlograms: https://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/ .

So the work of a data analyst starts with visualization. We can also use methods from descriptive statistics to understand a data set in a better way. Mostly we calculate measures of central tendency and measures of dispersion.

Another application is statistical inference. He we assume that the data comes from a larger population and want to prove a statistical proposition. Those can be point estimation like the calculation of the Value at Risk (https://en.wikipedia.org/wiki/Value_at_risk) which the Chief Risk Officer of a bank has to calculate and communicate. Another application are interval estimation like “Fehlbelegung in Krankenhäusern” (see http://www.mdk.de/1321.htm in German) – allocation errors in hospitals.

Another application is statistical testing of a hypothesis called null hypothesis. Statistical tests produce a p-value and this is equal to the probability of obtaining the observed difference. So you can prove according to a level of significance if the null hypothesis is true.

Just let me give a very simple example. In statistics we call characteristic with discrete values “categorical variable” – this of “day in a week” like Monday, Tuesday etc. for example. Let me continue with this example: given a set of numbers sick leaves for each day there is the question whether there are “equally distributed”. All data sets I know speak a different language and say that most sick leaves are on Monday for example (see http://de.statista.com/statistik/daten/studie/251314/umfrage/verteilung-von-krankmeldungen-arbeitsunfaehigkeitsfaelle-auf-wochentage/ f.e.). The hypothesis that the sick leaves are nor equally distributed can be rejected using Person’s test which is a one-liner in R: chiseq.test(table(weekday)) given the number of sick leaves for every day. But then questions occur: Did you really prove that there exists something like Saint Monday (see https://en.wikipedia.org/wiki/Saint_Monday)? Or you consider it as reasonable that people tell their employer at the weekend that there are ill and that they go to the doctor more likely on Monday or Tuesday. So you could start to make another hypothesis: what happens if we split the week (f.e starting with Sunday) and ask whether the sick leaves in the first week and in the last week are equally distributed.

I don’t want to show how it works but you can many things from this simple example, which is in fact a beginner’s exercise for undergraduate students and occurs in many books. I showed this example to illustrate the following points:

  • Even the question what “significance” means, should be considered carefully.
  • Don’t draw conclusions too fast.
  • Data doesn’t speak for itself. You have to set up new hypotheses and test them.

The last bullet point is very important. Therefore often you have to transform the data set and introduce other categorical variables like “firstHalfOfWeek” resp. “SecondHalfOfWeek”. R is provides powerful tools to modify data frames. But what happens when the data is too big to fit into a frame? And this why HANA is good for: just build a Calculation View that calculates everything you need perhaps with aggregated data. And this is why the connection between HANA and R is promising.

And last but not least: of course there are in R even more tests for many categorical variables, metric variables, stochastic processes, time series and much more.

Going the Client-way: Accessing the HANA from R Studio

I learned about scenario for the first time in 2015 in Blag’s blog: http://scn.sap.com/community/hana-in-memory/blog/2012/01/26/hana-meets-r In fact it was also taught in SAP TechEd course DMM255. Here I am stealing one of the slides which shows this perfectly. You can work with you R studio on your client PC to select your data and work with it with R.

outsidein.PNG

Another important message is that I consider SAP TechEd is IMHO the most valuable source of information for all SAP experts who wants to keep in pace the progress of SAP technology. If you want to know how it works, then just look at the blog on SCN: http://scn.sap.com/community/hana-in-memory/blog/2012/02/21/sap-hana-my-experiences-on-using-sap-hana-with-r and http://scn.sap.com/community/hana-in-memory/blog/2012/01/29/r-meets-hana.

The Inside-Out-Scenario: Using R Operators in HANA

When I work with HANA and R the situation is a little bit different. SAP recommends to install R on a different server and to establish a connection between HANA and R so that you can call R procedures from HANA scripts (see http://help.sap.com/hana/SAP_HANA_R_Integration_Guide_en.pdf ). This is called “inside-out” since the HANA calls R procedures and adds the power of R to SQLscript. I will come back to this scenario later blogs in discuss it detail. Now I am stealing a picture from SAP documentation which shows how it works.

insideout.PNG

By the way: I think SAP documentation is getting better and better since more architecture diagrams like the one above are created.

There is one thing you should definitely keep in mind: The R logic is stored within R operators in the HANA systems and transferred together with database date to the R Server. The calculations are performed on the R server and the data is given back to the HANA system. This has some consequences:

  • there is no R logic persisted in the R Server
  • this scenario creates network traffic
  • the calculation of the R server will slow down the HANA queries
  • the most important data type in R you will use in analytical tasks is data.frame (and maybe data.table) – data types for columnar data

Those aspects will be discussed in later blogs. In this blog entry I will introduce another scenario which is the one I am doing most of my time so far. But before doing so I would like to discuss the following question: is it reasonable to work directly with the R server in above mentioned inside-out scenario. I don’t think so and I explain my reasons in the following.

Not Recommended: Intrusion into the R Server

Supposed, you installed a server for the above mentioned “inside-out scenario”. In theory (and so far I have no reason to recommend it) you could work directly with this R server to create your R applications: just log on to R server, use packages like RODBC to perform ETL, execution of DDL and other commands. This could be done by establishing an ODBC connection:

library(“RODBC”)
connection <- odbcConnect(“hana”, uid=”system”, pwd=”secret”)

But again, I don’t recommend it for many reasons:

  • security
  • stability
  • complexity, especially if you have dedicated application logic on R servers in file systems and workspaces
  • integration problems: if those artifacts should be accessible from HANA in inside-out calls as well, then this is a deep intrusion into the system and you would have to use undocumented features.

I have to admit that I have no experience with this scenario. I consider it possible to work this way but it seems to be reasonable and only experts could work that way. Moreover sooner or later you will need additional R packages like RHANA which is up to my knowledge not officially released for SAP customers. So unless you don’t have solutions for above mentioned problems (which I don’t have) I can’t recommend it.

Summary

At the moment the “Outside-In” scenario works fine for me. I work like described in the beginning of the blog. But instead of working with flat files containing the data I can access the HANA directly. The only thing I have to take care of are that the data size transferred to the R studio on my PC can be huge. There are special R libraries that can deal with massive data, but this topic for another topic. In fact this is not rocket science since R is very open and can import data from many sources and there

In blog entry I described the two integration scenarios of HANA and R. The “Outside-In” approach is a no rocket science – fetching data from databases is a best practice. The “Inside-Out” approach is far more interesting. I consider as promising

But those are topics for other blogs.

To report this post you need to login first.

8 Comments

You must be Logged on to comment or reply to a post.

  1. Lars Breddemann

    Hi Tobias

    thanks for sharing your experiences with SAP HANA and R.

    Few remarks from my side:

    • putting user name and password into ODBC/HDBSQL connection strings for SAP HANA should be considered a bug. ODBC can make direct use of the hdbuserstore. Shown here (HANA quick note – checking my connections and using them securely …)
    • Working with R comes in many flavours. The explorative, model building approach you described being one of them.
      Other scenarios often involve having a data “wrangling” pipeline that feeds into a set of pre-made R scripts and result into some results either in form of a stored dataframe/csv file or some DB table contents.
      In my eyes, HANA and R plays it’s strenghts when improving and integrating the data pipeline and having R only do the absolute necessary model-magic.
    • Performance of the R scripts is obviously of concern, and a major part for that is parallel processing.
      Check Parallelization options with the SAP HANA and R-Integration for a good treatment of this topic.

    Looking forward to the next parts of your blog post series. Would be interested to learn about your usage scenarios, data volumes and results (what setup used? what runtimes settled for? what landscape management installed?).

    Cheers,

    Lars

    (0) 
  2. Lars Breddemann

    Dug deep in my folders and found my old “TESTR” procedure that I find quite practical when setting up RServe with SAP HANA.

    Maybe this is useful for others as well:

    create type RSERVEINFO AS  TABLE (

    “R.version.platform” NVARCHAR (256),

    “R.version.arch” NVARCHAR (256),

    “R.version.os”   NVARCHAR (256),

    “R.version.system”  NVARCHAR (256),

    “R.version.status”   NVARCHAR (256),

    “R.version.major” NVARCHAR (256),

    “R.version.minor” NVARCHAR (256),

    “R.version.year”  NVARCHAR (256),

    “R.version.month”  NVARCHAR (256),

    “R.version.day”   NVARCHAR (256),

    “R.version.svn.rev”  NVARCHAR (256),

    “R.version.language” NVARCHAR (256),

    “R.version.version.string” NVARCHAR (256),

    “R.version.nickname”   NVARCHAR (256));

    drop procedure testR;

    create procedure testR (OUT  rinfo RSERVEINFO)

    LANGUAGE R

    READS SQL DATA AS

    BEGIN

    si <- sessionInfo()

    si_df <- as.data.frame (si[1])

    rinfo <- si_df

    END;

    call testR(?);

    R.version.platform  R.version.arch R.version.os R.version.system R.version.status R.version.major R.version.minor R.version.year R.version.month R.version.day R.version.svn.rev R.version.language R.version.version.string    R.version.nickname
    x86_64-suse-linux-gnu x86_64        linux-gnu  x86_64, linux-gnu                 3              3.1            2016          06            21          70800            R                R version 3.3.1 (2016-06-21) Bug in Your Hair 
    (0) 
    1. Tobias Trapp Post author

      This is really useful since administrators can use this for smoke tests.
      Another useful statement is returning the installed R libraries.

      Best Regards,

      Tobias

      (0) 
  3. Klemen Kadak

    Hi Tobias,

    do you have any experience with “Inside-out” approach on massive amounts of data (1.5B rows in a couple of tables). That’s over 450GB in memory. How do you think R would handle it?

    (0) 
    1. Tobias Trapp Post author

      Unfortunately not. At the moment I doing experiments with time series in an “Inside-Out” way but only with some 10.000 data points. This is reasonable since you apply techniques like decomposition and smoothing. I think I will blog about my experience in the Business Rules space here.

      You seem to deal with a high volume Data Mining Scenario – perhaps PAL is better here. But this is only a guess.

      Best Regards,

      Tobias

      (0) 
    2. Jonathan Sidhu

      Klemen,

       

      I am very interested in your use case as I have a similar one.  Have you made much progress with the “Inside-Out” approach?  Can you suggest any resources for getting started with these methods?

       

      Thank you!

       

      Jonathan

      (0) 

Leave a Reply