In this blog series I will share my experience working with HANA and R. In this entry I will describe two ways of integration HANA and R: “Inside-out” and “Outside-In”. When you connect the R studio with HANA, SAP calls it “Outside-In”. I will tell you use cases of the latter scenario and give you a quick impression how people are working with R on the client – the typical work of a data analyst. The limitations of R on the client (f.e. limited memory) can be weakened by performing calculations in the HANA. I close this blog entry by suggesting promising use cases for the “Inside-Out” scenario.
Working with R on the Client
For me as mathematician R is just one my tools. Usually I extract data, do much transformation between different file formats mostly in Python and then I use different Open Source libraries to perform calculations. When I am working with R on my local machine I am usually using directories and workspaces. The workspace can be saved to the file system and represents the project I am working with and the directories contain different data sets I am working with. Also it contains my scripts I use when I don’t want to type in commands again and again or if I want to automate certain things. My workspace also contains the variables – f.e. the results of my calculations. When I reload my workspace again the variables are still there and I can continue my work effortlessly.
Summarized the R Studio contains a set of tools of a console for typing in command, a graphic window, a data editor, a data frame editor (looks like a spreadsheet if you see it for the first time). The visualization is really cool – there are many resources on the web that show the possibilities like box plots, heat maps, 3D plots, correlograms: https://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/ .
So the work of a data analyst starts with visualization. We can also use methods from descriptive statistics to understand a data set in a better way. Mostly we calculate measures of central tendency and measures of dispersion.
Another application is statistical inference. He we assume that the data comes from a larger population and want to prove a statistical proposition. Those can be point estimation like the calculation of the Value at Risk (https://en.wikipedia.org/wiki/Value_at_risk) which the Chief Risk Officer of a bank has to calculate and communicate. Another application are interval estimation like “Fehlbelegung in Krankenhäusern” (see http://www.mdk.de/1321.htm in German) – allocation errors in hospitals.
Another application is statistical testing of a hypothesis called null hypothesis. Statistical tests produce a p-value and this is equal to the probability of obtaining the observed difference. So you can prove according to a level of significance if the null hypothesis is true.
Just let me give a very simple example. In statistics we call characteristic with discrete values “categorical variable” – this of “day in a week” like Monday, Tuesday etc. for example. Let me continue with this example: given a set of numbers sick leaves for each day there is the question whether there are “equally distributed”. All data sets I know speak a different language and say that most sick leaves are on Monday for example (see http://de.statista.com/statistik/daten/studie/251314/umfrage/verteilung-von-krankmeldungen-arbeitsunfaehigkeitsfaelle-auf-wochentage/ f.e.). The hypothesis that the sick leaves are nor equally distributed can be rejected using Person’s test which is a one-liner in R: chiseq.test(table(weekday)) given the number of sick leaves for every day. But then questions occur: Did you really prove that there exists something like Saint Monday (see https://en.wikipedia.org/wiki/Saint_Monday)? Or you consider it as reasonable that people tell their employer at the weekend that there are ill and that they go to the doctor more likely on Monday or Tuesday. So you could start to make another hypothesis: what happens if we split the week (f.e starting with Sunday) and ask whether the sick leaves in the first week and in the last week are equally distributed.
I don’t want to show how it works but you can many things from this simple example, which is in fact a beginner’s exercise for undergraduate students and occurs in many books. I showed this example to illustrate the following points:
- Even the question what “significance” means, should be considered carefully.
- Don’t draw conclusions too fast.
- Data doesn’t speak for itself. You have to set up new hypotheses and test them.
The last bullet point is very important. Therefore often you have to transform the data set and introduce other categorical variables like “firstHalfOfWeek” resp. “SecondHalfOfWeek”. R is provides powerful tools to modify data frames. But what happens when the data is too big to fit into a frame? And this why HANA is good for: just build a Calculation View that calculates everything you need perhaps with aggregated data. And this is why the connection between HANA and R is promising.
And last but not least: of course there are in R even more tests for many categorical variables, metric variables, stochastic processes, time series and much more.
Going the Client-way: Accessing the HANA from R Studio
I learned about scenario for the first time in 2015 in Blag’s blog: http://scn.sap.com/community/hana-in-memory/blog/2012/01/26/hana-meets-r In fact it was also taught in SAP TechEd course DMM255. Here I am stealing one of the slides which shows this perfectly. You can work with you R studio on your client PC to select your data and work with it with R.
Another important message is that I consider SAP TechEd is IMHO the most valuable source of information for all SAP experts who wants to keep in pace the progress of SAP technology. If you want to know how it works, then just look at the blog on SCN: http://scn.sap.com/community/hana-in-memory/blog/2012/02/21/sap-hana-my-experiences-on-using-sap-hana-with-r and http://scn.sap.com/community/hana-in-memory/blog/2012/01/29/r-meets-hana.
The Inside-Out-Scenario: Using R Operators in HANA
When I work with HANA and R the situation is a little bit different. SAP recommends to install R on a different server and to establish a connection between HANA and R so that you can call R procedures from HANA scripts (see http://help.sap.com/hana/SAP_HANA_R_Integration_Guide_en.pdf ). This is called “inside-out” since the HANA calls R procedures and adds the power of R to SQLscript. I will come back to this scenario later blogs in discuss it detail. Now I am stealing a picture from SAP documentation which shows how it works.
By the way: I think SAP documentation is getting better and better since more architecture diagrams like the one above are created.
There is one thing you should definitely keep in mind: The R logic is stored within R operators in the HANA systems and transferred together with database date to the R Server. The calculations are performed on the R server and the data is given back to the HANA system. This has some consequences:
- there is no R logic persisted in the R Server
- this scenario creates network traffic
- the calculation of the R server will slow down the HANA queries
- the most important data type in R you will use in analytical tasks is data.frame (and maybe data.table) – data types for columnar data
Those aspects will be discussed in later blogs. In this blog entry I will introduce another scenario which is the one I am doing most of my time so far. But before doing so I would like to discuss the following question: is it reasonable to work directly with the R server in above mentioned inside-out scenario. I don’t think so and I explain my reasons in the following.
Not Recommended: Intrusion into the R Server
Supposed, you installed a server for the above mentioned “inside-out scenario”. In theory (and so far I have no reason to recommend it) you could work directly with this R server to create your R applications: just log on to R server, use packages like RODBC to perform ETL, execution of DDL and other commands. This could be done by establishing an ODBC connection:
connection <- odbcConnect(“hana”, uid=”system”, pwd=”secret”)
But again, I don’t recommend it for many reasons:
- complexity, especially if you have dedicated application logic on R servers in file systems and workspaces
- integration problems: if those artifacts should be accessible from HANA in inside-out calls as well, then this is a deep intrusion into the system and you would have to use undocumented features.
I have to admit that I have no experience with this scenario. I consider it possible to work this way but it seems to be reasonable and only experts could work that way. Moreover sooner or later you will need additional R packages like RHANA which is up to my knowledge not officially released for SAP customers. So unless you don’t have solutions for above mentioned problems (which I don’t have) I can’t recommend it.
At the moment the “Outside-In” scenario works fine for me. I work like described in the beginning of the blog. But instead of working with flat files containing the data I can access the HANA directly. The only thing I have to take care of are that the data size transferred to the R studio on my PC can be huge. There are special R libraries that can deal with massive data, but this topic for another topic. In fact this is not rocket science since R is very open and can import data from many sources and there
In blog entry I described the two integration scenarios of HANA and R. The “Outside-In” approach is a no rocket science – fetching data from databases is a best practice. The “Inside-Out” approach is far more interesting. I consider as promising
- visualization of Calculation Views using R operators in Fiori Apps using APF: http://scn.sap.com/community/developer-center/front-end/blog/2015/04/01/analysis-path-framework-a-hidden-gem-for-creating-awesome-analytical-apps
- best practices to work with massive data in R
- analytical business rules that are called from SAP Decision Service Management.
- Using R for advanced calculations like optimization: https://cran.r-project.org/web/views/Optimization.html
But those are topics for other blogs.