I decided to write this blog after facing an issue that it seems it does not have a simple solution in the standard SAP BI frontend tools, so here is the challenge;

Challenge: To be able to make a boxplot chart to obtain the data distribution of sensors with millions of data points per year, you want to chart for each sensor one box per month (Data on HANA).


So lets review the standard SAP frontend suit as September 2016,


– Lumira, comes with a very neat boxplot solution which im using extensively, works like a charm…for less than 10,000 data points

– Design Studio, does not comes with boxplot solution out of the box, there are some SDK code that can be use to make one.


There is a couple of forum threads that highlight the issue quite clearly;

– Boxplots in Lumira How to define measure in HANA Studio for Lumira Box Plot

– Boxplots in Design Studio Lessons learned in creating SDK extensions. (yep he coded the whole thing)


Solution: As the chart was for an offline report to a client i decided it was a good chance to put those R skills to work, R have one of the most advance chart capabilities around.

Here is the step by step, assuming you don’t even have R installed;

(*Note) For this exercise you just need the HANA server and your laptop/desktop, you will use R as a desktop app to analyze and chart the data.

  1. Lets install R -> How to Install R – dummies (sorry for the “dummies” it picked up from the web page title)
  2. Now lets install Rstudio (a nicer frontend for R) -> How to Install and Configure RStudio – dummies (same here)
  3. Now we need two libraries, one for the OBDC connector to HANA and the nicer plot library (in R there are many way to do the same thing, in this case there is a base plot library, we will go for the nicer one as it the one it will be expected in a work environment)

       First open Rstudio and go to Tools -> Install Packages and then install RODBC and plotly

               /wp-content/uploads/2016/09/1_1035270.png

                    2.JPG

     4. Now we create the ODBC connection, please follow the steps in point 1) on this guide (also is a very good read on more details on how HANA+R      connects) -> SAP HANA: My experiences on using SAP HANA with R

     5. Below if the full code (start a new R script as in the image below), then i explain steps by step what each lines is used for.

3.JPG


library(RODBC)
library(plotly)
ch<-odbcConnect("hana_odbc_connection",uid="hana_user",pwd="*****")
sensors<-sqlQuery(ch,"SELECT * FROM \"SCHEMA_NAME\".\"T_MD_METERS\"")
for (meter in sensors$meter){
  q1<-paste0('SELECT * FROM "SCHEMA_NAME"."TABLE/VIEW_SENSOR_DATA" WHERE "Record_Flag" = \'OK\' AND "Sensor" = \'',meter,'\'', sep="")
  res<-sqlQuery(ch,q1)
  names(res)[2]<-'CWF'
  res$year_month<-substr(res$Time,0,7)
  bp<-ggplot(res, aes(x=year_month, y=CWF, fill=year_month)) + geom_boxplot()
  fn<-paste0(meter,'.jpg')
  ggsave(filename=fn, plot=bp)
}

(Note*) Full operational code in R version 3.3.1, volume of data in HANA aprox. 100 million Records for a full year of data for 40+ sensors

(Note*) You will see “<-” and “$” a lot, so it may deserve an explanation by itself, for “<-” in R you dont actually write code to define variables name or type, the action of “definition”&”assign data” happens all at the same type, so for example if i write “a<-1” immediately R interprets as Variable “a” type Vector (simple list of values) type Numeric have in its first position the value “1” (cool right?), this type of assignment works from simple values to chart, img, files, connections to DB, etc. For “$” is simply how you refer an specific column in a matrix using the column name, so for a matrix/data frame for an entity cars you will have the columns: maker, model, type, so if you just want to work or reference only the maker you will go for cars$maker

a. Call to libraries, this will load to your session the needed set of functions for your code to work, in this case we will be using functions to connect to a      ODBC database (HANA) so we call RODBC and the prettier version of a boxplot chart function from “plotly” library


library(RODBC)
library(plotly)

   

b. Now we use the RODBC function to create a connection from R to HANA, for that we use odbcConnect with the key parameters as the ODBC      connection name (as defined on step 4) HANA user name and password, finally we pass this function call to “ch”


ch<-odbcConnect("hana_odbc_connection",uid="hana_user",pwd="*****")

c. Here is when my first actual access to HANA occurs, we are bringing the list of sensors in the database from the table that contains the list of them      all, for this we use sqlQuery function from RODBC package as well, which requires a connection (defined in “ch”) and a query in HANA SQL valid      language, the parsing of the querys can be a bit of a pain, but there is a couple of examples to guide in the code. If all successful the variable “sensors” will      receive a table of X number records and X number of columns in this case (all sensors and its attributes), by default this transform “sensors” into a data      frame (kind of internal memory table).


sensors<-sqlQuery(ch,"SELECT * FROM \"SCHEMA_NAME\".\"T_MD_METERS\"")


d. Then we start a loop operation (there is MANY other ways of doing this loop bit), i used “for” just as familiarity and the limited number of loops. this will      execute for each meter as there are quite a few is not realistic to do it manually hence we automate this part, for just a few records this loop can be omitted


for (meter in sensors$meter){


e. now we elaborate the query to the HANA table using a variable component (the sensor name contained for each loop in “meter”) for it we use “paste0”      which is a concatenate function (part of the base R package so does not require extra libraries to be called) then we pass the query as a string vector to      “q1”


  q1<-paste0('SELECT * FROM "SCHEMA_NAME"."TABLE/VIEW_SENSOR_DATA" WHERE "Record_Flag" = \'OK\' AND "Sensor" = \'',meter,'\'', sep="")


f. We execute the query and pass the results to “res”, the data set will contain at least a time dimension, the sensor ID and the metric we want to plot


res<-sqlQuery(ch,q1)


g. We apply a couple of data manipulation functions (optional and very specific to my scenario, but worth mentioning) the first use the R base function      “names” to change the name of a column, why do i need to do this? because the name of the column in HANA contain a SPACE and it is a bit hard to      reference it later, then i use the function “substr” that is exactly what it looks like it is meant for, from a date “2016-09-01 12:01:01 PM” it will retrieve just      “2016-09” and allow the plot function to aggregate per year-month, note that the column “year_month” does not exist in HANA and it is created on the fly in      this code line.


  names(res)[2]<-'CWF'
  res$year_month<-substr(res$Time,0,7)

h. Finally the plot itself, to fully explain “ggplot” usage i will need a blog for itself, but in short is quite a powerful chart function that can plot most of all the      basic and advance plots types with an extensive customizing capabilities even capable of overlying plot layers (a boxplot + a line chart) see more here      (Quick-R: ggplot2 Graphs) in short we are calling ggplot from the plotly library with the parameters x axis = year_month, y axis = sensor flow metric (CWF)      and plot type = geom_boxplot then the outcome we pass it to “bp”, if you want to see the boxplot on your screen you just execute the line without the      variable assignment (“bp<-“) on the console


  bp<-ggplot(res, aes(x=year_month, y=CWF, fill=year_month)) + geom_boxplot()

     /wp-content/uploads/2016/09/4_1035317.png


i. finally we wrap it up creating a name for the file (one plot per sensor) and saving it to the R root folder, for the file name we simply paste the sensor name      in “meter” and the extension “.jpg”, a path could be added here for other output folders. the we use the function from plotly again ggsave, that requires a plot      and a path, so we pass the plot stored in “bp” and the file name in “fn”, if all successful your root folder will start getting filled with boxplots.


  fn<-paste0(meter,'.jpg')
  ggsave(filename=fn, plot=bp)

/wp-content/uploads/2016/09/5_1035318.png  M-029-BT .jpg

M-023-BS - N .jpg

(Final Note *) Since we are on R i finalize with a real life story on how to extend its use, the plot use on point 5.h raised some question about the validity of the data plotted as ZEROs should not be use for the plots, if you check the plot by itself you will rise the same question as Boxplots use 5% 25% median 75% 95% and outliers representation (math of it here Box-and-Whisker Plots: Five-Number Summary) so to prove yourself right you can use a couple of simple base R functions, see below the use of “min”, “median” and “quantiles” quite prove that no zero’s were included, sure you can simply search for zero’s on the data but this also prove the chart right

6.JPG

To report this post you need to login first.

3 Comments

You must be Logged on to comment or reply to a post.

  1. Ted Kwon

    Jorge,

    What a wonderful and detailed document you have completed !

    . challenge – solution – final note storyline

    . R code description is the second to none, so fresh

    SAP BO Predictive Analytics supports R integration.

    You can find excellent blog on ‘How to guide for R integration’ by Andreas;

    Extending SAP Predictive Analytics Functionality

    It is similarly acting as RStudio behaviour, overpassing the native SAP functionality.

    We might think of two data analytics scenarios as follow;

    – Offline mode (dataset resides in local and R installed in local machine)

    – Online mode (dataset resides in HANA database and R/R Serve installed in another SUSE host)

    However, SAP BO Predictive Analytics does not quite well fitted with hybrid scenario, as used in your scenario;

    – use R as a local desktop app to analyze

    – dataset on HANA server

    In this regard, your scenario could provide competitive solution over SAP BO Predictive Analytics, targeting a client with economic SAP budget.

    Thank you.

    (0) 
    1. Jorge Lizama Post author

      Hi Ted,

      Thanks for your words, Your final statement is actually quite true, just bare in mind that requires quite a bit of effort to do things in R that PA and the APL/PAL library brings out of the box, also comes without saying the advance Lumira-like user interface of PA is miles away of the code-focus R for final users.

      Myself i tend to go to Rstudio to do stuff over the data that i cannot do on classic SAP BI tools, like ANOVA or regression analysis over HANA data, you can some of this in PA, but im a nerd and love the flexibility of the coding in R 🙂

      Regards

      (0) 
  2. Lars Breddemann

    Good write up of your experience! I enjoyed the read.

    While progressing through your steps, there were a couple of things that came to my mind and that maybe useful to you or others:

    • whenever using ODBC, don’t put in logon data into the code. Instead use the hdbuserstore feature. See e.g. HANA quick note – checking my connections and using them securely …
    • you select data via SELECT * instead of specifically only requesting the required columns. This will increase the execution time and the amount of data transferred between HANA and R.
    • Instead of pulling all data to R in a loop and calculate the 5 point stats for the box plot in R, why not calculate the 5 point stat for all data sets in HANA and only pull the results?

    Thanks again for sharing this piece.I think it well demonstrates how just by trying out a little bit of the available technology in a creative way can quickly deliver real solutions.

    Well done!

    (0) 

Leave a Reply