Data Conduction – R SAP and Hadoop
This blog post is intended to be a primer making a case for the use of Hadoop and SAP integration with R. You may just want to skip to the example at the bottom – I promise not to be offended.
Data Conduction
… I think is a reasonable description for what we do all the time with data – we pull it in from multiple sources, in multiple formats, and then manipulate it in order to make sense of it, or in the hope that it can reveal secrets about the places that it comes from through this process of orchestration.
Hadoop epitomises this, where the orchestration spans as many nodes as you have in your cluster, and the data is shunted through various pipelines of processing, changing format, sorting, summarising, and eventually coming out the other end in the desired marshalled state.
But Hadoop isn’t the management layer, or even a particularly good interface to data for an Analyst to work with – this is where R comes in with it’s natural expressiveness for interpreting the data that we wield.
In this blog, I want to explore using SAP as a primary data authority, and Hadoop as another tool to add to the list that can be seamlessly integrated with R, giving us more options when we tackle our large scale data analytics problems.
Hadoop
In the last year or so, Hadoop got very easy to use with scripting languages when a relatively small change was added to Hadoop streaming – this was something called Typed Bytes (first proposed in 2009, but only made main stream releases relatively recently) which enabled the flexible definition of data passed in and out of streaming map or reduce jobs. Suddenly scripting language interfaces could format data for Hadoop in ways that suited their internal structures – game on. This has lead to a number of frameworks for Hadoop most notably Dumbo for Python and RHadoop (Antonio Piccolboni from RevolutionAnalytics) for R.
Hadoop takes a bit of setting up – to take advantage of the new features you need to ensure that you have version >= 1.0.2. There is a good simple tutorial for getting it going here with the standard installation instructions here .
One thing about Hadoop that is not widely discussed, is that it’s not just good at scaling up to managing vast quantities of computing resources for parallel processing – it is actually good at handling highly constrained resources too. The typical trade off is time as you break the job down into smaller pieces and handle them serially – this is no more apparent than when testing examples such as the following on a single node cluster like my laptop!
R, SAP, and Hadoop
R is a natural interface for shaping and interpreting data where it inherently expects data to come from multiple sources, especially with RStudio as a workbench (covered here previously). I find that it allows you to incrementally approach a problem, back track, as well as put the problem aside (without loosing the thread) for a while when necessary through features such as session and history saving. All of this is highly conducive to the craft (art?) of data analytics, where the majority of the time it is hard to know in advance what paths will be taken to the answer, or indeed if the answer is anything like what you thought it might be.
These unique data exploratory characteristics of R are one of open sources best kept secrets.
SAP is the core/central data authority in the business. Other data sources exist inside and outside the enterprise, but SAP (in most cases) is the single source of truth for key data describing the business. In many cases, these data sets are not huge, but critical eg: business partner lists, and account balances.
Data sources that lie outside of SAP are often difficult to access in a real-time fashion and may also be in many and varied formats, with differing access mechanisms which is where R and Hadoop can help.
In the following example, I want to demonstrate how Hadoop integration works with R, and give a flavour of how it can be used to solve problems that need to span SAP and other data sources.
Solving a Problem
Using the standard flight booking example tables in SAP – SBOOK, and SFLIGHT2, I have a fictional travel agency, where I want to know who are my most valued customers. I define my valued customers in terms of (a) how much money they spend, and (b) how far they fly. In the SFLIGHT2 table, I do get an estimate of mileage of each flight, but I don’t trust this value, so I need to go outside of SAP to calculate it. It so happens that Google has a geo location API and I know what the airport codes are for the journey start and end points. Given this, I want to take each completed booking, determine the location start and end points, calculate the distance between them and then summarise the total distance, and total spend per customer per year.
Using RSAP I’m going to extract the booking and flight data from SAP, and then use Hadoop to farm out the ‘heavy processing’ (it’s not really heavy but hey – this is just an example) of data formatting, geo location, distance calculation and summarisation.
Setup
Firstly, you don’t actually need a Hadoop cluster to try out these examples, but it makes it more fun. Rhadoop has a local processing option that allows you to debug your code prior to hitting the cluster by setting – rmr.options(backend = “local”) (see the DEBUG flag in the example below).
However, you must have a memcache server available, as the geo locating process stashes discovered locations here. This is necessary because of Googles API limit on calls, and is also an interesting twist on how to improve performance of map/reduce jobs.
From R, you need to install the following packages – ensure that they are installed in a site wide package location:
- RSAP (from CRAN – further instructions can be found here)
- rmemcached (from github)
- The two required RHadoop modules (rmr2, rhdfs) as described here
- RJSONIO (from CRAN)
- fossil – for calculation distances between locations (from CRAN)
Basic R package installations take the form of install.packages(‘<package name>’) from the R command line.
Extract the SFLIGHT2 and SBOOK tables from SAP:
dataDir = '/a/directory/R/'
# Grab data from SAP via RFC call and store in R data file
if (!exists('cbook')) {
if (!file.exists(file.path(dataDir, 'cbook.RData'))) {
print("Could not find cbook - generating it...")
# read SFLIGHTS2 for airports and SBOOK for bookings
library(RSAP)
conn <- RSAPConnect("sap.yml")
print(system.time((sflights <- RSAPReadTable(conn, "SFLIGHTS2")))) # 4880 rows
print(system.time((sbook <- RSAPReadTable(conn, "SBOOK",
fields=list('CARRID', 'CONNID', 'FLDATE', 'INVOICE', 'CANCELLED', 'CUSTOMID', 'PAYMENTSUM'))))) # 1370282 rows
RSAPClose(conn)
# merge the tables together on flight and date
cbook <- merge(sflights, sbook, by=c('CARRID', 'CONNID', 'FLDATE'))
save(cbook, file=file.path(dataDir, 'cbook.RData'), ascii=FALSE)
print("generated and saved cbook for next time")
} else {
# grab the saved dataset
print("Loading cbooks from disk")
rm(cbook)
print(load(file.path(dataDir, 'cbook.RData')))
}
}
We now have two data frames sflights, and sbook that contain 4,880, and 1,370,282 rows respectively, and they have been stashed in an R data file, so we don’t have to go there again.
Run the map/reduce
Key things to note about the map/reduce job:
- set the environment variables for your Hadoop cluster
- the local backend option uses temp files, and chops the data up into 10,000 record chunks to emulate Hadoop processing
- Ensure that you do not exceed the Google API query limit or the job will fail
Please read the notes in the code:
# sort out loss of env with RStudio for Hadoop
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
library(rmr2)
# switch on debug to run local - no Hadoop
DEBUG=TRUE
if (DEBUG) {
rmr.options(backend = "local")
} else {
rmr.options(backend = "hadoop")
}
# the returned data structure from the completed job - this is the natural output from a map/reduce call
if (!exists('out')) {
rm(out)
}
if (!exists('result.df')) {
rm(result.df)
}
out = from.dfs(
mapreduce(
to.dfs(cbook), # save the bookings data to the Hadoop file system
map = # define the mapper that will do the formatting and geo location
function(k, v) {
# connect to server here as global so that the connection can be reused
library(rmemcached)
server <- cache.connect("127.0.0.1", 11211) # change to your memcache server
# geo coding routine shamelessly taken from Jitender Aswani http://allthingsbusinessanalytics.blogspot.com/
getGeoCode <- function(gcStr) {
library("RJSONIO")
# fix up airport codes that are wrong
if (gcStr == 'TYO') {
gcStr = 'HND'
} else if (gcStr == 'KIX') {
gcStr = 'KANSAI'
} else if (gcStr == 'THF') {
gcStr = 'BRANDENBURG'
}
# use memcache to store results so we don't spam the Google API
stash <- cache.get(server, gcStr)
if (!is.null(stash)) {
names(stash) <- c("Lat", "Lng")
return(stash)
} else {
# stop(paste("could not geocode from cache", gcStr))
gcStr <- gsub(' ','%20',gcStr) #Encode URL Parameters
connectStr <-
paste('http://maps.google.com/maps/api/geocode/json?sensor=false&address=', gcStr, '%20airport', sep="")
con <- url(connectStr) # call the API
data.json <- fromJSON(paste(readLines(con), collapse="")) # decode the JSON results
close(con)
#Flatten the received JSON
data.json <- unlist(data.json)
# grab the lat and long from the JSON structure
if(data.json["status"]=="OK") {
lat <- data.json["results.geometry.location.lat"]
lng <- data.json["results.geometry.location.lng"]
gcodes <- c(lat, lng)
names(gcodes) <- c("Lat", "Lng")
# save in the cache for next time
cache.set(server, gcStr, as.numeric(gcodes))
return(as.numeric(gcodes))
} else {
# if you got here then you probably exceeded the API limit
stop(paste("could not geocode", gcStr, '-', data.json))
}
}
}
# function to calculate the distance between two geo loc points
calcDistance <- function(origin, dest) {
library(fossil) # has earth.dist
return(as.numeric(earth.dist(as.table(rbind(getGeoCode(origin), getGeoCode(dest))))))
}
# now the real mapper function - notice how it uses native R structures of data.frame or matrix
# to describe inputs and outputs
# select only relevant records and columns:
# * invoiced
# * not cancelled
v <- v[v$INVOICE == 'X' & v$CANCELLED == "", c('AIRPFROM', 'AIRPTO', 'CUSTOMID', 'PAYMENTSUM', 'FLDATE')]
if (nrow(v) > 0) {
v$year <- substr(v$FLDATE, 1, 4) # grab the year from the flight date
v$distance <- unlist(lapply(1:nrow(v), function(i) {calcDistance(v[i,'AIRPFROM'], v[i,'AIRPTO'])})) # calculate distance
} else {
v$year <- c() # empty result placeholders
v$distance <- c()
}
# we return a matrix of keys and a matrix of values
return(keyval(cbind(v$year, v$CUSTOMID), cbind(rep.int(1, nrow(v)), round(as.numeric(v$distance), 2), as.numeric(v$PAYMENTSUM))))
},
# use the reducer to summarise the distance and cost by customer by year
reduce =
function(k, vv) if(nrow(vv) > 0) keyval(k, cbind(sum(vv[, 1], na.rm=T), # number of. flights
sum(vv[, 2], na.rm=T), # total distance
sum(vv[, 3], na.rm=T))), # total paid
combine=TRUE # no real combiner allocated
))
# reformat output to data.frame
result.df <- data.frame(cbind(out$key, out$val), stringsAsFactors = FALSE)
colnames(result.df) <- c('year', 'customid', 'flights', 'distance', 'cost')
You may receive a warning like:
In to.dfs(cbook) : Converting to.dfs argument to keyval with a NULL key
This is because no key has been specified for the dataset, and can be disregarded.
The Results
And there we have in the results below. Not surprisingly we have a different set of customers when viewed by mileage than by revenue. The higher mileage customers might think they have more of a claim on a customer reward scheme, but we may see it differently as revenue by customer may indicate higher value. Either way, we can now reward our top customers.
# 2012 top 10 most valued customers by revenue
subset(result.df[order(as.numeric(result.df$cost), decreasing=TRUE),], year=="2012")[1:10,]
year customid flights distance cost
2012 1731 77 669032.78 27619910.84
2012 1764 74 626387.3 26672217.89
2012 3077 65 567615.38 26331673.76
2012 1919 65 524266.25 26318934.19
2012 4015 69 589793.49 26160147.92
2012 1402 75 616431.91 25427669.25
2012 2793 76 692470.6 25044591.17
2012 1922 70 644232.32 24998612.27
2012 4528 75 636527.83 24915862.29
2012 52 70 643509.53 24757153.08
# 2012 top 10 most valued customers by airmiles
subset(result.df[order(as.numeric(result.df$distance), decreasing=TRUE),], year=="2012")[1:10,]
year customid flights distance cost
2012 2793 76 692470.6 25044591.17
2012 3348 80 674154.39 21708568.64
2012 2525 78 672130.08 22931827.02
2012 1731 77 669032.78 27619910.84
2012 1511 83 663782.33 22940185.46
2012 3429 84 661984.52 22914627.49
2012 4268 77 655323.39 21989782.32
2012 18 72 653851.2 21678816.85
2012 2150 68 650492.96 20314354.86
2012 861 73 650359.4 23346561.28
In Summary
RHadoop provides new and novel ways of data blending within the R workbench, adding another useful tool to the life of a data analyst. The interface to the map reduce process is very R like, allowing us to relate to the data in the pipeline as R data types.
R is a blank canvas with a whole raft of widgets for ‘painting’ a path to data analytics, but the same is true of Hadoop. What you put inside a map/reduce job could literally be anything in terms of how you source and manipulate data – infact there is nothing stopping you from mixing and matching steps from different programming languages and data interfaces to get the jobs done.
Almost too much choice!
Piers! This is just awesome! 😀 Great example! I'm still trying to find some time to learn Hadoop but your blog really puts me on the right track...
As soon as I have some time...I will try to write a blog using SAP HANA, R and Hadoop... 😉
Greetings,
Blag.
Hey thanks Alvaro - and I look forward to the SAP Hana version.
Cheers,
Piers Harding.
Hi Piers,
Piers HardingIts very rare to see SAP guy writing about Hadoop 😉 . This blog just connected another missing dot in the circle.
awesome one, with most understandable example. 😉 .
I think very soon Alvaro is going to put an end to my wait, with a wonderful blog featuring HANA, R and HADOOP. 🙂 with some of his other expertise also... say a video in his blog 😉
Regards
Kumar.
Thanks Kumar.
Hello Piers,
really good work you have done. I have a problem with a crashing Rgui Windows 64 bit when I try to connect.
Can compile RSAP with warnings.
Can load RSAP
Trying to use an adapted RSAPconn(...) from the test.R file on your github site to find out whether I can connect to the AS ABAP Server.
Rgui crashes in RSAP
First attempt with 64 bit R 3.0.2 which was already installed here. Used Rtools 3.0 for that compile. Crashed.
Then uninstalled/removed totally R and all libs etc from the system and started with a new install of R 3.1.1 and Rtools 3.1.
Compile with warning too. Same crash in RSAP dll when trying to connect.
Any hints?
Best regards
George
Hi Georg -
I have heard of this problem and I think it is restricted to MS, and has only appeared in R 3.x (I have no access to Windows to attempt to recreate this problem). Have you tried building this extension under Linux, or perhaps you could try R2.x ?
Cheers,
Piers Harding.
Hi Piers,
thanks for the follow up. Well understood that you do not have a Windows system at hand.
Me neither - I normally use OSX. But for that there is no SAP SDK. So I tried it first in a Windows VM and made the comment here about my problems.
Later I had got it going under a fresh install of Ubuntu 14.04 LTS in an other VM. Only problem was that it took me a while to find out that I had to copy the SAP library files from the SDK to a place where library files reside and make entries into the Linux library configuration file/path /etc/ld.so.conf.d for that. Probably well known procedure for those who regularly use Linux/UNIX. When I did so years ago that procedure was different.
I had the impression from your readme that Alvaro gave hints how to get it going under new Windows and R 3.x versions. May be I should ask him whether he has current experience with that.
Anyhow I can access RFC functions on our ABAP ERP from R. GREAT!
Thanks for you effort.
Yes - good point. Must have left that out for R, as I have it there for Perl, and Ruby ...
From Perl sapnwrfc -
YOU MUST MAKE SURE THAT YOU ADD THE LIB PATH TO LD_LIBRARY_PATH
EG:
export LD_LIBRARY_PATH=/usr/sap/rfcsdk/lib
Example:
perl Makefile.PL --source /opt/rfcsdk --addlibs '-lsomethingwacky'
Cheers,
Piers Harding.