In-depth understanding the principles of SAP HANA integrating with R
we also have a Chinese version of this blog.
From this article on, I will research the principle to combine HANA with R. After the SAP D-code meeting, I figured out that so many related applications are using R, especially the application about analysis and prediction. So I want to study the details deeply. These articles need the experience of R, and you had better have used R in SAP HANA. You can find related documents on https://help.sap.com/hana/SAP_HANA_R_Integration_Guide_en.pdf, and you can also read my another document on http://scn.sap.com/community/chinese/hana/blog/2014/02/14/r%E8%AF%AD%E8%A8%80%E5%8C%85%E5%AE%89%E8%A3%85%E5%B9%B6%E5%AE%9E%E7%8E%B0%E4%B8%8Ehana%E7%9A%84%E6%95%B4%E5%90%88
These documents will show the bi-directional data flows that SAP HANA communicates with R. Then you can design more efficient procedure on SAP HANA with R and it also can help you to figure out the reason for your problem. You can check out logs for R, and you can even integrate R with your own applications in TCP/IP if you can support TCP/IP.
(1) Embedded R’s execution environment
Because R can support an embedded execution environment, so SAP HANA can integrate with R. It means that if you have installed some specified libraries, you can add R programs into C programs.
Under R’s installation directory (such as /user/local/lib64/R), there are some head files providing some functions’ prototype and a dynamic-link library file- libR.so. It can run R program with C with these files’ support. For example
#include <stdio.h>
#include "Rembedded.h" //header file
#include "Rdefines.h"
int main(){
char *argv[] = {
"REmbeddedPostgres", "--gui=none", "--silent" //arguments
};
int argc = sizeof(argv)/sizeof(argv[0]);
Rf_initEmbeddedR(argc, argv);
SEXP e;
SEXP fun;
SEXP arg;
int i;
fun = Rf_findFun(Rf_install("print"), R_GlobalEnv);
PROTECT(fun);
arg = NEW_INTEGER(10);
for(i = 0; i < GET_LENGTH(arg); i++)
INTEGER_DATA(arg)[i] = i + 1;
PROTECT(arg);
e= allocVector(LANGSXP, 2);
PROTECT(e);
SETCAR(e, fun);
SETCAR(CDR(e), arg);
/* Evaluate the call to the R function.Ignore the return value. */
eval(e, R_GlobalEnv);
UNPROTECT(3);
return 0;
}
This code mainly defines some functions and macro in R language kernel. Firstly, initialize an embedded execution environment through calling Rf_initEmbeddedR(argc, argv). SEXP represent a kind of pointer which point some internal data structures. (refer to R’s source code with R-2.15.0/src/main), then it defines an array arg values from 1 to 10. Lastly, execute the function print() with eval(e, R_GlobalEnv).
We can compile these codes with command:
gcc embed.c -I/usr/local/lib64/R/include -L/usr/local/lib64/R/lib –lR
-l: the path of head files;
-L: the path of dynamic-link library;
-IR: with this parameter, it can link to libR.so
As we can see, the result is similar to R.
Because of this, based on embedded R execution environment, we can create an R server as a TCP/IP server. It can accept request from TCP/IP client, execute the R program and return the result to the client. This is the primary reason for developing Rserve
Above all is the basis of combination of SAP HANA and R.
(2) Introduction for Rserve
Rserve was born on 10.2003, the newest version is Rserve 1.7-3 which was published on 2013. The writer is Simon Urbanek(http://simon.urbanek.info/) who is doing some research work on AT&T labs. We can download Rserve server and client from http://www.rforge.net/Rserve/, and we can get more details about Rserver on this website.
Server is implemented with C. It can accept request and data from client, then it will return the results to client after calculation. The client provides C++, java and PHP version. Speak of this, the C++ interface only provides basic functionality, with author’s own words, “This C++ interface is experimental and does not come in form of a library”. It is just experimental which only some basis data structures, like lists, vectors, and doubles. For some other types, you need to design it by yourself. Just as “Look at the sources to see how to implement other types if necessary”
However, in SAP HANA’s R client, it is implemented by C++. But it is more complicated than the original C++ interface. Actually in theory, you can implement any kind of clients with any language if you get TCP/IP’s support.
(3) Message-oriented communication protocol:QAP1
QAP1(quad attributes protocol v1) is applied for Rserve to communicate with clients. According to QAP1, the clients should send a message first, which contains specific actions and some related data, then it will wait for the response message from server. The response message should contain the response code and the result data. As the structure of the response message, it contains a header portion whose size is 16 byte and data portion. The structure of header is as follows:
Offset type meaning
[0] (int) the type to request and response
[4] (int) set the length of message(0 to 31bit)
[8] (int) set the offset of data part
[12] (int) set the length of message(32 to 63 bit)
The data portion of the message may contain some additional parameters, such as DT_INT, DT_STRING or other types of parameters. Specific reference Rsrv.h .
Here are some commands which Rserve support,
command parameters | response data
CMD_login DT_STRING | –
CMD_voidEval DT_STRING | –
CMD_eval DT_STRING or | DT_SEXP
DT_SEXP
CMD_shutdown [DT_STRING] | –
CMD_openFile DT_STRING | –
CMD_createFile DT_STRING | –
CMD_closeFile – | –
CMD_readFile [DT_INT] | DT_BYTESTREAM
CMD_writeFile DT_BYTESTREAM | –
CMD_removeFile DT_STRING | –
CMD_setSEXP DT_STRING, | –
DT_SEXP
CMD_assignSEXP DT_STRING, | –
DT_SEXP
CMD_setBufferSize DT_INT | –
CMD_setEncoding DT_STRING | – (since 0.5-3)
since 0.6:
CMD_ctrlEval DT_STRING | –
CMD_ctrlSource DT_STRING | –
CMD_ctrlShutdown – | –
since 1.7:
CMD_switch DT_STRING | –
CMD_keyReq DT_STRING | DT_BYTESTREAM
CMD_secLogin DT_BYTESTREAM | –
CMD_OCcall DT_SEXP | DT_SEXP
The most commonly used command is CMD_EVAL. It can receive an R code. After syntax parsing, it can execute the code and get the result, then sends back the response message.
Actually, we can run embedded R program directly in SAP HANA, and it is more simple and efficient. But we cannot do this because of the copyright issues of open source software.
That’s it for now, I will introduce the operating mechanism Rserve and its communication with Rserve in the following blogs. If you know the principles of the detail, I think this will help you write better R procedure.