当SAP HANA遇上R——有什么更新?[翻译后转载]
🙂 原文链接:http://scn.sap.com/community/developer-center/hana/blog/2013/02/18/when-sap-hana-met-r–whats-new
自从我写了博客当SAP HANA遇上R——第一次接吻,我收到了很多好的反馈……而那些反馈之一就是……“有什么更新?”……
好的……也许你现在用Rserve来让SAP HANA和R一起工作,一个允许与R服务器交流的包,所以真是如此……不会有太多的新功能……但是……很酷的是SAP HANA已经经过与R 2.15和Rserve 0.6-8的测试,所以任何新增到R和Rserve的新功能将立刻在SAP HANA中可用。 😉
但是嘿!我不会写一篇博客因为不只有一个很酷的新功能,是吗?你可以从SAP HANA R整合向导读到更多。
当然……为此你需要SAP HANA rev.48(SPS5)
所以什么是很酷的功能?好的……你可以把一个火车模型存储成像lm()或ksvm()直接放在一个表里以供以后使用。这很酷,因为如果你需要大量计算,你只需要存储模型,并且在以后使用它的时候不用再一次预处理每件事。
让我们举个例子……希望所有R的粉丝不会因此杀了我……因为当涉及到统计……我实在是迷失了方向。 🙁
比方说我们有两个来自SFLIGHT包的表……SPFLI和STICKET,因此我们要预测一名客户有多少次飞往不同目的地(CITYFROM-CITYTO),依赖于所有客户飞往那些相同地点的次数。
我们将创建一个SQL脚本文件来获取信息,转换它,创建模型并且存储到数据库中……
Build_Flight_Model.sql |
---|
–Create a TYPE T_FLIGHTS to grab the information from the SPFLI and STICKET tables. DROP TYPE T_FLIGHTS; CREATE TYPE T_FLIGHTS AS TABLE ( CARRID NVARCHAR(3), CUSTOMID NVARCHAR(8), CITYFROM NVARCHAR(20), CITYTO NVARCHAR(20) ); –Create a TYPE FLIGHT_MODEL_T and a table FLIGHT_MODEL to get and store the model in the database. DROP TYPE FLIGHT_MODEL_T; CREATE TYPE FLIGHT_MODEL_T AS TABLE ( ID INTEGER, DESCRIPTION VARCHAR(255), MODEL BLOB ); DROP TABLE FLIGHT_MODEL; CREATE COLUMN TABLE FLIGHT_MODEL ( ID INTEGER, DESCRIPTION VARCHAR(255), MODEL BLOB ); –This R procedure will receive the T_FLIGHTS information, create table containing a field call FLIGHT_NAME that will contain the concatenation –of the CARRID, CITYFROM and CITYTO. ie: AA-NEW YORK-SAN FRANCISCO. –We’re going to convert the table into a data.frame, so all the similar values in FLIGHT_NAME are going to be summarized. –Using the subset() function, we’re going to get rid of all the FLIGHT_NAME’s that has a frequency lower or equal than 0. –We’re going to use the nrow() function to count all the FLIGHT_NAME occurrences and multiply that by 10 (Stored in f_rows) –We’re going to use the sum() function to sum all the frequencies and the divide it by f_rows (Stored in f_sum) –We’re going to use the mapply() function to divide each of the frequencies by f_sum –We’re going to use the order() function to sort by FLIGHT_NAME –We’re going to use the colnames() function to assign names to our data.frame –We’re going to use the lm() function to generate a Linear Regression based on the FLIGHT_NAME and it’s frequency –Finally, we’re going to use the generateRobjColumn() custom created function to store the result of the model in the buffer. DROP PROCEDURE FLIGHT_TRAIN_PROC; CREATE PROCEDURE FLIGHT_TRAIN_PROC (IN traininput “T_FLIGHTS”, OUT modelresult FLIGHT_MODEL_T) LANGUAGE RLANG AS BEGIN generateRobjColumn <- function(…){ result <- as.data.frame(cbind( lapply( list(…), function(x) if (is.null(x)) NULL else serialize(x, NULL) ) )) names(result) <- NULL names(result[[1]]) <- NULL result } tab<-table(FLIGHT_NAME=paste(traininput$CARRID,traininput$CITYFROM,traininput$CITYTO,sep=”-“)) df<-data.frame(tab) ss<-subset(df,(df$Freq>0)) freq<-ss$Freq f_rows<-(nrow(ss)) * 10 fsum<-sum(freq) / f_rows ss$Freq<-mapply(“/”,ss$Freq, fsum) flights<-ss[order(ss$FLIGHT_NAME),] colnames(flights)<-c(“FLIGHT_NAME”,”FREQUENCY”) lmModel<-lm(FREQUENCY ~ FLIGHT_NAME,data=flights) modelresult<-data.frame( ID=c(1), DESCRIPTION=c(“Flight Model”), MODEL=generateRobjColumn(lmModel) ) END; –This SQLSCRIPT procedure will grab all the needed information from the tables SPFLI and STICKET and will assign it to flights –We’re going to call the R procedure FLIGHT_TRAIN_PROC –We’re going to do an INSERT to finally store the model from the buffer into the database DROP PROCEDURE POPULATE_FLIGHTS; CREATE PROCEDURE POPULATE_FLIGHTS () LANGUAGE SQLSCRIPT AS BEGIN flights = SELECT SPFLI.CARRID, CUSTOMID, CITYFROM, CITYTO FROM SFLIGHT.SPFLI INNER JOIN SFLIGHT.STICKET ON SPFLI.CARRID = STICKET.CARRID AND SPFLI.CONNID = STICKET.CONNID; CALL FLIGHT_TRAIN_PROC(:flights, FLIGHT_MODEL_T); INSERT INTO “FLIGHT_MODEL” SELECT * FROM :FLIGHT_MODEL_T; END; CALL POPULATE_FLIGHTS(); |
当我们调用POPULATE_FLIGHTS(), 我们的FLIGHT_MODEL表应该看上去如是……
如果你好奇为什么我们有个“X”……这是因为内容被序列化而且存储到一个BLOB字段……如果你检查内容,你会收到一堆奇怪的十六进制数字……
不管怎么样……SAP HANA花了6.165秒来处理1842160条记录。
现在我们已经把模型安全的存储在数据库中,我们可以来看下一个SQL脚本文件了……
GET_AND_USE_FLIGHT_MODEL.sql |
---|
–We’re going to create a TYPE T_PREDICTED_FLIGHTS and a table PREDICTED_FLIGHTS to store the information of the current number of flights and –the estimated (according to our prediction) number of flights DROP TYPE T_PREDICTED_FLIGHTS; CREATE TYPE T_PREDICTED_FLIGHTS AS TABLE ( CUSTOMID NVARCHAR(8), FLIGHT_NAME NVARCHAR(60), FREQUENCY INTEGER, PREDICTED INTEGER );
DROP TABLE PREDICTED_FLIGHTS; CREATE TABLE PREDICTED_FLIGHTS ( CUSTOMID NVARCHAR(8), FLIGHT_NAME NVARCHAR(60), FREQUENCY INTEGER, PREDICTED INTEGER ); –In this R procedure, we’re going to receive the flight for a given customer, the model stored in the database and we’re going to return the result so it can be –stored in our PREDICTED_FLIGHTS table. –We’re going to use the unserialize() function to extract the model. –We’re going to create a table containing a field call FLIGHT_NAME that will contain the concatenation of the CARRID, CITYFROM and CITYTO. –ie: AA-NEW YORK-SAN FRANCISCO. and also the CUSTOMID –We’re going to convert the table into a data.frame, so all the similar values in FLIGHT_NAME are going to be summarized. –We’re going to use the colnames() function to assign names to our data.frame –We’re going to use the nrow() function to get the number of records in the data.frame (Stored in dfrows) –We’re going to use the rep() function to repeat the CUSTOMID value of the first record dfrows times –We’re going to use the predict() function to predict the amount of flights based on our model (retrieved from the database) and the new data that we recovered –Finally, we’re going to create a data.frame containing all the information that should be stored in our table PREDICTED_FLIGHTS DROP PROCEDURE USE_FLIGHT; CREATE PROCEDURE USE_FLIGHT(IN flights T_FLIGHTS, IN modeltbl FLIGHT_MODEL_T, OUT out_flights T_PREDICTED_FLIGHTS) LANGUAGE RLANG AS BEGIN lmModel<-unserialize(modeltbl$MODEL[[1]]) tab<-table(FLIGHT_NAME=paste(flights$CARRID,flights$CITYFROM,flights$CITYTO,sep=”-“),CUSTOMID=flights$CUSTOMID) df<-data.frame(tab) colnames(df)<-c(“FLIGHT_NAME”,”CUSTOMID”,”FREQUENCY”) dfrows<-nrow(df) customid<-rep(df$CUSTOMID[1],dfrows) prediction=predict(lmModel,df,interval=”none”) out_flights<-data.frame(CUSTOMID=customid,FLIGHT_NAME=df$FLIGHT_NAME,FREQUENCY=df$FREQUENCY,PREDICTED=prediction) END; –This SQLSCRIPT procedure will select the information from the FLIGHT_MODEL table and store in the flight_model variable –We’re going to select all the needed information from the table SPFLI and STICKET based on the customer ID number –We’re going to call the R procedure USE_FLIGHT and it will return us the PREDICTED_FLIGHTS that we’re going to store in the database
DROP PROCEDURE GET_FLIGHTS; CREATE PROCEDURE GET_FLIGHTS(IN customId NVARCHAR(8)) LANGUAGE SQLSCRIPT AS BEGIN flight_model = SELECT * FROM FLIGHT_MODEL; out_flights = SELECT SPFLI.CARRID, CUSTOMID, CITYFROM, CITYTO FROM SFLIGHT.SPFLI INNER JOIN SFLIGHT.STICKET ON SPFLI.CARRID = STICKET.CARRID AND SPFLI.CONNID = STICKET.CONNID WHERE CUSTOMID = :customId; CALL USE_FLIGHT(:out_flights, :flight_model, PREDICTED_FLIGHTS); INSERT INTO “PREDICTED_FLIGHTS” SELECT * FROM :PREDICTED_FLIGHTS; |
现在我们准备好了所有存储过程……我们可以创建最后一个SQL脚本文件来实际地填一些数据到表PREDICED_FLIGHTS中……
PREDICT_FLIGHTS_FOR_CUSTOMERS.sql |
---|
CALL GET_FLIGHTS(‘00000156’); CALL GET_FLIGHTS(‘00002078’); CALL GET_FLIGHTS(‘00002463’); |
正如你所见……我们只需要调用GET_FLIGHTS存储过程,传递客户ID……
这个过程只花了970毫秒,并且为三位客户生成了122条记录……
现在我确定你意识到了这有多酷……如果我们没有在数据库中存储模型……那么我们将必须为每位客户计算模型……并且每位客户会花去我们7秒时间(获取lm(), 加上预计)……这样三位客户大概将用掉21秒……而我们可以说整个过程仅仅花了我们7秒……如果你需要为所有大于100万客户做预期计算……你将陷入一堆麻烦。 🙂
让我们看看PREDICTED_FLIGHTS表的内容……当然,我将只显示其中一部分……
我们可以通过SAP HANA Studio把这些信息导入一个CSV文件,可以说……用它在SAP虚拟智能中生成一个很好的图表……
但愿你喜欢它。 🙂