In this first part, we’re just going to write down all the hypothesis and some extra information.
The second part will contain all the results visualization stuff, including the use of SAP UI5 and its combination with D3js.
What is the Link Prediction problem?
For the problem definition, I’m taking the most part of the title from this paper. In that paper, we have a main question that states:
Given a snapshot of a social network, can we infer which new interactions among its members are likely to occur in the near future?
The main point is to measure the proximity in nodes to try to find interesting information. This means that, with the Link Prediction algorithm, we can anticipate those relationships that will occur given a directed graph.
In the Predictive Analysis Library (PAL) included in HANA, there is an algorithm called Link Prediction that makes all this calculations.
The number of possible applications is very large. Actually, Lucas Sparvieri has written a very interesting one.
Doing something similar, in our particular case, we’re going to analyse the relations between those who buy something from others, and those who sell something to others.
Our data set consists of those buyers and sellers with whom they establish a transaction. We take the difference when the person A buys something to the person B; this is a relation of type “B”. When the person A sells something to the person B, this is a relation of type “S”. Both cases are just the same, and we’re only going to address the “B”. We have information about the IDs of the buyers and sellers, the number of the month when the transaction was made, and also the year.
You can download our data set from here.
With the data already loaded, we have to prepare the procedure that will do the analysis.
We start creating all the control variables that will define the structure of the information, and so, execute the Link Prediction algorithm implemented in the Predictive Analysis Library:
create type PAL_LP_DATA_T as table ("NODE1" INTEGER, "NODE2" INTEGER); create type PAL_LP_RESULT_T as table ("NODE1" INTEGER, "NODE2" INTEGER, "SCORE" DOUBLE); create type PAL_CONTROL_T as table ( "NAME" VARCHAR(100), "INT_ARGS" INTEGER, "DOUBLE_ARGS" DOUBLE, "STRING_ARGS" VARCHAR(100) ); create column table PAL_LP_PDATA_TBL( "ID" INTEGER, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100) );
According to the documentation, we need to do the next:
insert into PAL_LP_PDATA_TBL values (1, 'CARLOS.PAL_LP_DATA_T', 'in'); insert into PAL_LP_PDATA_TBL values (2, 'CARLOS.PAL_CONTROL_T', 'in'); insert into PAL_LP_PDATA_TBL values (2, 'CARLOS.PAL_LP_RESULT_T', 'out'); -- some times is necesary to uncomment the next line -- grant select on CARLOS.PAL_LP_PDATA_TBL to system;
After that, we generate the AFL wrappers:
call "SYSTEM"."AFL_WRAPPER_ERASER"('PAL_LINK_PREDICTION_BUYERS'); call "SYSTEM"."AFL_WRAPPER_GENERATOR"('PAL_LINK_PREDICTION_BUYERS', 'AFLPAL', 'LINKPREDICTION', PAL_LP_PDATA_TBL);
You can play with some method details. These details are well explained in the SAP PAL documentation.
create local temporary column table #PAL_CONTROL_TBL like PAL_CONTROL_T; insert into #PAL_CONTROL_TBL values ('THREAD_NUMBER', 1, null, null); insert into #PAL_CONTROL_TBL values ('METHOD', 1, null, null); insert into #PAL_CONTROL_TBL values ('BETA', null, 0.005, null);
The table with the data that will feed the algorithm is created then:
create column table PAL_LP_DATA_TBL AS ( select ID_FROM "NODE1", ID_TO "NODE2" from "CARLOS"."LOG_TRANS" where MONTH_IN_YEAR = 6 -- Selection of the desired month group by ID_FROM, ID_TO );
You have to create a table where the results will be deposited, and call the procedure:
create column table PAL_LP_RESULT_TBL LIKE PAL_LP_RESULT_T; call _SYS_AFL.PAL_LINK_PREDICTION_BUYERS(PAL_LP_DATA_TBL, #PAL_CONTROL_TBL, PAL_LP_RESULT_TBL) with overview; select NODE1 "BUYER", NODE2 "SELLER", SCORE from PAL_LP_RESULT_TBL;
You can also download this SQL script that contains all the previous steps.
Those are the results from the Link Prediction analysis. The Link Prediction algorithm gives you the most probably next relations, with their score. This score is the probability to occur. Remember that these are the edges of a directed graph, where each BUYER (and eventually the SELLERS) are the vertices.