Social Network Analysis (SNA) is a fundamental research field and the problem of finding the optimal strategy for the spread of information, in our case over an entire community inside the network, is called Viral Marketing.

Data scientists know that some of the key factors for an idea to go viral are: the rate at which people become “infected”, the “connectedness” of the network and how a target group of individuals, called “seed”, who first become infected, are linked to the rest. This elements distinguish the networks you want to study but are not the only ones.

Our results are based on a “tipping model” proposed by Shakarian et. Al [1]. They have found a way to identify a seed group that can spread a message across a network. For very large networks we have chosen the integration of SAP HANA with R as a useful research and development tool.

As the authors point out: “The problem is NP-Complete so approximation algorithms must be used”. The method is based on the idea that an individual will eventually receive your information if a certain proportion of his or her friends already have that message. This proportion is called the “threshold” and is a crucial parameter in the model. The next step is to examine the network and look for all those individuals who have more friends than this critical number and then remove those who exceed the threshold by the largest amount. Repeating this process leaves us with a group of people in the network who have more friends than the threshold. When this happens, whoever is left is the seed group. In the paper the authors also investigate the limitations of the tipping model, so this must be taken into account.

This is a simple-to-code way to find a set of nodes that causes the entire population to activate – but it is not necessarily of minimal size.

We have tested the algorithm in well known networks to see how it works. Integration of R with SAP HANA is crucial for optimization. Our examples come from the Stanford Large Network Dataset Collection [2]. The algorithm performs very well:

```
--TippingImplementationOnSQLAndR
--Copyright (C) 2014 Sergio Samuel Nieto Mejía
--This program is free software: you can redistribute it and/or modify
--it under the terms of the GNU General Public License as published by
--the Free Software Foundation, either version 3 of the License, or
--(at your option) any later version.
--This program is distributed in the hope that it will be useful
--but WITHOUT ANY WARRANTY; without even the implied warranty of
--MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
--GNU General Public License for more details.
--You should have received a copy of the GNU General Public License
--along with this program. If not, see <http://www.gnu.org/licenses/>.
CREATE PROCEDURE schema.VIRALMarketing
(IN initial_data schema.Data, OUT viral_nodes schema.Data)
LANGUAGE RLANG SQL SECURITY INVOKER
AS BEGIN
tipp_model <- function(Edges_Data, threshold){
library(igraph)
graph_Data <- graph.data.frame(d=Edges_Data,directed=T)
Edges_Data[,1]<-as.character(Edges_Data[,1])
Edges_Data[,2]<-as.character(Edges_Data[,2])
inner_degree <- degree(graph_Data, mode= "in", loops=T)
outer_degree <- degree(graph_Data, mode="out", loops=T)
k <- threshold*inner_degree
k2<- threshold*outer_degree
dist_in <- inner_degree - k
dist_out <- outer_degree - k2
node_list <- data.frame(d_in=dist_in, d_out=dist_out)
while (any(node_list$d_in>=0)){
non <- node_list$d_in>=0
min_val <- min(node_list$d_in[non])
remove <- node_list$d_in==min_val
neigh <- row.names(node_list[remove,])
m <- Edges_Data[Edges_Data$[,2] %in% neigh,]
m$V1 <- as.character(m$V1)
m$V2 <- as.character(m$V2)
node_list <- node_list[!remove,]
if (nrow(m)>0){
m1 <- m[1]
no_neg <- m1[m1>0]
neg <- m1[m1<=0]
node_list$d_in[row.names(node_list) %in% neg] <- -Inf
node_list$d_in[row.names(node_list) %in% no_neg] <- node_list$d_in[row.names(node_list) %in% no_neg]-1
}
}
viral_nodes<- rownames(node_list)
return(viral_nodes)
}
result <- tipp_model(data,0.8)
resulting_viral_nodes <- data.frame(A=result,B=result)
END;
```

1.- On the Facebook social network we analyzed a small network with 4,039 nodes and 88,234 edges and obtained the result: 170 viral nodes

2.- On the Twitter social network we analyzed a network with 70,097 nodes and 2,420,766 edges and obtained the result: 942 viral nodes

3.- On data crawled over the Amazon website we analyzed a product co-purchasing network with 403,394 nodes and 3,387,388 edges and obtained the result: 193,743 viral nodes

4.- On the Google+ social network we analyzed a network with 72,271 nodes and 30,494,866 edges and obtained the result: 2,496 viral nodes

(In all the networks we assume that the threshold value is 80% of the node degree, so the viral nodes are a small set in all networks)

We observe that the seed group scales with the size of the dataset almost linearly, but also processing time increases. Many of the datasets we used are very simple and not highly clustered, so the result is a very small seed set. In a future post we’ll work on different applications of this implementation and improve the algorithm for better performance for high-clustered networks. The important conclusion is that R implementation with SAP HANA greatly improves efficiency and consistency of our data science algorithms.

REFERENCES

1. [1309.2963] A Scalable Heuristic for Viral Marketing Under the Tipping Model

2. Leskovec, J.: Stanford network analysis project (SNAP) (2012). URL http://snap.stanford.edu/index.html