I am very lazy when it comes to social networks, I would love to have thousands of followers in Twitter, but I don’t have the will to tweet frequently enough to grow my number of followers. Regardless of that, I regularly check my twitter account expecting that magically some new follower comes my way, and when it does, I feel like I accomplished something. I know its silly, but I can’t help it. Anyway, I wonder if I could use the SAP HANA Predictive Analytics Library (PAL) to see who my next follower will be. SAP introduced many new features with the release of SPS06, and one of those is the Link Prediction Algorithm in PAL. Predicting links in social networks is not something new, it has been around for many years. This algorithm tries to answer the following question: Given a snapshot of a social network, can we predict which new interactions among its members are likely to occur in the near future? This is commonly known as the link prediction problem and there are multiple approaches based on measures for analyzing the “proximity” of the different nodes in a network. When we say social networks, we not only mean Twitter or Facebook, but it can also apply to, for example, employees in a company. This algorithm is also oftenly used in Fraud Prevention to detect missing nodes (fraudsters) in criminal networks.

Like I already said, there are multiple ways in which we can approach the link prediction problem, and specifically in PAL, there are 4 different methods implemented to compute the distance of any two existing nodes using existing links in a network:

  • Common Neighbours
  • Jaccard’s Coefficient
  • Adamic/Adar
  • Katz

I’m not going to get into the details of how the different methods work, for that you can take a look at the PAL User Guide. Instead I’m going to get my hands on it ;).

I want to predict my next twitter follower, so the first thing I need to do is download data from twitter that I can use to train the algorithm. For that I’m going to use Python, more specifically, a Python library called Tweepy which is basically a wrapper around the Twitter API.

First we need to setup Python to be able to connect to HANA. If you don’t know how to do this, you can take a look at this wonderful post by Blag that shows how to do it.

Then, we need to download and install Tweepy (https://github.com/tweepy/tweepy)

Now that we are all set, we can start downloading data from Twitter. I’m going to create a Column Table in HANA to store the data.

CREATE COLUMN TABLE LINK_PREDICT( FOLLOWER INTEGER, FOLLOWING INTEGER );

First I’m going to download my Followers List by running the following Python Script. I don’t have a lot of followers so this will only take a couple of seconds.

import tweepy
import dbapi
consumer_key="..." #Your Consumer Key
consumer_secret="..." #Your Consumer Secret
access_token="..." #Your Access Token
access_token_secret="..." #Your Access Token Secret
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
con = dbapi.connect('hana_host', 30015, 'SYSTEM', 'password') 
cur = con.cursor()
for user in tweepy.Cursor(api.followers_ids, screen_name="LukiSpa").items():
    cur.execute("INSERT INTO LINK_PREDICT VALUES(?,?)", (user, 'userid')) #Save the content to the table. Replace userid with your Twitter User ID  

Now, I would like to get the Followers of my Followers, for that I’m going to run the Python script below. Beware that Twitter limits the number of request you can make to the API, so to avoid exceeding that limit and getting an error message I’m waiting 60 seconds before making a new call to the API, that means that this code can run for quite a long time, so I would suggest running it over night.

import tweepy
import dbapi
import time
consumer_key="..."
consumer_secret="..."
access_token="..."
access_token_secret="...
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
con = dbapi.connect('hana_host', 30015, 'SYSTEM', 'password') 
cur = con.cursor()
query = "SELECT FOLLOWER FROM LINK_PREDICT"
ret = cur.execute(query)
ret = cur.fetchall()
for row in ret:
    ids = []
    for page in tweepy.Cursor(api.followers_ids, id=row[0]).pages():
        ids.extend(page)
        time.sleep(60)
    for user in ids:
       cur.execute("INSERT INTO LINK_PREDICT VALUES(?,?)", (user, row[0]))

And finally, I want to download my Followings plus the Followings of my Followers (besides me)

import tweepy
import dbapi
import time
consumer_key="..."
consumer_secret="...
access_token="..."
access_token_secret="..."
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
con = dbapi.connect('hana_host', 30015, 'SYSTEM', 'password') 
cur = con.cursor()
query = "SELECT DISTINCT FOLLOWING FROM LINK_PREDICT"
ret = cur.execute(query)
ret = cur.fetchall()
for row in ret:
    ids = []
    for page in tweepy.Cursor(api.friends_ids, id=row[0]).pages():
        ids.extend(page)
        time.sleep(60)
    for user in ids:
        cur.execute("INSERT INTO LINK_PREDICT VALUES(?,?)", (row[0], user))

Now I’m ready to run the Link Prediction Algorithm. I wanted to run it using the AFM (Application Function Modeler), but for some reason this algorithm is not available in the tools palette, not sure if this is a bug or something wrong with my PAL implementation (any comments here will be much appreciated), so I will need to do it the old way.

First I create the procedure by calling AFL Wrapper Generator

SET SCHEMA MYSCHEMA;
DROP TYPE PAL_LP_DATA_T;
CREATE TYPE PAL_LP_DATA_T AS TABLE("FOLLOWER" INTEGER, "FOLLOWING" INTEGER);
DROP TYPE PAL_LP_RESULT_T;
CREATE TYPE PAL_LP_RESULT_T AS TABLE("FOLLOWER" INTEGER, "FOLLOWING" INTEGER, "SCORE" DOUBLE);
DROP TYPE PAL_CONTROL_T;
CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR(100), "INT_ARGS" INTEGER, "DOUBLE_ARGS" DOUBLE, "STRING_ARGS" VARCHAR(100));
DROP TABLE PAL_LP_PDATA_TBL;
CREATE COLUMN TABLE PAL_LP_PDATA_TBL( "ID" INTEGER, "TYPENAME" VARCHAR(100), "DIRECTION" VARCHAR(100));
INSERT INTO PAL_LP_PDATA_TBL VALUES (1,'MYSCHEMA.PAL_LP_DATA_T','in');
INSERT INTO PAL_LP_PDATA_TBL VALUES (2,'MYSCHEMA.PAL_CONTROL_T','in');
INSERT INTO PAL_LP_PDATA_TBL VALUES (3,'MYSCHEMA.PAL_LP_RESULT_T','out');
CALL SYSTEM.afl_wrapper_generator('PREDICT_FOLLOWER','AFLPAL','LINKPREDICTION', PAL_LP_PDATA_TBL);

And then I execute the procedure with the data I downloaded from Twitter

SET SCHEMA MYSCHEMA;
DROP TABLE #PAL_CONTROL_TBL;
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_CONTROL_TBL LIKE PAL_CONTROL_T;
INSERT INTO #PAL_CONTROL_TBL VALUES ('THREAD_NUMBER', 2, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('METHOD', 1, null, null);
INSERT INTO #PAL_CONTROL_TBL VALUES ('BETA', null, 0.005, null);
DROP TABLE LP_RESULT;
CREATE COLUMN TABLE LP_RESULT LIKE PAL_LP_RESULT_T;
CALL _SYS_AFL.PREDICT_FOLLOWER(LINK_PREDICT, #PAL_CONTROL_TBL, LP_RESULT) with overview;

Let’s take a look at the results

/wp-content/uploads/2013/09/1_272355.png

Hmmm, seems like I will have one new follower, let’s see on tweeterid.com who he/she is

/wp-content/uploads/2013/09/2_272356.png

@atul_vaikul, I have no idea who you are but I’m here waiting mate! πŸ™‚

We went thru all this trouble to find my next follower, but that’s not all, I can also find out in the results who should I be following

/wp-content/uploads/2013/09/3_272357.png

/wp-content/uploads/2013/09/4_272358.png

@SAPCommNet is the twitter account of SCN – I was surprised that I didn’t already follow it. Same with @SAPinMemory, almost a no-brainer to follow and @JohannesSchnatz is in fact blogging a lot about SAP and SAP HANA. I don’t really share his interest for SAP HCM (and fishing), but we are both guitar players, as it seems!

Hope you liked it!

Follow me on Twitter: @LukiSpa (especially you, @atul_vaikul)

Info en Español sobre SAP HANA™:

www.HablemosHANA.com

To report this post you need to login first.

21 Comments

You must be Logged on to comment or reply to a post.

  1. Dolly Mishra

    Tweepy” πŸ™

    Hi All, It is indeed a nice article. I have used  twitter4j and java to get the tweets in HANA in one of my POC.

    However this time trying using python tweeter API.

    I downloaded tweepy-1.2 and installed successfully using  python setup.py install.

    However when finally running my python script end up with

    “ImportError: No module named tweepy”  error.

    I already searched in forums but ended up with  no result.

    Any help will be really appreciated..

    Thanks…

    (0) 
  2. sumit bisht

    This is one of the blog that inspires you to look into technology as it presents an end to end use-case succinctly. Thank you again Lucas (@LukiSpa) !

    (0) 
  3. Tharindu Fernando

    Genius!!!

    The potential of such logic is imperative to an amazing (and mind-boggling!) future.

    Reading your post inspired me to get into deep thinking about random scenarios and all the factors affecting each; and about how SAP HANA could be used to make related predictive analyses.

    I apologize if my comment appears to be unstructured, but… the power of HANA is simply overwhelming, and you did a great job illustrating a part of its real-world uses!

    Tharindu Fernando

    (0) 

Leave a Reply