Detecting World Cup GOAL using Twitter and SAP HANA
I am a software developer who wants to learn to develop application on HANA. Yes, I have done HANA-related courses at Open SAP, I have gone through several videos provided by SAP HANA Academy and they are excellent resources to get myself familiar with HANA. However, I found myself wanting to learn more. Rather than going through HANA guides without having a useful context, I challenged myself to develop a scenario/application that utilises HANA.
I came across this blog describing a strategy to detect world cup goal using twitter data so I thought to myself why not build this on HANA since all major components (except raspberry pi) are available and can be easily configured in HANA. The major components are namely:
- Connection with Twitter API using HANA XS Outbound Connection
- Full-text indexes for text analysis to find GOAL
- Exponential smoothing algorithm to produce smoothed version of tweets data
- SAPUI5 for presentation
After that, I will compare the result with actual events and I will describe some improvements that can be made.
My source code is available here.
– SAP HANA SPS6 or above is needed to make XS Outbound connection. I am using SAP HANA developer edition rev 72 available in SAP CAL. To get your own developer edition, refer to http://scn.sap.com/docs/DOC-28294
– Twitter application created in your twitter account. Refer to the following link for more information http://scn.sap.com/docs/DOC-49203
1. Connection with Twitter API – HANA XS Outbound Connection
To get data from Twitter into HANA, I need XSJS Outbound connection, XSJS service to store tweets to database and XSJS scheduler to run the service in regular basis.
XSJS Outbound connection – twitter_connection.xshttpdest
XSJS service – TwitterCollector.xsjs
In the service, I want to collect tweets that contain “#BELvsUSA” which is the hashtag suggested by Twitter for Belgium vs USA soccer game.
If you are familiar with Twitter API, you may be wondering why in my code, I’m not using HTTPS protocol since Twitter only allows HTTPS protocol. That is because I don’t have access to download SAP cryptographic library (sapgenpse) which is the prerequisite to make an HTTPS outbound connection with HANA XS (reference). So I created php service to act as an adapter to Twitter. I would have made HTTPS connection directly to Twitter API if I have access to the library.
2. Full-text indexes for text analysis
Once I have tweets data, I resort to full-text indexes that I have created on the tweet data for text analysis. I use LINGANALYSIS_FULL to basically breakdown the tweets into individual words. For example: User @juanvofficial tweeted “Well deserved goal by the Belgium team #BELvsUSA”. The index will provide the following information
Why do I need to breakdown the tweet into words? It’s just to make it easier for me to exclude the text that does not indicate an actual goal, i.e. @SmileyElie tweeted “#USA has the best goalkeeper ever!!! So impressed!!! #BELvsUSA”. The tweet contains “goal” but the tweet does not indicate that one of the team has scored.
3. Exponential smoothing algorithm
The formula I chose to indicate goal is percentage of tweets mentioning goal to indicate whether a goal has happened.
No of tweets mentioning goal per minute * 100 / No of tweets per minute
After we get all tweets that mentions GOAL, I want to perform exponential smoothing algorithm which is usually used to remove noise data. Let me just show you a Line chart indicating the percentage of tweets mentioning goal / minute before the algorithm.
I chose single exponential smoothing algorithm which is available SAP PAL library and after applying the algorithm, the result is shown in green line which is smoother than original graph.
4. SAPUI5 for presentation
Oh well, you can see that the graphs above are SAPUI5 graphs.
Now let’s compare our graphs with actual events. There are 3 goals during Belgium vs USA game in round of 16 World Cup 2014.
If I can translate the timeline into our graph, the goals happened at the bold green points in the graph below and after a goal happens, the percentage number of tweets mentioning goal increase to more than approximately 8%.
First goal by Belgium in extra time definitely gets people excited because a lot of people has been expecting a goal. The percentage went up again because of second goal by Belgium but there weren’t as many tweets as the before. Moreover, Twitter was busy again after first goal by USA in second half of the extra time. This definitely raised hope for USA to turn around the game. Unfortunately, USA could not score another goal and lost the game.
1. Twitter API
I collect the data from search/tweets which is an REST API source that provides relevant tweets from a limited corpus of recent tweets. To get more relevant search result, Streaming API should be used and it requires more complex solution. However, for the purpose of this exercise, I did not want to put too much effort in getting the tweets and as you can see from the graph above, the collected tweets from REST API gives sufficient information to indicate when a goal happens.
2. Text analysis
Improvement can be made when performing analysis especially if you are working with data from different language i.e. golazo means goal in Spanish and there could be many different variations in other languages.
There is also limitation with text analysis I chose, i.e. @Guevara_Caro tweeted “Were it not for Howard and Beasley, #USA would be losing by a couple of goals! #BEL has a great team. Let’s pick it up!! #GoUSA #BELvsUSA”. Obviously, the tweet contains word goal but the whole sentence does not indicate a goal has happened. Regular expression (which is available in R & other programming language, but not in HANA SQL) should be used to capture goal in tweets more accurately.
Various components available in HANA make it easier for me to obtain unstructured data, perform analysis on the data and present the analysis result using SAPUI5. In this case, goals can be identified by using twitter data. The solution is far from perfect and the threshold of 8% may not be used to indicate a goal in other world cup matches but I am hoping to gain more analytical knowledge and explore more interesting scenario in near future using SAP HANA.