Skip to Content
Author's profile photo Ian Henry

Creating a Rugby World Cup Sentiment Tracker

With the Rugby World Cup now on, I decided to put some of the SAP kit bag to the test.

25-Sept-2015, We have updated Lumira Cloud visualisations and added a few more screenshots below

The latest output of this *should* be automatically republished daily at 22:00 BST to Lumira Cloud, allowing you to interact with it.

http://tiny.cc/RWCTweets

Rugby Tweet Analysis v2.png

During the first 7 days of the tournament I have already captured over 1.4 million tweets from the #RWC2015 Twitter Feed.  I hope to keep the data capture running throughout the tournament

In this example I have used

1. Smart Data Integration (SDI) within SAP HANA to acquire the tweets from Twitter in real time from the #RWC2015 feed

2. SAP HANA to store, process and the data

3. Text Analysis to turn Tweets into a structured form

4. Text Mining to identify Relevant Terms

5. SAP HANA Studio to model

6. SAP Lumira Desktop to create some analytics

7. SAP Lumira Cloud to expose the output

1. Data Acquisition through the SDI Data Provisioning Agent

From HANA SPS 09 Smart Data Integration has been added directly in HANA. One of the data provisioning (DP) sources available is a Twitter.  I won’t repeat the steps to setup the DP agent here, as Bob has created a great series of SAP HANA Academy videos of this setup here.

SAP HANA Academy – Smart Data Integration/Quality : Twitter Replication Pt 1 of 3 [SPS09] – YouTube

With the virtual table now available in HANA you can make this real-time by issuing the following SQL.


SET SCHEMA HANA_EIM;
--Create SDA Virtual Table
CREATE VIRTUAL TABLE "HANA_EIM"."RWC_R_STATUS" at
"TWITTER"."<NULL>"."<NULL>"."status";
--Create a target table
create COLUMN table "HANA_EIM"."RWC_T_STATUS" like "HANA_EIM"."RWC_R_STATUS";
--Create Subscriptions
create remote subscription "HANA_EIM"."rt_trig1"
as (select * from "HANA_EIM"."RWC_R_STATUS" where "Tweet" like '%#RWC2015%')
target table "HANA_EIM"."RWC_T_STATUS";
--SELECT * FROM "HANA_EIM"."RWC_T_STATUS";
--truncate table "HANA_EIM"."RWC_T_STATUS";
--Queue the subscription and start streaming.
alter remote subscription "HANA_EIM"."rt_trig1" queue;
alter remote subscription "HANA_EIM"."rt_trig1" distribute;
select count(*) from "HANA_EIM"."RWC_T_STATUS";
--Stop Subscription
--ALTER REMOTE SUBSCRIPTION "rt_trig1" RESET;










This table holds the raw Tweets coming in from twitter

Twitter Data Table.png

Twitter provide a number of columns, the Tweet itself is the most useful of these for this analysis.

Twitter Table Definition.png

With the data now being acquired “automatically” it’s possible to monitor the acquisition via the XS Monitoring URL http://ukhana.mo.sap.corp:8000/sap/hana/im/dp/monitor/?view=DPSubscriptionMonitor

DPSubscriptionMonitor.png

3. Text Analysis

As I previously described Using Custom Dictionaries with Text Analysis in HANA SPS9, for Formula One Twitter Analysis creating custom dictionaries for your subject area is very easy.

I’ve added one to include the Rugby teams, Twitter handle and short name.  This new dictionary was included in a new configuration.

HANA Web IDE.png

To turn on Text Analysis on the acquired twitter data, use the following syntax


CREATE FULLTEXT INDEX "RWC-TWEETS" ON "HANA_EIM"."RWC_T_STATUS"("Tweet")
CONFIGURATION 'RWC::RUGBY_SOCIAL_CONFIG'
FAST PREPROCESS OFF
LANGUAGE COLUMN "isoLanguageCode"
LANGUAGE DETECTION ('EN','FR','DE','ES','ZH','IT')
TEXT ANALYSIS ON
TEXT MINING ON
FUZZY SEARCH INDEX ON








Text Analysis is really clever and identifies some useful elements, beyond the basics. Who, Where, When, etc.  The more advanced output is often known as fact extraction, of these “facts” Sentiment, Emotion and Requests are three of these that could potentially be useful in the Rugby Tweet data.

4. Text Mining the Tweets

Now I wanted to try something more than just sentiment, mentions and emotion.  For this I decided to use Text Mining which is also built into HANA, and has been further enhanced is SPS10 with SQL access to Text Mining functions.  Activating Text Mining is very easy, it’s done when when specifying the FULL TEXT index by using the syntax as above TEXT MINING ON.

Text Mining has multiple capabilities which are applicable at a document level, for this I treated each Tweet as a document which served a purpose. As tweets by nature are very short you don’t gain that much additional insight from the document level analysis.


SELECT *
FROM TM_GET_RELEVANT_TERMS (
DOCUMENT IN FULLTEXT INDEX WHERE "Tweet" like '%England%'
SEARCH "Tweet" FROM "HANA_EIM"."RWC_T_STATUS"
RETURN
TOP 16
) AS T










After investigating the Text Mining functions TM_GET_RELEVANT_TERMS and TM_GET_RELATED_TERMS with Twitter data I found the core Text Analysis functions to be more than capable for my analysis purposes. If however I was analyzing news reports, blogs or documents then Text Mining would be much more appropriate

Text Mining Output.png

5. HANA Modelling

This piece took the longest and was fairly challenging as you need to model the Tweets with final output in mind.  This turns the structured $TA table into a format suitable for analysis in Lumira (or other BI tool) by identifying the entities and the relationships, Countries, Tweets, Sentiment.

I created 2 Calculation Views in HANA Studio, they are still a work in progress, but are sufficient to give some useful output.

I felt it easier to create 2 as they are at different levels of granularity. One is at the Country level, the other at Country, Key Word

Text_Analysis_Calc_View_Annotated.png

Base Data in the $TA_RWC-TWEETS table

Screen Shot 2015-09-25 at 10.55.17.png

Selected output from the Projection_3 above

Screen Shot 2015-09-25 at 10.53.47.png

Aggregation_2 from the Calc View above, showing fields being used.

Screen Shot 2015-09-25 at 10.56.54.png

Text_Analysis_Words_CV_Annotated.png

6. SAP Lumira Desktop to create some visualisations

With the modelling and manipulation taken care of in HANA, using Lumira is then easy (although you can spend some time perfecting your final output).  Here we can build some visualisations as below and then encapsulate them into a story board.

Screen Shot 2015-09-23 at 10.34.32.png

My original visualisations have now been greatly enhanced by Daniel Davis into a great Lumira Story.

Daniel has also created a England Rugby Wall chart available for download from here http://www.thedavisgang.com/

Screen Shot 2015-09-23 at 10.46.32.png

7. SAP Lumira Cloud

To share the output in an interactive way we can publish the visualisaitons, stories and dataset to SAP Lumira Cloud.  There’s one crucial story option “Refresh page on open” that is required to  update the visualisations within the story which by default is OFF. Set this to ON and the story also gets updated.

Lumira Desktop has a scheduling agent built in, once enabled it can automatically refresh and republish to Lumira Cloud.

I have set this to refresh the Rugby Tweet Analysis every day at 22:00

Within Lumira Cloud we now need to make the story public, this is set under the Story optionsLumira Cloud Share.png

Change Access.png

Public.png

We now have the URL which can be shared with others, for ease of consumption I created a Short URL pointing to this long URL with http://tiny.cc/

To View the full interactive Lumira Story Board please use the link below

http://tiny.cc/RWCTweets

Tweets over Time.png

Assigned tags

      14 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Martin Chambers
      Martin Chambers

      Very, very cool!

      A few questions:

      Does Twitter provide the sentiment data?

      Would it be possible to get a more detailed description of how you transformed the data?

      Regards,

      Martin

      Author's profile photo Ian Henry
      Ian Henry
      Blog Post Author

      Thanks Martin, the feedback is most welcome.

      No, Twitter does not provide the Sentiment data.  For Analysis I am just taking the raw tweet and the date/time values

      For the sentiment I used the Text Analysis "Voice of the Customer" linguistic capability.  This takes the raw tweet text and identifies the parts of speech, nouns, verbs, etc.  It then identifies entities such as people and organisations.  The more advanced capability it performs is the fact extraction, which includes 5 levels of sentiment strong, weak, neutral negative and positive.  As well as sentiment I am also looking for emotion which also has 5 levels from strong positive to strong negative.  Text Analysis also identified Requests and Minor and Major problems which I have included some of the analysis performed.

      The output of Text Analysis is a $TA_ table which identifies each of the elements per tweet.  I think some additional screenshots would help explain this better?

      If there's something specific I'm happy to edit the blog and include the details you are looking for.

      Cheers, Ian.

      Author's profile photo Martin Chambers
      Martin Chambers

      Hi Ian,

      I'm considering implementing Twitter analysis as a showcase. I just hope I'm not biting off more than I can chew. Is your Formula One Twitter analysis less demanding?

      I come from the BW side and have been introduced to HANA only indirectly via BW on HANA. So far, I've only created a few HANA view and written some very, very basic SQL. But I do believe that in future BW will become more and more "HANAfied" (see B/4).

      Is the whole HANA Add-on EIM (=SDI + SDQ) required? I looked at the HANA Academy youtube videos and they only installed the DP agent.

      You mention HANA SPS10. Will SPS09 suffice?

      Could you add a screenshot of the table (or view) structures at the beginning and end of your dataflow?

      Cheers,

      Martin

      Author's profile photo Ian Henry
      Ian Henry
      Blog Post Author

      Hi Martin,

      With analysis you should start with the end goal in mind. Choosing a subject that interests you makes this slightly easier, but understand how you would like to present the data should be one of your first thoughts.  This is one of the challenges when using Twitter, as what to aggregate, and how.

      If you just want to use Text Analysis (TA) and Twitter data then you don't need to create custom dictionaries, TA has a great capability already.  I would pick a hash tag that interests you and start small, even just a few records makes for a great learning experience.  You can also use TA with any other unstructured data, description fields, comments, PDFs, web pages, etc.

      In terms of pre-requisits you don't need the full SDI, SDQ capability, but I believe it is licensed with those. SP10 has some improved capability on Text Analysis and Text Mining, but SP09 would also be sufficient.

      I will add a couple more screenshots to show you what I have done.

      Thanks, Ian.

      Author's profile photo Martin Chambers
      Martin Chambers

      Hi Ian,

      Thanks for the additional screenshots. They were quite helpful. Unfortunately for you, I now have some further questions.

      1. Where did you install the DP agent, on your own PC?
      2. I see you've created your own text analysis config file "RUGBY_SOCIAL_CONFIG.hdbtextconfig". Which of SAP config files did you use as a template?
        EXTRACTION_CORE or EXTRACTION_CORE_VOICEOFCUSTOMER?
        Did you change your copy?
      3. Does the text analysis fill the language column, i.e. detect that the language is English and write EN into isoLanguageCode? Or is this already filled coming from twitter?
      4. SDI contains a somewhat restricted Data Services. But it seems to me, that you only use SDI because of the Twitter extractor and then create an SDA datasource. Is that correct?
      5. My select on the TM_GET_RELEVANT_TERMS function results in an SQL error.
        Does this function work with SQL in SPS09?
      6. Creating the fulltext index worked, but the column TA_NORMALIZED remained empty, despite my adding my custom dictionary RWC2015 to a copy of the voiceofcustomer config file.

        <string-list-value>sap.hana.ta.config::RWC2015.hdbtextdict</string-list-value>

        Do custom dictionaries work in SPS09?
        Is data added to the Fulltext table $TA_RWC-TWEETS automatically when new data is added to the underlying table TWEET_STATUS?

        CREATE FULLTEXT INDEX "RWC-TWEETS" ON
        "MARCHAMB"."TWEET_STATUS"("TWEET")
        CONFIGURATION 'sap.hana.ta.config::RWC2015'
        FAST PREPROCESS OFF
        LANGUAGE COLUMN "ISO_LANGUAGE_CODE"
        LANGUAGE DETECTION ('EN','FR','DE','ES','ZH','IT')
        TEXT ANALYSIS ON
        TEXT MINING ON
        FUZZY SEARCH INDEX ON;

      Cheers,

      Martin

      Author's profile photo Ian Henry
      Ian Henry
      Blog Post Author

      1. I have the DP Agent running on a Windows VM

      2. I used VOICEOFCUSTOMER as this contains the ability to understand sentiment.  I purely added my custom dictionary to the end.

      3. Twitter provides the language of a tweet, when creating the text index you can specify the column holding the language, ISO_LANGUAGE_CODE above.

      4. SDI contains most of the Data Services feature set.  In the example above I only used the DP Agent that comes with SDI, not SDI itself as I was not transforming any data.

      5. The Text Mining SQL interface was new in SPS10, so that won't work in SPS09.

      6. Yes, custom dictionaries have always been part of Text Analysis, and I have created them with SPS09.  Did you add your dictionary inside an existing <dictionary> entry? You can also check the preprocessor diagnosis file for more details.  Yes, the $TA is automatically updated as new data arrives. Do you have other entries in your $TA table?

      Author's profile photo Martin Chambers
      Martin Chambers

      Hi Ian,

          

      Thanks a lot for your detailed answers. Our basis guy has promised to install SDI next week. So I found an internet based, free solution which unfortunately only delivers exactly 50 tweets an hour. This makes time analysis rather pointless. You must have millions of tweets by now. Makes me quite envious. Sorry England crashed out. Or do you support another country?

      Everything else seems to be working. I have tried various SAP config files and got the
      expected results. It’s just my customer dictionary that keeps getting ignored!!

      I tried misspelling my dictionary name and this did produce a preprocessor error. But the correctly spelled dictionary name does not produce any preprocessor error.

      This is what my RWC2015_neu.hdbtextdict looks like.

      <?xml version="1.0" encoding="UTF-8"?>

        <dictionary xmlns="http://www.sap.com/ta/4.0" transient="true">

          <entity_category name="Rugby_Country">

            <entity_name standard_form="England">

                  <variant name="ENG"/>

                  <variant name="EnglandRugby"/>

                        …

            </entity_name>   

        </entity_category>

        </dictionary>

      This is what my copy of the VOC config file “RWC2015_neu.hdbtextdict” looks like. The red line, is the line I added, right at the file end.

      <?xml version="1.0" encoding="utf-8" ?>

        <!--Standard text analysis configuration for comprehensive linguistic analysis

              plus sentiment analysis ("voice of the customer" extractions).

        -->

        <tasdk-configuration xmlns="http://www.sap.com/ta/config/4.0">

            <configuration name="SAP.TextAnalysis.DocumentAnalysis.Extraction.ExtractionAnalyzer.TF" based-on="CommonSettings">

         …..

      <!-- List of Text Analysis extraction dictionaries for Sentiment Analysis. -->

        <property name="Dictionaries" type="string-list">

        …..

      <string-list-value>sap.hana.ta.config::RWC2015.hdbtextdict</string-list-value>

      </property>

      </configuration>

      </tasdk-configuration>

      And this is the result of one tweet. As you can see, VOC works but ENG is NOT mapped to England in the column TA_NORMALIZED.

          

      1)

      Table TWEET_STATUS

      TWEET                                                                               ISO_LANGUAGE_CODE

      The best. This is awesome. 🙂 I love ENG. #RWC2015    EN

      2)

      Table
      $TA_RWC-TWEETS        

       

      ID

       

       

      TA_RULE

       

       

      TA_TOKEN

       

       

      TA_

       

      LAN

       

      GUA

       

      GE

       

       

      TA_TYPE

       

       

      TA_N

       

      ORM

       

      ALIZED

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      RT

       

       

      en

       

       

      ORGANIZATION/MEDIA

       

       

      ?

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      JiffyRugby

       

       

      en

       

       

      SOCIAL_MEDIA/ID_TWITTER

       

       

      ?

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      best

       

       

      en

       

       

      StrongPositiveSentiment

       

       

      ?

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      best

       

       

      en

       

       

      Sentiment

       

       

      ?

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      This is awesome. 🙂 I love ENG. #RWC2015

       

       

      en

       

       

      Emoticon

       

       

      ?

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      awesome

       

       

      en

       

       

      StrongPositiveSentiment

       

       

      ?

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      awesome

       

       

      en

       

       

      Sentiment

       

       

      ?

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      🙂

       

       

      en

       

       

      WeakPositiveEmoticon

       

       

      ?

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      love

       

       

      en

       

       

      StrongPositiveSentiment

       

       

      ?

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      love ENG

       

       

      en

       

       

      Sentiment

       

       

      ?

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      ENG

       

       

      en

       

       

      Topic

       

       

      ?

       

       

      1

       

       

      Entity

       

      Extraction

       

       

      #RWC2015

       

       

      en

       

       

      SOCIAL_MEDIA/TOPIC_TWITTER

       

       

      ?

       

      Author's profile photo Ian Henry
      Ian Henry
      Blog Post Author

      Yes, I'm English so I have to look to the other home nations for interest. 

      I've got about 3.5 million tweets captured now.

      Does your RWC2015_neu.hdbtextdict open in an XML editor / web browser without any

      errors?

      My XML file does not have the transient="true", parameter so perhaps try removing that.

      Did you create the custom dictionary and config with the web IDE?

      Author's profile photo Martin Chambers
      Martin Chambers

      Hi Ian,

      well, since the UK had four participants, you do have some choice. Perhaps you can watch Wales in the finals? I wonder, do you think a unified UK team would win the cup?

      Is there any chance I could lay my hands on your rugby tweets, dropbox or Google drive? It's OK if you think this might prove to complex.

      Thanks for your tip about transient='true'. I think I copied it from one of SAP's Thesaurus files.

      Do you understand how they work? They only have standard forms, no variants.

      <dictionary xmlns="http://www.sap.com/ta/4.0" transient="true">

        <entity_category name="COR@Noun">

          <entity_name standard_form="call">

          </entity_name>

          <entity_name standard_form="calls">

          </entity_name>

      Anyway, deleting it solved my problem. It works now just fine.

      Cheers,

      Martin

      Author's profile photo Ian Henry
      Ian Henry
      Blog Post Author

      Hi Martin,

      The Lions, well they he performed well in the past, but then so have England! 🙂

      Happy to share the if you ping me, my email with my name at sap.com

      I haven't used the thesaurus capability, but I know people that have, so I can follow up with them...

      Author's profile photo Martin Chambers
      Martin Chambers

      Hi Ian,

      Thanks for you kind offer. I will.

      Tried installing the SDI. Turned out I had installed the DP agent for SPS10 instead of SPS9.

      Can I just uninstall the DP agent? Or do I have to do lots of extra things?

      Author's profile photo Marin VIDENOV
      Marin VIDENOV

      This pièce of work is absolutely fantastic!!!

      Well done, Ian!

      I'd love to deep-dive into such ent 2 end innovative and very compeling solutions to support my job (biz developement by mostly story telling around digital transformation supported by data-driven data science based innovative solutions).

      Some questions I have:

      1) How's this différent/better from Hybris Marketing capabilities?

      2) In your opinion, is HANA usage absolutely needed and why?

      3) What do you think is the effort to build (from scratch) or reproduce your solution (with your help) in another similar environment?

      4) How easy would be to augment your solution with additional predictive analytics capabilities?

      I'd absolutely love to leverage your great work, enrich it with additional predictive or visualization capabilities and use it to support innovation topics with my customers.

      I'd love to hear from you  

      Thanks,

      Marin

      Author's profile photo Ian Henry
      Ian Henry
      Blog Post Author

      Hi Marin,

      Thanks for the kind comments, it's always good to receive any sort of feedback.

      1. I'm not an expert on what Hybris marketing provides, so I would need to discuss with someone who can brief me on what that does/doesn't provide.

      2. HANA does provide a lot here.  It is the platform.  Without HANA you have to resort to the traditional method.  Data Services to acquire the data, Data Services to transform the data for the Text Analysis, Data Services to re-structure the data for reporting.  It could be done but it would likely be more complex and take longer.  If you were developing from scratch it would definitely take longer.  HANA is a major productivity boost.

      3. If it is exactly the same this can be done in 1 day.

      4. I was thinking of using predictive, but in the short time I have thought about it I could not identify an appropriate use.  I have seen others cluster the tweets, but I wanted to use Country to drive most of the output and I struggled to see how this would fit together.  I am familiar with Predictive Analysis, PAL and the APL, so if you have a use case this could be performed quickly.

      Author's profile photo vijaykumar ijeri
      vijaykumar ijeri

      Great work Ian! Loved it. I would like to try this as well. I will connect if I need some help.

      This motivating work, cheers!

      Regards,

      Vijay