Real Time Social Media Analysis in SAP Analytics Cloud
In this blog post you will find out how to create an end-to-end scenario on the theme of social networking by using the technology of SAP HANA and SAP Analytics Cloud. This project would not have been possible without the technical expertise and creativity of David Probst, Daniel Gerdes, Fabian Filsinger & Lukas Brueggemann from SAP Consulting. Because consultants make the world a better place ?
The objective of this Proof of Concept was to establish a connection between Twitter and the SAP HANA data platform followed by the visualization of the results in SAP Analytics Cloud. SAP HANA was used to directly extract generic data, calculate various key figures and to automatically evaluate data by using the available functions for sentiment analysis. Finally, the results of this analysis were presented in SAP Analytics Cloud in a clear and understandable way. The approach depicted below allowed us to run a sentiment analysis in SAP HANA and the presentation of results in SAP Analytics Cloud in near real-time. The following picture shows a concept for an automated search (1), processing (2), analysis (3) and preparation of data for automatic display (4):
The first step of the sentiment analysis is called data extraction and requires Twitter to connect to SAP HANA via the Data Provisioning Agent (DP Agent). Thereafter, a request from SAP HANA is being sent to the Twitter Search API via the DP Agent. Its answer reaches then SAP HANA where the data is prepared accordingly and structured in tables to represent only relevant information. The sentiment analysis automatically initiates preprocessing and the subsequent feature extraction. The last step of the knowledge extraction is to create a calculation view that joins together the different tables or views that have been created previously. This new interface, a calculation view from SAP HANA, contains all the key metrics and can be visualized in SAP Analytics Cloud by using a direct, live connection.
How to fetch Twitter Data
SAP HANA can be used to start a search on Twitter data for a specific keyword and to deduct sentiments from it. In general, Twitter provides several interfaces that allow an automatic data access. However, only the Search API, as an interface for the Search function of Twitter has the required scope of functions and is therefore our recommendation. The Search API can be used in a similar way as the native Twitter search function and the result of the request contains a lot of Twitter metrics. Here are several useful ones to begin with:
• favorite count
• retweet count
The “favorite count” reflects the number of likes a tweet has received. The “retweet count” represents the number of retweets of a tweet. Since both values give greater importance to a positive or negative tweet, these are essential in the calculation of the sentiment score. The metric “isolanguagecode” contains the language of the tweet according to ISO-3166-1 already classified by Twitter. “Country” contains the name of the country of origin. The variable “userid” shows the ID of the author of the respective tweet. The ID of the tweet itself is contained in “id”. These two values can be used to calculate the number of different tweets and the number of different users who tweeted about the search term that is being looked up. “Tweet” contains the actual text of the tweet and thus forms the basis of the sentiment analysis.
Connecting SAP HANA to Twitter
The Smart Data Access Data Store allows SAP HANA to access and display data from external systems in a so-called “virtual table”. It represents the tabular data of the external system in the SAP HANA system without creating a copy. As soon as the connection to the system terminates, the connection to the table is also terminated. The access to external systems can also be established via so-called “virtual functions”. Similar to a virtual table, a virtual function represents an external system without creating a copy in SAP HANA. To access APIs from Twitter over a secure connection, the Smart Data Access Data Store uses the so-called “DP Agent”. It acts as a proxy and forwards the data to a Data Provisioning Server (DP Server) running on the SAP HANA platform. The DP Agent provides several preinstalled adapters which can connect to an external system. Once these adapters are registered on the DP server of the SAP HANA after their activation, the connect is established to the respective external system via the DP agent. The Smart Data Access Data Store enables direct access to the data from Twitter and requires no additional storage capacity due to the nature of virtual tables or virtual functions.
SAP HANA stores the return values listed via a virtual function in the result table called “Result.” In the first step, the relevant data from the Result table is delimited in the view called “Reduced_Result”. The Reduced_Result viewpoint is the starting point of the sentiment analysis on HANA. In preprocessing, the tweets’ sentences are divided into words and different phrases. These phrases are called “tokens”. The next step, Feature Extraction, is based on these tokens. They are analyzed at the level of aspects and each token gets a type (a sentiment or a label such as “organization”, “person”, etc.). The token types are also stored in a separate column in the $ TA_Table. The overview below shows all possible columns of the $ TA_Table.
Set of Columns in Table $TA*
In our project we used only the following columns from this table:
• Key Column (ID)
The ID of the tweet of the source table is necessary in order to link the Calculation view to the original text. The TA_TYPE column contains the classification of the individual tokens. For example, the word “SAP” is recognized as “Organization” and the word “like” as “Weak Positive Sentiment”. For the determination of the sentiments, the following terms of the TA_TYPE column are considered to be mood-relevant:
• Strong Positive Sentiment
• Weak Positive Sentiment
• Neutral Sentiment
• Weak Negative Sentiment
• Minor Problem
• Strong Negative Sentiment
• Major Problem
These terms are used for calculating the sentiment score based on the so-called “Score_Mapping” table. For instance, the Strong Negative Sentiment (or Major Problem) token type is rated -2, and the Strong Positive Sentiment token is rated +2. The score_mapping table is then joined to the reduced _ $ TA_RESULT view so that numeric values can be assigned to the types. The tweet token’s ratings are cumulated per tweet and stored in the “Score” view.
Calculation Views allow us to merge different tables. Besides, different aggregations can be formed with aggregated columns. In order to calculate the key figures required in this PoC, the Score view is joined with the Reduced_Result view via the ID. As a result, each tweet receives a rating. Based on the result of the join, aggregated columns can be calculated from isolanguagecode, country, userid, id. Using the favoritecount, the retweetcount, and the number of tweets, a sentiment score can be calculated. The result of the Calculation View represents all analyzed key figures, allows the access of other applications to it and brings the knowledge extraction to an end.
Presentation of Results in SAP Analytics Cloud
Finally, the created calculation view can be used to build dashboards in SAP Analytics Cloud via direct live SAP HANA connection. Below are our final results. Click here for more details on how to establish this connection: https://www.sapanalytics.cloud/guided_playlists/sap-hana/
SAP Analytics Cloud Dashboard with key Twitter metrics of the word “SAP”
Sentiment Analysis in SAP Analytics Cloud of the word “SAP” and “TechEd” based on Twitter Data
This concept demonstrates the possibility to use SAP HANA to generically access Twitter data and run a Sentiment Analysis at no additional charge. However, for a productive deployment, manual extension of the existing text analysis dictionary is necessary. Furthermore, the access to the Twitter APIs via a standard access is quite limited. The connection via the search API is restricted to a maximum review of 7 days, a maximum number of 1500 Tweets per view and a fixed number of views altogether. Popular topics such as Brexit political discussions can lead up to 1,500 shared tweets within minutes. For this reason we recommend fee-based interfaces of Twitter for a productive use of this concept. The approach presented above remains unchanged and can be followed without modifications.
* Mamidela, Dilip (2017). SAP HANA TA – Text Analysis. SAP Blogs. url: https: //blogs.sap.com/2017/05/21/sap-hana-ta-text-analysis/