On using Data Services for Twitter data sentiment analysis
In this blog I will discuss using Data Services for consumer sentiment analysis of the data collected from Twitter using JSONAdapter.
As discussed in the first blog entry, Twitter Search API has been accessed, with the word ‘cityrail’ as the search term. To those not in the know, Cityrail is the train network of Sydney (Australia) metropolitan area. It was a very obvious target: with relatively big customer base it was guaranteed to get enough unstructured data. Over time (2-3 months) Data Services collected a few thousands tweets supposedly related to Cityrail.
It is worth to elaborate on the data collection process. In one request-response session, Twitter Search API returns up to 100 most recent tweets. Provided that within every 15 minutes the number of tweets about Cityrail does not exceed 100 (that assumption has been confirmed), Data Services Batch Job running every 15 minutes can collect all such tweets almost in real-time.
I would also like to mention a conversation I had with a colleague recently. He wondered if JSONAdapter may help to obtain a large amount of tweets for analysis instantly, and if Twitter Streaming API might help with that. The answer to the last question seems to be negative: once you open a Stream, Twitter feeds entries there in real time, but those are current entries; they can be grouped into time slots or used in another ETL process by Data Services. Streaming API should be used, for example, when for some topic the number of tweets exceeds, say, 100 per minute, hence Batch Job described above may not be able to cope, even if executed every minute.
Otherwise, the only difference between Stream and Search APIs becomes that the Stream API would provide raw data, while Search would apply some extra filtering/ranking by relevance to the search term. In fact, it is possible to build a Data Services job to get historical results, executing consecutive search requests to Search API deeper and deeper into the past (by restricting the TweetID field in a request) — the process would not be instant, though, but probably running for 1-2 hours (consider it an Initial Load), and it is hard to tell how far into the past it can go.
The bottom line is: if there is an immediate need to analyze the historical data, you may have to contact the Twitter’s partner data providers. Otherwise, JSONAdapter may help to start collecting the data and implement (near) real-time analysis.
The further discussion will be around the following points:
- text parsing using using Data Services,
- ‘noise’ reduction,
- Topic-Sentiment links rebuilding,
- the sarcasm problem (no pun intended!).
Text parsing itself is simpler than one might expect. A special Transform in Data Services v.4.0, called Entity Extraction, parses the input unstructured text and extracts entities and facts about them. Its output is a number or ID’ed records containing one entity/fact each, accompanied with location attributes (paragraph number, sentence number and offset) and categorized accordingly to the rules specified in the Transform options.
Provided out of the box are dictionary of categorized entities and a few rulesets for facts extraction – they are located in the folder TextAnalysis of a standard Data Services installation (availability for use is subject to the license: either full DI+DQ or DI Premium). One of those rulesets, Voice Of Customer (VOC), is used for this work. SAP allows customization of rulesets (at your own risk, of course) and implementation of user-defined dictionaries and rules. SAP has also published several blueprints, which could be used to start new text analyses developments. For this blog, a blueprint for sentiment analysis has been used, it does the following:
- parse the incoming unstructured text into Topic and Sentiment entities using Voice Of Customer ruleset: for example, a phrase “I like apples” would be parsed into Topic=apples and Sentiment=like (accompanied with Sentiment Type ‘StrongPositive’),
- process Topics data and put Topics into groups, to enable some measures, like number of Sentiments per group, and ensure topics like ‘apple’ and ‘apples’ would fall into the same group,
- build a SAP BusinessObjects Universe on top of resulting tables, to enable WebI reporting with slice and dice capabilities.
A few important changes have been made to the blueprint design to deal with Twitter data, the first one covering the issue of noise elimination. For starters, the blueprint assumed the original text to be in plain English; in reality, tweets constitute quite a lingo, full of abbreviations, expletives of all kinds, and with incomplete grammar. In the upcoming SP release for Data Services, SAP makes an attempt to keep in touch with social media and adds new entities for trend/hashtag and for handle/mention. That did not seem enough, and a custom ruleset has been implemented and added to the Entity Extraction transform to detect and mark words that should be excluded from further processing. The picture below demonstrates options of the transform, including two out of the box rulesets followed by the custom one:
This way, if an entity is extracted as both e.g. Topic and a (custom-defined) Blather type of entity, it will be detected in a simple join:
The following screenshot displays a sample output from the transform ‘Blather’, the second column contains the entity extracted from the original tweet and categorized as an expletive:
All such entities would be filtered, thus clearing the output from most of the ‘yucks’, ‘hells’, and ‘lols’. There is one more use of those, which will be discussed a couple paragraphs below.
Noise may also occur on macro-level. Tweets analysis is different from the blueprint in one more way: while the blueprint assumed the source text is completely relevant (for example, customer feedbacks on the imaginary Swiftmobile, collected in a separate folder), tweets don’t have to be. Filtering tweets by a word X returns not just a customer’s view on X, but all aspects of people lives that somehow involve X and that people care to write about. The amount of such macro-noise in ‘cityrail’ selection is, actually, small, but in a selection for, say, ‘westfield’ (a major chain of shopping malls in Australia) it becomes much bigger, for obvious reasons. A possible way to further filter the results would be by having a predefined list of topics specific to the bigger theme.
By default, the output of Entity Extraction transform looks like what might be called a ‘spaghetti’ type of data, i.e. it doesn’t care about relationships between Topics and Sentiments. While it may be considered sufficient, a need to relate Topics and Sentiments may be considered. Assuming that in a sentence related topic and sentiment should be closely located, it’s possible to derive Topic-Sentiment pairs from ParentID and Offset fields of Entity Extraction transform output:
This design obviously ignores topics not accompanied by sentiments and vice versa, and those could be added to the reporting data model.
‘Raw’ tweets preview in the database revealed that tweets mostly expressed negative feedback on Cityrail: people tend to complain more often than praise, and – by the way – I wrote the first draft of this blog on the day and hour when some Cityrail’s equipment failure caused major suspension of service and delays. Therefore, it was surprising to see significant ‘StrongPositiveSentiment’-related numbers in the reporting. The reason was that many tweets were sarcastic and should not have been taken literally, but, rather, their sentiment would be opposite to their literal meaning. So, if a tweet is deemed sarcastic, its positive Sentiment should be reverted; while negative Sentiment still counts.
Apparently, sarcasm detection in user feedbacks is a much bigger problem without a general solution. Even a human cannot detect sarcasm perfectly (73% accuracy has been reported from one research), as familiarity with the context is often required. Given the Data Services’ ability to process Python scripts in User Defined Transforms, one might attempt to build a sarcasm detection functionality in Data Services based on a few published approaches, using not only words, but markers of emotions: emoticons, ‘blather’-words discussed above, words highlighting using ‘*’ (like in “I *love* when trains go slowly in rainy weather”) or enclosing into quotation marks, and, of course, the hashtag #sarcasm. Coincidence of negative and positive (rather, strong positive, in terms of VOC’s ruleset) sentiments or, rather, emotions in one tweet is also a potential sarcasm marker. The last one, actually, can be implemented with regular Data Services ETL:
The results below could have been slightly better if VOC knew that ‘con’ is a short form for ‘conditioner’, not a negative sentiment expression. Some extra customization of the dictionary may be required.
Implementation of the full outlined above sarcasm detection functionality scope, however, seems to be a project by its own and beyond this blog.
Setting up reports on the analysed data was not a primary goal of this work, as the SAP blueprint’s approach of BusinessObjects Universe was adopted. The original plan was to use SAP BW BEx reporting, but as storage of texts longer than 60 characters in BW InfoProviders is not trivial, the idea had been discarded.
Consumer sentiment is quantified here by counting the number of feedbacks, restricted measures have been created for each feedback type. The screenshot below demonstrates how Data Quality grouped topics into larger groups using fuzzy matching logic:
A drilldown into a group is then possible, like below:
An extra characteristic, time, has been added to reporting as an obvious choice: there is clear correlation between number of ‘cityrail’ tweets and morning/afternoon transport peak hours. One might think of implementing a “rolling total negative sentiment” of 30 minute scope and raise an alert if that value exceeds some threshold.
Lastly, beyond consumer sentiment analysis, another obvious idea would be to geocode Tweets using either geolocation information (GPS coordinates) from tweets metadata, or geographical names from tweets themselves (post-processing is required for the latter, of course, to eliminate noise). The geocoded data could be made available for visualization in Business Objects or provided to a GIS product like ArcGIS for use in spatial analyses.
– Roman Bukarev