#Sharknado Social Media Analysis Using SAP Predictive Analysis
In July, SyFy premiered the latest addition to its B-movie arsenal, the insta-hit Sharknado, which obviously, is about a tornado full of sharks reigning terror on Los Angeles. I would explain more of the plot, but when the tagline of the movie is “ENOUGH SAID!”, I guess the creators thought the title was pretty self-explanatory.
Sharknado quickly became a social media sensation, with the #sharknado hashtag quickly trending world wide. While the initial airing generated somewhat disappointing viewership numbers compared to predecessors like Sharktopus (yeah, Shark + Octopus) and DInoshark (…), the social media chatter generated so much excitement and buzz that SyFy was able to schedule a second and third airings of the film with increasing viewership each time, and negotiated a limited release into theaters. Clearly, this shows that social media is incredibly influential and transformed this disappointing B-movie starring Tara Reid and Ian Ziering into an instant cult classic that will return huge dividends to SyFy in ratings, advertising, and theatrical deals.
It is therefore important to collect, analyze, monitor, and react to social media data quickly, which allows an organization to leverage this powerful tool. Mining social media data for customer feedback is perhaps one of the greatest untapped opportunities for customer analysis in many organizations today. Social media data is freely available and allows organizations to personally identify and potentially interact directly with customers to resolve any potential dissatisfaction. In today’s blog post, I’ll discuss using SAP Predictive Analysis visualize, and analyze social media data related to Sharknado.
Visualization and Analysis of #Sharknado Data
For this analysis, I collected over 33,000 tweets over a period of days using SAP Data Services to extract and perform entity extraction and sentiment analysis on tweets related to the topic “sharknado”. In all, over 200,000 individual entities were extracted from these tweets. A natural first step is generating descriptive charts to explain the nature of these extracted entities and tweets. The chart below shows an area chart of all the entities extracted from the tweets by category. Twitter hashtags were the most commonly identified entities, followed by sentiments, Twitter users, topics, and organizations. The depth of color indicates the tweet-level average sentiment. This shows that tweets with topic entities have the highest (most positive) overall sentiment, while tweets with hashtags were much less positive.
Here are a few other fast facts on the Sharknado tweets I collected:
- 38% of the tweets collected included a retweet from another user
- 41% of tweets had a topic entity extracted from the text
- 7.5% of tweets had a location entity within the tweet text
- 45% of tweets had a sentiment entity identified in the text
- 54.5% of tweets had 5 or more entities extracted from the text
The chart below shows a histogram of tweets by the length of the tweet text—tweets are most commonly right around the 140 character limit, with about 25% of tweets at 135 characters and above. I’ve also plotted the average tweet sentiment (on a scale of 0 to 1 with 0 as strong negative, 0.5 as neutral, and 1 as strong positive) as a line graph, and while there is a slight increase in average sentiment between 0 and 45 characters, the average sentiment is relatively steady across all tweet lengths.
Now, we can start to examine the individual entities extracted from the tweets and sentiments associated with each entity. For example, we can pull the Person entities identified by the text analysis in a word cloud, shown below. This word cloud shows the most common entities (larger size) and the sentiment associated with the person entities (depth of color).
This shows that Tara Reid, Cary Grant, Tatiana Maslany, Ian Ziering, and Steve Sanders were the most commonly identified person entities, with Tatiana Maslany and Tara Reid appearing in tweets with higher average sentiments. Tara Reid and Ian Ziering are actors that appeared in Sharknado, and Steve Sanders was Ian Ziering’s character in Beverly Hills, 90210, but I was confused by the appearance of Cary Grant, who Wikipedia identifies as an English actor with debonair demeanor who died in 1986, and Tatiana Maslany, a lesser-known Canadian Actress, neither of whom appeared in Sharknado. Further filtering the tweet text for these particular entities, I find an extremely high retweet frequency for 2 influential tweets:
@TVMcGee: #Sharknado is even more impressive when you realize Tatiana Maslany played all the different sharks.
@RichardDreyfuss: People don’t talk about it much in Hollywood (omertà and everything) but Cary Grant actually died in a #sharknado
The entity “impressive” was considered strongly positive for Tatiana Maslany, while “n’t talk” was considered a minor problem for the Cary Grant tweet. Further analysis can be done to identify popular characters and portions of the movie, which the Sharknado filmmakers can mine to identify the characters, plots, or topics to revisit in the already-approved sequel to Sharknado (coming Summer 2014).
Similarly, investigating location entities shown in the word cloud below, we can see the most common references to Texas and Hollywood, with tweets about Texas being more positive than Hollywood.
And similarly, Organizations identified by Text Analysis show SyFy (the channel that brought you Sharknado) and the phrase Public Service Announcement, as well as Lego and Nova were common in tweets, as shown in the word cloud below.
The SyFy and public service announcement phrases were found in a retweeted tweet that showed numerous times up in the data:
@Syfy: Public Service Announcement: #Sharknado will be rebroadcast on Thurs, July 18, at 7pm. Please retweet this important information.
Other organization entities that bubbled up in the analysis included:
- Nova: a character in the show who may have met an untimely end, which apparently did not elicit positive sentiments.
- Lego: the term “lego” was included in a commonly re-tweeted tweet of a picture of a sharknado made of Legos.
Predictive Analysis on #Sharknado Data
After summarizing and visualizing the data, I can leverage SAP Predictive Analysis’s Predict pane to evaluate the models using predictive algorithms. We can further summarize tweet data across multiple numeric characteristics using a clustering algorithm. Clustering is an unsupervised learning algorithm and one of the most popular segmentation methods; it creates groups of similar observations based on numeric characteristics. In this case, the numeric characteristics available are: length of tweet, # of entities extracted from the tweet, and the presence of a topic or an sentiment flags. While binary variables are not technically appropriate to use in a clustering model, we’re including them here to increase the complexity of our model and make the results more interesting.
The clustering model results show 3 groups of tweets, roughly separated by size, with Cluster 3 being the short tweets, Cluster 1 the longer tweets, and Cluster 2 between 3 and 1. This clustering model does show us that longer tweets were more likely to have more entities identified by the text analysis and were more likely to have a sentiment and a topic within the tweet.
While this is an extremely simple example, with additional descriptive statistics, we could cluster tweets according to sentiment and occurrences of key phrases or words; if the organization could link these tweet segments to customer satisfaction or other key metrics (such as referrals generated through social media buzz or calls to a customer service center), monitoring the frequency of tweets by segment would be a great nearly real-time leading indicator of viral buzz, customer complaints, or referral business.
Another potential application for predictive models would be attempting to estimate the impact of tweet characteristic on the sentiment value of the tweet. In this case, I’ve arbitrarily determined that a tweet with an average sentiment of 0.4 or higher is “Positive”. I can then use the R-CNR Decision Tree algorithm or a custom R function for Logistic Regression (see this previous blog on Custom R Modules) to predict which elements are most indicative of positive tweets. In order to compare these models, I use a filter transform to filter out tweets without sentiments. Then, I configure the Logistic Regression and R-CNR Tree modules to include all my descriptive data, including tweet length, number of entities extracted, and presence of location and topic entities.
Once this predictive workflow has been run, I can review results for the logistic regression and decision tree results.
Logistic Regression results
These model output charts show that the logistic regression model is not terribly predictive, showing an AUC (area under the ROC Curve) of only 0.598 (AUC varies from 0 to 1 with a baseline of 0.5 and values closest to 1 indicating more accurate predictions).
This chart shows that there is a slight increase in predicted average sentiment (red line) across the actual average tweet sentiment (x axis). Blue bars represent tweet volume for each level of average sentiment. Ideally, the red line would be approximately diagonal from bottom left to top right.
Decision Tree results
The Decision tree shows that the model is able to identify large pockets of tweets that are much more likely to be positive.
In summary, the models show potential to distinguish tweet positivity based on tweet content characteristics. These models could be further enhanced with more Sharknado-related model attributes, such as whether the tweet mentioned specific plot points, emotions, or characters. In these preliminary models,results suggest that having a location entity, longer tweet length, and presence of a retweet contribute to positive sentiments. Perhaps this suggests that people are more likely to retweet positive tweets than negative? Developers of the Sharknado sequel, which has already been approved for 2014, could determine which specific aspects of the film were most positively and negatively received by the audience and incorporate these concepts into the sequel.
See my full blog post on SAP BI Blog with more details on the data collection, EIM, and text analysis process