SAP PA and Twitter – Sentiment Analysis
Continuing from previous post we now explore Sentiment Analysis. First of all let’s talk about Sentiment Analysis and Text Mining and what exactly it means when we speak about these terms. Wikipedia defines Sentiment Analysis as “Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document”. Sometimes it is also called as Opinion Mining which is extracting information from people’s opinions. Opinions are usually in the form of text and hence to do Sentiment Analysis we need some knowledge of Text Mining also. Text Mining in the words of Hearst (1999) is “the use of large online text collections to discover new facts and trends about the world itself” Standard techniques are text classification, text clustering, ontology and taxonomy creation, document summary and latent corpus analysis. We are going to use the combination of both Sentiment Analysis and Text Mining in our example scenario discussed below.
Before I start let me make it clear that this is only sample data which was analyzed only for the purpose learning. It’s not to target any brand or influence any brand. The outputs and analysis shown here are just based on opinion and should not be considered facts.
I downloaded some public opinion data regarding Car Manufacturer from the NCSI-UK website.
The data is from 2009-2013. My intention was to just see what is the public sentiment of people for these manufacturers on Social Networking Site twitter and build a probable score for 2014 based on twitter sample population. The intention is just to see if the scores are similar to those obtained in 2013.
The steps to do sentiment analysis using SAP PA and twitter are shown below. The code is shown at the end of this post.
1. Load the necessary packages. Also load the credential file that stores the credential information required to connect to twitter. This credential file was created using the steps shown in the below post. Also establish the handshake with twitter.
2. Retrieve the tweets for each of the brand in our data-set (total 9) and save the information in a data-frame for each car brand.
3. The next step is to analyse the tweets obtained for negative and positive words. For this we use something called as Lexicons. As per Wiki, the word “lexicon” means “of or for words”. A Lexicon is basically similar to dictionary and collection of words. For our sentiment analysis we are going to use Lexicon of Hu and Liu available at Opinion Mining, Sentiment Analysis, Opinion Extraction. The Hu and Liu Lexicon is a list of positive and negative opinion words or sentiment words for English (around 6800 words). We download the Lexicon and save it on our local desktop. We load this file to create an array of positive and negative words as shown in the code. We can also append our own list of positive and negative words as required.
4.Now that we have an array of positive and negative words we need to compare them with the tweets we obtained and assign a score of 1 to each positive word in the tweet and -1 to each negative word in the tweet. Each score of 1 is considered a positive sentiment and a score of -1 is considered a negative sentiment.
The sum of overall sentiment score gives us the net sentiment for that brand. For this we require a Sentiment Scoring function. I have used the function available at the below website.I have used the function As-Is from the below website and give full credit to the author who created that function. This function is not created by me.
How-To | Information Research and Analysis (IRA) Lab
5. After getting the sentiment score for each brand next step is to sum the score and assign it to an array. This array than we bind with our original data set. We use this final table to generate heat maps as shown below:
Final Output with Sentiment Score
Histogram
Heat Maps
As we see from the above analysis that although the industry score for one brand (Audi) is quite high, the current pubic sentiment is with another brand (Vauxhall) that had an overall low industry score. This is just a basic analysis with 500 tweets. We can extend this analysis further and try to increase the tweets and create a more advanced score function that uses other parameters like region, time and historical data while calculating the final sentiment score.
This post serves as a starting point for anyone interested in doing Sentiment Analysis using twitter. There is certainly a lot of possibility to explore.
Code:
mymain<- function(mydata, mytweetnum)
{
## Load the necessary packages for twitter connecttion
library(twitteR)
library(RJSONIO)
library(bitops)
library(RCurl)
##Packages required for sentiment analysis
library(plyr)
library(stringr)
##Loading the credential file saved
load(‘C:/Users/bimehta/Documents/twitter authentication.Rdata’)
registerTwitterOAuth(credential)
options(RCurlOptions = list(cainfo = system.file(“CurlSSL”, “cacert.pem”, package = “RCurl”)))
## Retrieving the tweets for the brands in our excel.
tweetList <- searchTwitter(“#Audi”, n=mytweetnum)
Audi.df = twListToDF(tweetList)
tweetList <- searchTwitter(“#BMW”, n= mytweetnum)
BMW.df = twListToDF(tweetList)
tweetList <- searchTwitter(“#Nissan”, n= mytweetnum)
Nissan.df = twListToDF(tweetList)
tweetList <- searchTwitter(“#Toyota”, n= mytweetnum)
Toyota.df = twListToDF(tweetList)
tweetList <- searchTwitter(“#Volkswagen”, n= mytweetnum)
Volkswagen.df = twListToDF(tweetList)
tweetList <- searchTwitter(“#Peugeot”, n= mytweetnum)
Peugeot.df = twListToDF(tweetList)
tweetList <- searchTwitter(“#Vauxhall”, n= mytweetnum)
Vauxhall.df = twListToDF(tweetList)
tweetList <- searchTwitter(“#Ford”, n= mytweetnum)
Ford.df = twListToDF(tweetList)
tweetList <- searchTwitter(“#Renault”, n= mytweetnum)
Renault.df = twListToDF(tweetList)
##Upload the Lexicon of Hu and Liu saved on your desktop
hu.liu.pos = scan(‘C:/Users/bimehta/Desktop/Predictive/Text Mining & SA/positive-words.txt’, what=’character’, comment.char=’;’)
hu.liu.neg = scan(‘C:/Users/bimehta/Desktop/Predictive/Text Mining & SA/negative-words.txt’, what=’character’, comment.char=’;’)
##Build an array of positive and negative words based on Lexicon and own set of words
pos.words = c(hu.liu.pos, ‘upgrade’)
neg.words = c(hu.liu.neg, ‘wtf’, ‘wait’,’waiting’,’fail’,’mechanical’,’breakdown’)
## Build the score sentiment function that will return the sentiment score
score.sentiment = function(sentences, pos.words, neg.words, .progress=’none’)
{
# we want a simple array (“a”) of scores back, so we use
# “l” + “a” + “ply” = “laply”:
scores = laply(sentences, function(sentence, pos.words, neg.words) {
# clean up sentences with R’s regex-driven global substitute, gsub():
sentence = gsub(‘[[:punct:]]’, ”, sentence)
sentence = gsub(‘[[:cntrl:]]’, ”, sentence)
sentence = gsub(‘\\d+’, ”, sentence)
# and convert to lower case:
sentence = tolower(sentence)
# split into words. str_split is in the stringr package
word.list = str_split(sentence, ‘\\s+’)
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) – sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
## Creating a Vector to store sentiment scores
a = rep(NA, 10)
## Calculate the sentiment score for each brand and store the score sum in array
Audi.scores = score.sentiment(Audi.df$text, pos.words,neg.words, .progress=’text’)
a[1] = sum(Audi.scores$score)
Nissan.scores = score.sentiment(Nissan.df$text, pos.words,neg.words, .progress=’text’)
a[2]=sum(Nissan.scores$score)
BMW.scores = score.sentiment(BMW.df$text, pos.words,neg.words, .progress=’text’)
a[3] =sum(BMW.scores$score)
Toyota.scores = score.sentiment(Toyota.df$text, pos.words,neg.words, .progress=’text’)
a[4]=sum(Toyota.scores$score)
##Sentiment Score for other brands is considered 0
a[5]=0
Volkswagen.scores = score.sentiment(Volkswagen.df$text, pos.words,neg.words, .progress=’text’)
a[6]=sum(Volkswagen.scores$score)
Peugeot.scores = score.sentiment(Peugeot.df$text, pos.words,neg.words, .progress=’text’)
a[7]=sum(Peugeot.scores$score)
Vauxhall.scores = score.sentiment(Vauxhall.df$text, pos.words,neg.words, .progress=’text’)
a[8]=sum(Vauxhall.scores$score)
Ford.scores = score.sentiment(Ford.df$text, pos.words,neg.words, .progress=’text’)
a[9]=sum(Ford.scores$score)
Renault.scores = score.sentiment(Renault.df$text, pos.words,neg.words, .progress=’text’)
a[10]=sum(Renault.scores$score)
##Plot the histogram for a few brand.
par(mfrow=c(4,1))
hist(Audi.scores$score, main=”Audi Sentiments”)
hist(Nissan.scores$score, main=”Nissan Sentiments”)
hist(Vauxhall.scores$score, main=”Vauxhall Sentiments”)
hist(Ford.scores$score, main=”Ford Sentiments”)
## Return the results by combining sentiment score with original dataset
result <- as.data.frame(cbind(mydata, a))
return(list(out=result))
}
Code Acknowledgements:
Opinion Mining, Sentiment Analysis, Opinion Extraction
How-To | Information Research and Analysis (IRA) Lab
R by example: mining Twitter for consumer attitudes towards airlines
Something I was looking to do in R with code
Thanks for the code.
Great work!
The only thing that i'm missing is the credential file, i couldn't find the link.
Outstanding contribution though.
🙂