Continuing from previous post we now explore Sentiment Analysis. First of all let’s talk about Sentiment Analysis and Text Mining and what exactly it means when we speak about these terms. Wikipedia defines Sentiment Analysis as “Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document”. Sometimes it is also called as Opinion Mining which is extracting information from people’s opinions. Opinions are usually in the form of text and hence to do Sentiment Analysis we need some knowledge of Text Mining also. Text Mining in the words of Hearst (1999) is “the use of large online text collections to discover new facts and trends about the world itself” Standard techniques are text classification, text clustering, ontology and taxonomy creation, document summary and latent corpus analysis.  We are going to use the combination of both Sentiment Analysis and Text Mining in our example scenario discussed below.

Before I start let me make it clear that this is only sample data which was analyzed only for the purpose learning. It’s not to target any brand or influence any brand. The outputs and analysis shown here are just based on opinion and should not be considered facts.


I downloaded some public opinion data regarding Car Manufacturer from the NCSI-UK website.

Scores By Industry

The data is from 2009-2013. My intention was to just see what is the public sentiment of people for these manufacturers on Social Networking Site twitter and build a probable score for 2014 based on twitter sample population. The intention is just to see if the scores are similar to those obtained in 2013.

The steps to do sentiment analysis using SAP PA and twitter are shown below. The code is shown at the end of this post.

1. Load the necessary packages. Also load the credential file that stores the credential information required to connect to twitter. This credential file was created using the steps shown in the below post. Also establish the handshake with twitter.

2. Retrieve the tweets for each of the brand in our data-set (total 9) and save the information in a data-frame for each car brand.

3. The next step is to analyse the tweets obtained for negative and positive words. For this we use something called as Lexicons. As per Wiki, the word “lexicon” means “of or for words”. A Lexicon is basically similar to dictionary and collection of words. For our sentiment analysis we are going to use Lexicon of Hu and Liu available at Opinion Mining, Sentiment Analysis, Opinion Extraction. The Hu and Liu Lexicon is a list of positive and negative opinion words or sentiment words for English (around 6800 words).  We download the Lexicon and save it on our local desktop. We load this file to create an array of positive and negative words as shown in the code. We can also append our own list of positive and negative words as required.

4.Now that we have an array of positive and negative words we need to compare them with the tweets we obtained and assign a score of 1 to each positive word in the tweet and -1 to each negative word in the tweet. Each score of 1 is considered a positive sentiment and a score of -1 is considered a negative sentiment.

The sum of overall sentiment score gives us the net sentiment for that brand. For this we require a Sentiment Scoring function. I have used the function available at the below website.I have used the function As-Is from the below website and give full credit to the author who created that function. This function is not created by me.

How-To | Information Research and Analysis (IRA) Lab

5. After getting the sentiment score for each brand next step is to sum the score and assign it to an array. This array than we bind with our original data set. We use this final table to generate heat maps as shown below:

Final Output with Sentiment Score

Pic12.PNG

Histogram

Pic13.PNG

Heat Maps

Pic14.PNG

Pic15.PNG

As we see from the above analysis that although the industry score for one brand (Audi) is quite high, the current pubic sentiment is with another brand (Vauxhall) that had an overall low industry score. This is just a basic analysis with 500 tweets. We can extend this analysis further and try to increase the tweets and create a more advanced score function that uses other parameters like region, time and historical data while calculating the final sentiment score.

This post serves as a starting point for anyone interested in doing Sentiment Analysis using twitter. There is certainly a lot of possibility to explore.

Code:

mymain<- function(mydata, mytweetnum)

{

## Load the necessary packages for twitter connecttion

library(twitteR)

library(RJSONIO)

library(bitops)

library(RCurl)

##Packages required for sentiment analysis

library(plyr)

library(stringr)

##Loading the credential file saved

load(‘C:/Users/bimehta/Documents/twitter authentication.Rdata’)

registerTwitterOAuth(credential)

options(RCurlOptions = list(cainfo = system.file(“CurlSSL”, “cacert.pem”, package = “RCurl”)))

## Retrieving the tweets for the brands in our excel.

tweetList <- searchTwitter(“#Audi”, n=mytweetnum)

Audi.df = twListToDF(tweetList)

tweetList <- searchTwitter(“#BMW”, n= mytweetnum)

BMW.df = twListToDF(tweetList)

tweetList <- searchTwitter(“#Nissan”, n= mytweetnum)

Nissan.df = twListToDF(tweetList)

tweetList <- searchTwitter(“#Toyota”, n= mytweetnum)

Toyota.df = twListToDF(tweetList)

tweetList <- searchTwitter(“#Volkswagen”, n= mytweetnum)

Volkswagen.df = twListToDF(tweetList)

tweetList <- searchTwitter(“#Peugeot”, n= mytweetnum)

Peugeot.df = twListToDF(tweetList)

tweetList <- searchTwitter(“#Vauxhall”, n= mytweetnum)

Vauxhall.df = twListToDF(tweetList)

tweetList <- searchTwitter(“#Ford”, n= mytweetnum)

Ford.df = twListToDF(tweetList)

tweetList <- searchTwitter(“#Renault”, n= mytweetnum)

Renault.df = twListToDF(tweetList)

##Upload the Lexicon of Hu and Liu saved on your desktop

hu.liu.pos = scan(‘C:/Users/bimehta/Desktop/Predictive/Text Mining & SA/positive-words.txt’, what=’character’, comment.char=’;’)

hu.liu.neg = scan(‘C:/Users/bimehta/Desktop/Predictive/Text Mining & SA/negative-words.txt’, what=’character’, comment.char=’;’)

##Build an array of positive and negative words based on Lexicon and own set of words

pos.words = c(hu.liu.pos, ‘upgrade’)

neg.words = c(hu.liu.neg, ‘wtf’, ‘wait’,’waiting’,’fail’,’mechanical’,’breakdown’)

## Build the score sentiment function that will return the sentiment score

score.sentiment = function(sentences, pos.words, neg.words, .progress=’none’)

{

 

  # we want a simple array (“a”) of scores back, so we use

  # “l” + “a” + “ply” = “laply”:

  scores = laply(sentences, function(sentence, pos.words, neg.words) {

    # clean up sentences with R’s regex-driven global substitute, gsub():

    sentence = gsub(‘[[:punct:]]’, ”, sentence)

    sentence = gsub(‘[[:cntrl:]]’, ”, sentence)

    sentence = gsub(‘\\d+’, ”, sentence)

    # and convert to lower case:

    sentence = tolower(sentence)

    # split into words. str_split is in the stringr package

    word.list = str_split(sentence, ‘\\s+’)

    # sometimes a list() is one level of hierarchy too much

    words = unlist(word.list)

    # compare our words to the dictionaries of positive & negative terms

    pos.matches = match(words, pos.words)

    neg.matches = match(words, neg.words)

    # match() returns the position of the matched term or NA

    # we just want a TRUE/FALSE:

    pos.matches = !is.na(pos.matches)

    neg.matches = !is.na(neg.matches)

    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():

    score = sum(pos.matches) – sum(neg.matches)

    return(score)

  }, pos.words, neg.words, .progress=.progress )

  scores.df = data.frame(score=scores, text=sentences)

  return(scores.df)

}

## Creating a Vector to store sentiment scores

a = rep(NA, 10)

## Calculate the sentiment score for each brand and store the score sum in array

Audi.scores = score.sentiment(Audi.df$text, pos.words,neg.words, .progress=’text’)

a[1] = sum(Audi.scores$score)

Nissan.scores = score.sentiment(Nissan.df$text, pos.words,neg.words, .progress=’text’)

a[2]=sum(Nissan.scores$score)

BMW.scores = score.sentiment(BMW.df$text, pos.words,neg.words, .progress=’text’)

a[3] =sum(BMW.scores$score)

Toyota.scores = score.sentiment(Toyota.df$text, pos.words,neg.words, .progress=’text’)

a[4]=sum(Toyota.scores$score)

##Sentiment Score for other brands is considered 0

a[5]=0

Volkswagen.scores = score.sentiment(Volkswagen.df$text, pos.words,neg.words, .progress=’text’)

a[6]=sum(Volkswagen.scores$score)

Peugeot.scores = score.sentiment(Peugeot.df$text, pos.words,neg.words, .progress=’text’)

a[7]=sum(Peugeot.scores$score)

Vauxhall.scores = score.sentiment(Vauxhall.df$text, pos.words,neg.words, .progress=’text’)

a[8]=sum(Vauxhall.scores$score)

Ford.scores = score.sentiment(Ford.df$text, pos.words,neg.words, .progress=’text’)

a[9]=sum(Ford.scores$score)

Renault.scores = score.sentiment(Renault.df$text, pos.words,neg.words, .progress=’text’)

a[10]=sum(Renault.scores$score)

##Plot the histogram for a few brand.

par(mfrow=c(4,1))

hist(Audi.scores$score, main=”Audi Sentiments”)

hist(Nissan.scores$score, main=”Nissan Sentiments”)

hist(Vauxhall.scores$score, main=”Vauxhall Sentiments”)

hist(Ford.scores$score, main=”Ford Sentiments”)

## Return the results by combining sentiment score with original dataset

result <- as.data.frame(cbind(mydata, a))

return(list(out=result))

}

Code Acknowledgements:

Opinion Mining, Sentiment Analysis, Opinion Extraction

How-To | Information Research and Analysis (IRA) Lab

R by example: mining Twitter for consumer attitudes towards airlines


To report this post you need to login first.

2 Comments

You must be Logged on to comment or reply to a post.

Leave a Reply