Technical Articles
Sentiment Analysis on Twitter Data using SAP Data Intelligence
This blog post describes how to do Sentiment Analysis on Twitter data in SAP Data Intelligence and then reporting it in SAP Analytics Cloud by creating a dashboard. The main idea of this blog post is to introduce the overall process by taking a simple integration scenario, and this is likely to help you in more complex requirements.
Customers always look for ways to improve their service to gain a competitive edge in the market. One of the ways to do so is to offer better service compared to its competitor or gauge the sentiments of its own users on the offered services. The analysis can be done based on a search string that can be related to the company, its product, or any services. In this article, we will analyze the recent tweets and extract sentiments of users expressed in the tweets.
Here are the overall high-level steps to achieve this task:
- Connecting to Twitter API using library tweepy
- Fetching and cleaning the Twitter Data
- Extracting sentiments using library TextBlob
- Pushing the processed data from SAP Data Intelligence to SAP Analytics Cloud
- Reporting on the Extracted Sentiments in SAP Analytics Cloud
- Summary
Connecting to Twitter API using library tweepy
In order to fetch the live tweets from Twitter, you need to have Twitter API credentials (Access Token, Access Secret, Consumer Key, and Consumer Secret). This can be generated by creating an application on Twitter. To get the Consumer Key & Consumer Secret, you need to log into the developer section of Twitter (you can refer this Blog) and create an app there. Before we move ahead, please keep these details ready.
Now, it’s time to logon to SAP Data Intelligence and gets started. Once you logged in, you need to click on “ML Scenario Manager” tiles to create a machine learning scenario. If you have already created an ML scenario before, you can simply use it.
Create a Python Notebook where we will be writing the python code to carry out our analysis. Select the “Notebooks” tab and click the “+”-sign. Give a name and description and click “Create” and the Notebook opens in a new window. You will be prompted for the kernel, select the default kernel “Python 3”.
A blank notebook will open in a new window on Jupyter Lab.
Now we are ready to code in Python, to explore the Twitter data and do the sentiment analysis. You may have to install the required libraries before you import it. In order to install a python library, use the below command in notebook cell and hit the run. You can see the progress of the package installation. If required, you can install the other libraries in the same way.
Now it’s time to import all the required libraries and establish the connection to Twitter API. We need the following libraries to carry this task:
Tweepy – An easy-to-use Python library for accessing the Twitter API.
re – to clean the fetched tweets
NLTK – to tokenize the cleaned tweets into word / remove stop words
Seaborn, Matplotlib – to visualize the data in the notebook (optional)
TextBlob – TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. I have used this package to extract the sentiments from the tweets.
# Import all relevant libraries and set the connection to Twitter
import tweepy as tw
import re
from textblob import TextBlob
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
pd.set_option('display.max_colwidth',300)
ACCESS_TOKEN = '1017107168-xxxxxxxxxxxxxxxxxxYlE5Fxur'
ACCESS_SECRET = '8mQLSJ72xxxxxxxxxxxxxxxxxxxA7SgbT2vU'
CONSUMER_KEY = 'aIGuxxxxxxxxxqPSVVskotq'
CONSUMER_SECRET = 'hLHVxxxxxxxxxxxxxxxxxxxxxxxxxxxJ070i0Qu2'
# Setup tweepy to authenticate with Twitter credentials:
auth = tw.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
api = tw.API(auth, wait_on_rate_limit=True)
Fetching and cleaning the Twitter Data
Once the connection is established, I will now pass the search string to the Twitter API. The API will then return the latest tweets from Twitter to me. I have the option to specify multiple parameters that will decide my dataset. For my analysis, I have used the below parameters to control my dataset:
search string – pass the string of your choice. I have replaced the actual string with xxxxxx here
filter: retweets – to exclude the retweeted tweets
date_since – any older tweets from this date will be ignored
language – exclude tweets in any other language except English
item – number of tweets to be fetched (500, in my case)
# Define the search string and the date
search_string = "xxxxxx -filter:retweets" # look for xxxxxx excluding retweets
date_since = "2020-07-01" # date since you want to search tweets
# Fetching 500 tweets for analysis
tweets = tw.Cursor(api.search,
q=search_string,
lang="en",
since=date_since).items(500)
I have also defined a function clean_tweet() to clean the tweets. A usual tweet will have lots of special characters like punctuation, exclamation, and lots of emojis. I have used python library re to clean the tweet and stored the cleaned tweets into a panda data frame for the further processing.
# defining a function to clean the tweets
def clean_tweet(tweet):
tweet = tweet.lower() # convert text to lower-case
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs, replace with URL
tweet = re.sub('@[^\s]+', 'USER', tweet) # remove usernames
tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", tweet).split())
# Cleaning the tweets
clean_tweets = [clean_tweet(tweet.text) for tweet in tweets]
print("Total no. of tweets fetched: ",len(clean_tweets))
print(clean_tweets[0:5])
Even after removing the special characters and cleaning the tweets, it will still have the stop words. These stop words do not add much value to text analysis and they don’t carry much meaning. You can use python library nltk and import the stop words for further analysis. However, one of the problems with nltk stop words is that even a word like “not” is flagged as a stop word. In my case, I wanted to capture this so that I can report on negative feedback too. So, I have defined my own stop words list and used for further cleaning. I have also included words like USER and URL in my stop word list because I have removed user hashtags and URLs from the tweets while cleaning them.
# We'll remove the stop words from the tweets. I am defining my own list as I do not want to remove
# a word like "not". I also want to get rid of words like "USER" and "URL".
stop_words = [u'i', u'im', u'USER', u'URL', u'hi', u'hey', u'plz', u'please', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off',u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why',u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'only',u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']
def remove_stop(tweet):
word_tokens = word_tokenize(tweet)
print(word_tokens)
filtered_tweet = [w for w in word_tokens if w not in stop_words]
result = ' '.join(filtered_tweet)
return result
tweet_stopwords_removed = [remove_stop(tweet) for tweet in clean_tweets]
print("Couple of tweets after removing stop words")
tweet_stopwords_removed[0:2]
Now, my dataset is ready for sentiment extraction. Let’s see how to achieve this.
Extracting sentiments using library TextBlob
There are multiple ways to carry out sentiment analysis. In this article, we will make use of the python library TextBlob. TextBlob is an extremely powerful NLP library for Python. TextBlob is built upon NLTK and provides an easy to use interface to the NLTK library. It comes with a method that will give you the polarity score on the text that is passed to it.
Using the TextBlob library, I have passed the cleaned tweets and created a TextBlob object. Then I used the sentiment method and passed the TextBlob objects to create the sentiment score. Finally, I stored the sentiment scores for my dataset into a panda data frame.
# Create textblob objects
sentiment_objects = [TextBlob(tweet) for tweet in tweet_stopwords_removed]
sentiment_score = [[tweet.sentiment.polarity, str(tweet)] for tweet in sentiment_objects]
sentiment_df = pd.DataFrame(sentiment_score, columns=["polarity", "tweet"])
The resulting data frame will have a polarity score associated with each tweet. A negative score means a negative sentiment and a positive score means a positive sentiment. Polarity equals to 0 means that the package was not able to extract any strong sentiment from the tweet.
We have the option to explore the data in python notebook using a library such as Seaborn and Matplotlib. The codes below will display a histogram on the processed data. I have removed the neutral tweets (polarity score = 0) to see only those tweets that have either positive or negative tweets. I have created 10 bins here to view the data at a different polarity scale.
# Remove polarity values equal to zero from cleaned tweets
polarized_sentiment = sentiment_df[sentiment_df.polarity != 0]
comp = search_string.split("-")[0]
fig, ax = plt.subplots(figsize=(8, 6))
polarized_sentiment.hist(bins=[-1, -0.75, -0.5, -0.25, 0, 0.25, 0.5, 0.75, 1], ax=ax, color="purple")
plt.title("Polarized Sentiments from Twitter for "+" "+ comp.capitalize())
print("No. of tweets with polarized sentiments: ", len(polarized_sentiment))
print("Percentage of tweets with polarized sentiments: ", len(polarized_sentiment)*100/len(sentiment_df))
The above cell will give you the following graph in the python notebook.
However, I want to integrate this data into SAP Analytics Cloud. For that, I need to store this data in SAP Data Intelligence so that it is available for any other complex integration scenario. In order to save this data frame, I executed the below command from Jupyter Notebook. As a result, the processed data will be saved as a CSV file in SAP Data Intelligence.
Sentiment = sentiment_df.to_csv("Polarized tweets.csv")
This file can also be copied, downloaded, shared as a link, copied as a download link. Right-click on the file to see all the available options.
Pushing the processed data from SAP Data Intelligence to SAP Analytics Cloud
Before we create a pipeline in SAP Data Intelligence to push the data to SAP Analytics cloud, we must configure OAuth Clients for the given SAP Data Intelligence instance in SAP Analytics Cloud under App Integration. Note down host, Authorization URL, Token URL, Client ID and Secret. If you are using SAP Analytics Cloud on NEO platform, the screen to register OAuth client might be different than if you are on Cloud Foundry. (read more about Cloud Foundry Vs Neo)
In order to push the data from SAP Data Intelligence to SAP Analytics could, we need to create a pipeline in SAP Data Intelligence using two operators – SAP Analytics Cloud Formatter and SAP Analytics Cloud Producer. Please refer to this excellent blog post on detailed steps regarding SAP Data Intelligence and SAP Analytics Cloud integration.
From SAP Data Intelligence home, click on Modeler to launch the modeler and create a graph. The detailed steps are given in the blog post above, so I’ll skip that part. The final graph should look like this:
Few important points on these operators and their configuration:
Read File – used to read the data from SAP Data Intelligence. Specify the path of the file that was generated using the python code.
From File – used to extract the path and feed into Decode table.
Decode Table – to decode the input CSV file into table message.
SAP Analytics Cloud Formatter : This operator is used to convert message.table input data from the previous operator to message format before sending it to SAP Analytics Cloud. This is used before the SAP Analytics Cloud Producer operator.
SAP Analytics Cloud Producer: This operator is used to send the data from SAP Data Intelligence to SAP Analytics Cloud. We need to provide the access token, access URL, OAuth ID, and Secret key generated earlier in SAP Analytics Cloud during App Integration.
Finally, we need to use the Wiretap operator to close the graph. Now our pipeline is ready. The next step is to save and run this graph from the SAP Data Intelligence modeler.
On first execution, you need to grant permission for OAuth authentication/Access Token Request by clicking the “Open UI” from SAP Analytics Cloud Producer Operator (Refer this blog post for more details) Once the permission is granted, the pipeline should be stopped and re-run again. You should not face any issue in the next run, assuming SAP Analytics Cloud dataset API is enabled on your SAP Analytics Cloud tenant where you have generated the OAuth id and access token. A successful run of the pipeline will push the data to SAP Analytics Cloud, and you can verify this by logging to SAP Analytics Cloud. The data will be placed under “My Files” in SAP Analytics Cloud.
Reporting on the Extracted Sentiments in SAP Analytics Cloud
Once the data is available in SAP Analytics Cloud, it can be consumed in many ways. I have created a simple story on this dataset. For my analysis, I have created a pie chart that shows the percentage distribution of tweets w.r.to different sentiments polarity – Positive, Neutral, and Negative. Next, I have also created a chart and ranked them to display the worst 10 feedback (Negative sentiments with lowest sentiment scores). This will help me to see what’s going wrong with my current service. Here is how it looks:
Summary
In this blog post, we have seen how to carry out sentiments analysis on the live tweets using python library tweepy and TextBlob. This is one of the easiest ways to do sentiment analysis. The process is slightly different than training a regular ML model. Here, we don’t have to split the dataset in test/train, and there is no need to tune any hyper-parameter for the model as well. In fact, we have not created any machine learning model here. We also saw how to deploy the python code in a graphical pipeline for productive use. Once the data is generated and saved in SAP Data Intelligence, you can leverage this dataset in any application and in any way you want. In my analysis, I pushed the data to SAP Analytics Cloud for dashboard reporting. Thanks for reading so far and I would like to hear your feedback. If you require any additional information, please feel free to reach out to me at sap_dmlt_gce@sap.com
NIce!!
Hi,
Managed to recreate your blog and it's working great!
Do you any idea if you could save the file you generated through Jupyter Notebook (polarized tweets.csv) directly into the Shared-folder in Metadata Explorer? Or alternatively into the files-folder in System Management. Ideally through a line of code in Jupyter Notebook.
I'm asking this because I'm trying to automate the "file read operator" so it immediately reads the file that was generated through Jupyter Notebook. This way it is not necessary to download the file and manually upload it.
Thanks in advance!