Hope you are doing well and safe too 🙂
Since long i am seeing various post on COVID-19 data analysis and forecasting and having the same thought i decided to post my recent work mainly on text mining & text analysis on COVID-19 tweets & data.
IDEA is to connect twitter from python and insert data into SAP HANA and utilize the built-in powerful capabilities in SAP HANA for text mining & text analysis & displaying word cloud and using few text mining functions as well.
we have seen people are expressing their views and thoughts onto twitter with different sentiments and you know we have lot of unstructured data & to dig into that is really a very difficult job hence i pick this area to try my hands on and thought to showcase few insights of it hope you will like it and please feel free to share your suggestion and feedback.
Technical flow for twitter and HANA communication – we have now python which is going to make our job easy and provides a smooth integration with SAP HANA + so many built-in python functions used for mining and analysis.
HANA ML Python APIs – For more information about HANA Machine learning API please refer SAP HANA document.
Pre-requisites: HANA ML Installation environment + SAP HANA DB (UP and Running)
HANA ML Installation – https://developers.sap.com/tutorials/hxe-ua-install-python-ml-api.html
Let’s go step by step
- First of all we need one twitter developer account this is pretty easy and straightforward go to twitter for developer url and setup one app for yourself and save all the tokens mentioned there somewhere in notepad
- After successful creation of twitter account you will see below & please have them handy & save them somewhere in notepad
Twitter developer tokens
- Go to your conda/python command prompt and install the twitter API (there are few APIs for python install based on your comfort or like)
- execute the below command
pip install tweepy ## if this is successful execute the below command to check import tweepy as twitter
- Initialize all the tokens , give them the value we noted from twitter developer account this is going to be used for connecting with twitter
consumer_key = 'e6QvccFkjlkdsjlksadjlkjdsa' consumer_secret = 'aslkdjsalkdjsalkdjlsakd' token='1384909104-2alakdlsakdlsakdlakdslkdsalkdsaladkads' token_secret='rT0ZEUdasldkasdkjsadlkalkdjlkdajlkdsa'
- Start with twitter authentication if connection is successful test for downloading few tweets based on the Tags
# Authenticate to Twitter passing key and secret tokens auth = twitter.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(token, token_secret) # Create API object - Wait on rate limit is true so that we can download the tweets #without any connection break api = twitter.API(auth, wait_on_rate_limit=True) # Create a tweet - let' see this api.update_status("Hi from HANA ML API to Twitter - Analysing tweets")
- Tweeting using python – last command will post a tweet using my account
- The connectivity between python and twitter is fine now – let’s try to download few tweets and see how it’s coming at client and then cleansing and some formatting will be required before sending them to SAP HANA DB
- Try fetching first 10 tweets based on Hash tag (COVID19) & lets what we get :
hashtag = '#COVID19' date = '2020-06-30' tweets = twitter.Cursor(api.search, q=hashtag,lang='en',since=date, ).items(10) for tweet in tweets: for i in range(len(tweet.entities.get('hashtags'))): print(tweet.entities.get('hashtags')[i]['text'])
- Here we are passing date as well so it’s going to give us 10 items since 30th June & lets have a look at output.It’s up to you what parameter you choose & you are free to change Hash Tag .
- If you see in below image mainly people are talking about mask / air travel etc. we will see this in text analysis in detail this is just to check if we can download the tweets or not.
- Now execute the same command for 5K items
tweets = twitter.Cursor(api.search, q=hashtag,lang='en',since=date).items(5000) tweet_data = [[tweet.geo,tweet.text,tweet.user.screen_name,tweet.user.location,tweet.created_at,tweet.entities['hashtags']]for tweet in tweets]
- This is going to download 5000 tweets for you and surely we can do some text analysis with this data – you can go for 50K items as well it’s gonna take sometime & save those 50K times somewhere in the excel-sheet , and off-course you can try more function for text mining.
- GEO (Geo Location)
- Tweet Text
- User ID
Create a dataframe for storing the tweets
import pandas as pd tweet_df = pd.DataFrame(data=tweet_data,columns=['geo','text','user_name','loc','date','tags']) pd.set_option('max_colwidth',800)
Define a function to fetch Hash Tags
def hash_tag(tdata): tags = '' for i in range(len(tdata)): tags = tags + str(tdata[i]['text']) + ',' tags = tags.strip(',') return tags
- Tweets Processing only Hash Tag column
#process tweets , to have clean data for hastags for i in range(len(tweet_df['tags'])): tweet_df['tags'][i] = hash_tag(tweet_df['tags'][i])
Regex function for cleaning the text
Let’s build now some regex to clean the text before any text mining and also before inserting the data into SAP HANA
import re re.sub("RT @[\w]*:", "" , tweet_df.text) Output ' BREAKING NEWS Kimberly Guilfoyle Trump Jr s girlfriend has tested positive for COVID19' # RT and @ user name replaced with space , lets clean more ##Complete function which is going to remove @ , RT , HTTPS , New lines , # Tags , Idea is to clean the ##text. def clean_data(tweet): tweet = re.sub("RT @[\w]*:", "", tweet) tweet = re.sub("@[\w]*", "", tweet) tweet = re.sub("https://[A-Za-z0-9./]", "", tweet) tweet = re.sub("\n", "", tweet) tweet = re.sub("&", "", tweet) tweet = re.sub("#", "", tweet) tweet = re.sub(r"[^\w]", ' ', tweet ) return tweet
Apply the regex function and clean the text
tweet_df['text'] = tweet_df['text'].apply(lambda x: clean_data(x)) Print the text tweet_df['text']
if you look at the text i see most of things are clean but still scope is always there for improvement , i see here if i can remove the “HEY” but still HANA is going to remove it using the stop-words
Save this dataframe in your local file so that you have it for later use –
Establish the connection to SAP HANA System
from hana_ml import dataframe from data_load_utils import DataSets, Settings url, port, user, pwd = Settings.load_config(r"C:\Users\ABC\config\abc.ini") connection_context = dataframe.ConnectionContext(url, port, user, pwd) print(connection_context.connection.isconnected()) ##if connection is successful it will print True
Go to your HANA System now –
- Choose the schema where you want to create your table for storing tweets go to sql console
CREATE COLUMN TABLE TWEETS_DATA( "ID" INTEGER NOT NULL, "USER_NAME" NVARCHAR(100), "CREATED_AT" DATE, "TEXT" NVARCHAR (140), "HASH_TAGS" NVARCHAR (100), "GEO" NVARCHAR(100), "LOCATION" NVARCHAR(100), PRIMARY KEY("ID") );
- Note Once this is done create the full text index and text mining as well for this table
CREATE FULLTEXT INDEX index_table_name ON table_name (column_name) ASYNC FAST PREPROCESS OFF SEARCH ONLY OFF TEXT ANALYSIS ON TEXT MINING ON TEXT MINING CONFIGURATION MYTMCONFIGURATION.textminingconfig LANGUAGE DETECTION ('EN') CONFIGURATION 'LINGANALYSIS_FULL'
Come back to Jupyter notebook (I am using jupyter for all the python work i have posted above)
- Create an empty list example data = [ ]
- Append all the tweets into this list and insert this data into tweet table
data =  for indx , row in tweet_df.iterrows(): data.append([indx , row['geo'], row['text'], row['user_name'], row['loc'], row['date'], row['tags']]) sql = ('insert into ABC.TWEETS("ID", "GEO", "TEXT", "USER_NAME", "LOC", "DATE", "HASH_TAGS")' 'VALUES (?, ?, ?, ?, ?, ?, ?)') with connection_context.connection.cursor() as cur: rows_inserted = cur.executemany(sql, data) print(rows_inserted)
Note – you can also use csv upload approach instead of it up to you which approach to follow
Text Mining / Text Analysis
Let’s see word cloud first from HANA –
Let’s try the same thing using Python which will be talking to SAP HANA using Dataframe.
# Start with loading all necessary libraries import numpy as np import pandas as pd from os import path from PIL import Image from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.pyplot as plt % matplotlib inline
## Prepare SQL for Tokens sql_new = """SELECT STRING_AGG( "TA_TOKEN", ',' ) AS TOKENS FROM "ABC"."$TA_TWEET_DATA" """
##Collect will fetch the data df_new_sql = connection_context.sql(sql_new).collect()
Generate word cloud using python
Set the properties for word cloud before display
wordcloud = WordCloud(width = 800, height = 800, background_color ='white', min_font_size = 12).generate(df_new_sql.TOKENS) # plot the WordCloud image plt.figure(figsize = (12, 12), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 5) plt.show()
Note : – you may see some bad data as well (it’s not 100 % cleaned) but yes you will see the important text mining output as well
Execute the above python code to generate word cloud
- if you see the word cloud – data coming directly from SAP HANA – you see the covid 19 tag off-course and addition to this you see words like “severe” , “respiratory” , “sick” , “neurological”, ” cases” , “mask” etc.
- Important words – banned , plasma , President
So using this way you can insert more information and for sure it’s going to give you more relevant data for displaying into word cloud.
Fetching terms and checking importance using SAP HANA (text mining) –
Function Used – GET_SUGGESTED_TERM (See SAP HANA SQL for more info)
- Let’s see how text mining capability of SAP HANA is going to score the probability of word “CORONA”
- Call the function
##Create DataFrame to Store Result df_terms = pd.DataFrame() ## Call TM function for to fetch score for token in input_tokens: df_new = GET_SUGGESTED_TERMS(token,"TEXT") df_terms = df_terms.append(df_new)
See the score for first record is 0.75 means 75% probability that this is getting used in text so ranking of this term is very high
Let’s check for different token –
input_tokens = ['trump'] Output df_terms
Ranking for this word is really High 97 %
We can also do the analysis for sentence as well and you will see the different ranking and probability of each word in sentence.
input_text = "corona is spreading in india during lockdown as well"
- India is also ranked very high with 98% so you can say recent tweets are more related to india and you know people are also talking about lock-down its 45% around
Note – We have taken only 5000 tweets for analysis i am sure this can be better if we take more data , may be in my next post i will be uploading dataset from kaggle and will be playing more with TA & TM functions.
i will be sharing the notebook on my GitHub