Skip to Content
Technical Articles
Author's profile photo Shivam Shukla

COVID-19 Tweets analysis using HANA ML Python APIs.

Hello Everyone,

Hope you are doing well and safe too 🙂

Since long i am seeing various post on COVID-19 data analysis and forecasting and having the same thought i decided to post my recent work mainly on text mining & text analysis on COVID-19 tweets & data.

IDEA is to connect twitter from python and insert data into SAP HANA and utilize the built-in powerful capabilities in SAP HANA for text mining & text analysis & displaying word cloud and using few text mining functions as well.

we have seen people are expressing their views and thoughts onto twitter with different sentiments and you know we have lot of unstructured data & to dig into that is really a very difficult job hence i pick this area to try my hands on and thought to showcase few insights of it hope you will like it and please feel free to share your suggestion and feedback.

Technical flow for twitter and HANA communication – we have now python which is going to make our job easy and provides a smooth integration with SAP HANA + so many built-in python functions used for mining and analysis.

HANA ML Python APIs – For more information about HANA Machine learning API please refer SAP HANA document.

Pre-requisites: HANA ML Installation environment + SAP HANA DB (UP and Running)

HANA ML Installation – https://developers.sap.com/tutorials/hxe-ua-install-python-ml-api.html

Let’s go step by step

  • First of all we need one twitter developer account this is pretty easy and straightforward go to twitter for developer url and setup one app for yourself and save all the tokens mentioned there somewhere in notepad
  • After successful creation of twitter account you will see below & please have them handy & save them somewhere in notepad

 

Twitter developer tokens

  • Go to your conda/python command prompt and install the twitter API (there are few APIs for python install based on your comfort or like)
  •  execute the below command
    pip install tweepy
    
    ## if this is successful execute the below command to check 
    
    import tweepy as twitter
    
    
  • Initialize all the tokens , give them the value we noted from twitter developer account this is going to be used for connecting with twitter
    consumer_key = 'e6QvccFkjlkdsjlksadjlkjdsa'
    consumer_secret = 'aslkdjsalkdjsalkdjlsakd'
    token='1384909104-2alakdlsakdlsakdlakdslkdsalkdsaladkads'
    token_secret='rT0ZEUdasldkasdkjsadlkalkdjlkdajlkdsa'

     

  • Start with twitter authentication if connection is successful test for downloading few tweets based on the Tags
  • # Authenticate to Twitter passing key and secret tokens
    auth = twitter.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(token, token_secret)
    
    # Create API object - Wait on rate limit is true so that we can download the tweets
    #without any connection break
    api = twitter.API(auth, wait_on_rate_limit=True)
    
    # Create a tweet - let' see this 
    api.update_status("Hi from HANA ML API to Twitter - Analysing tweets")
  • Tweeting using python –  last command will post a tweet using my account

 

 

 

 

  • The connectivity between python and twitter is fine now – let’s try to download few tweets and see how it’s coming at client and then cleansing and some formatting will be required before sending them to SAP HANA DB

 

  • Try fetching first 10 tweets based on Hash tag (COVID19) & lets what we get :
    hashtag = '#COVID19'
    date = '2020-06-30'
    tweets = twitter.Cursor(api.search, q=hashtag,lang='en',since=date,  ).items(10)
    for tweet in tweets:
        for i in range(len(tweet.entities.get('hashtags'))):
            print(tweet.entities.get('hashtags')[i]['text'])

 

  • Here we are passing date as well so it’s going to give us 10 items since 30th June & lets have a look at output.It’s up to you what parameter you choose & you are free to change Hash Tag .

 

  • If you see in below image mainly people are talking about mask /  air travel etc. we will see this in text analysis in detail this is just to check if we can download the tweets or not.
  • Now execute the same command for 5K items
  • tweets = twitter.Cursor(api.search, q=hashtag,lang='en',since=date).items(5000)
    tweet_data = [[tweet.geo,tweet.text,tweet.user.screen_name,tweet.user.location,tweet.created_at,tweet.entities['hashtags']]for tweet in tweets]​
  • This is going to download 5000 tweets for you and surely we can do some text analysis with this data – you can go for 50K items as well it’s gonna take sometime & save those 50K times somewhere in the excel-sheet , and off-course you can try more function for text mining.

 

Columns 

  • GEO (Geo Location)
  • Tweet Text
  • User ID
  • Location
  • Date
  • tags

 

Create a dataframe for storing the tweets 

import pandas as pd
tweet_df = pd.DataFrame(data=tweet_data,columns=['geo','text','user_name','loc','date','tags'])
pd.set_option('max_colwidth',800)

 

Define a function to fetch Hash Tags 

def hash_tag(tdata):
    tags = ''
    for i in range(len(tdata)):
        tags = tags + str(tdata[i]['text']) + ','
    tags = tags.strip(',')
    return tags   
  • Tweets Processing only Hash Tag column
#process tweets , to have clean data for hastags

for i in range(len(tweet_df['tags'])):
    tweet_df['tags'][i] = hash_tag(tweet_df['tags'][i])

 

Let’s check how many records are tweet_df is having ?  its (5000 X 6)

 

Regex function for cleaning the text

Let’s build now some regex to clean the text before any text mining and also before inserting the data into SAP HANA

import re

re.sub("RT @[\w]*:", "" , tweet_df.text[0])

Output 
' BREAKING NEWS  Kimberly Guilfoyle  Trump Jr s girlfriend has tested positive for COVID19'

# RT and @ user name replaced with space , lets clean more

##Complete function which is going to remove @ , RT , HTTPS , New lines , # Tags , Idea is to clean the ##text.
def clean_data(tweet):
    tweet = re.sub("RT @[\w]*:", "", tweet)
    tweet = re.sub("@[\w]*", "", tweet)
    tweet = re.sub("https://[A-Za-z0-9./]", "", tweet)
    tweet = re.sub("\n", "", tweet)
    tweet = re.sub("&amp", "", tweet)
    tweet = re.sub("#", "", tweet)
    tweet = re.sub(r"[^\w]", ' ', tweet )
    return tweet

 

Apply the regex function and clean the text

tweet_df['text'] = tweet_df['text'].apply(lambda x: clean_data(x))

Print the text 

tweet_df['text']

Output

if you look at the text i see most of things are clean but still scope is always there for improvement , i see here if i can remove the “HEY” but still HANA is going to remove it using the stop-words

 

Save DataFrame

Save this dataframe in your local file so that you have it for later use –

tweet_df.to_csv('test_new.csv')

Establish the connection to SAP HANA System 

from hana_ml import dataframe
from data_load_utils import DataSets, Settings
url, port, user, pwd = Settings.load_config(r"C:\Users\ABC\config\abc.ini")
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
print(connection_context.connection.isconnected())

##if connection is successful it will print 

True

 

Go to your HANA System now – 

  • Choose the schema where you want to create your table for storing tweets go to sql console
    CREATE COLUMN TABLE TWEETS_DATA(
    "ID" INTEGER NOT NULL,
    "USER_NAME" NVARCHAR(100),
    "CREATED_AT" DATE,
    "TEXT" NVARCHAR (140),
    "HASH_TAGS" NVARCHAR (100),
    "GEO" NVARCHAR(100),
    "LOCATION" NVARCHAR(100),
    PRIMARY KEY("ID")
    );
    
  • Note Once this is done create the full text index and text mining as well for this table
  • CREATE FULLTEXT INDEX index_table_name ON table_name (column_name)
     ASYNC
     FAST PREPROCESS OFF
     SEARCH ONLY OFF
     TEXT ANALYSIS ON
     TEXT MINING ON
     TEXT MINING CONFIGURATION MYTMCONFIGURATION.textminingconfig
     LANGUAGE DETECTION ('EN')
     CONFIGURATION 'LINGANALYSIS_FULL'

 

Come back to Jupyter notebook (I am using jupyter for all the python work i have posted above)

  • Create an empty list example data = [ ]
  • Append all the tweets into this list and insert this data into tweet table
data = []
for indx , row in  tweet_df.iterrows():
    data.append([indx , row['geo'], row['text'], row['user_name'], row['loc'], row['date'], row['tags']])

sql = ('insert into ABC.TWEETS("ID", "GEO", "TEXT", "USER_NAME", "LOC", "DATE", "HASH_TAGS")' 'VALUES (?, ?, ?, ?, ?, ?, ?)')
with connection_context.connection.cursor() as cur:
    rows_inserted = cur.executemany(sql, data)
print(rows_inserted)

 

Note – you can also use csv upload approach instead of it up to you which approach to follow

 

Text Mining /  Text Analysis 

Let’s see word cloud first from HANA –

 

Let’s try the same thing using Python which will be talking to SAP HANA using Dataframe.

 

# Start with loading all necessary libraries
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
% matplotlib inline

 

## Prepare SQL for Tokens 

sql_new = """SELECT STRING_AGG( "TA_TOKEN", ',' ) AS TOKENS FROM "ABC"."$TA_TWEET_DATA" """

 

##Collect will fetch the data
df_new_sql = connection_context.sql(sql_new).collect()

 

Generate word cloud using python

Set the properties for word cloud before display

wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                min_font_size = 12).generate(df_new_sql.TOKENS[0]) 
  
# plot the WordCloud image                        
plt.figure(figsize = (12, 12), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 5) 
  
plt.show() 

 

Note :  –  you may see some bad data as well (it’s not 100 % cleaned) but yes you will see the important text mining output as well

 

Execute the above python code to generate word cloud 

 

  • if you see the word cloud  – data coming directly from SAP HANA  – you see the covid 19 tag off-course and addition to this you see words like “severe” , “respiratory” , “sick” , “neurological”,  ” cases” , “mask” etc.
  • Important words – banned ,  plasma , President

So using this way you can insert more information and for sure it’s going to give you more relevant data for displaying into word cloud. 

 

Fetching terms and checking importance using SAP HANA (text mining) – 

 

Function Used –  GET_SUGGESTED_TERM (See SAP HANA SQL for more info)

  • Let’s see how text mining capability of SAP HANA is going to score the probability of word “CORONA”
input_tokens

['CORONA']
  • Call the function
##Create DataFrame to Store Result
df_terms = pd.DataFrame()

## Call TM function for to fetch score
for token in input_tokens:
    df_new = GET_SUGGESTED_TERMS(token,"TEXT")
    df_terms = df_terms.append(df_new)

 

Output

See the score for first record is 0.75 means 75% probability that this is getting used in text so ranking of this term is very high

Let’s check for different token –

input_tokens = ['trump']

Output
df_terms

Ranking for this word is really High 97 %

 

We can also do the analysis for sentence as well and you will see the different ranking and probability of each word in sentence.

 

input_text = "corona is spreading in india during lockdown as well"

 

  • India is also ranked very high with 98% so you can say recent tweets are more related to india and you know people are also talking about lock-down its 45% around

 

Note – We have taken only 5000 tweets for analysis i am sure this can be better if we take more data , may be in my next post i will be uploading dataset from kaggle and will be playing more with TA & TM functions.

 

i will be sharing the notebook on my GitHub

 

Thanks.

 

 

Assigned Tags

      1 Comment
      You must be Logged on to comment or reply to a post.
      Author's profile photo Rakesh Chandra Joshi
      Rakesh Chandra Joshi

      Nice Article !