Intro

Former Member · ‎10-30-2014

You can find the series and project from Real-time sentiment rating of movies on SAP HANA (part 6) - wrap-up and looking ahead

Intro

Till now we do not need to worry about the lack of data, since we've prepared the metadata of new release movies and tweets in Real-time sentiment rating of movies on SAP HANA (part 1) - create schema and tables and Real-time sentiment rating of movies on SAP HANA (part 2) - data preparation. With job scheduling, the data will be inserted into SAP HANA automatically and continuously. So, in this blog we will make some text analysis on our data and prepare some models which will be consumed later in the UI part.

Sentiment analysis on tweets

I won't explain the details of text analysis in SAP HANA. If you are interested in SAP HANA text analysis, you can have a look at SAP HANA Platform – SAP Help Portal Page, "SAP HANA Search and Text Analysis References" part.

Regarding our scenario, we want to do the sentiment analysis on tweets, especially the "text" field. Now let's do it! It's not difficult for us to implement the sentiment analysis part, since what we need is just a full text index. Unfortunately, we cannot use full text index in SAP HANA XS currently. So, in order to realize this, we have to run the following SQL statements manually.



CREATE FULLTEXT INDEX "MOVIE_RATING"."TWEETS_I" ON "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ("text") CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER' ASYNC LANGUAGE DETECTION ('EN') TEXT ANALYSIS ON;

ALTER TABLE "MOVIE_RATING"."$TA_TWEETS_I" ADD CONSTRAINT TWEETS_FK FOREIGN KEY("id") REFERENCES "MOVIE_RATING"."movieRating.data::movieRating.Tweets"("id") ON DELETE CASCADE;

Here is the explanation.

- For the first SQL statement, we create a full text index "TWEETS_I" on the "text" field of tweets table. In addition, we use the voice of customer configuration which will detect the sentiment of the "text" field and since we only search tweets in English, we can limit the language detection to English only. After you run this SQL statement successfully, a new table named "$TA_TWEETS_I" will be created which maintains the analysis results. For the syntax of "CREATE FULLTEXT INDEX", please refer CREATE FULLTEXT INDEX - SAP HANA SQL and System Views Reference - SAP Library

- For the second SQL statement, it is optional. We just define a foreign key "id" in the automatic generated table "$TA_TWEETS_I" which references the "id" column in our "TWEET" table. Imagine if we delete a tuple in "TWEET" table, the corresponding text analysis records will also be deleted automatically.

Now we can find what's going on in the "$TA_TWEETS_I" table. You may find with the capability of text analysis, SAP HANA is able to detect the property， e.g., SOCIAL_MEDIA/TOPIC_TWITTER, DAY, DATE, URI/URL, PRODUCT, Sentiment, etc. (in "TA_TYPE" column) of the extracted token (in "TA_TOKEN" column). Here I highlighted some records in blue boxes and red boxes respectively. For example, there was a movie titled "Justice is Mind", since we searched tweets with the hashtag "#JusticeisMind" for this movie, you can find lots of "#justiceismind" in the "TA_TOKEN" field which are marked in blue boxes. As in our smart app, we focus on the sentiment analysis, I selected some records in red boxes, e.g., "great" is tagged as "StrongPositiveSentiment" and "Nice" is tagged as "WeakPositiveSentiment". Everything seems reasonable.

GENERATED ALWAYS AS

Because we plan to make some analysis based on the time dimension, let's first have a look at what we have now. From the last two pictures in Real-time sentiment rating of movies on SAP HANA (part 2) - data preparation, you can find we have "release_date" of movies and "created_at" of tweets.

Movies

Tweets

But we want more such as year, month, day something like that, so we can create some "GENERATED ALWAYS AS" columns to achieve this. Since currently CDS does not support this feature, we need to write some SQL statements manually. You can find the syntax from ALTER TABLE - SAP HANA SQL and System Views Reference - SAP Library or CREATE TABLE - SAP HANA SQL and System Views Reference - SAP Library which means you can create "GENERATED ALWAYS AS" columns not only when you create tables but when you alter tables.

We create the following "GENERATED ALWAYS AS" columns for movies.



ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Movies" ADD ("release_date_year" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("release_date", 'YYYY'));

ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Movies" ADD ("release_date_month" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("release_date", 'MON'));

ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Movies" ADD ("release_date_day" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("release_date", 'DD'));

ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Movies" ADD ("release_date_week" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("release_date", 'WW'));

ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Movies" ADD ("year_str" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("year"));

Meanwhile we create the following "GENERATED ALWAYS AS" columns for tweets.



ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("created_at_year" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("created_at", 'YYYY'));

ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("created_at_month" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("created_at", 'MON'));

ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("created_at_day" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("created_at", 'DD'));

ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("created_at_hour" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("created_at", 'HH24'));

ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("created_at_week" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("created_at", 'WW'));

ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("source_str" NVARCHAR(200) GENERATED ALWAYS AS SUBSTR_BEFORE(SUBSTR_AFTER("source", '>'), '<'));

That's it. Now you can see additional columns in the "Movies" and "Tweets" table.

Movies

Tweets

Modeling

Now let's create some information views which can be easily consumed in the UI part.

1. AT_TWEETS.attributeview

Step 1: Data Foundation

This attribute view is focused on sentiments with tweet and movie info, so we use "$TA_TWEETS_I", "Tweets" and "Movies" these three tables as data foundation. Sub-steps are shown as follows.

Join "$TA_TWEETS_I" and "Tweets" on the ID of tweet;
Join "Tweets" and "Movies" on the ID of movie.
Filter "TA_TYPE" with only five values "StrongPositiveSentiment", "WeakPositiveSentiment", "NeutralSentiment", "WeakNegativeSentiment" and "StrongNegativeSentiment", i.e., we only show tweets with sentiment and we ignore tweets without sentiment.
Add necessary columns to the output and create some calculated columns based on the output.

Step 2: Semantics

In this step, we define key attributes and hide some useless columns.

Step 3: Validate, activate and preview data

The following list displays the final output attributes.

token: the sentiment token, e.g., great, nice, wonderful, amazing, love, bad, ...
sentiment: StrongPositiveSentiment/WeakPositiveSentiment/NeutralSentiment/WeakNegativeSentiment/StrongNegativeSentiment
user_screen_name: the screen name of the user who posted the tweet
user_profile_image_url: the URL of the user profile icon
tweet_source: the source of the tweet, e.g., Twitter Web Client, Twitter for iPhone, ...
tweet_time_year: year of created_at
tweet_time_month: month of created_at
tweet_time_day: day of created_at
tweet_time_hour: hour of created_at
tweet_time_week: week of created_at
movie_id: movie ID
title: movie title
mpaa_rating: MPAA rating
movie_year: movie in year
runtime: movie runtime
release_date: movie release date in theater
poster: the URL of movie poster in thumbnail
studio: the studio of the movie
id_counter: the primary key
sent: sentiment in text like "Strong Positive Sentiment" instead of "StrongPostiveSentiment"
created_at_str: created_at in 'YYYY-MM-DD HH24:MM:SS' format
created_at_time_str: created at in 'MM-DD HH24:MM:SS' format
source_url: the URL of tweet source, like Download the free Twitter app | Twitter
text_head: the text before the token
text_tail: the text after the token
synopsis: movie synopsis
release_date_year: year of release_date
poster_pro: the URL of movie poster in profile
release_date_month: month of release_date
release_date_day: day of release_date
release_date_week: week of release_date

2. CV_MOVIES.calculationview

Compared with the above attribute view, this calculation view is focused more on new release movies themselves. What we want to do is given a release date range getting several attributes and calculated measures of movies in this range, e.g., # of mentions and the rating score.

Step 1: We use SQLScript instead of graphical view. Here is what we want to get:

id: movieID
title: movie title
poster: the URL of movie poster in profile
release_date: the release date of movie
mention: # of sentiments detected related with this movie
rating: the rating score

rating = (# of strong positive sentiment * 10 + # of weak positive sentiment * 8 + # of neutral sentiment * 6 + # of weak negative sentiment * 4 + # of strong negative sentiment * 2) / # of total sentiments

Step 2: Define input parameters date_from and date_to

Step 3: Add target columns

Step 4: Handle semantics

Step 5: Choose dates and preview data

Next Steps

Till now we can automatically analyze the sentiment from real-time tweets which we've searched in Real-time sentiment rating of movies on SAP HANA (part 2) - data preparation and we've built two information views, one for sentiments and the other for movies. In the next two blogs we'll have a look at the UI part and consume the information views in this blog.

Hope you enjoyed reading my blog. :smile:

Real-time sentiment rating of movies on SAP HANA (part 3) - text analysis and modeling

Intro

Sentiment analysis on tweets

GENERATED ALWAYS AS

Modeling

1. AT_TWEETS.attributeview

2. CV_MOVIES.calculationview

Next Steps

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win