You can find the series and project from Real-time sentiment rating of movies on SAP HANA (part 6) – wrap-up and looking ahead

Intro

Till now we do not need to worry about the lack of data, since we’ve prepared the metadata of new release movies and tweets in Real-time sentiment rating of movies on SAP HANA (part 1) – create schema and tables and Real-time sentiment rating of movies on SAP HANA (part 2) – data preparation. With job scheduling, the data will be inserted into SAP HANA automatically and continuously. So, in this blog we will make some text analysis on our data and prepare some models which will be consumed later in the UI part.

Sentiment analysis on tweets

I won’t explain the details of text analysis in SAP HANA. If you are interested in SAP HANA text analysis, you can have a look at SAP HANA Platform – SAP Help Portal Page, “SAP HANA Search and Text Analysis References” part.

Regarding our scenario, we want to do the sentiment analysis on tweets, especially the “text” field. Now let’s do it! It’s not difficult for us to implement the sentiment analysis part, since what we need is just a full text index. Unfortunately, we cannot use full text index in SAP HANA XS currently. So, in order to realize this, we have to run the following SQL statements manually.


CREATE FULLTEXT INDEX "MOVIE_RATING"."TWEETS_I" ON "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ("text") CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER' ASYNC LANGUAGE DETECTION ('EN') TEXT ANALYSIS ON;
ALTER TABLE "MOVIE_RATING"."$TA_TWEETS_I" ADD CONSTRAINT TWEETS_FK FOREIGN KEY("id") REFERENCES "MOVIE_RATING"."movieRating.data::movieRating.Tweets"("id") ON DELETE CASCADE;
















Here is the explanation.

– For the first SQL statement, we create a full text index “TWEETS_I” on the “text” field of tweets table. In addition, we use the voice of customer configuration which will detect the sentiment of the “text” field and since we only search tweets in English, we can limit the language detection to English only. After you run this SQL statement successfully, a new table named “$TA_TWEETS_I” will be created which maintains the analysis results. For the syntax of “CREATE FULLTEXT INDEX”, please refer CREATE FULLTEXT INDEX – SAP HANA SQL and System Views Reference – SAP Library

– For the second SQL statement, it is optional. We just define a foreign key “id” in the automatic generated table “$TA_TWEETS_I” which references the “id” column in our “TWEET” table. Imagine if we delete a tuple in “TWEET” table, the corresponding text analysis records will also be deleted automatically.

Now we can find what’s going on in the “$TA_TWEETS_I” table. You may find with the capability of text analysis, SAP HANA is able to detect the property, e.g., SOCIAL_MEDIA/TOPIC_TWITTER, DAY, DATE, URI/URL, PRODUCT, Sentiment, etc. (in “TA_TYPE” column) of the extracted token (in “TA_TOKEN” column). Here I highlighted some records in blue boxes and red boxes respectively. For example, there was a movie titled “Justice is Mind”, since we searched tweets with the hashtag “#JusticeisMind” for this movie, you can find lots of “#justiceismind” in the “TA_TOKEN” field which are marked in blue boxes. As in our smart app, we focus on the sentiment analysis, I selected some records in red boxes, e.g., “great” is tagged as “StrongPositiveSentiment” and “Nice” is tagged as “WeakPositiveSentiment”. Everything seems reasonable.

/wp-content/uploads/2014/10/10_1_573040.png

GENERATED ALWAYS AS

Because we plan to make some analysis based on the time dimension, let’s first have a look at what we have now. From the last two pictures in Real-time sentiment rating of movies on SAP HANA (part 2) – data preparation, you can find we have “release_date” of movies and “created_at” of tweets.

Movies

/wp-content/uploads/2014/10/8_1_573155.png

Tweets

/wp-content/uploads/2014/10/9_1_573078.png

But we want more such as year, month, day something like that, so we can create some “GENERATED ALWAYS AS” columns to achieve this. Since currently CDS does not support this feature, we need to write some SQL statements manually. You can find the syntax from ALTER TABLE – SAP HANA SQL and System Views Reference – SAP Library or CREATE TABLE – SAP HANA SQL and System Views Reference – SAP Library which means you can create “GENERATED ALWAYS AS” columns not only when you create tables but when you alter tables.

We create the following “GENERATED ALWAYS AS” columns for movies.


ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Movies" ADD ("release_date_year" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("release_date", 'YYYY'));
ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Movies" ADD ("release_date_month" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("release_date", 'MON'));
ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Movies" ADD ("release_date_day" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("release_date", 'DD'));
ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Movies" ADD ("release_date_week" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("release_date", 'WW'));
ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Movies" ADD ("year_str" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("year"));












Meanwhile we create the following “GENERATED ALWAYS AS” columns for tweets.


ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("created_at_year" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("created_at", 'YYYY'));
ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("created_at_month" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("created_at", 'MON'));
ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("created_at_day" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("created_at", 'DD'));
ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("created_at_hour" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("created_at", 'HH24'));
ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("created_at_week" NVARCHAR(20) GENERATED ALWAYS AS TO_NVARCHAR("created_at", 'WW'));
ALTER TABLE "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ADD ("source_str" NVARCHAR(200) GENERATED ALWAYS AS SUBSTR_BEFORE(SUBSTR_AFTER("source", '>'), '<'));












That’s it. Now you can see additional columns in the “Movies” and “Tweets” table.

Movies

/wp-content/uploads/2014/10/11_1_573156.png

Tweets

/wp-content/uploads/2014/10/12_1_573157.png

Modeling

Now let’s create some information views which can be easily consumed in the UI part.

1. AT_TWEETS.attributeview

Step 1: Data Foundation

This attribute view is focused on sentiments with tweet and movie info, so we use “$TA_TWEETS_I”, “Tweets” and “Movies” these three tables as data foundation. Sub-steps are shown as follows.

  1. Join “$TA_TWEETS_I” and “Tweets” on the ID of tweet;
  2. Join “Tweets” and “Movies” on the ID of movie.
  3. Filter “TA_TYPE” with only five values “StrongPositiveSentiment”, “WeakPositiveSentiment”, “NeutralSentiment”, “WeakNegativeSentiment” and “StrongNegativeSentiment”, i.e., we only show tweets with sentiment and we ignore tweets without sentiment.
  4. Add necessary columns to the output and create some calculated columns based on the output.

13.PNG

Step 2: Semantics

In this step, we define key attributes and hide some useless columns.

14.PNG

Step 3: Validate, activate and preview data

15.PNG

The following list displays the final output attributes.

  • token: the sentiment token, e.g., great, nice, wonderful, amazing, love, bad, …
  • sentiment: StrongPositiveSentiment/WeakPositiveSentiment/NeutralSentiment/WeakNegativeSentiment/StrongNegativeSentiment
  • user_screen_name: the screen name of the user who posted the tweet
  • user_profile_image_url: the URL of the user profile icon
  • tweet_source: the source of the tweet, e.g., Twitter Web Client, Twitter for iPhone, …
  • tweet_time_year: year of created_at
  • tweet_time_month: month of created_at
  • tweet_time_day: day of created_at
  • tweet_time_hour: hour of created_at
  • tweet_time_week: week of created_at
  • movie_id: movie ID
  • title: movie title
  • mpaa_rating: MPAA rating
  • movie_year: movie in year
  • runtime: movie runtime
  • release_date: movie release date in theater
  • poster: the URL of movie poster in thumbnail
  • studio: the studio of the movie
  • id_counter: the primary key
  • sent: sentiment in text like “Strong Positive Sentiment” instead of “StrongPostiveSentiment”
  • created_at_str: created_at in ‘YYYY-MM-DD HH24:MM:SS’ format
  • created_at_time_str: created at in ‘MM-DD HH24:MM:SS’ format
  • source_url: the URL of tweet source, like Download the free Twitter app | Twitter
  • text_head: the text before the token
  • text_tail: the text after the token
  • synopsis: movie synopsis
  • release_date_year: year of release_date
  • poster_pro: the URL of movie poster in profile
  • release_date_month: month of release_date
  • release_date_day: day of release_date
  • release_date_week: week of release_date

2. CV_MOVIES.calculationview

Compared with the above attribute view, this calculation view is focused more on new release movies themselves. What we want to do is given a release date range getting several attributes and calculated measures of movies in this range, e.g., # of mentions and the rating score.

Step 1: We use SQLScript instead of graphical view.  Here is what we want to get:

  • id: movieID
  • title: movie title
  • poster: the URL of movie poster in profile
  • release_date: the release date of movie
  • mention: # of sentiments detected related with this movie
  • rating: the rating score

rating = (# of strong positive sentiment * 10 + # of weak positive sentiment * 8 + # of neutral sentiment * 6 + # of weak negative sentiment * 4 + # of strong negative sentiment * 2) / # of total sentiments

16.PNG

Step 2: Define input parameters date_from and date_to

17.PNG

Step 3: Add target columns

18.PNG

Step 4: Handle semantics

19.PNG

Step 5: Choose dates and preview data

20.PNG

21.PNG

Next Steps

Till now we can automatically analyze the sentiment from real-time tweets which we’ve searched in Real-time sentiment rating of movies on SAP HANA (part 2) – data preparation and we’ve built two information views, one for sentiments and the other for movies. In the next two blogs we’ll have a look at the UI part and consume the information views in this blog.

Hope you enjoyed reading my blog. 🙂

To report this post you need to login first.

3 Comments

You must be Logged on to comment or reply to a post.

Leave a Reply