You can find the series and project from Real-time sentiment rating of movies on SAP HANA (part 6) – wrap-up and looking ahead

Intro

In the previous blog Real-time sentiment rating of movies on SAP HANA (part 1) – create schema and tables, we’ve determined APIs which we’ll use in our app and create some tables like movies and tweets with CDS. In this blog, we will prepare the data. In the first version of the smart app, I just crawled the metadata of new release movies once which means there were only about 20 movies at all. With the lack of Outbound Connectivity and Job Scheduling features, I could only use Java multithreading to crawl tweets continuously. With the introduction of Outbound Connectivity since SPS06 and  Job Scheduling since SPS07, now I can use SAP HANA XS to prepare data. We will discuss it in this blog.

Create roles

At the end of Real-time sentiment rating of movies on SAP HANA (part 1) – create schema and tables, we can see tables are already successfully created. However, they are owned by _SYS_REPO instead of the user which you use to add the SAP HANA system and login. So, first of all, we need to create few roles and grant these roles to some users. We can create two roles as follows. The user role has the select privilege on our design-time schema which means this role can only select data from this schema, while the administration role extends the user role and has additional insert/delete/update privileges on the schema which means the admin role can do select/insert/update/delete on this schema.

Notice: we use schema instead of catalog schema in our role definitions. You’re recommended to do so since we will build a pure SAP HANA native app.

User.hdbrole


role movieRating.roles::User {
  schema movieRating.data:MOVIE_RATING.hdbschema: SELECT;
}






Admin.hdbrole


role movieRating.roles::Admin
extends role movieRating.roles::User {
  schema movieRating.data:MOVIE_RATING.hdbschema: INSERT, DELETE, UPDATE;
}






After the activation, we can use the following SQL to grant the activated roles. In this example, we grant the admin role to a user who will crawl data from APIs and insert into SAP HANA.


CALL "_SYS_REPO"."GRANT_ACTIVATED_ROLE"('movieRating.roles::Admin', '<USERNAME>');






Create HTTP/HTTPS destinations

Since we want to use Outbound Connectivity in SAP HANA XS, the first thing is to create HTTP/HTTPS destinations. We will call APIs from Rotten Tomatoes API and Twitter API, so we need to create two destinations, one for Rotten Tomatoes API, the other for Twitter API.

rottenTomatoesApi.xshttpdest


description = "rotten tomatoes api";
host = "api.rottentomatoes.com";
port = 80;
pathPrefix = "/api/public/v1.0";
useProxy = true;
proxyHost = "proxy.pal.sap.corp";
proxyPort = 8080;
authType = none;
useSSL = false;
timeout = 0;






twitterApi.xshttpdest


description = "twitter api";
host = "api.twitter.com";
port = 443;
pathPrefix = "/1.1";
useProxy = true;
proxyHost = "proxy.pal.sap.corp";
proxyPort = 8080;
authType = none;
useSSL = true;
timeout = 0;






As we can call Rotten Tomatoes API via HTTP, you don’t need to configure anything. However for Twitter API, we can only call APIs via HTTPS and it is kind of complex to configure. It will take some time here. Please make sure you’ve finished Use XSJS outbound connectivity to search tweets before you go forward. So, for twitterApi.xshttpdest, we need to configure the trust store as showed in the red box below.

5.PNG

Create XSJS to crawl data

Now we can code XSJS to crawl data. We can create two XSJS files. One is for getting metadata of new release movies, the other is for searching tweets.

searchMovies.xsjs


function hashtag(title) {
  return "#" + title.split(":")[0].replace(/\W/g, "");
}
function searchMovies() {
  var baseURL = "/lists/movies/opening.json?limit=50&country=us";
  var apikey = "<your_api_key>";
  var destination = $.net.http.readDestination("movieRating.services", "rottenTomatoesApi");
  var client = new $.net.http.Client();
  var request = new $.net.http.Request($.net.http.GET, baseURL + "&apikey=" + apikey);
  var response = client.request(request, destination).getResponse();
  var movies = JSON.parse(response.body.asString()).movies;
  if (movies) {
  var movie;
  var conn = $.db.getConnection();
  var pstmtMovies = conn.prepareStatement('INSERT INTO "MOVIE_RATING"."movieRating.data::movieRating.Movies" ("id", "title", "year", "mpaa_rating", "runtime", "release_date", "synopsis", "poster", "studio", "hashtag", "timestamp", "since_id") VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)');
  var pstmtGenres = conn.prepareStatement('INSERT INTO "MOVIE_RATING"."movieRating.data::movieRating.Genres" ("movie_id", "genre") VALUES (?, ?)');
  var pstmtAbridgedCast = conn.prepareStatement('INSERT INTO "MOVIE_RATING"."movieRating.data::movieRating.AbridgedCast" ("movie_id", "cast") VALUES (?, ?)');
  var pstmtAbridgedDirectors = conn.prepareStatement('INSERT INTO "MOVIE_RATING"."movieRating.data::movieRating.AbridgedDirectors" ("movie_id", "director") VALUES (?, ?)');
  for (var i in movies) {
  request = new $.net.http.Request($.net.http.GET, "/movies/" + movies[i].id + ".json?apikey=" + apikey);
  response = client.request(request, destination).getResponse();
  movie = JSON.parse(response.body.asString());
  //Movie
  pstmtMovies.setInteger(1, movie.id);
  pstmtMovies.setString(2, movie.title);
  pstmtMovies.setInteger(3, movie.year, 10);
  pstmtMovies.setString(4, movie.mpaa_rating);
  pstmtMovies.setString(5, movie.runtime + "");
  pstmtMovies.setDate(6, movie.release_dates.theater, "YYYY-MM-DD");
  pstmtMovies.setString(7, movie.synopsis);
  pstmtMovies.setString(8, movie.posters.thumbnail);
  pstmtMovies.setString(9, movie.studio === undefined ? "" : movie.studio);
  pstmtMovies.setString(10, hashtag(movie.title));
  pstmtMovies.setTimestamp(11, new Date());
  pstmtMovies.setString(12, '0');
  pstmtMovies.execute();
  //Genres
  if (movie.genres) {
  for (var i in movie.genres) {
  pstmtGenres.setInteger(1, movie.id);
  pstmtGenres.setString(2, movie.genres[i]);
  pstmtGenres.execute();
  }
  }
  //AbridgedCast
  if (movie.abridged_cast) {
  for (var i in movie.abridged_cast) {
  pstmtAbridgedCast.setInteger(1, movie.id);
  pstmtAbridgedCast.setString(2, movie.abridged_cast[i].name);
  pstmtAbridgedCast.execute();
  }
  }
  //AbridgedDirectors
  if (movie.abridged_directors) {
  for (var i in movie.abridged_directors) {
  pstmtAbridgedDirectors.setInteger(1, movie.id);
  pstmtAbridgedDirectors.setString(2, movie.abridged_directors[i].name);
  pstmtAbridgedDirectors.execute();
  }
  }
  }
  pstmtMovies.close();
  pstmtGenres.close();
  pstmtAbridgedCast.close();
  pstmtAbridgedDirectors.close();
  conn.commit();
  conn.close();
  }
}






I don’t plan to explain the whole code, since you can find the XSJS reference from JSDoc: Index. I just want to mention several key points.

1. We first call Rotten Tomatoes API – Opening Movies to get all IDs of new release movies of current week, then we use Rotten Tomatoes API – Movie Info to get the detail info of each movie and insert into SAP HANA.

2. The hashtag() function: There are two steps: First, we remove the subtitle, for example there is a new release movie named “My Little Pony: Equestria Girls – Rainbow Rocks”, we just use “My Little Pony” as the movie title. Second, we just keep A-Za-z0-9_. Otherwise, some movie titles will be too long or there will be some marks in the movie title which are both not good for searching tweets.

3. For Rotten Tomatoes API – Opening Movies, you cannot get more than 50 new release movies and the default value of “limit” is 16 which is few to us. So, we set “limit” to 50.

4. We just use the release date in theaters and we change the data type from string to date.

5. since_id is 0 by default.

searchTweets.xsjs


function searchTweets() {
  var baseURL = "/search/tweets.json?lang=en&result_type=recent&count=100";
  var token = "<your_bearer_token>";
  var destination = $.net.http.readDestination("movieRating.services", "twitterApi");
  var client = new $.net.http.Client();
  var request, response, result, tweets, max_id_str;
  var conn = $.db.getConnection();
  var pstmtSelectMovies = conn.prepareStatement('SELECT "id", "hashtag", "since_id" FROM "MOVIE_RATING"."movieRating.data::movieRating.Movies" WHERE DAYS_BETWEEN("release_date", CURRENT_DATE) BETWEEN 0 AND 6 ORDER BY "release_date" DESC');
  var pstmtUpdateMovies = conn.prepareStatement('UPDATE "MOVIE_RATING"."movieRating.data::movieRating.Movies" SET "since_id" = ? WHERE "id" = ?');
  var pstmtTweets = conn.prepareStatement('INSERT INTO "MOVIE_RATING"."movieRating.data::movieRating.Tweets" ("id", "created_at", "text", "source", "user_screen_name", "user_profile_image_url", "longitude", "latitude", "movie_id", "timestamp") VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)');
  var rs = pstmtSelectMovies.executeQuery();
  while (rs.next()) {
  try {
  request = new $.net.http.Request($.net.http.GET, baseURL + "&q=" + rs.getString(2) + "&since_id=" + rs.getString(3));
  request.headers.set('Authorization', token);
  response = client.request(request, destination).getResponse();
  result = JSON.parse(response.body.asString());
  tweets = result.statuses;
  max_id_str = result.search_metadata.max_id_str;
  if (tweets.length === 1) {
  pstmtTweets.setString(1, tweets[0].id_str);
  pstmtTweets.setTimestamp(2, tweets[0].created_at.slice(4, 20) + tweets[0].created_at.slice(-4), "MON DD HH24:MI:SS YYYY");
  pstmtTweets.setString(3, tweets[0].text);
  pstmtTweets.setString(4, tweets[0].source);
  pstmtTweets.setString(5, tweets[0].user.screen_name);
  pstmtTweets.setString(6, tweets[0].user.profile_image_url);
  if (tweets[0].coordinates === null) {
  pstmtTweets.setNull(7);
  pstmtTweets.setNull(8);
  } else {
  pstmtTweets.setDecimal(7, tweets[0].coordinates.coordinates[0]);
  pstmtTweets.setDecimal(8, tweets[0].coordinates.coordinates[1]);
  }
  pstmtTweets.setInteger(9, rs.getInteger(1));
  pstmtTweets.setTimestamp(10, new Date());
  pstmtTweets.execute();
  } else if (tweets.length > 1) {
  pstmtTweets.setBatchSize(tweets.length);
  for (var i in tweets) {
  pstmtTweets.setString(1, tweets[i].id_str);
  pstmtTweets.setTimestamp(2, tweets[i].created_at.slice(4, 20) + tweets[i].created_at.slice(-4), "MON DD HH24:MI:SS YYYY");
  pstmtTweets.setString(3, tweets[i].text);
  pstmtTweets.setString(4, tweets[i].source);
  pstmtTweets.setString(5, tweets[i].user.screen_name);
  pstmtTweets.setString(6, tweets[i].user.profile_image_url);
  if (tweets[i].coordinates === null) {
  pstmtTweets.setNull(7);
  pstmtTweets.setNull(8);
  } else {
  pstmtTweets.setDecimal(7, tweets[i].coordinates.coordinates[0]);
  pstmtTweets.setDecimal(8, tweets[i].coordinates.coordinates[1]);
  }
  pstmtTweets.setInteger(9, rs.getInteger(1));
  pstmtTweets.setTimestamp(10, new Date());
  pstmtTweets.addBatch();
  }
  pstmtTweets.executeBatch();
  }
  //Update "since_id" of the movie
  pstmtUpdateMovies.setString(1, max_id_str);
  pstmtUpdateMovies.setInteger(2, rs.getInteger(1));
  pstmtUpdateMovies.executeUpdate();
  } catch (e) {
  }
  }
  conn.commit();
  rs.close();
  pstmtSelectMovies.close();
  pstmtUpdateMovies.close();
  pstmtTweets.close();
  conn.close();
}






Similar to searchMovies.xsjs, I just explain some key points.

1. Regarding the following SQL, the first thing we want to do in searchTweets.xsjs is to get new release movies, because we will search tweets based on the hashtags of movies. Since there were only about 20 movies in the first version of this app, I can always search tweets about all movies. However, in the second version of the smart app, we plan to get new release movies continuously which means we will get new release movies of each week. So, we will have huge amounts of movies in our movie table. It’s impossible to crawl tweets about all these movies because of API Rate Limits | Twitter Developers. Imagine there are about 30 new release movies each week in the US, after one year there will be 1500+ movies in the movie table. The result is that we’ll be not able to crawl tweets in real time! So my current solution is just searching tweets about new release movies which has been released less than one week. For example, if a movie is released on 2014-09-30, the app will crawl tweets about this movie from 2014-09-30 to 2014-10-06.


SELECT "id", "hashtag", "since_id" FROM "MOVIE_RATING"."movieRating.data::movieRating.Movies" WHERE DAYS_BETWEEN("release_date", CURRENT_DATE) BETWEEN 0 AND 6 ORDER BY "release_date" DESC




2. We use since_id as one of the parameters, because we don’t expect old results which we have already crawled. And we keep since_id for each movie, so after we insert tweets into SAP HANA, we will update since_id of the movie. Next time we want to crawl tweets about this movie, we can use the updated since_id directly.

3. Parameters “lang”, “result_type”, “count”

– For “lang=en”, we just search tweets in English for simplicity. Currently SAP HANA supports sentiment analysis the following five languages. See Voice of the Customer Content – SAP HANA Text Analysis Language Reference Guide – SAP Library

  • English
  • German
  • French
  • Spanish
  • Simplified Chinese

– For “result_type=recent”, since we want to search tweets in real-time, we just need recent tweets. You can find other options from GET search / tweets | Twitter Developers

– For “count=100”, the maximum count of each return is 100, so we set to maximum.

4. We use batch when inserting tweets if possible. If we get only one tweet, cannot use batch. You can find the logic from line 27 – 67.

5. We change the data type of “created_at” from string to timestamp.

Notice: Before going to the job scheduling step, you are highly recommended to activate all objects by now and call searchMovies.xsjs and searchTweets.xsjs manually. If everything is OK, please jump into the job scheduling part. Since I will use searchMovies.xsjs and searchTweets.xsjs in the job scheduling, I wrapped them in two functions. If you call them manually, please remove function() {}, that is the first line and the last line

Create job scheduling

Now you should be able to search new release movies and tweets manually. Since SAP HANA XS now supports the job scheduling feature, why not use this awesome feature?

searchMovies.xsjob


{
    "description": "Search movies",
    "action": "movieRating.services:searchMovies.xsjs::searchMovies",
    "schedules": [
       {
          "description": "00:00 every Tuesday (UTC)",
          "xscron": "* * * tue 0 0 0"
       }
    ]
}




Since the new release movies of current week does not change usually, we can just search new release movies once per week. Here we call the “searchMovies” function at 00:00 every Tuesday. You may ask why this specific time? I noticed that most movies are released on Friday. The reason is obvious, we can go to the theaters at weekends. See When Is Opening Your Film on Wednesday a Good Idea? – Film.com So, Tuesday is early enough.

Notice: The time in .xsjob is UTC time and you cannot change it to other timezones.

searchTweets.xsjob


{
    "description": "Search tweets",
    "action": "movieRating.services:searchTweets.xsjs::searchTweets",
    "schedules": [
       {
          "description": "every 2 minutes",
          "xscron": "* * * * * */2 0"
       }
    ]
}




The above is the job scheduling we will use to search tweets. Due to the API Rate Limits | Twitter Developers, two minutes interval is currently a good choice which means the app will crawl tweets about a range of new release movies every two minutes.

After the activation, do not forget to activate both job scheduling in the SAP HANA XS Administration Tool as follows. If you are not familiar with that, please have a look at Scheduling XS Jobs – SAP HANA Developer Guide – SAP Library

6.PNG

7.PNG

Next steps

So far we’ve finished the data preparation which means we are now searching new release movies and related tweets and inserting into SAP HANA in real-time! You may have crawled the similar results as follows. With the metadata of new release movies and huge amounts of tweets, we can do the sentiment analysis in the next blog.

Movies

8.PNG

Tweets

9.PNG

Hope you enjoyed reading my blog. 🙂

To report this post you need to login first.

8 Comments

You must be Logged on to comment or reply to a post.

  1. Christoforos Verras

    Hello Wenjun Zhou !

    I am trying to follow up your interesting series , but when i create the User.hdbrole and the Admin.hdbrole i get the following errors:

    Description Resource Path Location Type
    [9000002] Syntax error: role name “movieRating.roles::Admin” must be identical to the file name “movieRating::Admin” (without extension) Admin.hdbrole /movieRating Row 1, Column 6 SAP HANA Activation Problem
    Description Resource Path Location Type
    [9000002] Syntax error: role name “movieRating.roles::User” must be identical to the file name “movieRating::User” (without extension) User.hdbrole /movieRating Row 1, Column 6

    SAP HANA Activation Problem

    Is there something that i do wrong ?

    PS: My project name is movieRating

    Thank you in advance,

    Christoforos Verras

    (0) 
      1. Wenjun Zhou Post author

        Sorry for the delay reply. Just saw your problem. Yes, it should be a path issue. From the error message, the “roles” folder was missing. 🙂

        Best regards,

        Wenjun

        (0) 
        1. Christoforos Verras

          Hi Wenjun,

          Thank you for you reply, considering that in China it must be night time by now. 🙂

          Yes , my bad. I am so amateur in HANA and it’s components.

          We are trying to follow up your series but things got really complicated when you referred to:

          Outbound httpS with HANA XS (part 1) – set up your HANA box to use SSL/TLS

          Which is a very large project , and also we are not allowed to download the Cryptographic Library in the first place.

          We just want to use the search API of Twitter , instead of the streaming which we have already accomplished to develop a minor project.

          Do you think that there is an easier way to use the search API ?

          Thank you,

          Christoforos Verras

          (0) 
            1. Christoforos Verras

              Hi Wenjun,

              Is SAP HANA One something different from we now know ?

              If we make an AWS account , then we will create a SAP HANA One instance ?

              Now , we have a Monsoon Instance and a Hana Cloud Platform Trial instance.

              Thank you SO much !

              Christoforos

              (0) 
              1. Wenjun Zhou Post author

                Hi,

                The biggest difference between SAP HANA One and SAP HANA developer edition is that SAP HANA One is for production/commercial use, but SAP HANA developer edition is not. When you use SAP HANA developer edition, you just pay the cost of instance, e.g., AWS EC2 instance. However, when you run SAP HANA One, you pay not only the cost of instance, but also the SAP HANA license hourly.

                If you are interested in the SAP HANA developer editon, you may have a look at Step by Step Tutorials for SAP HANA Developer Edition

                For SAP HANA One, unfortunately only Chinese blog posts are available now.

                http://scn.sap.com/community/chinese/hana/blog/2014/04/16/sap-hana-cloud%E5%85%A5%E9%97%A8%E6%95%99%E7%A8%8B%E7%B3%BB%E5&hellip;

                http://scn.sap.com/community/chinese/hana/blog/2014/04/21/sap-hana-cloud%E5%85%A5%E9%97%A8%E6%95%99%E7%A8%8B%E7%B3%BB%E5&hellip;

                Best regards,

                Wenjun

                (0) 

Leave a Reply