Text Analysis in SAP HANA integrate with Twitter API
Note: This blog is not up to date as new functionalities have been introduced in late HANA revisions.
The example I am going to describe was actually created about two years ago when Text Analysis was first time introduced to HANA Platform. However, I think it is still a good example to demonstrate how simple the text analysis feature is and programming with HANA in Java language that I’d like to share to you. I’m from the Startup Focus team, if you are a startup and interested in developing on HANA, visit here for more information.
Prerequisites
Register an Application at Twitter Developers
As we are going to use the Twitter API to extract the data from Twitter, it is required to create an application at Twitter Developer and we will need the authentication information of the application and use them to invoke the APIs later.
In case you haven’t use Twitter before, you need to create your twitter account firstly. You can register an application and create your oAuth Tokens at https://dev.twitter.com/. Logon with your twitter account, click your profile picture and click the “My applications”.
Click the button “Create a new application”.
Follow the form instructions to complete the registration. You need to input the application name, description, your websites and leave the call back URL as blank. Accept the developer rules and click the button “Create your Twitter application”.
After that, you will be able to see the oAuth settings like below, save the values of Consumer Key, Consumer secret, Access token and Access token secret. We need to use them later in the APIs.
Download Twitter API Java library – Twitter4J
Twitter4J is an unofficial open source Java library for the Twitter API. With Twitter4J, you can easily integrate your Java application with the Twitter services. The link to download it is http://twitter4j.org/en/index.html.
Extracting the downloaded zip file, go the sub folder lib and you will see the file twitter4j-core-3.0.3.jar, which is the library we need in the Java project and it must be added as the library or class path in the java runtime.
There are some useful examples and you can simply check them to help yourselves getting familiar with the Twitter APIs.
Prepare the HANA jdbc library
In order to access SAP HANA from java, we will need the jdbc library, which you can find it at C:\Program Files\SAP\hdbclient\ngdbc.jar in windows and /usr/sap/hdbclient/ngdbc.jar in Linux by the default installation.
Exercise
Now it is ready to go, in the end of the blog, we will understand the source code of the project and know how to connect HANA from java, how to use the twitter services in java and the most impressive thing is how simple it is to run the text analysis in HANA, which combines the unstructured data from various sources like twitter, documents with the structured data in RDBMS.
Import the Java Project in Eclipse
To save your time, I will upload the project here later so you can import the existing java project instead of starting from scratch. Do not worry and we will explain all the components of the project in details below. Open your HANA Studio and follow the steps below:
1. In the File menu, choose Import….
2. Select the import source General > Existing Projects into Workspace and choose Next. You should have created the workspace in the XS exercise. Otherwise, you may need to have your workspace created first.
3. Select the root directory where your project files located, selects the project TwitterAnalysis and click Finish to complete the import.
The project structure looks like this:
Understand the Java Project
The following table lists the major files in the project and we will explain them in details later in the exercise.
Directory |
File |
Description |
src |
TwitterConnection.java |
Build the connection to twitter services |
HDBConnection.java |
Build the jdbc connection to HANA |
|
Configurations.java |
The public interface for the network, twitter authentication configurations, override it by your own account or settings |
|
Tweet.java |
The java bean class for the tweet objects |
|
TweetDAO.java |
The data access object |
|
lib |
ngdbc.jar |
SAP HANA jdbc library |
twitter4j-core-3.0.3.jar |
Twitter4j library for twitter services in java |
|
CreateTable.sql |
The SQL statement to create the column table in HANA |
|
CreateFullTextIndex.sql |
The SQL statement to create the fulltext index for text analysis |
|
Readme.txt |
The file describes the steps to execute the project |
Create a column table in HANA
Firstly, we need to create a table in HANA, where we want to store the tweets we fetched from the twitter services.
1. Open HANA Studio, copy the SQL statement from the CreateTable.sql and execute it in the SQL Console. You need to replace the current schema with your own schema.
2. Expand the Catalog folder in HANA studio, you should find the table TWEETS in your schema and the definition of the table is like:
Update the configurations
In the purpose to maintain the configurations easily, we put all the required information in a single interface and it is mandatory for you update it with your own account or settings before you can connect to either HANA or Twitter.
1. Open the file Configurations.java in your project. Basically, there are 4 category of setting you can override:
- Network Proxy Settings: The proxy host and port, set the HAS_PROXY as false if you do not need to
use proxy
- HANA Connection Settings: Replace the HANA URL with your own HANA host and port, user,
password and the schema where you created your table
- Twitter Authentication Settings: Replace with your own authentication information from your twitter
application as described in the prerequisites
- Search Term: We will search the twitter based on the search term “startup” and we want to know what
people were talking around the startups in twitter. You can always replace it with your own term if you are interested in other topics
Test Connection to Twitter
Once have the twitter authentication maintained correctly in the previous step. You can open TwitterConnection.java and run it. You will see the message “Connection to Twitter Successfully!” following with your twitter user id in the console as the screenshot shows below.
Test Connection to SAP HANA
Now let us open the file HDBConnection.java and run it. You will see the message “Connection to HANA Successfully!” in the console as the screenshot shows below. Check the Configurations.java if you encountering any issue.
The data access object TweetDAO is the single point to communicate with HANA from java, take a look how the source code looks like and you will know the SQL statement and how to use the jdbc library.
Invoke Twitter API and save the tweets into HANA
Now it’s time to the do the real stuff. Open the file SearchTweets.java and run it, which will search the tweets based on the search term we specified in the Configurations.java and everything we got will saved to HANA table. You will see the messages in the console indicate the tweets have been inserted to HANA successfully like the screenshot shows:
After that, you can run the data preview in HANA studio and see the contents of the table TWEETS in your schema like this:
Run text analysis in HANA
Now we already have the tweets stored in the HANA table. The next step, we are going to run the text analysis to see what people are talking around the “startup” in twitter.
To run the text analysis, the only thing we need to do is creating a fulltext index for the column of the table we want to analysis and HANA will process the linguistic analysis, entity extraction, stemming for us and save the results in a generated table $TA_YOUR_INDEX_NAME at the same schema. After that, you can build views on top of the table and leverage all existing analysis tools around HANA to do the visualization even the predictive analysis.
1. Copy the SQL statement from the CreateFullTextIndex.sql and execute it in SQL console:
— Replace the Scheme with your own Schema! —
SET SCHEMA “I045664”;
DROP FULLTEXT INDEX “TWEETS_FTI”;
Create FullText Index “TWEETS_FTI” On “TWEETS”(“TEXT”) TEXT ANALYSIS ON CONFIGURATION ‘EXTRACTION_CORE’;
2. Do you believe the text analysis is already done by HANA? Yes, it is. Now you know how simple it is! You will be able to find a generated table $TA_TWEETS_FTI in your schema. The structure of the table looks like this, which is the standardized format for the results of text analysis:
3. And here is the data preview of the $TA table, you will see the Tokens extracted from the tweets and the number of occurrence and entity type of each token.
Thanks for the explanation, Eric, that answers one of my open questions I always wanted to know but never found the time to dive into.
On the Twitter connectivity however, such an Adapter exists already, no need to write one.
Here a video of the Hana Academy on how to use the Adapter to copy the data to Hana and keep pushing Tweets from Twitter in realtime.
SAP HANA Academy - Smart Data Integration/Quality : Twitter Replication Pt 3 of 3 [SPS09] - YouTube
And this video explains how to configure the Adapter. (There is a third video, on how to install all)
SAP HANA Academy - Smart Data Integration/Quality : Twitter Replication Pt 2 of 3 [SPS09] - YouTube
The Smart Data Integration and Smart Data Quality is different, with various adaptors you can combine different kinds of data sources and perform data transformations, the Twitter Library used in this blog is for java programmers who is interested to build something by their owns.
Not getting the point. If I write the SQL command
create remote subscription s_twitter using
(select * from v_tweets)
target table TWEETS;
I get 100% of what you did above. Get all tweets into the Hana table called TWEETS.
Plus the advantage that all is administrated, monitored etc from within Hana.
And if a user wants to build a Java program to get data into Hana, it would probably bmake sense to write an adapter using the Java Adapter SDK for Hana. Just like the twitter adapter was built.
Anyway, just want to show the options.
Yes, there are various options available. It really depends on the architecture design, leverage as many HANA native features as possible or use HANA as a fast DB only.
Nice blog. Need to try.
Hello Eric,
Very nice job !
Do you know whether there is a way - or can you think of, that thiscan work on a more friedly UI ?
Also, sometimes it terminates searches by itself. Do you know why is this happening ?
thanks,
Christoforos Verras
Hello Christoforos,
You can find the $TA table as the result of the text analysis, from where you can declare a web service for the UI layer to consume.
Please also check the limitation of Twitter APIs like the total number of calls allowed to see if that restricts your use case.
Regards,
Eric Du
Good morning Eric,
I will try that today ! Thanks so much for the advice !
Christoforos Verras
Hello Eric,
When I imported the twitter4j-4.0.7 zip file, I couldn’t able to see the src folder and the packages under it.
I have also checked the downloaded ZIP file and see no "src" folder under it.
Can you please advise ?
Thanks.
Krishna.