Streaming Real-time Data to HADOOP and HANA

Former Member · ‎08-07-2013

For those that are interested in Hadoop and Hana I've recently created a prototype which attempts to leverage some of the key strengths of the companion solutions, to deal with Big Data challenges.

The term ‘Big Data’ is commonly used these days, but is perhaps best reflected by the high volumes of data generated every minute by Social media, Web logs & Remote sensing/POS equipment. Dependent on the source it probably doesn't make sense to stream all this data into HANA, instead it may better to only store a subset most relevant for analytic reporting in HANA. [As an example Twitter alone may generate 100,000’s of tweets a minute - High Volume Low Value]

The following Diagram illustrates an example of how 'Big Data' might flow to HANA via HADOOP:

The key point is that I use Hadoop Flume to establish a connect to Twitter (via Twitter4j API) and then store the details of each tweet in Hbase,while simultaneously sending a subset of the fields to HANA, via Server Side Javascript.

This slide, including a definitions page, can be found here: https://docs.google.com/file/d/0Bxydpie8Km_fVXhHWkFENl9iWms/edit?usp=sharing

The follow YouTube video briefly demonstrates my prototype:

To build this Prototype in your own environment:

1. Setup Hadoop and follow Cloudera’s Twitter example: setting up Flume and Twitter4J API to write tweets to HDFS:

http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager...

http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

https://github.com/cloudera/cdh-twitter-example

http://blog.cloudera.com/blog/2013/03/how-to-analyze-twitter-data-with-hue/

Dan Sandler (www.datadansandler.com):

http://www.datadansandler.com/2013/03/making-clouderas-twitter-stream-real.html

Dan has also created videos walking through the entire process

http://www.youtube.com/watch?v=2xO_8P09M38&list=PLPrplWpTfYTPU2topP8hJwpekrFj4wF8G

2. Setup Flume to write to HBASE, Impala & Hana

https://github.com/AronMacDonald/Twitter_Hbase_Impala

Note: Inspired by Dan Sandler’s Apache Web Log Flume Hbase example

https://github.com/DataDanSandler/log_analysis

3. Setup HANA Server Side script for inserting tweets

https://github.com/AronMacDonald/HANA_XS_Twitter

Note: Inspired by Thomas Jung’s Hana XS videos & Henrique Pinto’s blog

http://scn.sap.com/docs/DOC-33902

Note2: In SPS06 there is also the option to use ODATA create/update services which may

remove the need for Server Side JS.

FYI: My tiny Hadoop cluster on AWS Cloud costs approx $175 /mth to operate ($70 p/mth if you sign up to a 3 year deal with AWS). Building your own cluster will be cheaper, but is less flexible than cloud computing.

Wejun Zhou has written an excellent example of using social media data and HANA XS to provide interesting 'voice of customer' analysis with a very pretty UI.

http://scn.sap.com/community/developer-center/hana/blog/2013/06/19/real-time-sentiment-rating-of-mov...

He also makes use of the Twitter4J API. The subtle difference is that in his example tweets are queried from Twitter upon request and subset of results saved to HANA, rather than streaming data to HANA based on predefined key words.

There are Pro’s and Con’s of both methods.

While I use Twitter in this example a similar approach for streaming data to HANA could be used for other sources of ‘Big Data’ such as remote sensing equipment.

It’s also worth noting that rather than streaming data to HANA using Flume, Hadoop has other tools such as Oozie & Sqoop which could potentially be used to schedule data loads between Hana and Hadoop, to help keep Hana lean and mean.

Other Thoughts:

In my first blog I provided some benchmarks comparing HANA and Hadoop Impala running on AWS, with extremely small environments

http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop...

My primary conclusion was that while Impala achieves good query speeds as your Hadoop clusters increases, HANA’s in memory solution still provides the optimal performance. To get your best return on your HANA investment you may not want to have it bogged down storing high volumes of low value data.

That data though may still have value, and could be archived, but using Hadoop (which is open source) you have the opportunity of keeping the data ‘live’ in a lower cost storage solution, designed explicitly for storing and analyzing large volumes of DATA.

Starting from HANA SPS06 you are even able to expose HADOOP tables (though currently limited to Intel distribution of Hadoop) to HANA as a virtual tables.

I've explored this further in the following blog:

http://scn.sap.com/community/developer-center/hana/blog/2013/08/22/smart-data-access-and-hadoop

Virtual tables in HANA will definitely have their use, but to get the true power of HANA, important data will need to be stored in HANA.

29-April-2014: As a follow up to this blog I've posted another blog with a sample XS Application, that enables the summary information to be read from HANA and Details from Hbase:

Tip of the iceberg: Using Hana with Hadoop Hbase

Streaming Real-time Data to HADOOP and HANA

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z