Skip to Content
Author's profile photo Former Member

Streaming Real-time Data to HADOOP and HANA

For those that are interested in Hadoop and Hana I’ve recently created a prototype which attempts to leverage some of the key strengths of the companion solutions, to deal with Big Data challenges. 

The term ‘Big Data’ is commonly used these days, but is perhaps best reflected by the high volumes of data generated every minute by Social media, Web logs & Remote sensing/POS equipment. Dependent on the source it probably doesn’t make sense to stream all this data into HANA, instead it may better to only store a subset most relevant for analytic reporting in HANA.  [As an example Twitter alone may generate 100,000’s of tweets a minute – High Volume Low Value]

The following Diagram illustrates an example of how ‘Big Data’ might flow to HANA via HADOOP:

HADOOP to HANA.jpg

The key point is that I use Hadoop Flume to establish a connect to Twitter (via Twitter4j API)  and then store the details of each tweet in Hbase,while simultaneously sending a subset of the fields to HANA, via Server Side Javascript.

This slide, including a definitions page, can be found here: https://docs.google.com/file/d/0Bxydpie8Km_fVXhHWkFENl9iWms/edit?usp=sharing

The follow YouTube video briefly demonstrates my prototype:

To build this Prototype in your own environment:

1. Setup Hadoop and follow Cloudera’s Twitter example: setting up Flume and Twitter4J API to write tweets to HDFS:

http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/

http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

https://github.com/cloudera/cdh-twitter-example

http://blog.cloudera.com/blog/2013/03/how-to-analyze-twitter-data-with-hue/

Dan Sandler (www.datadansandler.com):

http://www.datadansandler.com/2013/03/making-clouderas-twitter-stream-real.html

Dan has also created videos walking through the entire process

http://www.youtube.com/watch?v=2xO_8P09M38&list=PLPrplWpTfYTPU2topP8hJwpekrFj4wF8G

2.  Setup  Flume to write to HBASE, Impala & Hana

            https://github.com/AronMacDonald/Twitter_Hbase_Impala

            Note: Inspired by Dan Sandler’s Apache Web Log Flume Hbase example

                      https://github.com/DataDanSandler/log_analysis

3. Setup HANA Server Side script for inserting tweets

            https://github.com/AronMacDonald/HANA_XS_Twitter

          Note: Inspired  by Thomas Jung’s Hana XS videos & Henrique Pinto’s blog

                    http://scn.sap.com/docs/DOC-33902

           Note2: In SPS06 there is also the option to use ODATA create/update services which may

                      remove the need for Server Side JS.

FYI: My tiny Hadoop cluster on AWS Cloud costs approx $175 /mth to operate ($70 p/mth if you sign up to a 3 year deal with AWS). Building your own cluster will be cheaper, but is less flexible than cloud computing.

Wejun Zhou has written an excellent example of using social media data and HANA XS to provide interesting ‘voice of customer’ analysis with a very pretty UI.

http://scn.sap.com/community/developer-center/hana/blog/2013/06/19/real-time-sentiment-rating-of-movies-on-sap-hana-one

He also makes use of the Twitter4J API.  The subtle difference is that in his example tweets are queried from Twitter upon request and subset of results saved to HANA, rather than streaming data to HANA based on predefined key words.

There are Pro’s and Con’s of both methods.

While I use Twitter in this example a similar approach for streaming data to HANA could be used for other sources of ‘Big Data’ such as remote sensing equipment.

It’s also worth noting that rather than streaming data to HANA using Flume, Hadoop has other tools such as Oozie & Sqoop which could potentially be used to schedule data loads between Hana and Hadoop, to help keep Hana lean and mean.


Other Thoughts:

In my first blog I provided some  benchmarks comparing HANA and Hadoop Impala running on AWS, with extremely small environments

http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws

My primary conclusion was that while Impala achieves good query speeds as your Hadoop clusters increases, HANA’s in memory solution still provides the optimal performance. To get your best return on your HANA investment you may not want to have it bogged down storing high volumes of low value data.

That data though may still have value, and could be archived, but using Hadoop  (which is open source)  you have the opportunity of keeping the data ‘live’ in a lower cost storage solution, designed explicitly for storing and analyzing large volumes of DATA.

Starting from HANA SPS06 you are even able to expose HADOOP tables (though currently limited to Intel distribution of Hadoop)  to HANA as a virtual tables.

I’ve explored this further in the following blog:

http://scn.sap.com/community/developer-center/hana/blog/2013/08/22/smart-data-access-and-hadoop

Virtual tables in HANA will definitely have their use,  but to get the true power of HANA, important data will need to be stored in HANA.




29-April-2014:  As a follow up to this blog I’ve posted another blog with a sample XS Application, that enables the summary information to be read from HANA and Details from Hbase:

Tip of the iceberg: Using Hana with Hadoop Hbase


Assigned Tags

      20 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Kamal Mehta
      Kamal Mehta

      Just Excellent .

      Several Key learning indicators to focus on for future.

      Thanks for sharing it .

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Thanks for that.

      If you are interested Hortonworks have recently created a video showing a HADOOP use case for sentiment analysis, slightly similar to Wenjun Zhou's excellent HANA example.


      http://www.youtube.com/watch?feature=player_embedded&v=y3nFfsTnY3M

      The video also gives a quick glimpse of what the Hadoop User Environment HUE looks like.

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Hi Aron

      Thanks for sharing an excellent blog and the steps to build the Prototype environment.

      Regards

      Kumar 🙂

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Hi Kumar,

      Thanks for that. I do hope you also have success, if you give it a try.

      If you get stuck, at any point, just let me know and I can try an point you in the right direction.

      Cheers

      Aron

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Hi Aron

      I want to give it a try, but I am not able to because of cost factor of Hadoop cluster on AWS.

      Is there any alternate of AWS hadoop cluster to achieve the same say by using VM of hortonworks ??

      Regards

      Kumar.

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      I've not tried it but you could perhaps try Clouderas VM.

      http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html

      You may also be able to get it to work with Hortonworks VM, if you already have that up on running, though the steps to get a Twitter feed running may will be slightly different, as I used Cloudera Manager for some of the settings which is unique to Clouderas HADOOP distribution.

      If the VM's cause you any connection issues, then the next cheapest alternative is to install HADOOP on some old PC's if you have any lying around. But you will need a lot of spare time for that.

      For a bit of fun some guy has even run HADOOP on Rasberry PI's, a bit pointless as a Big DATA solution, but good fun and shows HADOOP can be run on very lower powered, cheap machines as well.  😎

      http://blog.ittoby.com/2013/08/starting-small-set-up-hadoop-compute.html

      If though you follow Jeff's link below you will notice a 30 day free trial of SAP's ESP.

      That would also make for an interesting new blog.

      Hope that helps

      Aron

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Hi Aron,

      Thanks for your info. I will give it a try and will update 🙂

      Regards

      Kumar.

      Author's profile photo Former Member
      Former Member

      Is There any product of SAP that can be used in place of flume?

      Thanks for your input

      Regards

      Divya

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Hi Divya,

      That's probably best answered by someone who works for SAP.

      BO Data Services can schedule data loads into HANA, but I'm not aware if it has Event process handling.

      SAP sells the Sybase Event Stream Processor:

      http://www.sybase.co.uk/products/financialservicessolutions/complex-event-processing

      Unfortuantely I have no knowledge how easy it is to configure to listen for events(e.g. for social media), whether it can write directly to HANA or how it handles large data volumes such sources may produce. I welcome though someone from SAP to comment.

      I hope that info helps.

      Regards

      Aron

      Author's profile photo Jeff Wootton
      Jeff Wootton

      As Aron says,  SAP Sybase Event Stream Processor (ESP) can be used to receive, process and capture streaming data in SAP HANA.  ESP includes native support for SAP HANA (in addition to capturing streaming data in HANA, ESP can also query HANA, enabling HANA to provide context, reference info - and even analytics - that ESP can use in processing incoming data.  ESP is designed to be very scalable - processing hundreds of thousands of messages per second on a single server (we've tested it processing over a million per second - on a large server) and to deliver consistently low latency (milliseconds).  More info is available in the ESP Developer Center

      As for integration with streaming sources, ESP includes a number of standard inputs, including Web Services,  message buses, etc, plus an adapter toolkit and APIs.

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Thanks Jeff for the good intro to ESP.

      I've checked out some of the intro videos under the link you gave.  Very interesting.

      HADOOP Flume certainly doesn't have a graphical user interface (ESP Studio), you still have to write the business logic yourself in JAVA.

      I especially liked the HANA integration video.

      http://scn.sap.com/docs/DOC-40248

      I've seen HADOOP Hive ODBC drivers that enable SQL inserts, so could HADOOP Hive be configured as an output data service (in ESP), in addition to HANA?


      Author's profile photo Jeff Wootton
      Jeff Wootton

      Yes,  you can configure Hadoop as an output data service for ESP - in fact we plan for this to be a standard ESP configuration option in the future.

      Also, ESP goes beyond providing the visual editor.  The visual editor and the underlying continuous query language of ESP reduce the time and effort to set up the business logic to apply to events when you want to do event processing - things you would have to code yourself in Java to do event processing in Flume.

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Thanks, Jeff.  If I get the free time I'll be sure to try out the ESP 30 day trial as well.

      Author's profile photo Vivek Singh Bhoj
      Vivek Singh Bhoj

      Great blog Aaron

      Thanks for sharing valuable info on building our own environment

      Regards,

      Vivek

      Author's profile photo Former Member
      Former Member

      Hi Gurus,

      Please let me know how to install HIVE drivers in HANA . Please help us.

      Thsnks in advance.

      Regards,

      Teja

      Author's profile photo Vivek Singh Bhoj
      Vivek Singh Bhoj

      Hi Teja,

      Check the below document on How To Install Database Drivers for SAP HANA Smart Data Access:

      https://websmp203.sap-ag.de/~sapidb/012006153200000561122013E/HANA_smart_data_access_drivers.pdf

      Regards,

      Vivek

      Author's profile photo Former Member
      Former Member

      Hi Vivek,

      Thank you very much for your quick replay, I have the same document which tells how to setup after installation HIVE drivers in HANA. I have downloaded HIVE drivers, I am searching for how to install HIVE drivers in HANA box at O/S level....

      If you have idea please provide clear idea on this ...

      Thank you very much..

      Regards,

      Teja

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Hi Aron

      Just a quick question, the prototype which you did on HANA and HADOOP can you please confirm the type of SAP HANA box used.

      Did you used SAP HANA developer edition from AWS or SAP HANA ONE from AWS. ?

      Regards

      Kumar 🙂

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Hi Kumar,

      I built my above example using HANA Developer Edition on AWS.

      The HANA build portion though is very light and assuming the network and port are open then any HANA box could be the target.

      Rather than using server side JS to save the data, I'm in the process of updating it to use ODATA to perform a POST. Same result but more elegant design. 🙂

      Cheers

      Aron

      Author's profile photo Former Member
      Former Member

      Great blog Aron