Big Data Analytics: HANA vs HADOOP IMPALA on AWS

Technology Blogs by Members

Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!

Hi All,

For those that are interested I've made an initial attempt at bench-marking HANA and HADOOP Impala against each other.

My PowerPoint slide comparing them is publicly shared on Google docs at:

https://docs.google.com/file/d/0Bxydpie8Km_fWTd3RmJTbjVHd00/edit?usp=sharing

As most of you are aware there is a revolution taking place in Big Data Analytics, with many new solutions appearing on the market, including open source solutions running on HADOOP. For a brief explanation of HADOOP please read http://blogs.sap.com/innovation/big-data/what-is-hadoop-018605

HADOOP is designed to handle very large datasets. Large volumes of data can be processed but jobs need to be scheduled

The key benefits of HADOOP is that it is open source and operates on affordable scalable infrastructure.

Real-time reporting has been a weakness as reports may take minutes instead of seconds.

Recently Cloudera have released, on HADOOP, a new open source real time reporting solution called Impala

It also has the option to use Column store tables (PARQUET) to optimise query run times

Cloudera Impala 1.0 GA was released on the 29th April 2013.

http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard...

With the advent of Cloud computing it’s now easier than ever to test new products

I’ve been using HANA for almost a year now and I love it. To get your own HANA box see Get your own SAP HANA, developer edition on Amazon Web Services

Over the past couple of months I’ve also used AWS to setup a small HADOOP cluster to test out Impala (from the earlier BETA releases)

http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager...

I’ve tested Impala with 1, 3, 9 & 18 Node Cluster (Each node represents a separate cloud machine). [Companies such as Yahoo, Twitter & Facebook may use many thousand node clusters]

By contrast HANA running on AWS runs only on a single machine

I don’t consider HANA & HADOOP/IMPALA rival products, just different tools for different purposes, though there is an overlap.

I focused on SQL read-times, row limits and costs between the two solutions, both running on cloud machines hosted by Amazon Web Services (AWS).

To benchmark them I used sample SAP SPL Data and TPC-H data both loaded with 60 million records

For details on TPC-H see http://www.tpc.org/tpch/

At this point the analysis only focuses on queries running on a single table. Depending on feedback I may broaden the scope of comparison to include more complex queries with Joins.

If you notice any glaring inaccuracies or omissions then please feel free to let me know. Where possible I'm happy to update my slides accordingly.

All the best

Aron

SAP Managed Tags:
SAP HANA

34 Comments

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Count

Big Data Analytics: HANA vs HADOOP IMPALA on AWS

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z