HANA vs Hadoop – showdown
HANA and Hadoop are very good friends. HANA is a great place to store high-value, often used data, and Hadoop is a great place to persist information for archival and retrieval in new ways – especially information which you don’t want to structure in advance, like web logs or other large information sources. Holding this stuff in an in-memory database has relatively little value.
As of HANA SP06 you can connect HANA into Hadoop and run batch jobs in Hadoop to load more information into HANA, which you can then perform super-fast aggregations on within HANA. This is a very co-operative existance.
However, Hadoop is capable – in theory – of handling analytic queries. If you look at documentation from Hadoop distributions like Hortonworks or Cloudera, they suggest that this isn’t the primary purpose of Hadoop, but it’s clear that Hadoop is headed in this direction. Paradoxically, as Hadoop heads in this direction, Hadoop has evolved to contained structured tables using Hive or Impala. And with ORC and Parquet file formats within the HDFS filesystem, Hadoop also uses columnar storage.
So in some sense Hadoop and HANA are converging. I was interested to see from an aggregation perspective, how Hadoop and HANA compare. With HANA, we get very good parallelization even across a very large system and near-linear scalability. This translates to between 9 and 30m aggregations/sec/core depending on query complexity. For most of my test examples, I expect to get around 14m – with a moderate amount of grouping, say 1000 groups. On my 40-core HANA system that means that I get about 500m aggregations/second.
My research appears to show that Cloudera Impala has the best aggregation engine, so I’ve started with that. I’d like to know your feedback.
Setup Environment
I’m using one 32-core AWS EC2 Compute Optimized C3.8xlarge 60GB instance. In practice this is about 40% faster core-core than my 40-core HANA system. Yes that’s a nice secret – HANA One uses the same tech, and HANA One is also 40% faster core-core than on-premise HANA systems.
I’ve decked it out with RedHat Enterprise Linux 6.4 and the default options. A few notes on configuring Cloudera:
– Make sure you set an Elastic IP for your box and bind it to the primary interface
– Ensure that port 8080 is open in your security group
– Disable selinux by editing /etc/selinux/config and setting SELINUX to disabled
– Make sure you configure a fully qualified hostname in files /etc/sysconfig/network and /etc/hosts
– Reboot after the last two steps
– Disable iptables during installation using chkconfig iptables off && /etc/init.d/iptables stop
Installation is straightforward – just login as root and run the following:
wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin && chmod +x cloudera-manager-installer.bin && ./cloudera-manager-installer.bin
The only thing to note during the installation is to use fully qualified hostnames, login to all hosts as ec2-user, and use your AWS Private Key as the Authentication Method. This works for Cloudera and Hortonworks alike.
Testing using Hive
The first thing I did was to benchmark my test using Hive. My test data is some Financial Services market data and I’m using 28m rows for initial testing. With HANA, we get 100ms response times when aggregating this data, but let’s start small and work up.
I can load data quickly enough – 5-10 seconds. We can’t compare this to HANA (which takes a similar time) because HANA also orders, compresses and dictionary keys the data when it loads. Hadoop just dumps it into a filesystem. Running a simple aggregation when using TEXTFILE storage on Hive runs in around a minute – 600x slower than HANA.
That’s roughly what we would expect, because Hive isn’t optimized in any way.
CREATE TABLE trades ( tradetime TIMESTAMP, exch STRING, symb STRING, cond STRING, volume INT, price DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH ‘/var/load/trades.csv’ INTO TABLE trades;
select symbol, sum(price*volume)/sum(volume) from trades group by symbol;
Moving from Hive to Impala
I struggled a bit here because Cloudera 5.0 Beta is more than a little buggy. Sometimes you could see the Hive tables from Parquet, sometimes not. Sometimes it would throw up random errors. This is definitely not software you could use in production.
I used Parquet Snappy compression which should provide a blend of performance and compression. You can’t load tables directly into Impala – instead, you have to load into Hive and then Impala. That’s quite frustrating.
create table trades_parquet like trades;
set PARQUET_COMPRESSION_CODEC=snappy;
insert into trades_parquet select * from trades;
Query: insert into trades_parquet select * from trades
Inserted 28573433 rows in 127.11s
So now we are loading at around 220k rows/sec – on equivalent hardware we could expect nearer 5m from HANA. This appears to be because Impala doesn’t parallelize loading so we are CPU bound in one thread. I’ve read that they didn’t optimize writes for Impala yet so that makes sense.
select symbol, sum(price*volume)/sum(volume) from trades group by symbol;
Now the first time this runs, it takes 40 seconds. However, the next time it runs it takes just 7 seconds (still 70x slower than HANA). I see 4 active CPUs, and so we have 10x less parallelization than HANA, and around 7x less efficiency, which translates to 7x less throughput in multi-user scenarios, at a minimum.
Final Words
For me, this confirms what I already suspected to be the case – Hadoop is pretty good at consuming data (and I’m sure with more nodes it would be even better) and good at batch jobs to process data. It’s not any better than HANA in this respect, but this $/GB is much lower of course, and if your data isn’t that valuable to you and isn’t accessed often, storing it in HANA will be cost-prohibitive.
But when it comes to aggregating, even in the best case scenario, Hadoop is 7x less efficient on the same hardware, and the number of features that HANA has, simplicity of operation and storing data only once – if your data is hot, and is accessed and aggregated often in different ways, HANA is the king.
And we didn’t even cover in this blog the number of features that HANA has and the incredible maturity of HANA’s SQL and OLAP engines compared to what is in Hadoop, plus the fact that Impala is the fastest engine but it is only supported by Cloudera and very immature.
Since with Smart Data Access, we can store our hot data in HANA and cold data in Hadoop, this makes HANA and Hadoop very good friends, rather than competition.
What do you think?
Hi John,
Another great blog from you
Hadoop is great for storing huge amounts of data and its map reduce functionality is also awesome while HANA is very fast
I agree with you that HANA and Hadoop can become good friends and we can surely store cold data in Hadoop and hot data in HANA and using Hadoop will also save money for companies
SAP and Hortonworks have already entered into partnership last year
You can learn more about their partnership and check their reference architecture using HANA and Hadoop:
SAP + Hortonworks = Instant Access + Infinite Scale with HANA + Hadoop
Regards,
Vivek
Thanks - right, SAP, Intel and Hortonworks are all in partnership. Do you know of any reference customers?
EMC now use Hadoop as the persistent storage engine for Greenplum, now known as Pivotal. I hear SAP have also considered using Hadoop as the persistent storage engine for HANA, which is interesting.
This would provide a capability like IBM GPFS, and would do away with the SAN in HANA scale-out clusters, but without any license implication.
John
Hi John
I am currently working in a Oil and Gas company which has bought the whole SAP vision of BIG DATA using HANA for Hot data, Sybase IQ for warm data and HADOOP for cold data. with SRS as the replicating agent.
SAP has been advocating SDA with remote caching ( for faster pre fetches) , however I remain skeptical about performances . HADOOP seems infinitely slow, and since it does not have a native SQL features , even simple select * from table gets broken down into a series of map reduce statements .
I feel that HADOOP - HANA is best served with use cases where HADDOP is used for volume and batch lifted via ETL and and HANA for velocity .
Google had originally bought out big table and mapreduce , but has moved to DREMEL and PERCOLATOR. and they seem to address the very issues that I find problematic in HADOOP
I think your statement is too generic. A well built Hive table with Parquet can perform acceptably. Totally depends on the design, hardware and volumes.
Obviously if your Hadoop performance sucks then SDA will probably suck too.
I do agree that Hadoop will converge to a more SQLesque standard - we see this already, with Hive, Impala, Parquet, and this will mean that the scenario that SAP describes will work better, in more scenarios.
Hi John,
I'm very please to see you giving it a try now too. Careful it's addictive. 🙂
I totally agree with your comment that it 'depends on design, hardware & volumes'.
Just to back that up a few people in the Cloudera community replicated my
TPC-H trial from my blog last year.
https://docs.google.com/spreadsheet/ccc?key=0AgQ09vI0R_wIdEVMeTQwZGJSOVQwcFRSRFFFUmcxWWc#gid=6
Hadoop is designed to work best across a cluster of machines, not just on a single node.
It might also have been interesting if you had compared HANA with a cluster of smaller powered AWS machines rather than 1 larger powered node.
You'll see in the spreadsheet linked above that a 3 node HADOOP cluster worked 5 times faster then 1 node, there after the performance improvement in HADOOP was more linear as nodes were added.
I agree with your conclusion that they are friends rather than competitors. I look forward to your next HADOOP/HANA blog.
Cheers
Aron
Hi John,
Thank you for putting it together. Surprised to know this:
I agree with you on HANA + HADOOP friendship. Considering Hadoop it can run on cheaper HW will definitely be a cost saver for many companies, but to go with HANA + HADOOP altogether it will be a big decisions for many of the companies.
With my understanding HANA was suppose to end the Batch Jobs and has been discussed widely, when Hadoop comes with HANA it brings batch jobs back ...!!
Regards
Kumar.
Kumar,
it seems not to be a fair statement to say that HANA brings batch jobs back with HADOOP 😉 .
HANA eliminates batch jobs in ERP end-to-end business processes, like in MRP and in PM scheduling and for operational reporting with HANA Live, so there is no bringing back...
In the context of HADOOP and the logical EDW, we can use SDA Smart Data Access without data replication to read data, which are stored in HADOOP, as John pointed out.
The only time you would load from HADOOP into HANA (batch) would be a scenario where high value data have been stored in HADOOP as a primary data store (for whatever reasons) and these high-value data require faster and more flexible analysis than HADOOP and the HADOOP tools can provide.
If one designs the logical EDW, this scenario would be an exception and not the rule, because if you know up-front which data are high-value, you would store the data in HANA as the primary storage, and use HADOOP as a kind of a NLS, low-cost down-stream storage from a Data Lifecycle perspective. For example for big data for which you do not need for real-time operations/analysis but rather for long-term trending analysis.
In a scenario with streaming data, you would analyse on the fly the data (up to 2.5 billion events per day) in real-time from ESP (SAP Event Stream Processor (on SAP HANA preferably)), as they flow thru ESP in real-time.
The streamed data could be persisted in HADOOP as a primary storage for reference or for further analysis once the real-time analysis of streamed data indicates further use cases. Only then the batch job kicks in, again, as an exception.
Thank you
erich
Yes true. There are couple of other scenarios:
- Hadoop stores very large, low value data which it aggregates into fairly large, high value data in HANA
- As per the above, new ideas are considered for aggregation, and the very large, low value data is re-aggregated into HANA in a new way
JOhn
Erich , John thanks for clearing my head 🙂
Regards
Kumar.
hi JOhn ,
NICE info.
Hi John,
Nice post, congratulations.
Did you test Pivotal HAWQ? I got good results in some tests if comparing with Impala (Even using Parquet). The base of HAWQ is Greenplum DB.
Best regards
Hi,
Great knowledge dissemination. It would be interesting to see if hadoop can move away from HDFS as data store and start processing things in memory.....and to make things even better.....leveraging on existing virtualized memory.
Hadoop is evolving very fast. A standard 10 node cluster can have a 1 TB of RAM total and do the entire processing in memory (distributed) using MapReduce2/YARN and using newer frameworks like Spark and Shark but still the most mature being Impala. Having said that, at the core of the Hadoop architecture is distributed computing and having one server, however powerful it might be, is not the right way to benchmark or comparison.
HANA has its own place in the technology stack for SAP and so do Data Warehouses and its related technologies. But the days when Hadoop will be the system that will host data warehouses are not far away. Ralph Kimball, the guru of data warehousing, is very upbeat on using Hadoop for Data Warehouses. Currently, it is used to complement and augment existing DWs but when people realize (and the technology evolves) that they will be spending a fraction to implement a very large DW of 1 PB and above (the often quoted number is $300 to $1000 per TB for the hardware cost), the move might accelerate.
Hi Barath,
I agree with most of your comments.
I don't think though there is any move yet for an ACID compliant IN-Memory RDMS on HADOOP yet.
But with Silicon Valleys best and brightest working on improving Hadoop, who knows what the future holds.
Cheers
Aron
That is true. Hadoop is following two routes. In-memory Impala and Shark for analytics and machine learning. And currently, the closest to providing ACID properties is HBase. This itself might be a problem that you have different tools for doing different things but the predominant focus has been to make it an analytics tool. There are architectural reasons in HDFS that prevent ACID properties but as you say, someone might come out with that.
Hi,
Nice Blog,Thanks for your valuable Information!.
Hey John - nice blog. to you and all who added such interesting comments....
I've started a seperate discussion on this on something similar (SAP DMS, OpenText DM & Archiving, Hadoop and HANA) but seems like it was a good link in to this blog also.
I'm wondering whether there would be any theoretical (cost permitting) case for:
1* DMS can work for managing documents in SAP
2* OpenText can extend DMS to store documents and archive them externally
3* Hadoop probably overlaps Opentext usage, but can store large numbers of documents and has many tools for searching them. I wonder if OpenText could even be a feeder for Hadoop so that DMS/OpenText manage your on-going business with documents and Hadoop can be used to search through the unstructured data in these documents (perhaps providing cheap 2nd level archiving also on HDFS_
4* HANA as an option (esp if appliances already in the landscape) to run very quick searches on "hot data" swapped in/out of SAP/Hadoop as required
I guess perhaps there is some consideration of whether Sybase IQ would be in the picture here too. Basically I'm thinking about the theoretical or academic ideal for managing documents for any company who has a lot of paper based and historic records that need to be kept.
Seems that there almost is no ideal for document management that I've seen in practice yet but each of these tools brings something to the game, even if there is a lot of overlap
Regards all
Maybe Andrew Gunn has a view on this.
thinking of a battle between Apache Spark and HANA 🙂
Can't imagine that anyone would consider Spark if they are looking at a SAP roadmap, but perhaps in a wider eco-system where you wanted to run in-memory processes on other data sources, then sure.
Would be good to do a benchmar between those two!