Big Data Analytics: HANA vs HADOOP IMPALA on AWS
Hi All,
For those that are interested I’ve made an initial attempt at bench-marking HANA and HADOOP Impala against each other.
My PowerPoint slide comparing them is publicly shared on Google docs at:
https://docs.google.com/file/d/0Bxydpie8Km_fWTd3RmJTbjVHd00/edit?usp=sharing
As most of you are aware there is a revolution taking place in Big Data Analytics, with many new solutions appearing on the market, including open source solutions running on HADOOP. For a brief explanation of HADOOP please read http://blogs.sap.com/innovation/big-data/what-is-hadoop-018605
HADOOP is designed to handle very large datasets. Large volumes of data can be processed but jobs need to be scheduled
The key benefits of HADOOP is that it is open source and operates on affordable scalable infrastructure.
Real-time reporting has been a weakness as reports may take minutes instead of seconds.
Recently Cloudera have released, on HADOOP, a new open source real time reporting solution called Impala
It also has the option to use Column store tables (PARQUET) to optimise query run times
Cloudera Impala 1.0 GA was released on the 29th April 2013.
With the advent of Cloud computing it’s now easier than ever to test new products
I’ve been using HANA for almost a year now and I love it. To get your own HANA box see Get your own SAP HANA, developer edition on Amazon Web Services
Over the past couple of months I’ve also used AWS to setup a small HADOOP cluster to test out Impala (from the earlier BETA releases)
I’ve tested Impala with 1, 3, 9 & 18 Node Cluster (Each node represents a separate cloud machine). [Companies such as Yahoo, Twitter & Facebook may use many thousand node clusters]
By contrast HANA running on AWS runs only on a single machine
I don’t consider HANA & HADOOP/IMPALA rival products, just different tools for different purposes, though there is an overlap.
I focused on SQL read-times, row limits and costs between the two solutions, both running on cloud machines hosted by Amazon Web Services (AWS).
To benchmark them I used sample SAP SPL Data and TPC-H data both loaded with 60 million records
For details on TPC-H see http://www.tpc.org/tpch/
At this point the analysis only focuses on queries running on a single table. Depending on feedback I may broaden the scope of comparison to include more complex queries with Joins.
If you notice any glaring inaccuracies or omissions then please feel free to let me know. Where possible I’m happy to update my slides accordingly.
All the best
Aron
Thanks for sharing this excellent presentation. Kudos to you for doing such an excellent piece of useful work. I learnt a number of new things from your presentation deck.
Hi Sujoy,
Thanks for the positive feedback. I'm glad you found it useful.
Kind Regards
Aron
SAP BW on HANA already supported SYBASE IQ as NLS. Technically just a 2nd DB connection. But because SAP kernel doesn't support IMPALA, your wish may...
Thanks for the very helpful comment. I've also heard that from BW on HANA 7.4 on-wards there are even more exciting enhancements coming with SYBASE IQ NLS integration.
So for the next few years, at least, Sybase IQ (as an SAP product) will probably be the gold standard solution for NLS.
What if though over the next 5 years HADOOP continues to develop and mature at it's current pace. Organisations running SAP ECC/BW on HANA and HADOOP may start questioning whether they continue to need to use a third product such as Sybase IQ for NLS.
I certainly wouldn't suggest anyone use HADOOP for NLS today. 😉
Also IMPALA is just one of many new real time reporting solutions being developed on HADOOP.
Here is a list of some of it's competitors:
-IBM (Big SQL)
-Hortonworks (Stinger),
-MapR (Drill),
-Pivotal (HAWQ)
-Teradata (SQL-H)
I'd love to benchmark them all but there is only so many hours in the day.
If anyone does though please do contribute. 🙂
When HADOOP footprint is large enough in enterprise IT systems, I think SAP definitely will consider it. But nowadays, enterprise still uses data warehouse a lot, I think.
Another question is if SAP is willing to open the door for other vendors. Now software vendors would rather provide a package instead of giving too many choices. 😛
You might be right, only time will tell. I might need to wait till SAP acquires one of the Hadoop vendors. 😉
On a related note, the following slide pack by SAP and Cloudera suggests some work might be taking place behind closed doors to integrate HADOOP and HANA (slide 9), though it might just be marketing spin for now.
http://www.cloudera.com/content/dam/cloudera/Resources/PDF/Meeting_the_New_Data_Challenge_with_SAP_and_Cloudera.pdf
Also earlier this year SAP and Intel made this press announcement:
The two companies plan to build and bring to market a breakthrough "big data" solution for enterprise customers centered on the SAP HANA® platform and Intel® Distribution for Apache™ Hadoop® software.
http://www.sap.com/corporate-en/news.epx?PressID=20498
It also mentions:
Upcoming phases of the big data solution from SAP and Intel are planned to enable integrated query processing, optimized data loading, and unified administration.
If anyone has more technical details about either of these strategies or other similar ones they can share, then please contribute.
Yeah, true. SAP has investigated on HANA and Hadoop integration for a long time. I remembered last year I read some PoC reports of this. At that time the integration was done by Javascript. I have been expecting the official mature integration for a long time because obviously HANA and Hadoop complement each other. Pretty much like current DB vendors all provide in-memory features and traditional disk-based engine in newer release. But problem for SAP is it has no hardware business, so it may have no as initiatives as IBM or Oracle. They(and HP) offered Hadoop appliance at very earlier time but I think the main purpose is to sell hardware. For SAP, only when enterprises really need Hadoop + traditional DB as a platform to store whatever kind of data, even produced by ERP/CRM etc, SAP will make big progress on it, I think.
David, you may find this “How to Use Hadoop with your SAP Software Landscape” paper interesting.
Thanks for the link. If that slide form SAP on integration was publicly available sooner it may have saved me a lot of effort. 🙂
There's a lot of great content inside though unsurprisingly it's missing some of the latest developments in real-time reporting in HADOOP, which are currently under heavy development by the big players.
Don't forget EMC with Greenplum HD!
I think your theory on the direction of EDW is correct. Cold data will become about $/GB alone and you can't beat Hadoop here. It's just not ready yet.
This kind of posts sharing experience are always nice to read!
I looked at the time degradation of the load into HANA (slide 5) and got curious how did you do these loads and if there is any way to improve them?
In your wishlist you mentioned the ability of "a single query". As you may have seen there is a new "Smart Data Access" capability in HANA SPS6, which does not allow you to connect to Impala yet, but allows you to have virtual table on Intel distribution of Hadoop/Hive. Pls have a look at Note 1868209 - SAP HANA Smart Data Access: Central Note.
Thanks for that OSS note, I’d not released that the new ‘Smart Data Access’ could work with the Intel Distribution of Hadoop. That's great progress from SAP.
I now very keen to try out SPS06, to see if there is a way to ‘tweak’ it to work with other distributions of Hadoop (obviously not supported by SAP). Hopefully other Hadoop distributions will be enabled in subsequent revisions.
Enabling high volume, low value data to sit outside of HANA, and yet still be queried from within HANA is a fantastic idea to leverage the benefits of the respective products.
Perhaps when the dust settles on the battle for the best Hadoop real-time reporting engine (e.g. Impala, Stinger, MapR etc) then SAP may ‘Smart Data Access’ enabled it as well, as they offer 10-100 times better performance than HIVE.
I believe the answer to this is that Smart Data Access uses ODBC (from the IQ federation engine). So it should be possible to create virtual tables against any DB that supports ANSI SQL statements and has a Linux ODBC driver.
Currently it's only supported for Intel/Hadoop/Hive but I'm sure that won't get in the way of your testing!
Thanks, that's good to know.
Sorry I forgot to answer your question re: slide 5.
In essence I ran something like the following for both HANA and IMPALA:
<NOTE: the following is just the PSEUDO LOGIC>
FOR COMPANY_CODE = A001 to A010
FOR PERIOD = 1 to 12
$VAR1 = COMPANY_CODE
$VAR2 = PERIOD
INSERT INTO <table>
( SELECT '$VAR1' as "A", '$VAR2' as "B", "C", "D", etc.
FROM <tableref>)
NEXT
NEXT
The above logic performs 120 INSERT statements.
Initially I did this in HANA using a SQL procedure, however since I was bench-marking HADOOP Impala as well I found it easier to use PYTHON to run and insert my interim results in HBASE for both. I then could just query my performance logs using HIVE or IMPALA.
PYTHON is a very simple language, which I think is ideal for repeatable load/query performance testing in HANA.
The following link has details had to use PYTHON with HANA
http://scn.sap.com/community/developer-center/hana/blog/2012/06/08/sap-hana-and-python-yes-sir
Alternatively it might have been fun to use HANA XS to do the same. 😉
With regards to the HANA load performance, I didn't make any attempts to tune it as it still ran within 1 hour, which was fine given my needs.
I'm sure the following would have sped up the loads considerably:
- Partitioning the table (e.g. by 5)
- Using IMPORTS (instead of INSERTS)
- Manage the MERGE DELTA (if the uncompressed memory was consuming to much of the the available resource, causing <tableref> to be unloaded).
It would be an interesting to see if after trying these tuning options there is still a noticeable degradation of load times as the HANA table increases. I might try that when I have time.
If you have any other performance tuning tips to speed up HANA inserts/imports feel free to contribute. 🙂
Yes, I was primarily curious how you managed MERGE DELTA during your load. You may have seen as well some experiments John Appleby described in his blog http://scn.sap.com/community/hana-in-memory/blog/2012/03/20/inside-sap-hana--optimising-data-load-performance
Thanks again. That's an excellent blog, which I have seen and taken inspiration from previously. I notice you were also mentioned as assisting with that so an added thanks.. 🙂
John's HANA box has significantly more power than AWS HANA, but upon re-reading his blog there might also be a slight degradation in IMPORT times as the cube row count increased, but it don't look much to stress about.
As a side note I've also found using the LOAD <table> ALL statement an important step to add prior to bench-marking query run-times , as my sneaky little HANA box might UNLOAD columns of relevant tables, during IMPORT/INSERT in Low memory situations,
e.g. to free up space for the new UNCOMPRESSED data, prior to the DELTA MERGE.
On AWS HANA this happens all to frequently, e.g. once you have over a 100M records.
I've not analysed it on AWS, but on a 4Tb (8*512Gb) HANA box, I've noticed that UNLOADS happen when Free Memory goes below ~20% (on a given Host). There might be a parameter for that but I've not bothered digging for one yet.
I've also observed that queries that perform complex Joins on very large tables can temporarily consume a HUGE amount of Memory causing UNLOADs to happen (even as of Rev 57), which may invalidate your query run-time bench-marking, as the data will have the added overheard of reloading to memory. Obviously If you notice that happening you should seriously revisit your query design, but that's another story 😉
The answer is yes - if you don't setup your partitions right, then load performance will degrade as table size increases. If you partition and load your data by e.g. date, then older partitions will stay static as you load new data, and load performance will remain flat.
It's an important consideration when working with large volumes of data in HANA.
Also note that in multi-node environments you typically partition by both HASH(COLUMN) and RANGE(DATE), where COLUMN is a well distributed column. This ensures good table/host distribution.
Thanks John for that valuable comment, as well as your earlier blog linked above.
Unsurprisingly tables in HADOOP can also be partitioned in a similar fashion for the same good reasons.
When Data size in HANA necessitates multiple Hosts then RANGE(DATE) may no longer be an option, without an additional RANGE(Object) partition to support it.
For example on a 4Tb (8Hosts *512Gb) HANA Side Car (HANA Live) we are using SLT to replicate ECC tables.
Round robin isn't an option as the tables have keys, so HASH is the logical choice.
Adding RANGE partitions as well makes sense, but every table may be different.Phew that might take a lot of work to define each ECC table differently!!!
For reporting performance one recommendation we had from SAP was to keep commonly joined tables on the same host, which doesn't do a lot for load distribution.
Imagine knowing all your query requirements in advance and determining your partition strategy (across hosts) accordingly. I suppose you could put different SAP module tables in different hosts but the lines are often blurred in complex reports.
Using Date for RANGE partition (where possible) would put current year/period reporting all on a single host which also wouldn't be undesirable.
Just using HASH partition across Hosts is the simplest way to implement partitioning but will slow down queries as they need to gather data across the network and aggregate. Hopefully they won't slow down much, because the effort to implement further partitioning might be prohibitive.
It'd be nice to partition SAP ECC tables in HANA by SAP config organisational objects ranges (e.g. Company Code, Controlling Areas, Plants, Sales Orgs (or perhaps a new custom derived field in SLT)) but I'm not aware of any easy way of doing that short of adding that partitioning logic to all the largest ECC tables, which would be very time consuming. Hopefully I've missed something obvious or there are enhancements in SLT / HANA coming subsequent revisions.
Have you seen anyone using HASH partition ONLY working well across multiple hosts or do you think it's worth investing the time upfront before we hit a performance wall?
Well ECC has a lot of tables and with ECC on HANA they sorted the partitioning strategy for a single node already in HANA 1.0 SP06. I suspect this is one of the reasons ERP on HANA does not yet support scale-out. Yes, it's an overhead, but most side-car solutions include tens of tables, not the 40k tables in ERP 🙂
To answer the HASH partition question - yes, I have it working great for massive tables spanning many hosts. What we do is to replicate the master data through multiple hosts, which avoids the hash partition intra-host join problem that you describe. Now, you can partition your transactional data however you like, and get all the power of your HANA cluster. This is of course at the expense of some memory wasted - but most master data elements add up to just a few GB per node in total, so this is acceptable in many scenarios.
Unfortunately that feature is not supported yet, hopefully we will see it in SP07.
I hope that the HANA dev team will figure out how to solve this properly. What they need to is to allow the ability to define a common hash join ID between tables, like a foreign key, which then matches the host contents between tables. This would be a generic solution to the config OU solution you describe.
Clearly the dev team has work to do for SP07 and SP08 🙂
Thanks for your great ideas on partitioning.
You are quite right, we have only so far replicated 65 tables (of the ~40K ECC tables) but give it time and that number will grow 😉
I'll let you know if we deviate much from HASH partitioning while we wait for a generic solution to become available in SPSxx. 🙂
I think there are two important things here:
1) I think SAP is focussing its efforts on SoH and not sidecar solutions, so probably the important question is when you will look to migrate from a sidecar, to full SoH. I doubt that SAP will optimize partition strategies for sidecar scenarios.
2) Partition strategies are all about needs. I guess you don't need to optimize it, or you would be doing already!
I don't know if it's a factor of the scenarios I work on, or of my nature, but I tend to look for every way to optimize HANA. I found for instance, in my latest model, that we could get from 30s response times to 600ms response times with some tuning and rethinking.
That's the difference between 200x faster than their existing EDW and 10,000x faster, and it's the ability to write apps on top of 10,000x faster that is differentiating to them. At 200x faster, it's just a faster data mart.
If you spent time optimizing your solution to be 10-50x faster, would that make any difference?
You are absolutely right to aim for optimal performance.
I suppose it's about striking the balance depending on the varying reporting needs.
In our case we could partition further e.g. HASH & RANGE across the nodes, however given our reporting requirements are evolving and changing rapidly any sophisticated partitioning strategy we adopt may become out dated. We still though give special attention to the largest tables 😉
Personally I think it would be nice if SAP had some default partitioning strategies for side-car and SoH, based on their own HANA Live reporting models, which customer could adopt and modify accordingly.
I suppose the key point is that partitioning isn't just for day-one but will need to be monitored and reassessed over time, to meet the changing user requirements and expectations.
Thank you Vitaliy. I found this cool smart data access feature weeks ago as well. SAP HANA is evolving really fast, each SP offers many cool features.
Dear Aron,
Hadoop is write optimized (files upload faster). You can find the information regarding the same in slide 4 in the following link http://www.stanford.edu/class/ee380/Abstracts/111116-slides.pdf
SAP HANA reads data fastre with the power of in-memory, column store, parallelization of queries,etc. It can read the data faster than hadoop. Hence the reports will be naturally be faster in SAP HANA than hadoop impala.
Rgds,
Mat.
Thanks for contributing, that’s a useful link.
You are of course correct that HANA and other In memory databases are going to be faster than disk bound solutions. Hadoop is designed for parallel computing the Hadoop software vendors are also working on ways to make use of the memory (cache) on each of the nodes to speed up reporting.
The big boys like Facebook have HADOOP clusters with 1000’s of nodes
(E.g. 2000 machines, with a combined 64Tb of Memory, 21 Pb storage and 16,000 Cores)
http://hadoopblog.blogspot.nl/2010/05/facebook-has-worlds-largest-hadoop.html
That's a lot of Memory that could be used to speed up reporting, if/when the HADOOP developers work out how to utilize it efficiently across distrbuted nodes.
NOTE: Impala already has the capability to cache it's data in memory, but you have to have executed the query first. It's a step in the right direction.
Interesting article Aron. I think the main thing I see is that Hadoop and HANA coexist rather nicely - the use cases are different, and often symbiotic when used together.
I have a few ways in which you might improve HANA performance.
- Your load performance looks bad. I typically load 3-500MB/sec of raw data into a HANA node. Probably your performance is bad because of the size of your box - you need 10GB of RAM just to run HANA, so you only have 3.5GB of space left for data and 3.5 for reporting with a 17GB system.
- If you're getting load degradation it's because your partitions are getting big. Each time you do a MERGE DELTA (or merge dog does it for you), it will create a copy of any changed partition to merge. If you're doing big loads with Data Services, then it will take care of this for you. If you do them yourself with LOAD statements then you need to control this yourself.
- In my tests, DOUBLE performs better than DECIMAL with HANA.
- SELECT * always performs badly with HANA because it scans every column. Consider rewriting your queries to use just the columns you need. All your slower HANA queries including a *.
- Consider using a partition strategy that uses a commonly used field. HANA will like you for this because it will reduce the data scanned.
And some other points:
- The largest certified HANA system is IBM's 56x1TB = 56TB system according to PAM. My understanding is that all vendors will consider certifying larger systems if you can afford them.
- HANA only runs on SUSE Linux for SAP Applications.
- Integration with Hive for query federation is in HANA 1.0 SP06, which has been released. The AWS version is imminent, I hear. NLS is already possible using Data Services.
- You can put HANA data on disk - it loads only the tables and columns it needs, and you can force unload. Save Points are always saved to disk to allow restarts.
- You didn't mention the HANA XS development platform, which I think is a big differentiator.
I think an interesting test would be to use a proper HANA cluster for this. All the HANA projects I work on use billions of rows and we get billions of aggregations per second. That's when HANA really shines and Hadoop will time out every equivalent query, even with a large cluster.
Have you considered a $/performance slide? In one example, HANA is 4x more expensive than Hadoop and 21x faster which makes HANA 5x better cost/performance. For most of my customers, that's the important question.
John
Hi John,
Thanks for some excellent comments and suggestions.
Though I am comparing HANA & HADOOP Impala, I firmly believe HANA and HADOOP are currently best suited to different use cases that can co-exist, as you point out.
I would have loved to have performed a similar exercise on a decent size HANA and HADOOP Cluster, with billions of records, but my budget is limited 😉
If anyone wants to loan their environment out for a few days though I'll give it a crack 🙂
It's very much work in progress but below is a link to a collaborative spreadsheet, with other members of the Impala user forum [special thanks to Jung-Yup, Henrick and OnMarc] aimed at comparing performance and cost for small HADOOP clusters and HANA on AWS (Row 21 on the summary sheet)
https://docs.google.com/spreadsheet/ccc?key=0AgQ09vI0R_wIdEVMeTQwZGJSOVQwcFRSRFFFUmcxWWc#gid=6
If you look on the summary sheet you can see that OnMarc have run a subset of TPC-H on Impala with 1.2 Billion records, and run times were pretty good for hardware costing only $475 p/month.
It still falls short of HADOOP's best use case of reporting on 10's or 100's of Billions records, but it gives a taste of what might be possible.
Using HADOOP on AWS has a relatively bad cost/performance factor but I think it's ideal for a small PoC. I imagine though that large organisations would be able to negotiate better hourly rates on AWS (or it's rivals) to improve it's cost performance factor.
In addition AWS recommends using much larger instance types, than I used, for productive HADOOP clusters. They claim the increased cost per virtual machine is more than offset by the significant performance improvement, which makes sense if it's proper utilised.
But as you said suggested having performance/cost stats on Large HANA and HADOOP would be very useful, though arguments will always rage as people debate whether you can truly compare them using TPC-H or TPC-DI or ......
I believe it's better to have something to compare solutions on rather than just relying on all the Marketing.
Thanks again
Aron
We can do some tests some time I have a box free.
In my current solution we can do select sum(amount) from transactions on 3bn rows in 600ms. If we add restrictions and group by, the numbers decrease because HANA has to do less aggregation and more scanning. If we add complex joins then we can get up to 3-5s response times (e.g. on 60m customers grouped into 10 bins).
I'd expect this to translate pretty equivalently into performance on TPC-H, and HANA seems to always scale linearly.
By the way we shouldn't need those count(*)s and they are very expensive in HANA. Should be possible to do count(l_extendedprice) and get the same result?
Benchmarks are benchmarks. I'm in two minds as to the usefulness of them, but at their best they allow people to understand how solutions work.
Cool. If we were to do that on a bigger HANA box I certainly agree with all your partitioning and query enhancements. Tune it to the max 🙂
TPC-H also has complex SQL queries with joins that would also be interesting to include. We could then also benchmark different partition strategies and the implications on reporting.
My other blog has the steps to create and import TPC-H dataset into a HANA environment.
http://scn.sap.com/community/developer-center/hana/blog/2013/06/14/bread-butter-8-tables-with-all-you-can-eat-data
The question then is how many billions of records would your spare box handle? 😉
If you are still keen, then ping me a private email with the Memory/Host specs of your HANA box. I'd then like to try and find another volunteer in the Hadoop Impala community, with a suitable sized HADOOP cluster that might also want to participate. Alternatively if we could gather donations of $3000 p/day we could compare up to a 50 Node HADOOP Cluster on AWS, using their most powerful instance type (cc2.8xlarge), which would give us 1600 Hyperthreaded virtual CPU's, 3Tb of Memory & 164Tb of storage to play with. All going well, with no distractions the comparison should only take a couple of days.But in reality for this limited testing it would only cost $100 p/day for 2 instances during the preparation stage, then you launch the other 48 instances when everything is in place to execute the finalised test scripts. The beauty of AWS is you only pay hourly for what you use. Cost aside my preference would be to find another willing participant in the HADOOP world, they might then bring some further tuning tips.
At worst this bench-marking exercise, on larger environments, might provide some interesting lessons learnt to share.
There's nothing like a race to encourage performance tuning competition.
surprised to see HANA do so well despite being represented by the smallest instance (17GB/m2.xlarge) and i don't think there is any other limit to its offering but the budget constraints.
Yes I'd agree that performance even on smallest box of HANA boxes was still very impressive.
On small point though, HANA is still limited even if you have the deepest pockets.
As John commented above: "The largest certified HANA system is IBM's 56x1TB = 56TB system according to PAM."
HADOOP by comparison handles Petabytes of data and beyond. Theoretically you can just keep throwing money at it, to get bigger and bigger, faster and faster.
All though don't expect SQL queries (using HIVE, IMPALA, STINGER etc) to come back in under 10 seconds if it needs to aggregate trillions of rows of data. 😉
They are just different tools for different jobs.
Aron,
the way i look at it is that there's no limit to what IBM will certify for a budget that is large enough and HADOOP is not really certifiable as it is open source, so i agree they are different tools for different jobs and different budgets (G$)
Just stumbled upon these recent slides from Cloudera.
http://www.slideshare.net/cloudera/presentations-25757981
Looks like they have tested Impala out on TPC-H on a 10 Node cluster, using a 1Tb dataset. That represents 6 Billion records in the line item table.
Their test queries, using Parquet tables (Column store), appear to run in under 30 seconds (see slide 12).
I can't tell from the slide what queries they ran on the column store table, except perhaps the TPC-H Q1 query which is a sample ''Pricing summary' report, which does include table joins.
Uncompressed the TPC-H linetem csv file they loaded is ~700Gb, which is just a little bit more than what HANA on AWS can cope with, even if we managed 5 times compression. 😛
I presume a 1TB Hana box should handle the load/merge of this number of records. Similar queries should then run in less than a few seconds. Any volunteers to try?? 😉