Parallel DBs faster than Google's MapReduce

Former Member · ‎04-24-2009

Data Management in the Cloud: Parallel DBs are faster than Google's MapReduce

There isn’t a conference or blog where people aren’t enthusiastic about cloud computing. John Ousterhout, a seasoned computer science veteran (UC Berkeley, Sun, a couple of startups and now Stanford), said recently at the Stanford Computer Forum’s Annual Meeting that this is the most interesting time of his live. Further to this point, the RAD Lab at UC Berkeley has just published a very comprehensive and optimistic paper on cloud computing http://abovetheclouds.cs.berkeley.edu/.

We all agree, for small software companies as well as big corporations virtualization of computing resources will improve efficiency considerably. New business models will open up for small SW companies as they don’t have to invest upfront in their IT but can consume as they go. TCO will come done dramatically for IT departments from midsized companies to large corporations. Bechtel, for example, reported recently that they have increased their server utilization to 70% by transitioning to a private cloud. "Our total costs are the same, but with a heck of a lot more capacity," says Ramleth, Bechtel’s CIO. “Ten times more, to be exact.” Bechtel has just gone half way; about 50% of their corporate users use this new virtualized environment. http://www.cio.com/article/453214/Cloud_Computing_to_the_Max_at_Bechtel?page=1

Intel’s IT chief architect Gregg Wyant is responsible for all IT-related innovations, and he is currently overseeing Intel IT’s plan for consolidating from 117 to 8 data center locations worldwide. Intel intends to achieve a net present value of US$550 to 650 million and overall cost-avoidance savings of US$1 billion or more through its data center efficiency program (SAP Research Summit 2008 and www.intel.com/it/apr.htm).

There are many questions that have to be answered before mission critical enterprise applications can run in a public cloud. What about SLA in the cloud? Google and Amazon e.g. have been down for many hours on several days last year. Security, audibility and privacy are other big issues. The Above The Cloud paper by the RAD Lab (see link above) discusses from a research point of view ten general obstacles and possible ways to overcome them.

One of the most pressing and challenging questions is scalable storage.

There have been many heated discussions about what is the “right” data storage strategy for future biz apps. Do we have to throw away databases and have to rewrite all applications for MapReduce? The fans of SQL say it’s a very powerful language, easy to use by business programmers and fairly portable with a proven track record of 30 years. Further SQL databases have tons of existing tools and applications for reporting and data analysis. Web and cloud aficionados will say most programmers are more familiar with object-oriented, imperative programming than with SQL.

Wouldn’t it be cool to compare those two paradigms in a benchmark?

Michael Stonebraker, Sam Madden (both MIT CSAIL) et al have done exactly that. They have compared RDBMS parallel clusters which have been commercially available for nearly two decades versus MapReduce for large-scale data analysis in terms of performance and development complexity. http://db.csail.mit.edu/pubs/benchmarks-sigmod09.pdf

The good news is, parallel databases are still 3 to 6 times (on the average) faster than MapReduce. RDBMS is better with regards to speed, query automation, less code to implement per task, etc. The paper also shows that RDBMS has disadvantages compared to MapReduce like lack of ease of setup and slow when importing a lot of data. MapReduce is slower since it has to plow through the entire input file at the start of each query. There is also a potential performance price to pay for materializing the intermediate files between the map and reduce phases. MapReduce stores the interim results of the map instance before passing it on to the reduce worker. Materialized intermediates are an advantage with regards to recovery though. For long queries for example, less data is lost when a node dies since recovery can use the stored interim results.

Further, the paper shows that brute force HW parallelization is not necessarily the best solution. DB clusters are so successful since they have developed over time the right balance between scale up and scale out. A data base needs fewer nodes than MR to manage the same amount of data due to the superior efficiency of modern DBMSs. Let’s take a total disk capacity of 2PB as example. MR would use 1000 nodes with 2TB of disk/node. Compare that to eBay’s Teradata configuration. It uses just 72 nodes (two quad-core CPUs, 32GB RAM, 104 300GB disks per node) to manage about 2.4PB of relational data.

SAP has its own in-memory column data storage, TREX. SAP internal benchmarks (not published so far) have shown that TREX is 6 times faster than the fastest parallel data base of the MIT benchmark (Vertica) as long as all data analyzed could be kept in main memory. This seems not too much of a restriction since compression reduces the original data size by up to 10 times and memory becomes cheaper fast. In other words, it’s fair to say our TREX is about 20 to 35 times faster than MapReduce when analyzing large data sets.

This is all good news as it shows that (parallel) data bases have still a comfortable speed advantage over cloud based storage.

We can go own using proven technology that is the basis for most applications and data stored in enterprises. Furthermore we can combine the relational and the object-oriented paradigm with new application frameworks like Ruby and LINQ that have implemented object-relational mapping patterns. By the way, we at SAP have just finished an exploratory research project that allows running Ruby in our robust and proven SAP Web Application Server for ABAP. http://www.sdn.sap.com/irj/scn/wiki?path=/display/research/blueruby