Skip to Content

Parallel DBs faster than Google’s MapReduce

Data Management in the Cloud: Parallel DBs are faster than Google’s MapReduce

There isn’t a conference or blog where people aren’t enthusiastic about cloud computing. John Ousterhout, a seasoned computer science veteran (UC Berkeley, Sun, a couple of startups and now Stanford), said recently at the Stanford Computer Forum’s Annual Meeting that this is the most interesting time of his live. Further to this point, the RAD Lab at UC Berkeley has just published a very comprehensive and optimistic paper on cloud computing


We all agree, for small software companies as well as big corporations virtualization of computing resources will improve efficiency considerably. New business models will open up for small SW companies as they don’t have to invest upfront in their IT but can consume as they go. TCO will come done dramatically for IT departments from midsized companies to large corporations.  Bechtel, for example, reported recently that they have increased their server utilization to 70% by transitioning to a private cloud. “Our total costs are the same, but with a heck of a lot more capacity,” says Ramleth, Bechtel’s CIO. “Ten times more, to be exact.” Bechtel has just gone half way; about 50% of their corporate users use this new virtualized environment. 

Intel’s IT chief architect Gregg Wyant is responsible for all IT-related innovations, and he is currently overseeing Intel IT’s plan for consolidating from 117 to 8 data center locations worldwide. Intel intends to achieve a net present value of US$550 to 650 million and overall cost-avoidance savings of US$1 billion or more through its data center efficiency program (SAP Research Summit 2008 and 


There are many questions that have to be answered before mission critical enterprise applications can run in a public cloud. What about SLA in the cloud? Google and Amazon e.g. have been down for many hours on several days last year. Security, audibility and privacy are other big issues. The Above The Cloud paper by the RAD Lab (see link above) discusses from a research point of view ten general obstacles and possible ways to overcome them. 

One of the most pressing and challenging questions is scalable storage.

There have been many heated discussions about what is the “right” data storage strategy for future biz apps. Do we have to throw away databases and have to rewrite all applications for MapReduce? The fans of SQL say it’s a very powerful language, easy to use by business programmers and fairly portable with a proven track record of 30 years. Further SQL databases have tons of existing tools and applications for reporting and data analysis. Web and cloud aficionados will say most programmers are more familiar with object-oriented, imperative programming than with SQL. 

Wouldn’t it be cool to compare those two paradigms in a benchmark? 

Michael Stonebraker, Sam Madden (both MIT CSAIL) et al have done exactly that. They have compared RDBMS parallel clusters which have been commercially available for nearly two decades versus MapReduce for large-scale data analysis in terms of performance and development complexity.

The good news is, parallel databases are still 3 to 6 times (on the average) faster than MapReduce. RDBMS is better with regards to speed, query automation, less code to implement per task, etc.  The paper also shows that RDBMS has disadvantages compared to MapReduce like lack of ease of setup and slow when importing a lot of data. MapReduce is slower since it has to plow through the entire input file at the start of each query. There is also a potential performance price to pay for materializing the intermediate files between the map and reduce phases. MapReduce stores the interim results of the map instance before passing it on to the reduce worker. Materialized intermediates are an advantage with regards to recovery though. For long queries for example, less data is lost when a node dies since recovery can use the stored interim results. 


Further, the paper shows that brute force HW parallelization is not necessarily the best solution. DB clusters are so successful since they have developed over time the right balance between scale up and scale out. A data base needs fewer nodes than MR to manage the same amount of data due to the superior efficiency of modern DBMSs. Let’s take a total disk capacity of 2PB as example. MR would use 1000 nodes with 2TB of disk/node. Compare that to eBay’s Teradata configuration. It uses just 72 nodes (two quad-core CPUs, 32GB RAM, 104 300GB disks per node) to manage about 2.4PB of relational data


SAP has its own in-memory column data storage, TREX. SAP internal benchmarks (not published so far) have shown that TREX is 6 times faster than the fastest parallel data base of the MIT benchmark (Vertica) as long as all data analyzed could be kept in main memory. This seems not too much of a restriction since compression reduces the original data size by up to 10 times and memory becomes cheaper fast. In other words, it’s fair to say our TREX is about 20 to 35 times faster than MapReduce when analyzing large data sets. 


This is all good news as it shows that (parallel) data bases have still a comfortable speed advantage over cloud based storage.

We can go own using proven technology that is the basis for most applications and data stored in enterprises. Furthermore we can combine the relational and the object-oriented paradigm with new application frameworks like Ruby and LINQ that have implemented object-relational mapping patterns. By the way, we at SAP have just finished an exploratory research project that allows running Ruby in our robust and proven SAP Web Application Server for ABAP. 

You must be Logged on to comment or reply to a post.
  • Hmm.. thanks for the good references in this article. Although I feel a bit unsure here: I think MapReduce and pRDBMS solves different problems. Nobody can/wants to run a business application on MapReduce base. On the other hand, nobody would run a one-time log file analysis on a OLAP cube.


    • It is a versus in many senses. 1) UC Berkeley doenst teach SQL any more. So there is a SQL vs object oriented programming. 2) Amazon runs its business on MR. Many people believe that MR-type applications are the future of OLAP and OLAP will disappear since it doesnt scale like MR does.
      • I don’t think it is correct that anything business critical at Amazon is run on M/R. They use Dynamo KV and also Oracle. They are on the forefront of Async Architectures and are a big believer in the CAP theorem (pick any two of consitency, availability and performance)

        For that massive scaleability requirements it is BASE vs. ACID.

        A major business which is based on M/R is Google’s Search – and thats not a typical Business Application.


        • Yes, Amazon uses Dynamo for ordering (OLTP) but they use MR for analytics (OLAP). MR and parallel DBs are for OLAP. Amazon does statistics like “people who bought X bought also Y”, “from your preferences you might be interested in X”, and many more things with MR.

          You are right for transactions its CAP and Amazon gave up the consistency. One can produce inconsistent shopping carts at Amazon; they live very well with eventually consistent. Did you ever experience an inconsistent shopping cart at Amazon?

          The big question for transactional systems is, is eventually consistent and a new ACID good enough for biz apps transactions. I think for most parts yes. Again, the blog and the paper by MIT is on OLAP analyzing (big amounts of data) not on OLTP.

          I dont agree that Googles search is not a biz app. Many companies use google search with small g (company specific version) as their search engine. Isnt that a biz app?

          Amazon Web Services has just come out with their elastic Hadoop service. .

          Soon we will see tons of new apps; especially from SMEs in the area of analytics, market research, etc.