Skip to Content

On July 24, Daniel McWeeney How Google could revolutionize BI on how Google could revolutionize BI. Since he missed the true significance of the new BI accelerator, I retell the tale in new words to put the right spin on it. In short, a “googly” approach reflects very closely the line we took to develop the BI accelerator.

Business Intelligence

A Business Intelligence (BI) application collects vast amounts of business data in a central repository and enables a business user to access and interpret the data to make more intelligent business decisions. SAP NetWeaver BI helps companies create such a repository and also gives the business users a point of easy and effective access to the data.

Most BI applications occupy many terabytes of disk space, use large amounts of memory, and tend to be CPU intensive. This can be difficult in consolidated systems, where many virtually separate servers run on one piece of hardware. To guarantee the availability of many terabytes of data, the best solution is parallel writes to remote storage. As for memory and CPU, to ensure acceptable response times the best approach is parallel processing over a scalable landscape of servers that can exceed the capacity of a single box.

The SAP NetWeaver BI accelerator parallelizes the execution of a query over a scalable blade server landscape and therefore takes a big step toward ensuring good response times. The BI accelerator is marketed as a preconfigured appliance that simply plugs into the existing customer landscape. But we can run a lot further with the basic ideas behind the accelerator.

Clusters, Chunks, and MapReduce

Google is really good at keeping huge amounts of data available, allowing very many machines to access data simultaneously, and processing requests in parallel across thousands of machines. Let’s see how we might use some Google ideas to rethink the basics of implementing BI.

The Google File System (GFS) is designed to be fault tolerant and self monitoring, and to work in scenarios where the data files are up to many GB in size. Files are divided into standard 64 MB chunks, each with a unique handle and each replicated on multiple servers. Reads are either large streaming reads of up to a few MB or small random reads of a few KB. Writes are mostly large sequential writes that append data to files. This sort of file system looks good for many BI scenarios.

A GFS cluster has a single master server and a number of chunkservers. The master server stores in memory an index with the names and addresses of the chunkservers for each part of each data file. When a client makes a request for a part of a file, it asks the master server where to find it. The master server responds with a chunk handle and the locations of the replicas. Throughput is optimized by separating file system control via the master from data transfer between chunkservers and clients.

The GFS balances load on the chunkservers as it adds new data to the cluster. The master server selects a chunkserver for new data based on current disk usage, replica location, and so on, to avoid overloading any single chunkserver. So as the files grow, the GFS automatically distributes the chunks over the cluster. All metadata changes are persisted to an operation log that is replicated to multiple servers.

Google uses a widely known programming paradigm called MapReduce. Map and reduce functions operate on lists of values. Map does the same computation on each value to produce a new list of values. Reduce collapses or combines the resulting values into a smaller number of values, again doing the same computation on each value. Again, all this looks good for a BI application.

Google’s “secret sauce” is to distribute map and reduce operations across thousands of machines in parallel. This opens up new possibilities for data processing. If you need to find a few records in a huge collection, MapReduce enables you to do it on an army of cheap servers, even if a few of them die on the way. Massively parallel processing makes the job fast and fault tolerant.

Parallel Processing to Accelerate BI

In unaccelerated SAP NetWeaver BI, a query goes like this:

  1. A BI user opens a GUI, types into the boxes and hits execute.
  2. The OLAP processor defines a query from the input.
  3. It then executes a select statement on the database to pull all the records it needs to answer the query.
  4. It then aggregates those records to return just the data the user asked for.

A “googly” approach would parallelize steps 2 to 4 like this:

  1. A client program constructs a map function based on the metadata for the server cluster to determine what data to read and where it is.
  2. The program constructs a reduce function to execute the query and aggregate the records on the keys the user requested.
  3. This MapReduce is sent out to the server cluster. The machines read their chunks of data and perform the execution and aggregation.
  4. The client program gets all the reduced chunks, merges them, and returns the result set to the BI system.

This approach is fault tolerant and very scalable, using standard components. The data can be aggregated in various ways, since this is done by a function acting on each of the chunks. The approach eliminates the need for precalculated aggregates, so administration costs are lower than for traditional BI systems.

In all essentials, this is the story of how the SAP NetWeaver BI accelerator works, using a number of inexpensive blades mounted on their own storage system. This whole approach allows great flexibility in terms of analysis and in future might allow BI to answer arbitrary queries against the data, including semantically deep ones entered in natural language.

The BI accelerator developers are excited by such “googly” ideas and paradigms such as grid computing. What we want to ship is an accelerator that runs in parallel on as many blades as our customers care to deploy and answers their questions as fast as they can ask them.

To report this post you need to login first.

3 Comments

You must be Logged on to comment or reply to a post.

  1. David Halitsky
    Thanks for a great overview post, Andrew.  Two points I’m confused about:

    1)
    You write:

    “The SAP NetWeaver BI accelerator parallelizes the execution of a query over a scalable blade server landscape and therefore takes a big step toward ensuring good response times. The BI accelerator is marketed as a preconfigured appliance that simply plugs into the existing customer landscape. But we can run a lot further with the basic ideas behind the accelerator.”

    Should we assume from this that the current accelerator still runs against standard databases such as DB2, Oracle, etc.  Or does “accelerated SAP BI” already use its own internal file system or database?

    2)

    You write:

    “A “googly” approach would parallelize steps 2 to 4 like this:

    A client program constructs a map function based on the metadata for the server cluster to determine what data to read and where it is.
    The program constructs a reduce function to execute the query and aggregate the records on the keys the user requested. This MapReduce is sent out to the server cluster. The machines read their chunks of data and perform the execution and aggregation. The client program gets all the reduced chunks, merges them, and returns the result set to the BI system.”

    Do we assume from this that a “googly” accelerated BI would use its own file system/database and not rely on the usual 3rd-party databases?  Or could a “googly” accelerated BI use 3rd-party databases that were already “googly” themselves?

    Thanks for any clarification you can provide.

    (0) 
  2. Thanks, David, I’m glad to have the chance to clarify these points. There was a lot I left unsaid.

    To your 1: “Should we assume from this that the current accelerator still runs against standard databases such as DB2, Oracle, etc.  Or does “accelerated SAP BI” already use its own internal file system or database?”

    The BI accelerator appliance comes with its own dedicated storage. But the attached BI system still runs on a standard database. The BI infocubes stay on the database. The BIA indexes that the BI accelerator creates from the infocubes are written to the BIA dedicated storage. The BI accelerator loads the relevant BIA indexes into memory (the BIA indexes are highly compressed relative to infocubes and the blades have many gigs of installed memory), either on startup or to answer specific queries, and does all the query processing in memory. The BI accelerator is fast because all the query processing runs in memory.

    In theory the BI accelerator dedicated storage is just storage, but in fact it needs to be tied closely to the blades for good load performance, so we have a validation process for BIA storage solutions. Logically, the accelerator storage is secondary, since the BI infocubes still reside on the BI database as the primary source of truth from which the BIA indexes can in principle be rebuilt from scratch, but as a disaster recovery strategy this is rather primitive, and redundant storage of BIA indexes is preferable.

    To your 2: “Do we assume from this that a “googly” accelerated BI would use its own file system/database and not rely on the usual 3rd-party databases?  Or could a “googly” accelerated BI use 3rd-party databases that were already “googly” themselves?”

    In a more generalized approach to storage, the BIA storage would be part of an integrated solution offering remote mirroring and fast failover for the entire landscape, to ensure disaster-proof HA. If commodity storage were used, one would have to work hard on I/O bottlenecks to ensure respectable performance. Simply storing BIA indexes on the BI database, which already has a mature HA concept, is tempting, but it would certainly exacerbate bottleneck issues.

    In a “googly” approach, the BIA indexes would be stored on the blades’ local drives, perhaps in triply redundant chunks as in the GFS. The benefits are high I/O bandwidth for local reads and high availability for free with blade and chunk redundancy. The drawbacks are synchronization and consistency issues that can only be solved by a large global logging overhead. Google has relaxed synchronization and consistency requirements, but a BI solution would have to do better. As you see, there is still some wishful thinking behind these “googly” ideas.

    (0) 
    1. David Halitsky
      I think your clarification tends to support the point I was making to Dirk here:

      Bill Mann and “Invisible” Fields (or, what SAP SHOULD do with ADABAS)

      I mean no offense to proponents of accelerate SAP BI or any other a posteriori and ex post facto technologies for remedying the deficiencies in the so-called “relational paraigm”, which was really nothing more than a typical IBM SRI over-simplification until Ellison got a hold of it and turned it into a marketing ploy centered around the notion that even “mom and pop” can understand enough IT to automate their candy stores.

      Back in the last century there was a laxative called “Serutan” whose slogan was:

      “Serutan is ‘natures’ spelled backwards’.

      This slogan has been parodied once by Paul Krassner of the now-legendary Realist magazine, who was prompted to write that “Tang is **** spelled sideways” after early American astronauts reported that Tang (a powdered citrus beverage mix) seemed to make them gaseous.

      So I’ll take the liberty of parodying this slogan once again by saying that all BI and BW approaches (not just SAP’s) are the industries way of spelling “index-centrism” neither backwards nor sideways, but implicitly.

      BTW, I checked out your personal URL and have to say I was impressed by the credentials you bring to your job at SAP.

      Folks like you used to be around in larger numbers at IBM’s Thomas Watson Research Center, until IBM decided it could not longer support anything but hard-science there.  Some of them did not deserve the sinecures they had, but others did/do great and honest work.  I’m thinking of Art Appel, who pioneered some of the earliest computer vision algorithms and of course, Charlie Bennett, known as one of the fathers of quantum cryptography.

      It is a credit to SAP that they understand the relevance of creds like yours to the jobs SAP needs performed.  Now if SAP were REALLY smart, they’d let Mark Finnern bring his future-salon in-house …

      (0) 

Leave a Reply