On July 24, Daniel McWeeney How Google could revolutionize BI on how Google could revolutionize BI. Since he missed the true significance of the new BI accelerator, I retell the tale in new words to put the right spin on it. In short, a “googly” approach reflects very closely the line we took to develop the BI accelerator.
A Business Intelligence (BI) application collects vast amounts of business data in a central repository and enables a business user to access and interpret the data to make more intelligent business decisions. SAP NetWeaver BI helps companies create such a repository and also gives the business users a point of easy and effective access to the data.
Most BI applications occupy many terabytes of disk space, use large amounts of memory, and tend to be CPU intensive. This can be difficult in consolidated systems, where many virtually separate servers run on one piece of hardware. To guarantee the availability of many terabytes of data, the best solution is parallel writes to remote storage. As for memory and CPU, to ensure acceptable response times the best approach is parallel processing over a scalable landscape of servers that can exceed the capacity of a single box.
The SAP NetWeaver BI accelerator parallelizes the execution of a query over a scalable blade server landscape and therefore takes a big step toward ensuring good response times. The BI accelerator is marketed as a preconfigured appliance that simply plugs into the existing customer landscape. But we can run a lot further with the basic ideas behind the accelerator.
Clusters, Chunks, and MapReduce
Google is really good at keeping huge amounts of data available, allowing very many machines to access data simultaneously, and processing requests in parallel across thousands of machines. Let’s see how we might use some Google ideas to rethink the basics of implementing BI.
The Google File System (GFS) is designed to be fault tolerant and self monitoring, and to work in scenarios where the data files are up to many GB in size. Files are divided into standard 64 MB chunks, each with a unique handle and each replicated on multiple servers. Reads are either large streaming reads of up to a few MB or small random reads of a few KB. Writes are mostly large sequential writes that append data to files. This sort of file system looks good for many BI scenarios.
A GFS cluster has a single master server and a number of chunkservers. The master server stores in memory an index with the names and addresses of the chunkservers for each part of each data file. When a client makes a request for a part of a file, it asks the master server where to find it. The master server responds with a chunk handle and the locations of the replicas. Throughput is optimized by separating file system control via the master from data transfer between chunkservers and clients.
The GFS balances load on the chunkservers as it adds new data to the cluster. The master server selects a chunkserver for new data based on current disk usage, replica location, and so on, to avoid overloading any single chunkserver. So as the files grow, the GFS automatically distributes the chunks over the cluster. All metadata changes are persisted to an operation log that is replicated to multiple servers.
Google uses a widely known programming paradigm called MapReduce. Map and reduce functions operate on lists of values. Map does the same computation on each value to produce a new list of values. Reduce collapses or combines the resulting values into a smaller number of values, again doing the same computation on each value. Again, all this looks good for a BI application.
Google’s “secret sauce” is to distribute map and reduce operations across thousands of machines in parallel. This opens up new possibilities for data processing. If you need to find a few records in a huge collection, MapReduce enables you to do it on an army of cheap servers, even if a few of them die on the way. Massively parallel processing makes the job fast and fault tolerant.
Parallel Processing to Accelerate BI
In unaccelerated SAP NetWeaver BI, a query goes like this:
- A BI user opens a GUI, types into the boxes and hits execute.
- The OLAP processor defines a query from the input.
- It then executes a select statement on the database to pull all the records it needs to answer the query.
- It then aggregates those records to return just the data the user asked for.
A “googly” approach would parallelize steps 2 to 4 like this:
- A client program constructs a map function based on the metadata for the server cluster to determine what data to read and where it is.
- The program constructs a reduce function to execute the query and aggregate the records on the keys the user requested.
- This MapReduce is sent out to the server cluster. The machines read their chunks of data and perform the execution and aggregation.
- The client program gets all the reduced chunks, merges them, and returns the result set to the BI system.
This approach is fault tolerant and very scalable, using standard components. The data can be aggregated in various ways, since this is done by a function acting on each of the chunks. The approach eliminates the need for precalculated aggregates, so administration costs are lower than for traditional BI systems.
In all essentials, this is the story of how the SAP NetWeaver BI accelerator works, using a number of inexpensive blades mounted on their own storage system. This whole approach allows great flexibility in terms of analysis and in future might allow BI to answer arbitrary queries against the data, including semantically deep ones entered in natural language.
The BI accelerator developers are excited by such “googly” ideas and paradigms such as grid computing. What we want to ship is an accelerator that runs in parallel on as many blades as our customers care to deploy and answers their questions as fast as they can ask them.