Background and perspective
Just a quick definition of BI for the uninitiated: BI or Business Intelligence is the buzz word to describe collecting vast amounts of business data in a central repository and allowing a business user the ability to access and interpret the data to make better informed, intelligent if you will, business decisions. Currently, SAP sells as part of their Netweaver stack a module called BI formerly known as BW that helps companies create this type of repository and also helps give the business users a point of entry into the information.
The ebb and flow of powerful computing is once again turning. A few years ago your average large company had a large air-conditioned room full of many separate systems all doing just fine. However, recently your average large company has started to consolidate instances and use virtualization to cram more and more servers onto one piece of really expensive hardware. There are three problems with putting applications like BI on these consolidated systems; BI systems are huge, occupying multiple terabytes of disk space, they also tend to use fairly large amounts of memory and tend to be very CPU intensive.
The first of these problems, the disk one, causes more issues for availability then for virtualization, how do you keep a constant backup on hand or in a different location for many terabytes of data? The second two are a problem best served not by server consolidation but by massive parallel processing. There is one company out there that does all of these things really well: Google.
Google does a few things really well ( actually they do lots of things really well, but well just focus on the ones that pertain to BI ): keep huge amounts of data available at all times, allow vast numbers of machines to simultaneously access data and parallel tasks across an immense, possibly tens of thousands of machines.
What could this possibly have to do with BI, well if you were at any SAP conference for the past two years you have probably seen the demo of the BIA. This is SAPs first attempt at parallelizing the execution of a query. Google on the other hand, has been doing this since its creation in some mythical lab at Stanford. Lets look at some technologies Google has at its arsenal first and then how we might assemble these parts together to make something that could change the face of BI.
GFS is arguably the main reason Google is able to do what it does best — searching and indexing. Google makes a few assumptions about the data that will be updated and served from the GFS: the files are big (100 MB to many GBs), most reads are either large streaming reads, individual operations [that] typically read hundreds of KBs, more commonly 1MB or more, or small random read[s] typically a few KBs, and the system must be highly fault tolerant and self monitoring.
One cluster of the GFS is relatively simple when looked at from a very high level; it has two major parts a single master server and a number of chunkservers. The master server stores a simple index, what parts of what files are located on what chunkservers and obviously where those chunkservers addresses. When a client ( in this sense this is probably some application, not you and I doing a search ) makes a request for some section of some file it asks the master server where it can locate it. The master server responds with the address of the chunkserver and other information that allows the client to access the chunkserver directly. The master is never recontacted unless a new file is needed. The other interesting bit about the GFS is that one file will exist on many different chunkservers. This allows the system to be highly available. If a client cannot connect to get the file from one chunkserver it simply asks the master where else it can find it. Think of this as having tons of hot spare distributed around the server room but none of the complexity of managing what that spare is doing. ( Some people think that Google has begun to distribute these chunks around the country to keep availability high even in the advent of a major disaster, but the GFS doesnt care if the server is in the same room or on the other side of the world. ) One of the other truly amazing parts of the GFS is its ability to rebalance the cluster. What this means is as new data is added to the cluster the master governing that cluster decides what chunkserver to place the new chunk on. It makes this decision based on several factors current disk usage, replica location and attempts to avoid putting a lot of new chunks on one system. So as the files are growing the system will automatically spread the new data around so no one chunkserver gets drilled with tons of new updates. For more information feel free to read the document linked to in above the paper is very good and provides more details to the adventurous.
Map/Reduce is a programming paradigm found in many functional languages. Here is the best definition I have found, from a Cornell CS class, Map operates on a list of values in order to produce a new list of values, by applying the same computation to each value. Reduce operates on a list of values to collapse or combine those values into a single value (or more generally a smaller number of values), again by applying the same computation to each value. This however isnt a Google construct, what Google has done however is come up with a way of distributing this task across vast numbers of machines in parallel without much heavy lifting on the part of the programmer. This opens up some truly amazing possibilities to data processing tasks. Need to deal with some massive data set and sift through it looking for certain things, before it used to be painful trying to run it on some huge expensive server. Whereas, Google will do it on a few thousands dirt cheap PCs and beat the huge server every time, even if a few of the cheap PCs die along the way. In the cited paper above Google discusses data sets that are approximately one terabyte. They are able to sort the entire data set in 891 seconds which includes writing the whole entire file out to the file system. Another test Google talks about is grep-ing for a particular 3 character pattern of the same terabyte of data takes 2.5 minutes. They go as far as killing some of the processes on some of the systems and still the task finishes. The ability to distribute these processes allows them to be much faster but also highly fault tolerant.
What this means to BI
What happens today when a user executes a query? ( lets assume no aggregates because as we all know the BIA does away with aggregates ):
- BW reads the query definition and presents some user interface to the user.
- The user types some stuff in the boxes and hits execute
- The OLAP processor looks at the query definition and splits the query into parts
- It then executes a select statement on the database to pull all the records it needs to suffice the query
- It then summarizes those records up to the exact table the user asked for
How can Google change this?
First they can take steps 3 through 5 and make use of the GFS and Map/Reduce to remove this from the BI server altogether. (Sounds familiar doesnt it? BIA anyone? ) The beauty of the Google model is the fault tolerance and the ability to easily take large tasks and fracture them into little parts. The way this would be distributed would probably work something like this ( keeping parts 1 and 2 — for now ):
- A client program constructs a map function that is based on the meta data for the GFS cluster that the data resides in determines what needs to be read and where
- The reduce function handles all the calculations for the query execution summarizing the records on the key the user requested ( this could be a complex calculation not just limited to the Default Aggregation allowed in BW )
- This map/reduce is sent out to a massive cluster of machines that in turn reads their chunks of data and then being to perform the reduce function
- The client program gets each one of the chunks performs a final reduce and passes back the exact solution set to the user query
This system would be highly fault tolerant, totally scalable and more then likely blow the doors off anything we have today. On top of that, the way the data can now be summarized is endless as it would be generated as a function to be called for each one of the smaller data sets and not restricted by database operations. There would be no need for aggregates as the granular data can be search so fast by the swarm of machines. In reality this is similar to what the BIA does, just on a much smaller scale, its using a handful of blades all linked into some proprietary piece of hardware, Google does it with dirt cheap off the shelf components and on a much larger scale.
This approach allows much greater flexibility in terms of analysis and at some point in the future will allow the user to just ask for what they want instead of having to navigate around this query thingy and that is is probably the Holy Grail of BI.
If you like this blog please “Digg” it here.