Hitchhiker's guide to Big Data

werner_daehn · ‎10-02-2017

Big Data marketing versus reality

When reading current publications, the IT world seems to have moved to Big Data. Forget about databases, with Big Data tools you can do everything - in no-time and on large volumes in addition. For every problem there is a Big Data tool: Hadoop, Spark, Hive, HBase, Kafka,...

The reality is slightly different. This post aims to explain why Big Data technology does complement databases - OLTP and OLAP alike. And that Big Data use-cases are orthogonal to Business Intelligence questions. In a follow up post, a future-proof architecture will be presented, meeting the business needs by combining various technologies in a way to get all advantages and none of the disadvantages.

The foundation of all the arguments will be the CAP theorem of distributed storage. It says that Consistency (reading the most recent data), Availability (redundant storage) and Partitioning (distributing the data) are requirements with different emphasis and a single system cannot be perfect in all three at all times. Example: If the network interconnect between two cluster nodes is down, either the query fails (missing redundancy) or there is the danger the data is not the most recent (redundant data is outdated while the interconnect is down).

Transactions versus Parallel Processing

It is important to notice above rule talks about distributed storage. But for distributed processing, variations of the CAP theorem apply as well. An obvious one: processing data in a single transaction but fully parallel with zero synchronization is a logical contradiction.

Hence, the first important difference between databases and Big Data tools is: database-like transactions are supported by databases only. For example, nobody would ever run an ERP system using Big Data technologies.

The argument can be turned upside down as well by saying that a database can be run on a cluster but it will never scale linear with the number of nodes. Twice the number of nodes does not provide twice the throughput (100% increase) but less, maybe a 50% increase? That is bound to the fact that transactions need synchronization between all nodes. As consequence, a database cluster will consist of few but powerful servers, whereas a Big Data cluster is based on a large number of small and cheap computers.

Above argument about transactions does not count if the Big Data environment is used to analyze data, but there are other reasons...

Joins versus Aggregations in the Big Data world

A computer cluster shall consist of 100 nodes and the data is evenly distributed. In such example, counting the number of rows can be distributed. Each node counts its own dataset and returns one value. Another process sums up the individual counts. Expressed in the SQL language:

select count(*) from table (partition1, partition2);
and
select sum(counts) from (select count(*) as counts from partition1 union all select count(*) as counts from partition2);
does return the same result.

This is an example of an almost perfect workload for a cluster. Each node works on its data, each node's resultset is tiny.

Another example shall be a join between two tables, a sales order table and the customer master data. Very likely the sales orders located on a single node will be from arbitrary customers, hence the required customer records need to be read from the other nodes prior to the join. In worst case, every single node needs to request the customer data from the 99 other nodes. In such case the entire customer data would be copied over the network 99(!) times (1/100th of the customer data moved to 99 nodes and that for each of the 100 nodes).

This join strategy is called broadcasting. It makes sense if the broadcasted table is small or better, if the entire table is available on all nodes from the get go.

The more common strategy to distribute the data is a reshuffle. Here both datasets are partitioned by the join condition and each node requests one partition. As end result, node1 would have all sales orders of customer1 and the masterdata of customer1. Now the data is all local and the join can be performed quickly. But overall the customer master data and the sales order data had to be sent across the network once to reshuffle it to the various computing nodes.

The only way to avoid reshuffling is when both datasets were partitioned by the join-condition initially, when creating and loading the table. This is a first indication that in the Big Data world the Data Model has to be purpose-built for one (type of) query.

Apache Cassandra is a good example where the data model serves one query only.

The different usage of Partitions

Above statement about purpose-built data models becomes obvious when joining the sales order data either to the customer master and/or the material master data. The data cannot be partitioned by customerID and materialID at the same time. Well, in theory via sub-partitioning it is possible technically, but by adding more and more potential columns to join with, the number of partitions grows exponentially. And the cluster will be busy reshuffling the data across the network instead of processing it.

To make things worse, partitioning is also used to speed up filter queries. If for example the data is partitioned by year and the query requests the 2017 data only, all partitions except the 2017-one can be skipped, thus drastically reducing the amount of data to be read. Now the data needs to be partitioned per join condition plus all possible large filter conditions, all at the same time! That is not possible.

The third reason to store data partitioned by yet another column is to support distributed processing. The year=2017 query finds the 2017 data immediately, it is a single partition. But now 100 worker nodes are available to process the data. So each of the worker nodes has to read the entire partition to finds its assigned subset of data. Would be much better if the data is partitioned by year and day as then only the files of 2017 are assigned to each worker node.

The consequence of all that is in many Big Data environments the same data will be stored multiple times so the user can pick the best-partitioned table for his query. Sometimes even like having three copies of the sales order table, each partitioned by something else. A Big Data system is definitely no general purpose OLAP system to connect a Business Intelligence tool with. Either it serves only specific queries or the query execution times are long.

That is a bit dangerous to say however, as it makes a couple of assumptions on the terminologies. If the understanding of Business Intelligence revolves around aggregation of purpose built data models matching the BI query, this is obviously possible. That would be the aggregation example of above. Another argument could be that Business Intelligence is more like reporting, meaning predefined reports are executed once a day and users just view them - will be possible most of the time as well. But Business Intelligence in its true meaning - self service, interactive, data discovery, data comparison - requires a generic data model and a high degree of flexibility for the user with response times of a second or less.

If somebody tells differently, try a query that requires multiple reshuffles. Often a three-way join like sales order with customer and material master is enough to prove this person wrong.

Big Data ≠ OLAP

When talking to customers, the term "Big Data" is often used as a synonym for "lots of data". But is that the case? If a company stores all sales orders in the ERP database, just because we are living in a Big Data world today, does that mean there are 1000 times more sales orders? No. Maybe old ERP data gets archived as a cost saving measure, but for that there are better ways like Data Aging (hot data in the in-memory engine, cold data in the disk engine of the database) - no argument for Big Data.

The term Big Data is better described by the statement to make vast amounts of operational data available - data that's not currently used. For example, sensors trigger an alarm but their readings are not stored permanently yet. Or weblogs from the company's web server are aggregated into access statistics and then the raw data is deleted. Or images and pdf files being archived and no longer available.

The promise of Big Data is to analyze this data to bolster business success. If that promise is met, more data will be stored permanently and more and more systems will produce such data by adding more sensors. There are the terms again, "analyze data" & "business" - isn't that Business Intelligence? If yes, isn't a typical Business Intelligence query an OLAP query? I would argue No in that instance.

To explain the thought, the company's webserver weblogs should be analyzed. The idea shall be, if more people look up one product page, the probability that its demand will go up is higher and hence the stock level should be increased a little. A simple OLAP query, isn't it? Maybe a query like select url, count(*) from weblog group by url;?

Such weblogs are very large files, simple structured and the lines look like this:

192.168.198.92 - - [22/Dec/2002:23:08:37 -0400] "GET / HTTP/1.1" 200 6394

192.168.198.92 - - [22/Dec/2002:23:08:38 -0400] "GET /images/logo.gif HTTP/1.1" 200 807

192.168.198.92 - - [22/Dec/2002:23:08:38 -0400] "GET /js/ajax.js HTTP/1.1" 200 8033

192.168.198.92 - - [22/Dec/2002:23:08:38 -0400] "GET /stylesheets/global.css HTTP/1.1" 200 3522

192.167.133.64 - - [22/Dec/2002:23:08:44 -0400] "GET /mysql/admin.php HTTP/1.1" 404 321

192.168.72.177 - - [22/Dec/2002:23:32:14 -0400] "GET /products/tennis/racket1.html HTTP/1.1" 200 3500

Storing the entire weblog in the Data Warehouse is impossible due to its size. On the other hand, the weblog contains a lot of lines nobody is interested in. The first line in above sample is the home page - not interesting for the use case. The second line is the request to show the company logo as an icon. Then there is access to java script libraries, other images, style sheet files.... none of which are of any interest. As a result only 10% of the lines in the weblog might even be product pages. Then there is web access from the various automatic crawlers like the Google search index crawler, the same from Microsoft Bing, access where somebody tries to see if an admin page exists to exploit vulnerabilities etc. The amount of potential interesting line goes down to 5%.

But even the data reduction is beside the point. When one page is opened by a single user ten times within a short time, is that the same as if ten users opened the page once? Of course not, there was a connection problem, he did reload the page multiple times or whatever. But such query same-user-same-page-within-60-seconds is no longer a simple OLAP query select count(*), url from weblog group by url; It also needs a bit of time windowing logic.

The weblog analysis could be made even more clever. Only those page impressions should count where the user did read the text actually. The Google Chrome browser (and others) have a feature to preload pages linked on the current page, under the assumption the user will likely follow some of those links. A pre-load operation should not count. Hence the web page can be designed to load content dynamically when scrolling down and that would be a clear indication the user did look at the page and based on the time between opening the page and scrolling down to the details, the time spent reading can be guessed. As a result the query would return the intensity(!) of the product page interest instead of a simple count.

These are all various arguments why Big Data queries and OLAP queries are two different things. Totally different things in fact. Big Data means vast amounts of raw(!) data and information needs to be derived from, prior using it for analysis. In many cases this data transformation is not even SQL-like but something else. Examples? Big Data is raw data of...

MRI scans in TIFF format: The transformation result shall be the location and the probability of a fracture as identified by the image recognition software. If the image recognition is certain but the doctor says other, there is something to look at. Doctor to check the marked area in the image, Data Scientist to optimize the algorithm.

Temperature sensor readings, one per second from 1000s of sensors: Does not make sense to store that if the temperature is 20°C always. It is more interesting to see when did the temperature change significantly and for how long. Like from 09:00 to 11:55 it was 20°C ±2, from 11:55 to 12:01 it was 30°C ±2.

...

That again is not something new. All examples are some kind of data aggregation, storing the results in a database and throwing away the raw data. So what is the story of Big Data then? It gets obvious in our weblog example, where somebody defined that multiple page hits within 60 seconds from the same browser should count as one. Why 60 seconds? Why not 10 minutes? What is the effect of changing the time window? For an average BI user these questions are irrelevant but experts on the subject - we call them Data Scientists nowadays - might be interested and could fine-tune the aggregation logic for all others. Or they find more use cases, like raising an alarm when a link does point do an non-existing page.

The new thing in Big Data is to keep the raw data, and hence allow thinking about usage of the data later instead of a use-case-first/single-use-case approach.

And the best approach doing that is to let Business Intelligence users play with the prepared data in a flexible to use database, while the Data Scientist has access to the raw data but knows what he is doing. This is the best combination from a user point of view and from a technical point of view.

Summary

Keep the raw data, it does not cost much and using Apache Spark Data Scientists can squeeze out valuable information.

Aggregate the raw data into information and provide that summary to the OLAP users by putting it into the SAP HANA database. There they can correlate the new data with other operational data using Business Intelligence tools.

Big data tools do not replace databases, not even remotely.

Adding computer processing power decreases the run times of only a very specific type of problem - algorithms that can work on an indefinitely small subset of the data and return a tiny resultset only. A car with 500hp has a high top speed, yes, but won't get you through city traffic faster.

Consider the Big Data target architecture which will be described in an upcoming post.