Getting it all wrong with big data.
Much is being made in the industry right now around the burgeoning interest in Big Data and of course the characteristics of databases, high speed internet and the general connectedness of systems is all facilitating this concept.
From my viewpoint the buzz around the topic is largely occurring as a direct result of vendor and analyst popularity only since early 2011 and this is borne out in the Google news trends which report a steady stream of use of the term but really only the popular use in many spheres in mid 2011 and spiked when McKinsey published their insight in May 2011.
The Winshuttle big data timeline of course demonstrates that while the term is popular today and is commonly used it has in fact been a concept that was in popular use for long before 2011.
In the early 1990’s when working in the health sector I had the opportunity to get some early exposure to the implications of big data when the organization that I was working was still considered to be a research institution.
At the time, a transition was in play, much of the epidemiological research that was being undertaken was happening in the big cities where there were higher concentrations of people and greater diversity and bigger data sets from more disparate sources.
In the locale where I was working there was a lower level of true research and much of the interest was driven out of the research assignments of individuals who were pursuing a research as part of an academic thesis. Research in the lesser economic centres was considered novel but not necessarily mainstream or significant because we were using small data sets from fewer sources.
I mention this research because at the heart of big data is analytics. The notion that taking various bits of data from disparate sources, you can arrive at a variety of conclusions. In fact the little chart above that I pulled from Google trends is nothing more than a big data trend line for, well, ‘Big Data’ as an item. At the time in these labs, we were using all kinds of data sources and trying to mash them together in variety of system reports and databases to help arrive at conclusions. We were using chemistry, microbiology, histological and cytological results to draw conclusions which meant working with numbers and text.
In those days we referred to the process as data mining but it has become more than mining in the sense that these days many of those mines are exposed and accessible and now we are down to the process of refinement of the analytics part.
The tools we had in those days were very rudimentary though I am sure at the larger institutions there were some of the more popular statistical tools in particular like SAS. We did a lot of hand stitching and correlation work.
One of the interesting items I came across most recently though, was the notion that these advanced analytics methods and tools available today don’t necessarily mean that you have to be using a large data set or a large number of them either and that large data sets aren’t necessarily advantageous. Our thinking logically, is that more data, better analytics but in fact that idea is time and time again proven to be a wrong assumption. Stuff gets lost in all the volume.
In fact a recent article by Christina Lioma Associate Professor at the department of Computer Science at the University of Copenhagen bears out the suggestion that my graph above gets easily contaminated even when I am doing my little string search on Google.
Lioma describes the concept of ‘small data thinking’ – a clear division between efficiency and effectiveness in search for the proverbial ‘needle in a haystack’ – a metaphor that she leverages effectively to describe different challenges. Her critique is that different types of analytics require different approaches and notion that “more is better” isn’t necessarily appropriate to your analytics mission.
This notion about size and data sets become even more interesting when you consider Simpson’s paradox – first described in a technical paper in 1951. Sometimes conclusions from the large data set are exactly the opposite of conclusion from the smaller sets. Unfortunately, the conclusions from the large set are also usually wrong.
Simpson’s Paradox is caused by a combination of a variable that has not been properly accounted for or accommodated in relation to data from unequal sized groups being combined into a single data set.
The unequal group sizes, in the presence of this ‘lurking variable’ can skew the results incorrectly. This can lead to seriously flawed conclusions.
An obvious way to prevent it is to not combine data sets of different sizes from diverse sources.
In a well thought out big data design the data sets therefore need to be carefully evaluated for the risk of this skewing.
Recommended additional reads: