Some time back, I started a bit of a blogging journey with this post. As I could have predicted at the time, it has taken me much longer to find space in my schedule to carry on with the initiative and get another part published but here we go…
At the same time, this is both a difficult & complex question along with a very straightforward one. I’m inclined to suggest a better question should be something like “Why is Big Data suddenly such a Big Thing?” Of course, as many readers of this post will be all too familiar, our industry is very clever at creating “The Next Big Thing®” that just happens to help sell the latest version of some platform, solution or system…
Etymology of Big Data
Ok, I admit – I only used this as a sub-title to get a nice big word like etymology into one of my ‘blog posts… Seriously, for those who don’t know what this is please read this Wikipedia entry.
I wanted to try and get an understanding for where and when the term Big Data first crashed into our world – I have a personal recollection but wanted to get a wider view. A quick Google (just what did we do before Google?) gleans a vast number of thoughts along these lines. One of my favourites was this post, partly for the main content and (as usual with the internet) partly for the comments. Interestingly, the author’s rough stab at Big Data hitting mainstream around 2012 isn’t too far from my thoughts (2011 is in my head for some reason.) I recall it reaching my conscious around that time also, mainly thanks to SAP’s HANA announcements and the increasing momentum the appliance was getting. It is also relevant to note that we are talking about our current understanding & use of the term Big Data here, and also to recognise that it has been used in a few other ways prior to this point.
The important point here though, is that we are talking about the history of the term Big Data, not that of the data itself. Or put another way – we have been generating & collecting vast quantities of data for many years prior to every man and his dog rushing to get Big Data or Data Scientist on their CV’s… What changed? Why are we suddenly using the term “Big Data” so much, in so many (often) vague ways?
How do you define Big?
Let’s break the term down and think just about the first word for a bit. I think the word “Big” can be the most misleading aspect of this whole subject. Having said that, I’m not sure I can think of a suitable alternative. We often hear size isn’t everything and I believe this relates to Big Data more than many will have you believe. As usual it comes down to perspective and how you want to measure and compare. As someone quite famous once said, it is all relative. We live in an age where data is generated through so many channels, at such an alarming rate, that we probably don’t know what is happening with it all. Conversely, we also don’t know what other data we could or should be generating and therefore capturing. Once we’ve generated and captured all of this data, what are we going to do with it? What happens if we haven’t captured any data that we can actually use? Ultimately, one of the intended benefits of our current obsession with Big Data and Data Scientists is how they enable us to actually focus down on a specific, highly tailored sub-set of the overall picture – our slice of the pie as it were. If we don’t have the ingredients for our pie, we will never get a slice of it…
I spotted an interesting exchange on Twitter not too long ago where Ethan Jewett was (I think!) trying to make a point about how we capture data. I didn’t manage to track the whole exchange (some observers might have suggested Ethan was having a drunken conversation with himself!) however I did takeaway some sense of agreement with this tweet. It really piqued my interest about the whole Big Data thing (as well as helping me to finally make a bit more effort to complete this post.)
All of this got me thinking and it reminded me of an aspect of Quantum Mechanics that I thought was quite appropriate to our current Big Data world and especially Ethan’s comments. I was idly wondering about how we cannot measure or capture all data and in fact, choosing to measure one aspect of a system could lead us to miss other, important measurements that we actually do need and would find useful. I’m officially naming this “Jewett’s Data Uncertainty Principle” 🙂
My Slice of the Pie
The challenge for all of our ‘new’ data scientists is how they take all of the data and information available at their fingertips and turn it into something useful. Just how do we capitalise on the sheer volume of information combined with processing power at our disposal? At a UKISUG Conference a year or two ago a colleague was speaking to a senior customer representative, who had asked what HANA could do for them – the answer was “what do you want it to do for you?” I have seen lots of Twitter traffic in recent weeks following a similar vein, where SAP users are struggling to understand what the actual use-cases for HANA can be. That suggests they don’t understand what Big Data is and more importantly, what it can offer.
This is one of the key challenges with the current state of Big Data, IMHO. We’ve reached a brave new world where almost anyone can access almost endless amounts of data; they can generate almost endless amounts of data; and then anyone can consume and mash all of this data up into all sorts of random results. What is the point and where is the value in all of this data? How do enterprises get value out of this data wrangling? Are we creating roles for data scientists that are somewhat self-serving?
As a rather pointless example, I discovered LinkedIn InMaps recently and duly generated my network map… Wow, doesn’t it look impressive with all of my connections there on one screen?
The problem is though, what does it do? What’s the point? What value does it create or add? This is effectively my slice of the much larger LinkedIn data pie but it doesn’t really serve much purpose. To make it useful, it needs something else added, some extra context. As soon as you start talking about context in relation to data and information, things start getting interesting fast…
It’s all about Context
I’m pretty sure Vishal Sikka said something along these lines last year some time. No doubt I have it as a favourite tweet, SCN bookmark, saved to My Pocket… Ok, I know I’ve got it hidden somewhere anyway. The point is, often just one element of data on its own is near meaningless but add another element, another dimension and suddenly it becomes valuable and of use.
As a real world example, here in the UK on our motorway network we have overhead gantry signs that display useful information. Often, on a journey you will see a message such as “To junction 18 – 22 minutes” with the idea that you can then gauge roughly how well the traffic is moving. However, there is a problem with this. You are only getting one dimension or measurement. It’s like a scalar value – it means something but isn’t easy to interpret in isolation. Now, on some of the overhead signs we have, there is more space and instead you get “To junction 18 – 25 miles, 22 minutes”. This extra dimension, which turns our data into a vector type value suddenly enables a better interpretation of the information represented. In your head you can do a rough calculation to determine if the traffic is running at or below the speed limit (70mph in the UK – I base my calculations on 60mph though, which is a mile per minute.) Now that is useful!
The above example is a clear showcase of how bringing more than once source of data (a constant distance between sign & junction) together with a dynamic source of data (current motorway speed) delivers a compound piece of information that is useful to someone. Lets extrapolate this example out a bit though into what might happen in future… What if the sat-nav systems in our cars could tap into this real-time data and perform calculations and decisions accordingly? Would we see journey times being much more accurately estimated? If we added in another dimension, such as weather or local events, which we know will impact traffic then we suddenly have a multi-dimensional source to base decisions on. We are already seeing this sort of technology appearing – I should be taking delivery of a new Audi A6 in a couple of weeks. Nothing out of the ordinary but it has an 8-speed automatic gearbox and on-line integration with Google maps – this combination allows the car to look ahead and determine if it is worth changing gear. So, if you are approaching a T-junction in 4th gear, it won’t bother changing up to 5th as it knows you will be slowing again soon and hence it is more economical to hold the current gear ratio. It might not make a massive difference but consider if each and every single car on the roads was able to do similar things and more by using multi-dimensional decisions?
This area of using multiple sources of data, often from completely unrelated areas, is how I see the Big Data movement moving forward and no doubt how those who have always been close to it have always understood it. It requires a bit of a stretch in how you understand the word Big though, as you don’t necessarily end up with vast volumes of data but instead maybe vast sources of small, finite information.
SAP Users need to re-think how they are approaching their use of Big Data and indeed HANA. If it is deployed to simply speed up BI, they have missed the point. Whilst having your dunning run completed in minutes rather than days is great, where is the value add? I’m not aware of anyone in the SAP world who is sat staring at their SAP system waiting for a dunning run to complete… However, I suspect if a financial controller could begin to predict and take proactive, mitigating decisions early in the dunning process with customers based on multiple sources of information, some people will start getting excited.
Finally we get to the end and no doubt you wonder what I think Big Data is? Well, I don’t imagine it would generate as much interest or excitement if it was called “Multi-Source, Multi-Dimensional, Intelligent, Decision-Making Data” would it? 😉