Skip to Content
Author's profile photo David Jonker

Big Data is Not About ‘Big’ Data

/wp-content/uploads/2013/02/speeed_186770.jpgOnly 50% of the term Big Data is correct – it’s about data but it’s not primarily about ‘big’. Size or volume of data isn’t the ‘big’ issue. For that matter it’s not primarily about variety either. While both of those factors play important roles, they are only important because of their impact on velocity.

Data storage never was the problem

Traditional relational databases have been able to store massive data sets for a long time. An Oracle 10g database can store over 8 Petabytes while for many years DB2 databases have been capable of storing well over 500 Petabytes. Of course, this is all theoretical. No customer has an Oracle or DB2 database that approaches sizes even close to that. Why? Because the speed, or velocity, at which data can be loaded and queries can be executed approaches zero well before then.

Similarly, all traditional relational databases can store any variety of data as text or binary large objects. The problem is that large volumes of unstructured data cannot be moved fast enough to enable rapid search and retrieval.

Hadoop and MapReduce enable organizations to distribute the search  simultaneously across many machines, reducing the time to find relevant nuggets of information in large volumes of data in a scalable way. That’s why Hadoop is being adopted by bleeding edge enterprises moving into the multi-petabyte club. There are already some environments that break the 100 Petabyte level, and theoretically can continue to scale.

Velocity is the real Big Data challenge

Don’t get me wrong. Data volume and variety matter, but only in so much as it impacts velocity.

Hadoop is able to search massive data sets much faster than any traditional database, but its batch nature means results come back in a timeframe that does not resemble anything close to real-time, which is proving a limiting factor. That’s why we are seeing many open source projects attempting to build in-memory caches on top of Hadoop. But is this enough?

Enterprises are looking for a solution that minimizes the delay between when an event or transaction happens and a decision is made. This requires more than an in-memory cache on a Hadoop cluster. It requires a radical rethink of enterprise architecture that dramatically accelerates the time from data to decision.
This is beyond the capabilities of a traditional relational database, analytic database, or Hadoop cluster. It requires a new approach: an in-memory data platform that transacts and analyzes simultaneously, that leverages Hadoop as a searchable data repository, and that moves from data to decisions in real-time.

It’s all about real-time data. Real-time is the new ‘Big’.

You might also like:

Big Data Opens Governments and Fosters Innovation

Why Big Data is Getting the Bully Treatment

Assigned Tags

      You must be Logged on to comment or reply to a post.
      Author's profile photo Derek Klobucher
      Derek Klobucher

      That's a gutsy move, David: removing two of the sacred V's from Big Data.


      Seriously though, do you see the industry dropping "Big" from Big Data because it's understood or redundant?

      Author's profile photo David Jonker
      David Jonker
      Blog Post Author

      There's lots of discussion around the term 'Big' - it's not the best descriptor in many ways. I think 'Big' will either be dropped or modified. Some people are starting to talk about Fast Data, which on first blush appears to be a better descriptor of the challenge that enterprises are actually facing.

      Author's profile photo Former Member
      Former Member

      I like fast data. It says it all really, doesn't it?