Water, water, every where,
And all the boards did shrink ;
Water, water, every where,
Nor any drop to drink.
— “The Rime of the Ancient Mariner”, Samuel Taylor Coleridge
How many times have you read the above quote turned into “data, data everywhere; nor any drop of info to be had” or something similar? It’s pretty incredible that in all the progress made in both hardware and software, with subsequent adoption by IT departments the world over, that the common complaint from the business side of corporations is that they are still lacking actionable information. While I don’t want to make excuses for any IT organization in that predicament, consider the following quote:
“If every image made and every word written from the earliest stirring of civilization to the year 2003 were converted to digital information, the total would come to five exabytes. An exabyte is one quintillion bytes, or one billion gigabytes—or just think of it as the number one followed by 18 zeros. That’s a lot of digital data, but it’s nothing compared with what happened from 2003 through 2010: We created five exabytes of digital information every two days. Get ready for what’s coming: By next year, we’ll be producing five exabytes every 10 minutes. How much information is that? The total for 2010 of 912 exabytes is the equivalent of 18 times the amount of information contained in all the books ever written. The world is not just changing, and the change is not just accelerating; the rate of the acceleration of change is itself accelerating”
–Michael Shermer. WSJ 2.22.2012 book review “Abundance: The Future Is Better than You Think”
It is unmistakable that widespread technology adoption and the digitization and subsequent automation of our world is driving that acceleration of change and the generation of that level of data volumes. The Era of Big Data is now here and here to stay. It is no longer just the marketers hyping the need for you to buy new solutions to deal with the challenges. It is a reality. A stock trading firm that I consulted at generates almost 1TB of raw data PER DAY with 20% growth Y/Y; comScore runs a 147 TB datawarehouse and loads 150 GB of data per day; Sybase 365 processes more than 1.8 billion messages per day which roughly translates to about 1.5 TB of data and that’s only a small fraction of the global messaging traffic. Publications from the likes of the Harvard Business Review to McKinsey & Company to PWC have written tomes about the opportunities and challenges posed by all things Big Data. To what extent your particular organization needs to deal with its implications–that’s an exercise of individual introspection, isn’t it?
While the challenges posed by Big Data does not rest solely with IT, as IT professionals, it is our job to enable the enterprises we work in to harness the opportunities associated with Big Data. The challenges are great. There’s no doubt about that.
In this blog series, since I work in SAP’s Database & Technology Services (DTS) organization, I want to cover the task of data modeling for data big and small plus the associated ecosystem around it to turn it into an operating model. But data in and of itself is boring. It’s not an island either. Therefore, I want to also touch on architectural aspects necessary to support the manipulation, collection, distribution, analysis, and use of all that data. It’s a tall order and I don’t purport to know everything there is to know about the subject. I like to think of this blog series as more of a journey. In parts of this journey where I have traveled, I would like to share some of my experiences. In that part of the journey which I have yet to travel, I invite you to join me to either share your experiences or as a companion in which we will explore topics together for a richer and varied experience.
Before starting out on discussing models though, let’s take a step back and think about some trends at a macro level about what’s going on around us. These are some of the things that come to my mind:
- The volume of data has gotten too large for us to move it around and process it in the manner we would like and within a reasonable timeframe dictated by the needs of the business. We now talk about tera-, peta-, and exabyte scales as the new norm
- Unit of computation is moving to where the data resides (Hadoop/MapReduce, In-Memory computing) due to the problem posed by the first item instead of retrieving the data to the point of computation
- The sources of data that must be collected, analyzed, and synthesized has gotten very large, very fast and a lot of them are external to our companies, e.g Facebook, Google, and Twitter
- The speed at which business demands actionable information has shrunk (again). This demand seems to follow Moore’s Law too
- Scale-out is being achieved or is desirable to achieve using commodity hardware
- Distributed/Parallel architectures of computation are the norm
- Unstructured data must now be mined and synthesized along with structured data. The rise of No-SQL movement manifests itself in products like MongoDB, HBase, CouchDB, and Cassandra just to name a few (as I write this nosql-database.org says there are 150 now!)
Given the above as a backdrop, what must enterprise/data architects do to contend with these challenges? What are the ingredients for enabling a successful platform? What tools are available today and what tools need to be built in order to help us with these challenges? These are the question I want to explore from a practitioners point of view. Analyst reports or whitepapers from marketing are fine to get a general framing of the topic but they are of limited use for practitioners like you and me. At the end of the day, I’m a foot soldier, a tinkerer at heart, and get paid to deliver working systems. I also learn best by getting my hands dirty.
As a start, I don’t want to take anything for granted about who the reader is nor what level of “technophile” you are so please allow me to start at the beginning.
What is a data model and why is it necessary?
Like all endeavors of human thought, a model is a means to an end, not an end in itself. To me, a model
- Helps us define the scope/boundary of a project. After all, projects are supposed to be finite. You will always have follow-on projects but they will all have some finite scope.
- Reveals the available (and not available or missing) set of data based on goals. What is available internally and what items can be collected from external sources?
I worked on a telephony project once where I discussed opportunities for delivering new value to customers (other telephone companies) with the product manager related to text messaging. Half way into the meeting, I realized that the product manager had no idea that location information accompanied every text message going across his network because he had never looked at a Call Data Record (CDR) before. Similarly, before the advent of Java, I worked in marketing for a small company that developed tools for C++ software developers. One of the main value proposition of the product set was a promise of platform portability, i.e. use our tools and you’ll be able to port to other platforms easily. The key decision point for the engineering department was: what should be the sequence for platform certifications given limited resources? Well, the marketing department had been collecting plaftform information from users for over 3 years but never bothered to share that info with engineering and engineering didn’t know to ask marketing! It was a guessing game as to which platform should be certified first until I joined the company and was able to make the connection.
- Defines the common set of lexicon for communication to discuss business requirements, processes, architectures, and more. Think about models such as the double helix, a home you want to build, an office building, a new airplane. They all allow us to visualize and discuss the various facets of the model in common terms. It’s also cheaper to fix things in a model than something running in production.
- Serves as a roadmap for what Ralph Kimball calls the “Enterprise Data Warehouse Bus Matrix” which represents the organization’s core business processes and associated dimensionality, some of which you will build in stages (think projects). Suppose your fully-envisioned model encompasses 30 dimensions but your immediate business priority and project scope can only accomodate the top 5 dimensions such as customer, product, channel, line-of-business, and time. The bus matrix is a tool/technique that allows you to keep track of what’s going to be implemented and when.
- Serves as a base data structure of sorts which will provide some benefits like ease-of-use, storage & retrieval efficiency, performance, maintainability, and more. After all, data, structured or not, has to be brought into memory at some point to process it.
I hope that this intro has whet your appetite for models and hope that you will join me on my exploration of this subject.
P.S. Got you’re own “adventures in modeling” stories? It’d be great to hear the good, bad, funny, horrific, and all other types of stories. Join in!
For other database related topics, be sure to check out the Database Services Content Library.
Learn more about SAP Database Services at http://www.sap.com/dbservices.