Thoughts on Big Data
I thought that I would take this opportunity to talk about some of the hype and misconceptions surrounding “Big Data”. Everywhere you turn, you see articles, analysts, and vendors all talking about “Big Data”, and their solutions to the “Big Data” problem. It seems as if we have suddenly been blind sided with a new and potentially unsolvable problem.
What exactly is “Big Data”? Ask 10 people, and you will get 10 different answers. Some define it as huge amounts of data. Others define it as unstructured data. Still others simply shout: Hadoop. That fount of all wisdom “Wikipedia” defines “Big Data” as a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. So, what is it? Why the sudden emphasis on “Big Data”? I guess the answer depends on what context. Let’s break it down in the following paragraphs.
Is “Big Data” a synonym huge amounts of data? The answer is: Yes, it can be. No one will deny that the amount of data captured/used by organizations is exploding. Some is by regulation, some is by choice. Whatever the reason, data volumes seem to be growing exponentially. Thinking back over my career with Sybase and SAP, I remember that in the mid 1990s, a 150MB (yes, that’s Megabytes) OLTP database was considered “Big”. Shortly thereafter, a 1GB database was “Big”. A couple of years later, I remember user databases breaking 100GB, and the awe that struck in most people. From the late 2000s on, we rapidly progressed through to 1TBs+. Now we are in the days of the PetaByte.
Fortunately, as data volumes grew, technological capabilities (both Hardware and Software) increased to allow us to handle/process the data. On the hardware front memory sizes increased, CPU core counts increased, storage capabilities increased. On the software front, parallelism became common, optimizers became better, storage compression became better. At no point, do I recall, that data volumes became an unsolvable issue. Some technique/component always arose to deal with the issue.
If data volumes aren’t a “Big Data” issue, then perhaps use of “Unstructured Data” has become a problem? Is this something new? Did we never process unstructured data before? The short answer is that we have always captured, stored, and processed unstructured data. In databases, remember “Varchar”, “Long Varchar”, and “Text” data types? Those have been with us almost since the beginning. And if you think back to what database vendors have offered as options, “Text Processing”, “Text Search”, you’ll remember that we have had these capabilities to process unstructured data, in relational database management systems, almost since day 1.
We’ve also always had the ability to process unstructured data, programmatically, outside of databases. I remember working on a project in the 1980s to scan, read, and interpret newspaper articles. This was done programmatically (“C” and “LISP”). Sound Familiar? Programmatically acquiring, storing and processing unstructured data? Hadoop anyone?
What about Hadoop? It’s hard to initiate or participate in a discussion of “Big Data” without the conversation turning to Hadoop. Is Hadoop Big Data? Is Hadoop the answer to Big Data? Hadoop is implemented as a distributed storage system, with a programmatic/computational paradigm (MapReduce). Hadoop is touted for its ability to store, distribute, process vast amounts of data, on a system composed of cheap, commodity compute nodes. Is Hadoop the panacea for Big Data problems? I submit to you the answer is no. It can be part of the solution, but, contrary to what the various vendors are shouting, Hadoop is not the end-all solution for Big Data. More on Hadoop as the savior of all things data in an upcoming issue.
Let’s get back to my original question. Is Big Data new, and more importantly, are we unable to solve the Big Data problem. My answer to the first part is: No, Big Data is not new. We have always had an ever increasing volume (and velocity) of data. Names have changed somewhat in that we have gone from “Data Warehouse” to VLDB to Big Data. However, I believe that the problem is the same. As for unsolvable, technology has always risen to the occasion. Indeed, the first relational database management systems were designed and implemented to provide for storage and straightforward, standardized access to data. Thoughout the year, both hardware and software innovations were devised to cope with the ever increasing volume (and velocity) of data.
One of those innovations was the development of a practical, usable columnar store database management system. Sybase IQ (now SAP Sybase IQ) was developed in the early 1990s as a response to the ever increasing size of data, and the ever increasing computational requirements. Right from the start, IQ demonstrated the ability to ingest, store, compress, and most importantly query huge amounts of data. Features that enable IQ to be the very best Big Data platform include:
- Volume: Storage Ability to efficiently handle immense quantities of data – Compression via Columnar Store
- Velocity: High-Speed/Real Time Data Ingestion
- Variety: Ability to ingest, store, and process both structured AND unstructured data
- Ability to handle large and complex queries/processes – IQ is a Massively Parallel Processing (MPP) implementation
- Scalability: IQ has the ability to efficiently and effectively scale to almost unlimited numbers of concurrent users
- Low TCO – Implementation of all the above features on inexpensive commodity hardware.
· Over the next 20 years, IQ grew to be the market leader with twice as many installs as all the other “Data Warehouse/VLDB” specialized systems. As for “Big Data”, in addition to the thousands of happy and successful IQ customers, IQ also holds several Guinness Book of World Records: Most Data ever loaded in a Database and World Record Load Rate (34.3 TB/Hour).
Is SAP Sybase IQ still relevant and more importantly, is IQ a valid solution to handle Big Data? The answer is a resounding YES. If you are looking to solve a Big Data problem, then SAP Sybase IQ is perhaps the best data management tool you can acquire. IQ is the most mature, proven and widely used Big Data management solution in use. In an upcoming edition, we will dive deeply into the IQ technology to demonstrate why IQ is such a good solution.