Let’s talk about the ‘Elephant’ in the room – The prequel
Every time I hear the beginning of my favorite TV series, Jonathan Nolan’s ‘Person of Interest‘ and Michael Emerson says “You are being watched….” , it gets me thinking, “Could this be true?“, “What if someone IS watching us?” .
Last month I was planning to buy a new camera and did my ‘mandatory due diligence’, online, of finding out the best model for a reasonable price. After a couple of hours of detailed analysis, I had zeroed in on a model. A couple of days later, browsing a famous developer forum website, what do I find? Staring at me was an advertisement from a famous online store promising me my camera model at the ‘lowest price possible’.
I am INDEED being watched…!!! 😯
If you take a closer and of course, a calmer look at this scenario, you realize that, this is possible due to the fact that most of us are logged into the browser with our Google accounts and this makes it very easy for the websites we visit to track our searches, clicks, wish-lists and come up with targeted ads for us.
An example is ‘Google AdSense’, an online advertising program, which Google offers to any interested websites to generate revenue on a per-click basis.
But is it really possible to store every click, search, browsing history of every internet user??
Consider the sheer volume of the data which would be produced!!
Also, at a given point in time, millions of people would be on the internet and we would need to store all their clicks as they happen.
Consider the velocity with which this data would hit the data store!!
It is clearly evident that this new kind of data will benefit the businesses, to target buyers and in turn increase sales.
But is also evident that traditional data management techniques and systems will be unable to manage them.
This lead to the advent of the concept of, ‘Big Data’.
‘Big Data‘ is a broad term for data sets so large or complex that traditional data processing applications are inadequate to handle them.
A much more formal definition would be:
“Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value”
Just to cite a few examples of the enormity of Big Data;
Facebook has to deal with 0.5 PetaBytes (500,000 gigabytes) of updates including 40 million photos on a daily basis!
Youtube gets loaded with videos which can be be watched for one year continuously, on a daily basis!
With smarter devices being introduced, which have the ability to connect and communicate through the internet (Internet of Things), the amount of data generated will only grow exponentially in the days to come.
Big Data has long been ignored in data analysis due to the limitations of traditional data management systems.
Off late,organizations are realizing that data external to an organization is becoming just as strategic as internal data.
New types of data like sentiments from social media, sensor data from smart machines, clickstreams from websites are now viewed to be as important for the business as the traditional data.
It is very evident that correlating data from multiple sources is generating much higher data analysis results and businesses that can use big data to generate more detailed results with a higher degree of accuracy will be at a competitive advantage.
There is a desperate need of a data management system which can store and process all kinds of data, while being scalable and inexpensive.
Another term which has become almost synonymous with Big Data is ‘Hadoop‘. It is the most popular and widely used Big Data solution in the market currently.
Hadoop is a Java framework which allows for distributed storage and processing of large data sets.
It is optimized for handling massive amounts of structured, semi-structured and unstructured data in parallel, using commodity hardware.
The popularity of Hadoop stems from the fact that, it is an open source project of the Apache foundation and is massively scalable, reliable and fault-tolerant. Another important point to note is that, the Hadoop ecosystem is constantly growing and adding new capabilities to Hadoop which makes it all the more powerful and capable of supporting new business use cases.
In this series of blogs – “Let’s talk about the ‘Elephant’ in the room” – I will try to share my knowledge of Hadoop, it’s components and it’s ecosystem, which can help you to prepare for the Big Data revolution.
Big Data is here to stay.. Be prepared, to be ahead of the game!
Coming up next in the series, Hadoop Overview and Hadoop distributed file system (HDFS).