What Is Hadoop?

As previously pointed out in a past blog post What is MapReduce?, there are two very important things to know when discussing Big Data – MapReduce and Hadoop.

Hadoop is a term you will hear over and over again when discussing the processing of big data information.

You might have also seen the yellow elephant image, which is the copyrighted icon depicting Hadoop (Hadoop was the name of the founder’s (Doug Cutting’s) son’s toy elephant).

In the other post, I broke down the idea of MapReduce into the most easily digestible way possible; here is the same with Hadoop.

A little history… Hadoop was born out of a need to process big data, as the amount of generated data continued to rapidly increase. As the Web generated more and more information, it was becoming quite challenging to index the content, so Google created MapReduce in 2004, then Yahoo! created Hadoop as a way to implement the MapReduce function. Hadoop is now an open-source Apache implementation project.

Overall, Hadoop enables applications to work with huge amounts of data stored on various servers. Hadoop’s functions allow the existing data to be pulled from various places (since now, data is not centralized, but distributed in places using cloud technology) and use the MapReduce technology to push the query code and run a proper analysis, therefore returning the desired results.

As for the more specific functions, Hadoop has a large-scale file system (Hadoop Distributed File System or HDFS), it can write programs, it manages the distribution of programs, then accepts the results, and generates a data result set.

This, in a nutshell (or more accurately one succinct post), is what you need to know about Hadoop and how it works with big data.