Let’s talk about the ‘Elephant’ in the room – The Origins
The birth of every hero has an interesting story behind it.
Be it a simple spider’s bite or exposure to gamma radiation or being struck by lightning in a chemistry lab during a particle accelerator blast!! There is always an “ORIGIN” story. Our hero is no different….
The hero of our story, HADOOP, also has an extremely fascinating story behind his birth.
Let me share this story with all of you, before things get serious.
In the year 1997, Douglas Cutting, better known as ‘the father of HADOOP’, started his work on a new search indexer called “Lucene”. Prior to building Lucene, Doug had held positions in many search technology builders like “Excite”, “Apple Inc” and “Xerox PARC”. Lucene, now an Apache Software Foundation project, is a full text search library used to build indexes and inverted indexes on ordinary text to make searches faster. In layman’s terms, it is what makes Google return results within milliseconds of firing a search query.
After a rather quiet start, Lucene gathered momentum only by the end of the year 2001. Once the Apache Lucene community started thriving, Doug and Mike Cafarella, a graduate student from the University of Washington, started their quest to make the entire Internet ‘searchable’ by indexing web pages. This resulted in a new Lucene subproject, called Apache Nutch.
Nutch is a web crawler, i.e. a program which ‘crawls’ the web, going from page to page, by following URLs between them. Then, Nutch uses Lucene to index the contents of the webpage, making it ‘searchable’. Very soon, Nutch was able to index 100 webpages per second, while being deployed on a single machine.
Like every developer who starts his work on a pilot or proof of concept, there was no real focus on non-functional aspects like performance and scalability in Mike and Doug’s initial project version. Very soon it was evident that indexing the entire internet using one single machine was almost impossible. So, they increased the number of machines to four. But, this brought in the complexity of managing the communication and data exchange between machines and other operational aspects.
Soon, it was very clear that a distributed storage layer was needed, which was open, scalable, durable and which could handle all the operational aspects automatically. Mike and Doug started work, to implement such a system. In the meanwhile, Google published a white paper on the Google File System (GFS) in 2003, which could already provide answers to a lot of problems which were being faced by Doug and Mike, trying to scale Nutch.
Google File System Whitepaper:
Seizing this opportunity, they quickly implemented a File system in Java based on the Google File System whitepaper and named it as NDFS, ‘Nutch Distributed File System’. It was designed to be a distributed, reliable file system which could hide all the operational complexity from the user and provide consistent, durable and fault tolerant storage using cheap, commodity hardware.
In 2006, NDFS moved out of the Nutch umbrella into a new Apache incubator project – ‘Hadoop’. (Consisting of Hadoop commons, HDFS and MapReduce).
HADOOP WAS BORN!!
Why ‘Hadoop’ did you say?? There is an intriguing story behind this name too!
Doug’s son, then aged 2, was just learning to talk and had named his beloved soft-toy yellow elephant – “Hadoop”. Doug felt that the name could make a good brand name for some future software project and had been saving it for the right time.
“The rules of names for software is that they’re meaningless because sometimes the use of a particular piece of software drifts, and if your name is too closely associated with that, it could end up being wrong over time” Cutting says in an interview.
A good software name should be easy to remember, lack very specific connections and must be able to withstand changes in the direction in which the software progresses. So, when Doug was searching for a name for his new Apache incubator project, ‘Hadoop’ was a clear choice.
Incidentally, this was not the first time Doug had chosen something so close to home, to name his software. His search indexer, ‘Lucene’, gets its name from his wife’s middle name and her grandmother’s first name, Lucene!!
Cutting says, “I like having the silly roots. You want to keep some fun and levity in things.It’s too easy to get too serious, especially when you’re talking about enterprise business software. It makes people who are working on it, think they’re having a little fun.”
THE YELLOW ELEPHANT RISES..
With Google growing in leaps and bounds as ‘THE’ search engine of choice, Yahoo was facing serious problems and was considering implementing Hadoop. In 2006, Doug joined Yahoo to help the team headed by Eric Baldeschwieler to make the Hadoop transition. This decision went on to save Yahoo.Hadoop helped spawn new ideas, resulting in fresh new products throughout the company.
Very quickly the ecosystem started to blossom with many projects like ‘HBase’, ‘Zookeeper’ taking shape. Facebook added their contribution of ‘Hive’, a SQL based Data-warehouse on top of Hadoop. Yahoo, itself came up with a project called ‘Pig’ to simplify working with MapReduce. We will look into all these components in fair detail in the upcoming blogs.
In 2008, ‘Cloudera’ was founded by BerkeleyDB guy Mike Olson, Christophe Bisciglia from Google, Jeff Hammerbacher from Facebook and Amr Awadallah from Yahoo!. In 2009, Doug Cutting left Yahoo, to work for Cloudera, as the chief architect, a position he holds till date.
With Hadoop professionals in high demand, Yahoo was not able to retain its star employees and in 2011 Eric Baldeschwieler, along with 7 more of his Product Management colleagues from Yahoo, started a new startup ‘Hortonworks’ aimed at keep Hadoop – 100% open source.
Along with ‘MapR’, Cloudera and Hortonworks form the ‘big 3’ in the space of Hadoop distributions currently.
More on Hadoop distributions and difference between them in the upcoming blogs.
Hope the origin and rise of our hero Hadoop was an interesting read.
See you next time with the ‘Hadoop distributed file system – HDFS’.