Hadoop, Its Ecosysytem and Use Cases

vivekbhoj · ‎10-14-2013

Hi Everyone,

In my previous blog Big Data Facts and Its Importance, I shared what I learnt while exploring Big Data.

In this blog, I would like to share what I have learned while exploring Hadoop.

What Is Hadoop?

According to Official Aache Hadoop site, Hadoop is a open-source software for reliable, scalable, distributed computing.

Hadoop is a fault tolerant software that is designed for running on large number of processors and offers massive parallel programming using Map Reduce Algorithm.

Read the article on Hadoop by James Turner where Mike Olson tells us about What is Hadoop, How it Works and What it can do

Read the documemt on Hadoop by Dhruba Borthakur.

Also check Yahoo's tutorial on Hadoop

Hadoop Ecosystem:

Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook that makes it possible to use Hadoop as read only RDMS and we can also write queries to get the data using SQL like language called HiveQL.

HBase is a non-relational distributed, column oriented database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. Facebook Messages is one of the apps that use HBase.

Pig produces Map Reduce programs that are written in PigLatin language developed by Yahoo.

Mahout provides a library for data mining and implements them using Map Reduce Model.

Sqoop is a tool designed for transferring large amount of data between Hadoop and structured data stores such as RDBMS.

Oozie is a Yahoo's workflow processing engine that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and then it intelligently links them to one another.

ZooKeeper is a centralized service used for deploying, monitoring and maintaining Hadoop Clusters.

Flume is a event queue manager generally used to collect data and integrate it into Hadoop.

HCatalog is a centralized metadata management and sharing service that allows a unified view of all data in Hadoop clusters and allows tools like Pig and Hive, to process any data elements without using the underlying filenames or formats and without needing to know physically where the data is in the cluster.

HDFS & Map Reduce are the two core components of Hadoop.

HDFS is a distributed file system that provides high-throughput access to data. It creates multiple replicas of each data block and distributes them on computers throughout a cluster to enable reliable and rapid access. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It supports a traditional hierarchical file organization like our native operating System. A user or an application can create directories and store files inside these directories like we do in our Linux or Windows.

Also check the HDFS Architecture Guide

Read the article published at Hortonworks to know the difference between HDFS and Other Storage Technologies.

Map Reduce is a framework for performing analytics and processing jobs in parallel using Map Reduce programming pardigm.

Map performs filtering and sorting - It splits the problem to be solved into multiple parallel jobs, each of which produces immediate partial results.

Reduce performs summary operation - It combines data from intermediate results of map phase to produce final results.

Check the Simple MapReduce Example by IBM.

Also check Map Reduce Tutorial

SQL and No SQL Databases:

As you already know we query data in our Relational Databases using SQL and the Databases that use SQL as their primary access language are called SQL Databases such as Oracle, HANA, DB2.

But there are many other non relational databases that do not use SQL for querying data like HBase that is used by Hadoop.

To get a list of all No SQL Databases, check this link: http://nosql-database.org/

To learn about differences between HBase and other RDBMS, get the free chapter HBase versus RDBMS from bookHadoop-The Definitive Guide at Inkling.

Importance of Hadoop:

Read the article by Wired on Why Hadoop is the future of Database

Read the article by Eweek on Why Hadoop is Important for Business

Read the article by IBM on Benefits of Hadoop

History of Hadoop:

Check the below four great blogs at Giagaom , that discuss the History of Hadoop and its journey in completing 10 years

In their first blog, Hadoop founder Doug Cutting and other people explain Hadoop's history

In their second blog, they show an infograph of companies that are selling Hadoop products

In their third blog, they discuss about the future of Hadoop

In their final blog, they highlight how Hadoop has grown over the years and consists of a list of videos and use cases.

Hadoop Use Cases:

Today many big companies use Hadoop for their day to day operations.

You can get list of all the companies using Hadoop here

To name a few: Facebook, Twitter, Yahoo.

Facebook:

Facebook runs the world’s largest Hadoop cluster. Just one of several Hadoop clusters operated by the company spans more than 4,000 machines, and houses over 100 petabytes of data. As I already mentioned earlier Facebook Messaging runs on top of HBase.It also uses Hadoop and Hive to generate reports for third-party developers and advertisers who need to track the success of their applications or campaigns.

To know more about Facbook's use of Hadoop, check the below links:

http://hortonworks.com/big-data-insights/how-facebook-uses-hadoop-and-hive/

http://gigaom.com/2013/08/14/facebooks-trillion-edge-hadoop-based-graph-processing-engine/

http://www.wired.com/wiredenterprise/2013/02/facebook-data-team/

https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-16/hadoop-and-hive-at-fac...

http://gigaom.com/2012/06/13/how-facebook-keeps-100-petabytes-of-hadoop-data-online/

Yahoo:

Yahoo was the Hadoop's first large scale user as it started using Hadoop to speed up indexing of Web crawl results for its search engine. It also uses Hadoop to block spam trying to get into its email servers. It stores 140 petabytes in Hadoop.

To learn more about how Yahoo uses Hadoop, check the below links:

http://developer.yahoo.com/blogs/hadoop/

http://www.informationweek.com/development/database/yahoo-and-hadoop-in-it-for-the-long-term/2400021...

Twitter:

Twitter uses Hadoop for product analysis, social graph analysis, generating indices for people search, natural language processing and many other applications. It uses HDFS to store its data and uses Pig for analysis.

You can get more details here:

http://blog.cloudera.com/blog/2010/09/twitter-analytics-lead-kevin-weil-and-a-presenter-at-hadoop-wo...

Learning Hadoop

If you are interested in learning Hadoop, then you can also install Hadoop in your PC.

Hortonworks provides a Hadoop Sandbox that has Hadoop environment with lots of tutorials.

Check the below document for installing Hadoop Sandbox using Vmware Virtualization:

http://hortonworks.com/products/hortonworks-sandbox/#install

To install this Sandbox, you can watch the below YouTube Video:

Note: If you are going to install Hadoop Sandbox in your Laptop or PC, then make sure that you have at least 4 GB RAM, I would prefer to have 6 or 8 GB RAM otherwise it will slow down your system.

You can also buy the book Hadoop-The Definitive Guide by Tom White.

You can also attend Hadoop Training from Hortonworks at http://hortonworks.com/hadoop-training/register-for-hadoop-training/

Additionally, you can also check the following free courses:

Free Course at http://bigdatauniversity.com/courses/

Access Hadoop Tutorials from Yahoo at http://developer.yahoo.com/hadoop/tutorial/

Learn more about Hadoop at http://www.mapr.com/academy/

Also read about Hadoop & HANA Integration

Thank you for reading my blog.