Hadoop Ecosystem – Welcome to the Hadoop Zoo
This blog is part of the series My Learning Journey for Hadoop. In this blog I will focus on Hadoop Ecosystem and its different components. If you have reached this blog directly, I would recommend reading my previous blog first – Introduction to Hadoop in simple words.
What is Hadoop Ecosystem?
So far, we only talked about core components of Hadoop – HDFS, MapReduce.
These core components are good at data storing and processing. But later Apache Software Foundation (the corporation behind Hadoop) added many new components to enhance Hadoop functionalities. These new components comprise Hadoop Ecosystem and make Hadoop very powerful.
Below image shows different components of Hadoop Ecosystem.
Categorization of Hadoop Components
All these components have different purpose and role to play in Hadoop Eco System. Below image shows the categorization of these components as per their role.
Data Storage Layer
HDFS (Hadoop Distributed File System)
HDFS is a distributed file-system that stores data on multiple machines in the cluster. It is suitable for storing huge files.
Note that HDFS does not provide facility of tabular form of storage. HDFS does data chunking and distribute data across different machines.
HBase is a column-oriented database which runs on top of HDFS for providing structural data models. It stores data in tabular form.
You can think of HDFS as local file system and HBase as database management system where we store data in database in tabular format. Internally DBMS communicate to write that logical tabular data to physical file system.
HCatalog provides a standard view of the data stored in HDFS, so that different processing tools (like Pig, hive etc.) can read and write data more easily.
HCatalog presents a relational view of data in tabular form to ensure that users need not worry about where or in what format their data is stored in HDFS like if it’s stored as RCFile format or text files or sequence files.
Data Processing Layer
It is the layer where data is processed and analyzed. In Hadoop 1.0 this layer consisted of only MapReduce. In Hadoop 2.0 YARN was introduced.
MapReduce is a programming model which is used to process large data sets in a parallel processing manner.
YARN (Yet Another Resource Negotiator) is a new component added in Hadoop 2.0
YARN took over this task of managing Hadoop Cluster from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best.
Data Access Layer
This layer is used to access data from Data Processing Layer.
Writing MapReduce programs require Java skills and you need to write lots of JAVA code. Pig and Hive are high level programming language and sits on top of MapReduce layer.
During run time Pig or Hive scripts are converted to MapReduce job only.
Pig is a tool used to analyze large amounts of data. Using the PigLatin scripting language data analysis and processing can be easily done.
Hive provides a SQL-like interface for data stored in Hadoop. Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements.
HQL statements are broken down by the Hive service into MapReduce jobs and executed across Hadoop cluster.
Sqoop is a tool to transfer data between Hadoop and relational databases.
You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
Mahout provides a library of scalable machine learning algorithms useful for big data analysis based on Hadoop or other storage systems.
Avro is a remote procedure call and data serialization framework developed within Hadoop project. It uses JSON for defining data types and protocols and serializes data in a compact binary format.
This is the layer that meets the user. User access the system through this layer which has various components. Few components are as below.
Ambari is a web-based tool for provisioning, managing and monitoring Hadoop Clusters.
Chukwa, is data collection system for monitoring of distributed systems and more specifically Hadoop clusters.
ZooKeeper is an open-source project which deals with maintaining configuration information, naming, providing distributed synchronization, group services for various distributed applications. It implements various protocols on the cluster so that the applications need not to implement them on their own.
Check the blog series My Learning Journey for Hadoop or directly jump to next blog Hello World program in Hadoop using Hortonworks Sandbox