Type one word in google “Hadoop” and you will get numerous result. It is very difficult to decide what to read first, how to go with it. Hadoop has 3 main paths to go.
- – Data Scientists
- – Developers/Analysts
- – Administration
To go any of these path you have to know few basic things like what Hadoop is, why Hadoop is and how is Hadoop. Here I have tried to capture very basic few things which will escort you to start with Hadoop.
What is Hadoop?
Apache Hadoop is an open source software framework written in JAVA for distributed storage and distributed processing of very large data set on computer cluster build on commodity hardware.
Challenge: Data is too big store in one computer
Hadoop Solution: Data is stored in multiple computer.
Challenge: Very high end machines are expensive
Hadoop solution: Run on commodity hardware
Challenge: commodity hardware will fail.
Hadoop Solution: Software is intelligent enough to deal with hardware failure.
Challenge: hardware failure may lead to data loss
Hadoop Solution: replicate (duplicate) data
Challenge: how will the distributed nodes co-ordinate among themselves
Hadoop solution: There is a master node that co-ordinates all the worker nodes
3 V attributes that are used to describe the big data problem.
— Volume: Volume reflects the large amount of data that needs to be processed. As the various data sets are stacked together the amount of data increases.
— Variety: Varity reflects different sources of data. It can vary from webserver logs to structured data from databases to unstructured data from social media.
— Velocity: Velocity reflects the amount of data which keeps on accumulating with time.
What is the difference between Hadoop and RDBMS?
Eco System Suite of java based(mostly) projects, A framework
One project with multiple components
Designed to support distributed architecture
Designed with idea of server client Architecture
Designed to run on commodity hardware
High usage would expect High end server
High fault tolerance
Based on distributed file system like GFS, HDFS.
Rely on OS file system
Very good support of unstructured data
Needs structured data
Flexible, evolvable and fast
Needs to follow defined constraints
Has lots of very good products like oracle, sql.
Suitable for Batch processing
Real time Read/Write
Arbitrary insert and update
Hadoop is an open source project from Apache that has evolved rapidly into a major technology movement. It has emerged as the best way to handle massive amounts of data, including not only structured data but also complex, unstructured data as well.
Its popularity is due in part to its ability to store, analyze, and access large amounts of data quickly and cost effectively across clusters of commodity hardware. Hadoop is not actually a single product but instead a collection of several components.
Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. At the present time, Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs.
Pig’s language layer currently consists of a textual language called Pig Latin, which is easy to use, optimized, and extensible. Pig was originally  developed at Yahoo Research around 2006.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. It provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
Both Pig and hive are on map reduce layer. The code written on Pig and hive gets converted into map reduce job and then run on hdfs.
HBase (Hadoop DataBase) is a distributed, column oriented database. HBase uses HDFS for the underlying storage. It supports both batch style computations using MapReduce and point queries (random reads).
The main components of HBase are as below:
– HBase Master is responsible for negotiating load balancing across all Region Servers and maintain the state of the cluster. It is not part of the actual data storage or retrieval path.
– RegionServer is deployed on each machine and hosts data and processes I/O requests.
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.
Apache Oozie is a workflow/coordination system to manage Hadoop jobs.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.
Apache Flume is a service for collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.
Flume is used to inject the data into hadoop system.
It has three entity. Source, channel and sink.
Source is the entity through which data enters into Flume.
Sink is the entity that delivers the data to the destination.
Sources ingest events into the channel and the sinks drain the channel.
In order to facilitate the data movement into or out of Hadoop sqoop/flume is used.
Hue provides a Web application interface for Apache Hadoop. It supports a file browser, Hive, Pig, Oozie, HBase, and more.
In order to facilitate the data movement into or out of Hadoop sqoop/flume is used.
HDFS – Hadoop distributed file system:
Master / Worker Design:
In HDFS design there are one master node and many worker node.
Master node named as name node (NN) and worker node named as data node (DN).
Master node keeps all the Meta data information about the HDFS and information about the data node. Master node is in charge of file system (like creating files, user permission etc). Without it cluster node will be inoperable.
Data node is the slave / worker node holds the user data in the form of data blocks. There can be any number of data node in the Hadoop cluster.
A data block can be considered as the standard unit of data or files stored in HDFS.
Each incoming files are broken into 64 MB by default.(Currently the size has changed to 128 MB)
Any larger than 64MB is broken down in to 64 MB blocks.
All the blocks which make up a particular file are of the same size (64MB) except for the last block which might be lesser that 64 MB depending upon the size of the file.
Runs on commodity hardware:
As we saw Hadoop doesn’t need fancy high end hardware. Hadoop runs on commodity hardware. Hadoop stack is built to deal with hardware failure. So if one node fails still file system will continue to run. Hadoop accomplish this by duplicating the data in across the nodes.
Data is replicated:
So how does Hadoop keep the data safe or resilient? Simple, it keeps the multiple copies of the data across the nodes.
Example: Data segment #2 is replicated 3 times, on a data node A, B and D. Let say if data node A fails still data will be accessible from data node B and D.
Data blocks are replicated across different nodes in the cluster to ensure a high degree of fault tolerance. Replication enables the use of low cost commodity hardware for the storage of data. The number of replicas to be made/maintained is configurable at the cluster level as well as for each file. Based on the Replication Factor, each file (data block which forms each file) is replicated many times across different nodes in the cluster.
Data is replicated across different nodes in the cluster to ensure reliability/fault tolerance. Replication of data blocks is done based on the location of the data node, so as to ensure high degree of fault tolerance.
For instance, one or two copies of the data block stored on the same rack and one copy is stored on a different rack in same data center and another copy is stored on a rack in different data center and so on.
HDFS is better suited for large data:
Generic file system like Linux EXT file system will store files of varying sizes, from few bytes to few giga bytes. HDFS is however designed to large files. Large as in a few hundred megabytes to a few gigabytes. HDFS was built to work with mechanical disk drives, whose capacity has gone up in recent years. However, seek times haven’t improved all that much. So Hadoop tries to minimize disk seeks.
Files are write once only (Not updatable):
HDFS supports writing files once (they cannot be updated). This is a stark difference between HDFS and a generic file system (like a Linux file system). Generic file systems allows files to be modified. However appending to a file is supported.
- Name Node servers as the master and each Data Node servers as a worker/slave.
- Name Node and each Data Node have built-in web servers.
- Name Node is the heart of HDFS and is responsible for various tasks including – it holds the file system namespace, controls access to file system by the clients, keeps track of the Data Nodes, keeps track of replication factor and ensures that it is always maintained.
- User data is stored on the local file system of Data Nodes. Data Node is not aware of the files to which the blocks stored on it belong to. As a result of this, if the Name Node goes down then the data in HDFS is non-usable as only the Name Node knows which blocks belong to which file, where each block located etc.
- Name Node can talk to all the live Data Nodes and the live Data Nodes can talk to each other.
- There is also a Secondary Name Node which comes in handy when the Primary Name Node goes down. Secondary Name Node can be brought up to bring the cluster online. This process of switching of nodes needs to be done manually and there is no automatic failover mechanism in place.
- Name Node receives heartbeat signals and a Block Report periodically from each of the Data Nodes.
- Heartbeat signals from a Data Node indicates that the corresponding Data Node is alive and is working fine. If a heartbeat signal is not received from a Data Node then that Data Node is marked as dead and no further I/O requests are sent to that Data Node.
- Block Report from each Data Node contains a list of all the blocks that are stored on that Data Node.
- Heartbeat signals and Block Reports received from each Data Node help the Name Node to identify any potential loss of Blocks/Data and to replicate the Blocks to other active Data Nodes in the event when one or more Data Nodes holding the data blocks go down.
Here are few general highlights about HDFS:
- HDFS is implemented in Java and any computer which can run Java can host a Name Node/Data Node on it.
- Designed to be portable across all the major hardware and software platforms.
- Basic file operations include reading and writing of files, creation, deletion, and replication of files, etc.
- Provides necessary interfaces which enable the movement of computation to the data unlike in traditional data processing systems where the data moves to computation.
- HDFS provides a command line interface called “FS Shell” used to interact with HDFS.
When to Use HDFS:
There are many use cases for HDFS including the following:
- HDFS can be used for storing the Big Data sets for further processing.
- HDFS can be used for storing archive data since it is cheaper as HDFS allows storing the data on low cost commodity hardware while ensuring a high degree of fault-tolerance.
When Not to Use HDFS:
There are certain scenarios in which HDFS may not be a good fit including the following:
- HDFS is not suitable for storing data related to applications requiring low latency data access.
- HDFS is not suitable for storing a large number of small files as the metadata for each file needs to be stored on the Name Node and is held in memory.
- HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file.
So here are the basics of hadoop which will help you to start thinking in hadoop arena. If you want to know more about how sap is going with hadoop. You can follow the below links.