Type one word in google “Hadoop” and you will get numerous result. It is very difficult to decide what to read first, how to go with it. Hadoop has 3 main paths to go.

  • – Data Scientists
  • – Developers/Analysts
  • – Administration

To go any of these path you have to know few basic things like what Hadoop is, why Hadoop is and how is Hadoop. Here I have tried to capture very basic few things which will escort you to start with Hadoop.

What is Hadoop?

Apache Hadoop is an open source software framework written in JAVA for distributed storage and distributed processing of very large data set on computer cluster build on commodity hardware.

Why Hadoop?

Challenge: Data is too big store in one computer

Hadoop Solution: Data is stored in multiple computer.

Challenge: Very high end machines are expensive

Hadoop solution: Run on commodity hardware

Challenge: commodity hardware will fail.

Hadoop Solution: Software is intelligent enough to deal with hardware failure.

Challenge: hardware failure may lead to data loss

Hadoop Solution: replicate (duplicate) data

Challenge: how will the distributed nodes co-ordinate among themselves

Hadoop solution: There is a master node that co-ordinates all the worker nodes

3 V attributes that are used to describe the big data problem.

— Volume: Volume reflects the large amount of data that needs to be processed. As the various data sets are stacked together the amount of data increases.

— Variety: Varity reflects different sources of data. It can vary from webserver logs to structured data from databases to unstructured data from social media.

— Velocity: Velocity reflects the amount of data which keeps on accumulating with time.

 
3v's.png

What is the difference between Hadoop and RDBMS?

Hadoop

RDBMS

Open Source

Mostly propriety

Eco System Suite of java based(mostly) projects, A framework

One project with multiple components

Designed to support distributed architecture

Designed with idea of server client Architecture

Designed to run on commodity hardware

High usage would expect High end server

Cost efficient

Costly

High fault tolerance

Legacy procedure

Based on distributed file system like GFS, HDFS.

Rely on OS file system

Very good support of unstructured data

Needs structured data

Flexible, evolvable and fast

Needs to follow defined constraints

Still evolving

Has lots of very good products like oracle, sql.

Suitable for Batch processing

Real time Read/Write

Sequential write

Arbitrary insert and update

Hadoop Ecosystem:

Ecosystem.png

Hadoop is an open source project from Apache that has evolved rapidly into a major technology movement. It has emerged as the best way to handle massive amounts of data, including not only structured data but also complex, unstructured data as well.

Its popularity is due in part to its ability to store, analyze, and access large amounts of data quickly and cost effectively across clusters of commodity hardware. Hadoop is not actually a single product but instead a collection of several components.

Pig:

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. At the present time, Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs.

Pig’s language layer currently consists of a textual language called Pig Latin, which is easy to use, optimized, and extensible. Pig was originally [3] developed at Yahoo Research around 2006.

Hive:

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. It provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

Both Pig and hive are on map reduce layer. The code written on Pig and hive gets converted into map reduce job and then run on hdfs.

HBase:

HBase (Hadoop DataBase) is a distributed, column oriented database. HBase uses HDFS for the underlying storage. It supports both batch style computations using MapReduce and point queries (random reads).

The main components of HBase are as below:

     – HBase Master is responsible for negotiating load balancing across all Region Servers and maintain the state of the cluster. It is not part of the actual data         storage or retrieval path.

     – RegionServer is deployed on each machine and hosts data and processes I/O requests.

Apache Zookeeper:

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.

Apache Oozie:

Apache Oozie is a workflow/coordination system to manage Hadoop jobs.

Apache Sqoop:

Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.


Flume:

          Apache Flume is a service for collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on                               streaming data flows.

               Flume is used to inject the data into hadoop system.

               It has three entity. Source, channel and sink.

               Source is the entity through which data enters into Flume.

               Sink is the entity that delivers the data to the destination.

               Sources ingest events into the channel and the sinks drain the channel.

In order to facilitate the data movement into or out of Hadoop sqoop/flume is used.

Hue:

Hue provides a Web application interface for Apache Hadoop. It supports a file browser, Hive, Pig, Oozie, HBase, and more.

            In order to facilitate the data movement into or out of Hadoop sqoop/flume is used.

HDFS – Hadoop distributed file system:

/wp-content/uploads/2015/08/nodes_764399.png

Master / Worker Design:

In HDFS design there are one master node and many worker node.

Master node named as name node (NN) and worker node named as data node (DN).

Master node keeps all the Meta data information about the HDFS and information about the data node. Master node is in charge of file system (like creating files, user permission etc). Without it cluster node will be inoperable.

Data node is the slave / worker node holds the user data in the form of data blocks. There can be any number of data node in the Hadoop cluster.

DATA block:

A data block can be considered as the standard unit of data or files stored in HDFS.

Each incoming files are broken into 64 MB by default.(Currently the size has changed to 128 MB)

Any larger than 64MB is broken down in to 64 MB blocks.

All the blocks which make up a particular file are of the same size (64MB) except for the last block which might be lesser that 64 MB depending upon the size of the file.

Runs on commodity hardware:

As we saw Hadoop doesn’t need fancy high end hardware. Hadoop runs on commodity hardware. Hadoop stack is built to deal with hardware failure. So if one node fails still file system will continue to run. Hadoop accomplish this by duplicating the data in across the nodes.

Data is replicated:

             So how does Hadoop keep the data safe or resilient? Simple, it keeps the multiple copies of the data across the nodes.

Example: Data segment #2 is replicated 3 times, on a data node A, B and D. Let say if data node A fails still data will be accessible from data node B and D.

Data blocks are replicated across different nodes in the cluster to ensure a high degree of fault tolerance. Replication enables the use of low cost commodity hardware for the storage of data. The number of replicas to be made/maintained is configurable at the cluster level as well as for each file. Based on the Replication Factor, each file (data block which forms each file) is replicated many times across different nodes in the cluster.

Rack awareness:

Data is replicated across different nodes in the cluster to ensure reliability/fault tolerance. Replication of data blocks is done based on the location of the data node, so as to ensure high degree of fault tolerance.

For instance, one or two copies of the data block stored on the same rack and one copy is stored on a different rack in same data center and another copy is stored on a rack in different data center and so on.

HDFS is better suited for large data:

Generic file system like Linux EXT file system will store files of varying sizes, from few bytes to few giga bytes. HDFS is however designed to large files. Large as in a few hundred megabytes to a few gigabytes. HDFS was built to work with mechanical disk drives, whose capacity has gone up in recent years. However, seek times haven’t improved all that much. So Hadoop tries to minimize disk seeks.

Files are write once only (Not updatable):

HDFS supports writing files once (they cannot be updated). This is a stark difference between HDFS and a generic file system (like a Linux file system). Generic file systems allows files to be modified. However appending to a file is supported.

/wp-content/uploads/2015/08/nodes_764399.png

HDFS Architecture:

/wp-content/uploads/2015/08/archi_764401.png

  • Name Node servers as the master and each Data Node servers as a worker/slave.
  • Name Node and each Data Node have built-in web servers.
  • Name Node is the heart of HDFS and is responsible for various tasks including – it holds the file system namespace, controls access to file system by the clients, keeps track of the Data Nodes, keeps track of replication factor and ensures that it is always maintained.
  • User data is stored on the local file system of Data Nodes. Data Node is not aware of the files to which the blocks stored on it belong to. As a result of this, if the Name Node goes down then the data in HDFS is non-usable as only the Name Node knows which blocks belong to which file, where each block located etc.
  • Name Node can talk to all the live Data Nodes and the live Data Nodes can talk to each other.
  • There is also a Secondary Name Node which comes in handy when the Primary Name Node goes down. Secondary Name Node can be brought up to bring the cluster online. This process of switching of nodes needs to be done manually and there is no automatic failover mechanism in place.
  • Name Node receives heartbeat signals and a Block Report periodically from each of the Data Nodes.
    • Heartbeat signals from a Data Node indicates that the corresponding Data Node is alive and is working fine. If a heartbeat signal is not received from a Data Node then that Data Node is marked as dead and no further I/O requests are sent to that Data Node.
    • Block Report from each Data Node contains a list of all the blocks that are stored on that Data Node.
    • Heartbeat signals and Block Reports received from each Data Node help the Name Node to identify any potential loss of Blocks/Data and to replicate the Blocks to other active Data Nodes in the event when one or more Data Nodes holding the data blocks go down.

  Here are few general highlights about HDFS:

  • HDFS is implemented in Java and any computer which can run Java can host a Name Node/Data Node on it.
  • Designed to be portable across all the major hardware and software platforms.
  • Basic file operations include reading and writing of files, creation, deletion, and replication of files, etc.
  • Provides necessary interfaces which enable the movement of computation to the data unlike in traditional data processing systems where the data moves to computation.
  • HDFS provides a command line interface called “FS Shell” used to interact with HDFS.

  When to Use HDFS:

There are many use cases for HDFS including the following:

  • HDFS can be used for storing the Big Data sets for further processing.
  • HDFS can be used for storing archive data since it is cheaper as HDFS allows storing the data on low cost commodity hardware while ensuring a high degree of fault-tolerance.

  When Not to Use HDFS:

There are certain scenarios in which HDFS may not be a good fit including the following:

  • HDFS is not suitable for storing data related to applications requiring low latency data access.
  • HDFS is not suitable for storing a large number of small files as the metadata for each file needs to be stored on the Name Node and is held in memory.
  • HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file.

So here are the basics of hadoop which will help you to start thinking in hadoop arena. If you want to know more about how sap is going with hadoop. You can follow the below links.

SAP Embraces Hadoop in The Enterprise

Fitting Hadoop in an SAP Software Landscape – Part 3 ASUG Webcast

References:

http://hadoopilluminated.com/hadoop_illuminated/HDFS_Intro.html#d1575e906

http://searchcloudcomputing.techtarget.com/definition/Hadoop

http://hortonworks.com/hadoop/hdfs/

http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

To report this post you need to login first.

11 Comments

You must be Logged on to comment or reply to a post.

  1. Priyesh Shah

    Again a very nice & a explanatory blog to help understand the growing need for Big Data & Hadoop. Thanks Soma. one Question though how should i start working on it.

    (0) 

Leave a Reply