Type one word in google “Hadoop” and you will get numerous result. It is very difficult to decide what to read first, how to go with it. Hadoop has 3 main paths to go.
To go any of these path you have to know few basic things like what Hadoop is, why Hadoop is and how is Hadoop. Here I have tried to capture very basic few things which will escort you to start with Hadoop.
What is Hadoop?
Apache Hadoop is an open source software framework written in JAVA for distributed storage and distributed processing of very large data set on computer cluster build on commodity hardware.
Why Hadoop?
Challenge: Data is too big store in one computer
Hadoop Solution: Data is stored in multiple computer.
Challenge: Very high end machines are expensive
Hadoop solution: Run on commodity hardware
Challenge: commodity hardware will fail.
Hadoop Solution: Software is intelligent enough to deal with hardware failure.
Challenge: hardware failure may lead to data loss
Hadoop Solution: replicate (duplicate) data
Challenge: how will the distributed nodes co-ordinate among themselves
Hadoop solution: There is a master node that co-ordinates all the worker nodes
3 V attributes that are used to describe the big data problem.
-- Volume: Volume reflects the large amount of data that needs to be processed. As the various data sets are stacked together the amount of data increases.
-- Variety: Varity reflects different sources of data. It can vary from webserver logs to structured data from databases to unstructured data from social media.
-- Velocity: Velocity reflects the amount of data which keeps on accumulating with time.
What is the difference between Hadoop and RDBMS?
Hadoop | RDBMS |
Open Source | Mostly propriety |
Eco System Suite of java based(mostly) projects, A framework | One project with multiple components |
Designed to support distributed architecture | Designed with idea of server client Architecture |
Designed to run on commodity hardware | High usage would expect High end server |
Cost efficient | Costly |
High fault tolerance | Legacy procedure |
Based on distributed file system like GFS, HDFS. | Rely on OS file system |
Very good support of unstructured data | Needs structured data |
Flexible, evolvable and fast | Needs to follow defined constraints |
Still evolving | Has lots of very good products like oracle, sql. |
Suitable for Batch processing | Real time Read/Write |
Sequential write | Arbitrary insert and update |
Hadoop Ecosystem:
Hadoop is an open source project from Apache that has evolved rapidly into a major technology movement. It has emerged as the best way to handle massive amounts of data, including not only structured data but also complex, unstructured data as well.
Its popularity is due in part to its ability to store, analyze, and access large amounts of data quickly and cost effectively across clusters of commodity hardware. Hadoop is not actually a single product but instead a collection of several components.
Pig:
Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs.
Pig's language layer currently consists of a textual language called Pig Latin, which is easy to use, optimized, and extensible. Pig was originally [3] developed at Yahoo Research around 2006.
Hive:
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. It provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
Both Pig and hive are on map reduce layer. The code written on Pig and hive gets converted into map reduce job and then run on hdfs.
HBase:
HBase (Hadoop DataBase) is a distributed, column oriented database. HBase uses HDFS for the underlying storage. It supports both batch style computations using MapReduce and point queries (random reads).
The main components of HBase are as below:
- HBase Master is responsible for negotiating load balancing across all Region Servers and maintain the state of the cluster. It is not part of the actual data storage or retrieval path.
- RegionServer is deployed on each machine and hosts data and processes I/O requests.
Apache Zookeeper:
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.
Apache Oozie:
Apache Oozie is a workflow/coordination system to manage Hadoop jobs.
Apache Sqoop:
Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.
Flume:
Apache Flume is a service for collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.
Flume is used to inject the data into hadoop system.
It has three entity. Source, channel and sink.
Source is the entity through which data enters into Flume.
Sink is the entity that delivers the data to the destination.
Sources ingest events into the channel and the sinks drain the channel.
In order to facilitate the data movement into or out of Hadoop sqoop/flume is used.
Hue:
Hue provides a Web application interface for Apache Hadoop. It supports a file browser, Hive, Pig, Oozie, HBase, and more.
In order to facilitate the data movement into or out of Hadoop sqoop/flume is used.
HDFS - Hadoop distributed file system:
Master / Worker Design:
In HDFS design there are one master node and many worker node.
Master node named as name node (NN) and worker node named as data node (DN).
Master node keeps all the Meta data information about the HDFS and information about the data node. Master node is in charge of file system (like creating files, user permission etc). Without it cluster node will be inoperable.
Data node is the slave / worker node holds the user data in the form of data blocks. There can be any number of data node in the Hadoop cluster.
DATA block:
A data block can be considered as the standard unit of data or files stored in HDFS.
Each incoming files are broken into 64 MB by default.(Currently the size has changed to 128 MB)
Any larger than 64MB is broken down in to 64 MB blocks.
All the blocks which make up a particular file are of the same size (64MB) except for the last block which might be lesser that 64 MB depending upon the size of the file.
Runs on commodity hardware:
As we saw Hadoop doesn’t need fancy high end hardware. Hadoop runs on commodity hardware. Hadoop stack is built to deal with hardware failure. So if one node fails still file system will continue to run. Hadoop accomplish this by duplicating the data in across the nodes.
Data is replicated:
So how does Hadoop keep the data safe or resilient? Simple, it keeps the multiple copies of the data across the nodes.
Example: Data segment #2 is replicated 3 times, on a data node A, B and D. Let say if data node A fails still data will be accessible from data node B and D.
Data blocks are replicated across different nodes in the cluster to ensure a high degree of fault tolerance. Replication enables the use of low cost commodity hardware for the storage of data. The number of replicas to be made/maintained is configurable at the cluster level as well as for each file. Based on the Replication Factor, each file (data block which forms each file) is replicated many times across different nodes in the cluster.
Rack awareness:
Data is replicated across different nodes in the cluster to ensure reliability/fault tolerance. Replication of data blocks is done based on the location of the data node, so as to ensure high degree of fault tolerance.
For instance, one or two copies of the data block stored on the same rack and one copy is stored on a different rack in same data center and another copy is stored on a rack in different data center and so on.
HDFS is better suited for large data:
Generic file system like Linux EXT file system will store files of varying sizes, from few bytes to few giga bytes. HDFS is however designed to large files. Large as in a few hundred megabytes to a few gigabytes. HDFS was built to work with mechanical disk drives, whose capacity has gone up in recent years. However, seek times haven't improved all that much. So Hadoop tries to minimize disk seeks.
Files are write once only (Not updatable):
HDFS supports writing files once (they cannot be updated). This is a stark difference between HDFS and a generic file system (like a Linux file system). Generic file systems allows files to be modified. However appending to a file is supported.
HDFS Architecture:
Here are few general highlights about HDFS:
When to Use HDFS:
There are many use cases for HDFS including the following:
When Not to Use HDFS:
There are certain scenarios in which HDFS may not be a good fit including the following:
So here are the basics of hadoop which will help you to start thinking in hadoop arena. If you want to know more about how sap is going with hadoop. You can follow the below links.
SAP Embraces Hadoop in The Enterprise
Fitting Hadoop in an SAP Software Landscape – Part 3 ASUG Webcast
References:
http://hadoopilluminated.com/hadoop_illuminated/HDFS_Intro.html#d1575e906
http://searchcloudcomputing.techtarget.com/definition/Hadoop