As we know data and information is exponentially increasing in current era therefore the technology like Hadoop, Cassandra File System etc became the hot technology and preferred choice among the IT professionals and business communities.

Hadoop and Cassandra are rapidly growing and proving themselves to be cutting edge technology in dealing with huge amount of structured and unstructured data.

Both  are open source software which comes under umbrella of Apache. Both technologies have large customer base which is exponentially growing and have certain pros and cons .Since both the file system are very popular and extensively been used in the areas of handling big data hence it is worth to do a comparison between both the technologies and helping the intended reader to understand the differences of  both technologies –

S.No

Parameters

Hadoop

Cassandra

1

CAP theorem

CP

AP

2

Architecture

Master/Slave

Name Node works as Master and data node as worker node.

Peer-to-Peer

Distributed architecture where all nodes are same.

3

Read and Write Design

Write once read many access models.

Read and write anywhere model.

4

Area of utilization

Batch-oriented analytical solutions.

Real time online transactional processing

5

Mode of accessing data

Map/Reduce for read/write operations.

Cassandra query language and Command line interface tools.

6

Data storage model

File system.

Large files are broken into small blocks and replicated on many data nodes.

Keys space column family to store the data and introduces primary and secondary indexes for high availability of data

7

Fault tolerance

Single point of failure

Vulnerable to failure when master node is down

High availability- NO SPOF

All node in the cluster are same and capable to handle the request

8

Storage schema

Physical file system schema.

Combines schema from Google big table and Amazon Dynamo

9

Communication

RPC/TCP and UDP

Gossip protocol

10

Indexing

Indexing is difficult. To achieve Hadoop distributed indexing we can configure Apache solr or Terrier with Hadoop.

Cassandra supports secondary indexes of the type keys. Indexing is easy.

11

Data persistence

Data is directly written to data node.

Data is first written in memory structure called mem-table and when it is full written to SStable and then to disk.

12

Throughput and latency

Reading a chunk of data can range from tens of milliseconds in the best case and hundreds of milliseconds in the worst case. Reduces write latency because of large number of data nodes

Unlike most databases, Cassandra achieves excellent throughput and latency.

13.

Naming

Central metadata server

Cassandra comes up with ‘inode’ column family to store meta data information

14.

Load balancing

For each data node the usage of the server different from the usage of the cluster by no more than the threshold value. In HDFS replicas are moved from it to another one respecting the placement policy if a node is unbalanced.

The data does not automatically get shared across new nodes equally when adding new nodes to the cluster and share load proportionately which makes completely unbalanced. By using the node tool move command we need to shift the token range and must be calculated in a way that involves sharing of data equally.

To report this post you need to login first.

1 Comment

You must be Logged on to comment or reply to a post.

  1. Abhishek Datta Gupta

    Nice comparision between Hadoop and Cassandra. I could not find similar comparision on the internet, where you have point by point comparision like you have done. Thank you for posting this.

    (0) 

Leave a Reply