Difference between Hadoop and Cassandra
As we know data and information is exponentially increasing in current era therefore the technology like Hadoop, Cassandra File System etc became the hot technology and preferred choice among the IT professionals and business communities.
Hadoop and Cassandra are rapidly growing and proving themselves to be cutting edge technology in dealing with huge amount of structured and unstructured data.
Both are open source software which comes under umbrella of Apache. Both technologies have large customer base which is exponentially growing and have certain pros and cons .Since both the file system are very popular and extensively been used in the areas of handling big data hence it is worth to do a comparison between both the technologies and helping the intended reader to understand the differences of both technologies –
S.No |
Parameters |
Hadoop |
Cassandra |
1 |
CAP theorem |
CP |
AP |
2 |
Architecture |
Master/Slave Name Node works as Master and data node as worker node. |
Peer-to-Peer Distributed architecture where all nodes are same. |
3 |
Read and Write Design |
Write once read many access models. |
Read and write anywhere model. |
4 |
Area of utilization |
Batch-oriented analytical solutions. |
Real time online transactional processing |
5 |
Mode of accessing data |
Map/Reduce for read/write operations. |
Cassandra query language and Command line interface tools. |
6 |
Data storage model |
File system. Large files are broken into small blocks and replicated on many data nodes. |
Keys space column family to store the data and introduces primary and secondary indexes for high availability of data |
7 |
Fault tolerance |
Single point of failure Vulnerable to failure when master node is down |
High availability- NO SPOF All node in the cluster are same and capable to handle the request |
8 |
Storage schema |
Physical file system schema. |
Combines schema from Google big table and Amazon Dynamo |
9 |
Communication |
RPC/TCP and UDP |
Gossip protocol |
10 |
Indexing |
Indexing is difficult. To achieve Hadoop distributed indexing we can configure Apache solr or Terrier with Hadoop. |
Cassandra supports secondary indexes of the type keys. Indexing is easy. |
11 |
Data persistence |
Data is directly written to data node. |
Data is first written in memory structure called mem-table and when it is full written to SStable and then to disk. |
12 |
Throughput and latency |
Reading a chunk of data can range from tens of milliseconds in the best case and hundreds of milliseconds in the worst case. Reduces write latency because of large number of data nodes |
Unlike most databases, Cassandra achieves excellent throughput and latency. |
13. |
Naming |
Central metadata server |
Cassandra comes up with ‘inode’ column family to store meta data information |
14. |
Load balancing |
For each data node the usage of the server different from the usage of the cluster by no more than the threshold value. In HDFS replicas are moved from it to another one respecting the placement policy if a node is unbalanced. |
The data does not automatically get shared across new nodes equally when adding new nodes to the cluster and share load proportionately which makes completely unbalanced. By using the node tool move command we need to shift the token range and must be calculated in a way that involves sharing of data equally. |
Nice comparision between Hadoop and Cassandra. I could not find similar comparision on the internet, where you have point by point comparision like you have done. Thank you for posting this.