Tuesday, April 11, 2017

Hadoop 1.x Limitations


Limitations of Hadoop 1.x:

  • No horizontal scalability of namenode
  • Does not support Namenode High availability
  • Overburdened Job Tracker
  • Not possible to run Non-MapReduce Big data applications on HDFS
  • Does not support Multi-Tenancy

Problem: Namenode - No Horizontal Scalability

Hadoop 1.x supports Single Namenode and Single Namespace, limited by namenode RAM. Even though we have hundreds of DataNodes in the cluster, the NameNode keeps all its metadata in memory, so we are limited to a maximum of only 50-100M files in the entire cluster because of a Single NameNode and Single Namespace.

Problem: Namenode - No High Availability

NameNode is Single Point of Failure. Without namenode the filesystem can't be used. We need to manually recover using Secondary NameNode in case of failure. Since secondary always lags with that of primary, data loss is inevitable.

Problem: Job Tracker is Overburdened

Job Tracker spends significant portion of time and effort managing the life cycle of Applications.

Problem: No Multi-Tenancy. Non-MapReduce jobs not supported

In Hadoop 1.x, dedicates all the Datanode resources to Map and Reduce slots. Other workloads such as Graph processing etc is not allowed to utilize the data on HDFS.

Hadoop 2.x features addressing Hadoop 1.x limitations:

  • HDFS Federation 
  • HDFS High Availability
  • YARN 

HDFS Federation:

HDFS Federation solves the "Namenode - No Horizontal Scalability" problem by using multiple independent Namenodes each of which can manage a portion of filesystem Namespace.

HDFS High Availability:

HDFS High  availability in Hadoop 2x resolves namenode high availability issue in Hadoop 1.x. In this implementation, there are a pair of namenodes in an active-standby configuration. If active namenode fails, standby takes over as new active. In this configuration, data nodes must send block reports to both namenodes, active and standby. So standby always have latest state available in memory.

YARN:

YARN is designed to overcome the disadvantage of too much burden on JobTracker in Hadoop 1.x. YARN also supports multi-tenancy approach. YARN adds more general interface to run non-hadoop jobs within hadoop framework.


1 comment:

Amazon S3: Basic Concepts

Amazon S3 is an reliable, scalable, online object storage that stores files. Bucket: A bucket is a container in Amazon S3 where the fil...