Thursday, July 25, 2024

Hadoop Interview Questions and Answers


  • Explain GenericOptionsParser?
    • GenericOptionsParser is a class that interprets Hadoop command-line options and sets them on a Configuration Object.
  • What is an Uber Job?
    • If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges that the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. Such a job is said to be uberized, or run as an uber task.
  • How Application Master qualifies a job as a small job?
    •  By default, a small job is one that has less than 10 mappers, only one reducer, and an input size that is less than the size of one HDFS block. 
  • What is the default OutputCommitter?
    • FileOutputCommitter is the default. It creates the final output directory for a job and temporary working space for task output.
  • Does data locality constraints applies to Reducers?
    • No, reducers can work anywhere in the cluster. Only mappers have data locality constraints.
  • What are the roles of OutputCommitter?
    • OutputCommitters ensures that jobs and tasks succeed or fails cleanly.
    • When a job starts, output committer performs job initialization like creating output directory and temporary working space for task output.
    • When job succeeds, output committer deletes the temporary working space and creates the _SUCCESS marker to indicate successful completion of job. Output files are moved to final destination folder.
    • When job fails, output committer deletes the temporary working space and makes sure job stops cleanly. 
    • In case of speculative jobs or multiple task attempts, output committer makes sure only files of successful task be promoted to final output directory. The other failed tasks will have their files deleted. 
  • What constitutes progress in mapreduce?
    • Reading an input record (in a mapper or reducer)
    • Writing an output record (in a mapper or reducer)
    • Setting the status description (via Reporter’s or TaskAttemptContext’s setStatus() method)
    • Incrementing a counter (using Reporter’s incrCounter() method or Counter’s increment() method)
    • Calling Reporter’s or TaskAttemptContext’s progress() method
  • How status updates are propagated through the MapReduce system?
    • The task reports its progress and status (including counters) back to its application master, which has an aggregate view of the job, every three seconds over the umbilical interface. Umbilical interface is the channel through which a child process communicates with the parent process, in this case task communicates with application master.
    • On the other hand, the client receives the latest status by polling the application master every second.
  • What is the maximum number of failed task attempts before the whole job is marked as failed?
    • Application master tries to reschedule the failed task for 4 times by default. If any task fails four times the whole job is marked as fail.
  • What could be causes of Task failure and how does MapReduce handle them?
    • Failure in user code throws runtime exception : In this case application master tries to reschedule the task in another node.
    • Sudden exit of task JVM : In this case application master tries to reschedule the task in another node.
    • Hanging tasks : The tasks that doesn't report progress to application master are called hanging tasks. By default, if application master doesn't receive progress update for 10 minutes then that task is marked as failed. Application master tries to reschedule the task in another node.
  • If a particular task is killed for 4 times, will application master mark the whole job as failed?
    • No. A task attempt may also be killed, which is different from it failing. A task attempt may be killed because it is a speculative duplicate. Killed task attempts do not count against the number of attempts to run the task because it wasn’t the task’s fault that an attempt was killed.
  • What is the maximum number of attempts allowed for YARN application master before marking it as failed?
    • By default, two attempts.
  • What happens when application master fails?
    • Resource manager starts a new instance of application master in a new container.
  • What happens when Node Manager fails?
    • If node manager stops sending heartbeats to Resource Manager for more than 10 minutes, then that particular node manager is marked as failed.
  • What happens to the successfully completed mappers and reducers in an incomplete job if node manager fails?
    • All the successful mappers on a failed node will be rerun by the application master on different node because mappers intermediate output is stored on local filesystem. 
    • If reducers have run successfully on failed node manager, they aren't considered for rerun because their output is stored on HDFS and is not lost. 
  • Blacklisting of Node Managers is done by ______________
    • Application Master
  • When are node managers blacklisted?
    • Node managers may be blacklisted if the number of failures for the application is high. Application master will try to reschedule tasks on different nodes if more than three tasks fail on a node manager and this node manager is marked as blacklisted. Please note, the three failed tasks should be part of same application. If three tasks from different applications are failed, the node manager is not marked as blacklisted.
  • Explain failover controller?
    • The transition of a resource manager from standby to active is handled by a failover controller. The default failover controller is an automatic one, which uses ZooKeeper to ensure that there is only a single active resource manager at one time.
  • Define shuffle stage in MapReduce?
    • The process by which the system performs the sort—and transfers the map outputs to the reducers as inputs—is known as the shuffle.
  • How do reducers know which machines to fetch map output from?
    • As map tasks complete successfully, they notify their application master using the heartbeat mechanism. Therefore, for a given job, the application master knows the mapping between map outputs and hosts. A thread in the reducer periodically asks the master for map output hosts until it has retrieved them all.
  • How many phases does reduce task have?
    • Reduce task is divided into three phases - Copy phase, Sort phase and Reduce phase.
    • Copy Phase - The reduce task starts copying the map outputs as soon as a map task completes. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads. Map outputs are first copied to reducer tasks memory, when full it's spilled to disk.
    • Sort Phase - When data from all mappers are copied, then sort phase starts. Sort phase merges the map outputs based on the merge factor, maintaining the data's sort ordering. The default merge factor is 10.
    • Reduce Phase - During the reduce phase, the reduce function is invoked for each key in the sorted output. The output of this phase is written directly to the output filesystem, typically HDFS.
  • What are the default types of Input/Output Key and Values?
    • Default Input Key class is LongWritable  and Input Value class is Text.
    • Default Output Key class is LongWritable and Output Value class is Text.
  • How to exclude certain files from the input in MapReduce?
    • To exclude certain files from the input, you can set a filter using the setInputPathFilter() method on FileInputFormat. FileInputFormat uses a default filter that excludes hidden files.
  • How to exclude certain files from the input using Hadoop Streaming?
    • Setting paths is done with the -input option for the Streaming interface. -input option accepts regular expressions to include the required files from the input directory.
    • The input path /user/root/wordcount contains two files words_include.txt and words_exclude.txt. Below is the command to include the file words_include.txt and exclude words_exclude.txt.
                    hadoop jar hadoop-streaming-2.7.1.2.4.0.0-169.jar
                    -input /user/root/wordcount/*include*
                    -output /user/root/out
                    -mapper /bin/cat
  • What is the optimal split size?
    • By default, split size is same as HDFS block size. If split size is greater than block size, then number of map tasks that are not local to block increases reducing performance.
  • If reducer doesn't emit output, are the part files still created?
    • Yes. FileOutputFormat subclasses will create output (part-r-nnnnn) files, even if they are empty.
  • How to avoid creating reducer part files with zero size?
    • LazyOutputFormat ensures that the output file is created only when the first record is emitted for a given partition. 
  • How is sort order controlled?
    • If the property mapreduce.job.output.key.comparator.class is set, either explicitly or by calling setSortComparatorClass() on Job, then an instance of that class is used.
    • Otherwise, keys must be a subclass of WritableComparable, and the registered comparator for the key class is used.
    • If there is no registered comparator, then a RawComparator is used. The RawComparator deserializes the byte streams being compared into objects and delegates to the WritableComparable’s compareTo() method.
  • What are the types of samplers used in mapreduce?
    • InputSampler
    • IntervalSampler
    • RandomSampler
    • SplitSampler
  • What is the difference between MapSide Join and ReduceSide Join?
    • If the join is performed by the mapper it is called a map-side join, whereas if it is performed by the reducer it is called a reduce-side join. 
    • For map-side join, each input dataset must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for a particular key must reside in the same partition.
    • A reduce-side join is more general than a map-side join, in that the input datasets don’t have to be structured in any particular way, but it is less efficient because both datasets have to go through the MapReduce shuffle.
  • What is the filesystem used by distcp when transferring data between two clusters running different versions of Hadoop?
    • HftpFileSystem is recommended to use as a source. Distcp command has to be run from destination server. HFTP is a read-only filesystem.
    • hadoop distcp hftp://namenodeA:port/data/logs hdfs://namenodeB/data/logs

No comments:

Post a Comment

Amazon S3: Basic Concepts

Amazon S3 is an reliable, scalable, online object storage that stores files. Bucket: A bucket is a container in Amazon S3 where the fil...