Thursday, July 25, 2024

Amazon S3: Basic Concepts


Amazon S3 is an reliable, scalable, online object storage that stores files.

  • Bucket: A bucket is a container in Amazon S3 where the files are uploaded. Files are stored in buckets. We need at least one bucket to store the files.
    • Bucket name has to be unique because it is shared by all users.
    • Buckets can't have nested buckets but can have nested directories.
    • Maximum of 100 buckets can be created in a single account.
    • There is no size limit on buckets.
    • We can't rename a bucket once created.
    • Buckets can be accessed via HTTP URLs as follows.
      • http://<BUCKET_NAME>.s3.amazonaws.com/<OBJECT_NAME>
      • http://s3.amazonaws.com/<BUCKET_NAME>/<OBJECT_NAME>
    • Buckets can be managed via
      • REST-Style HTTP Interface
      • SOAP Interface
    • The access logging feature if enabled, keeps track of bucket requests such as request type, resources accessed, date and time when requested.
  • Object: An object is a file on Amazon S3. Each object is assigned a unique identifier. Every object is stored in a bucket. Objects consist of data and metadata.
    • Objects can be managed via
      • REST-style HTTP Interface
      • SOAP Interface
    • Objects can be downloaded via
      • HTTP GET Interface
      • BitTorrent protocol
    • Every object is assigned a key as an identifier and is unique. 
    • Objects can be added to a folder in either of two ways
      • Add Files option - Individual files can be uploaded using this option.
      • Enable enhanced uploader - This option is used when we need to upload whole folders.
    • There are two options under Set Details section on files.
        • Use Reduced Redundancy Storage - Non critical data can be set to use reduced redundancy storage. Using this will store the file at lower levels of redundancy compared to standard storage class.
      • Use Server Side Encryption
        • This is for security. Data is encrypted while storing. When object is accessed, Amazon S3 decrypts the data.
    • Use server side encryption has two options.
      • Use the Amazon S3 service master key
      • Use an AWS Key Management Service master key
  • Key: A key is a unique identifier for an object within the bucket. Combination of bucket, key and version ID uniquely identifies each object. 
  • Region: We might want to choose the geographical region where Amazon S3 will store buckets.
  • Folder: The folders in Amazon S3 are S3 files that are used to put Amazon S3 objects together under one group. This is analogous to Directory.
  • Versioning: Versioning helps us to retrieve old objects. We can retrieve deleted and updated objects. When an object is deleted, Amazon S3 inserts a delete marker rather than deleting it permanently.
    • Versioning is enabled at bucket level.
    • Versioning can be enabled in any of the following states.
      • Unversioned - the default
      • Versioning enabled
      • Versioning suspended.
    • By default, when versioning is enabled Amazon S3 stores all versions of an object.
    • To control the limit of versions, enable "Lifecycle rules" for the object. These rules will delete the old files.
  • Data consistency Model: S3 provides eventual consistency for read-after-write. 
    • If we make a GET request to an object after an update request, we might get old data if update is not complete, else we will get latest data.
    • S3 would return old data or updated data but will never return partial data.
    • Amazon S3 provides high consistency by replication data across multiple servers. 
    • If a PUT request is successful, data is safely stored across multiple servers.
    • If changes have to be made to a file, the change has to be replicated across all locations and this will take time. Any GET request during this time period might return old data until change is fully propagated.

Resources for Hadoop Questions

https://www.facebook.com/pages/Hadoop-Certification-Dumps-Q-A/1412348605698718

http://www.dattamsha.com/big-data-quiz/


http://searchdatamanagement.techtarget.com/quiz/Quiz-Test-your-understanding-of-the-Hadoop-ecosystem

http://www.crinlogic.com/hadooptest.html

http://hadoopquestions.blogspot.in/2014/10/latest-hadoop-interview-questions-and.html

http://www.slideshare.net/rohitkapa/hadoop-interview-questions

http://www.aiopass4sure.com/cloudera-exams/cca-410-exam-questions

Hadoop Interview Questions and Answers


  • Explain GenericOptionsParser?
    • GenericOptionsParser is a class that interprets Hadoop command-line options and sets them on a Configuration Object.
  • What is an Uber Job?
    • If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges that the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. Such a job is said to be uberized, or run as an uber task.
  • How Application Master qualifies a job as a small job?
    •  By default, a small job is one that has less than 10 mappers, only one reducer, and an input size that is less than the size of one HDFS block. 
  • What is the default OutputCommitter?
    • FileOutputCommitter is the default. It creates the final output directory for a job and temporary working space for task output.
  • Does data locality constraints applies to Reducers?
    • No, reducers can work anywhere in the cluster. Only mappers have data locality constraints.
  • What are the roles of OutputCommitter?
    • OutputCommitters ensures that jobs and tasks succeed or fails cleanly.
    • When a job starts, output committer performs job initialization like creating output directory and temporary working space for task output.
    • When job succeeds, output committer deletes the temporary working space and creates the _SUCCESS marker to indicate successful completion of job. Output files are moved to final destination folder.
    • When job fails, output committer deletes the temporary working space and makes sure job stops cleanly. 
    • In case of speculative jobs or multiple task attempts, output committer makes sure only files of successful task be promoted to final output directory. The other failed tasks will have their files deleted. 
  • What constitutes progress in mapreduce?
    • Reading an input record (in a mapper or reducer)
    • Writing an output record (in a mapper or reducer)
    • Setting the status description (via Reporter’s or TaskAttemptContext’s setStatus() method)
    • Incrementing a counter (using Reporter’s incrCounter() method or Counter’s increment() method)
    • Calling Reporter’s or TaskAttemptContext’s progress() method
  • How status updates are propagated through the MapReduce system?
    • The task reports its progress and status (including counters) back to its application master, which has an aggregate view of the job, every three seconds over the umbilical interface. Umbilical interface is the channel through which a child process communicates with the parent process, in this case task communicates with application master.
    • On the other hand, the client receives the latest status by polling the application master every second.
  • What is the maximum number of failed task attempts before the whole job is marked as failed?
    • Application master tries to reschedule the failed task for 4 times by default. If any task fails four times the whole job is marked as fail.
  • What could be causes of Task failure and how does MapReduce handle them?
    • Failure in user code throws runtime exception : In this case application master tries to reschedule the task in another node.
    • Sudden exit of task JVM : In this case application master tries to reschedule the task in another node.
    • Hanging tasks : The tasks that doesn't report progress to application master are called hanging tasks. By default, if application master doesn't receive progress update for 10 minutes then that task is marked as failed. Application master tries to reschedule the task in another node.
  • If a particular task is killed for 4 times, will application master mark the whole job as failed?
    • No. A task attempt may also be killed, which is different from it failing. A task attempt may be killed because it is a speculative duplicate. Killed task attempts do not count against the number of attempts to run the task because it wasn’t the task’s fault that an attempt was killed.
  • What is the maximum number of attempts allowed for YARN application master before marking it as failed?
    • By default, two attempts.
  • What happens when application master fails?
    • Resource manager starts a new instance of application master in a new container.
  • What happens when Node Manager fails?
    • If node manager stops sending heartbeats to Resource Manager for more than 10 minutes, then that particular node manager is marked as failed.
  • What happens to the successfully completed mappers and reducers in an incomplete job if node manager fails?
    • All the successful mappers on a failed node will be rerun by the application master on different node because mappers intermediate output is stored on local filesystem. 
    • If reducers have run successfully on failed node manager, they aren't considered for rerun because their output is stored on HDFS and is not lost. 
  • Blacklisting of Node Managers is done by ______________
    • Application Master
  • When are node managers blacklisted?
    • Node managers may be blacklisted if the number of failures for the application is high. Application master will try to reschedule tasks on different nodes if more than three tasks fail on a node manager and this node manager is marked as blacklisted. Please note, the three failed tasks should be part of same application. If three tasks from different applications are failed, the node manager is not marked as blacklisted.
  • Explain failover controller?
    • The transition of a resource manager from standby to active is handled by a failover controller. The default failover controller is an automatic one, which uses ZooKeeper to ensure that there is only a single active resource manager at one time.
  • Define shuffle stage in MapReduce?
    • The process by which the system performs the sort—and transfers the map outputs to the reducers as inputs—is known as the shuffle.
  • How do reducers know which machines to fetch map output from?
    • As map tasks complete successfully, they notify their application master using the heartbeat mechanism. Therefore, for a given job, the application master knows the mapping between map outputs and hosts. A thread in the reducer periodically asks the master for map output hosts until it has retrieved them all.
  • How many phases does reduce task have?
    • Reduce task is divided into three phases - Copy phase, Sort phase and Reduce phase.
    • Copy Phase - The reduce task starts copying the map outputs as soon as a map task completes. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads. Map outputs are first copied to reducer tasks memory, when full it's spilled to disk.
    • Sort Phase - When data from all mappers are copied, then sort phase starts. Sort phase merges the map outputs based on the merge factor, maintaining the data's sort ordering. The default merge factor is 10.
    • Reduce Phase - During the reduce phase, the reduce function is invoked for each key in the sorted output. The output of this phase is written directly to the output filesystem, typically HDFS.
  • What are the default types of Input/Output Key and Values?
    • Default Input Key class is LongWritable  and Input Value class is Text.
    • Default Output Key class is LongWritable and Output Value class is Text.
  • How to exclude certain files from the input in MapReduce?
    • To exclude certain files from the input, you can set a filter using the setInputPathFilter() method on FileInputFormat. FileInputFormat uses a default filter that excludes hidden files.
  • How to exclude certain files from the input using Hadoop Streaming?
    • Setting paths is done with the -input option for the Streaming interface. -input option accepts regular expressions to include the required files from the input directory.
    • The input path /user/root/wordcount contains two files words_include.txt and words_exclude.txt. Below is the command to include the file words_include.txt and exclude words_exclude.txt.
                    hadoop jar hadoop-streaming-2.7.1.2.4.0.0-169.jar
                    -input /user/root/wordcount/*include*
                    -output /user/root/out
                    -mapper /bin/cat
  • What is the optimal split size?
    • By default, split size is same as HDFS block size. If split size is greater than block size, then number of map tasks that are not local to block increases reducing performance.
  • If reducer doesn't emit output, are the part files still created?
    • Yes. FileOutputFormat subclasses will create output (part-r-nnnnn) files, even if they are empty.
  • How to avoid creating reducer part files with zero size?
    • LazyOutputFormat ensures that the output file is created only when the first record is emitted for a given partition. 
  • How is sort order controlled?
    • If the property mapreduce.job.output.key.comparator.class is set, either explicitly or by calling setSortComparatorClass() on Job, then an instance of that class is used.
    • Otherwise, keys must be a subclass of WritableComparable, and the registered comparator for the key class is used.
    • If there is no registered comparator, then a RawComparator is used. The RawComparator deserializes the byte streams being compared into objects and delegates to the WritableComparable’s compareTo() method.
  • What are the types of samplers used in mapreduce?
    • InputSampler
    • IntervalSampler
    • RandomSampler
    • SplitSampler
  • What is the difference between MapSide Join and ReduceSide Join?
    • If the join is performed by the mapper it is called a map-side join, whereas if it is performed by the reducer it is called a reduce-side join. 
    • For map-side join, each input dataset must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for a particular key must reside in the same partition.
    • A reduce-side join is more general than a map-side join, in that the input datasets don’t have to be structured in any particular way, but it is less efficient because both datasets have to go through the MapReduce shuffle.
  • What is the filesystem used by distcp when transferring data between two clusters running different versions of Hadoop?
    • HftpFileSystem is recommended to use as a source. Distcp command has to be run from destination server. HFTP is a read-only filesystem.
    • hadoop distcp hftp://namenodeA:port/data/logs hdfs://namenodeB/data/logs

Map and Reduce functions to find maximum temperature in Python

Sample dataset which has year and temperature.
200935
200942
200912
201040
201020
201015

Map function in Python:

#!/usr/bin/env python
import re
import sys

for line in sys.stdin:
        val = line.strip()
        (key,value) = (val[0:4],val[4:6])
        print "%s\t%s" % (key,value)

Reduce function in Python:

#!/usr/bin/env python

import sys

(prev_year, max_temp) = (None, -sys.maxint)
for line in sys.stdin:
        (year,temp) = line.strip().split("\t")
        if not prev_year:
                prev_year = year
                max_temp = temp
        if prev_year != year:
                print "%s\t%s" % (prev_year,max_temp)
                prev_year = year
                max_temp = temp
        if prev_year == year:
                max_temp = max(int(max_temp),int(temp))

print "%s\t%s" % (prev_year,max_temp)


Amazon S3: Basic Concepts

Amazon S3 is an reliable, scalable, online object storage that stores files. Bucket: A bucket is a container in Amazon S3 where the fil...