Learning Big Data

Thursday, July 25, 2024

Amazon S3: Basic Concepts

Amazon S3 is an reliable, scalable, online object storage that stores files.

Bucket: A bucket is a container in Amazon S3 where the files are uploaded. Files are stored in buckets. We need at least one bucket to store the files.

Bucket name has to be unique because it is shared by all users.
Buckets can't have nested buckets but can have nested directories.
Maximum of 100 buckets can be created in a single account.
There is no size limit on buckets.
We can't rename a bucket once created.
Buckets can be accessed via HTTP URLs as follows.

http://<BUCKET_NAME>.s3.amazonaws.com/<OBJECT_NAME>
http://s3.amazonaws.com/<BUCKET_NAME>/<OBJECT_NAME>

Buckets can be managed via

REST-Style HTTP Interface
SOAP Interface

The access logging feature if enabled, keeps track of bucket requests such as request type, resources accessed, date and time when requested.

Object: An object is a file on Amazon S3. Each object is assigned a unique identifier. Every object is stored in a bucket. Objects consist of data and metadata.

Objects can be managed via

REST-style HTTP Interface
SOAP Interface

Objects can be downloaded via

HTTP GET Interface
BitTorrent protocol

Every object is assigned a key as an identifier and is unique.
Objects can be added to a folder in either of two ways

Add Files option - Individual files can be uploaded using this option.
Enable enhanced uploader - This option is used when we need to upload whole folders.

There are two options under Set Details section on files.

Use Reduced Redundancy Storage - Non critical data can be set to use reduced redundancy storage. Using this will store the file at lower levels of redundancy compared to standard storage class.

Use Server Side Encryption

This is for security. Data is encrypted while storing. When object is accessed, Amazon S3 decrypts the data.

Use server side encryption has two options.

Use the Amazon S3 service master key
Use an AWS Key Management Service master key

Key: A key is a unique identifier for an object within the bucket. Combination of bucket, key and version ID uniquely identifies each object.
Region: We might want to choose the geographical region where Amazon S3 will store buckets.
Folder: The folders in Amazon S3 are S3 files that are used to put Amazon S3 objects together under one group. This is analogous to Directory.
Versioning: Versioning helps us to retrieve old objects. We can retrieve deleted and updated objects. When an object is deleted, Amazon S3 inserts a delete marker rather than deleting it permanently.

Versioning is enabled at bucket level.
Versioning can be enabled in any of the following states.

Unversioned - the default
Versioning enabled
Versioning suspended.

By default, when versioning is enabled Amazon S3 stores all versions of an object.
To control the limit of versions, enable "Lifecycle rules" for the object. These rules will delete the old files.

Data consistency Model: S3 provides eventual consistency for read-after-write.

If we make a GET request to an object after an update request, we might get old data if update is not complete, else we will get latest data.
S3 would return old data or updated data but will never return partial data.
Amazon S3 provides high consistency by replication data across multiple servers.
If a PUT request is successful, data is safely stored across multiple servers.
If changes have to be made to a file, the change has to be replicated across all locations and this will take time. Any GET request during this time period might return old data until change is fully propagated.

Resources for Hadoop Questions

https://www.facebook.com/pages/Hadoop-Certification-Dumps-Q-A/1412348605698718

http://www.dattamsha.com/big-data-quiz/

http://searchdatamanagement.techtarget.com/quiz/Quiz-Test-your-understanding-of-the-Hadoop-ecosystem

http://www.crinlogic.com/hadooptest.html

http://hadoopquestions.blogspot.in/2014/10/latest-hadoop-interview-questions-and.html

http://www.slideshare.net/rohitkapa/hadoop-interview-questions

http://www.aiopass4sure.com/cloudera-exams/cca-410-exam-questions

Hadoop Interview Questions and Answers

Explain GenericOptionsParser?

GenericOptionsParser is a class that interprets Hadoop command-line options and sets them on a Configuration Object.

What is an Uber Job?

If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges that the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. Such a job is said to be uberized, or run as an uber task.

How Application Master qualifies a job as a small job?

By default, a small job is one that has less than 10 mappers, only one reducer, and an input size that is less than the size of one HDFS block.

What is the default OutputCommitter?

FileOutputCommitter is the default. It creates the final output directory for a job and temporary working space for task output.

Does data locality constraints applies to Reducers?

No, reducers can work anywhere in the cluster. Only mappers have data locality constraints.

What are the roles of OutputCommitter?

OutputCommitters ensures that jobs and tasks succeed or fails cleanly.
When a job starts, output committer performs job initialization like creating output directory and temporary working space for task output.
When job succeeds, output committer deletes the temporary working space and creates the _SUCCESS marker to indicate successful completion of job. Output files are moved to final destination folder.
When job fails, output committer deletes the temporary working space and makes sure job stops cleanly.
In case of speculative jobs or multiple task attempts, output committer makes sure only files of successful task be promoted to final output directory. The other failed tasks will have their files deleted.

What constitutes progress in mapreduce?

Reading an input record (in a mapper or reducer)
Writing an output record (in a mapper or reducer)
Setting the status description (via Reporter’s or TaskAttemptContext’s setStatus() method)
Incrementing a counter (using Reporter’s incrCounter() method or Counter’s increment() method)
Calling Reporter’s or TaskAttemptContext’s progress() method

How status updates are propagated through the MapReduce system?

The task reports its progress and status (including counters) back to its application master, which has an aggregate view of the job, every three seconds over the umbilical interface. Umbilical interface is the channel through which a child process communicates with the parent process, in this case task communicates with application master.
On the other hand, the client receives the latest status by polling the application master every second.

What is the maximum number of failed task attempts before the whole job is marked as failed?

Application master tries to reschedule the failed task for 4 times by default. If any task fails four times the whole job is marked as fail.

What could be causes of Task failure and how does MapReduce handle them?

Failure in user code throws runtime exception : In this case application master tries to reschedule the task in another node.
Sudden exit of task JVM : In this case application master tries to reschedule the task in another node.
Hanging tasks : The tasks that doesn't report progress to application master are called hanging tasks. By default, if application master doesn't receive progress update for 10 minutes then that task is marked as failed. Application master tries to reschedule the task in another node.

If a particular task is killed for 4 times, will application master mark the whole job as failed?

No. A task attempt may also be killed, which is different from it failing. A task attempt may be killed because it is a speculative duplicate. Killed task attempts do not count against the number of attempts to run the task because it wasn’t the task’s fault that an attempt was killed.

What is the maximum number of attempts allowed for YARN application master before marking it as failed?

By default, two attempts.

What happens when application master fails?

Resource manager starts a new instance of application master in a new container.

What happens when Node Manager fails?

If node manager stops sending heartbeats to Resource Manager for more than 10 minutes, then that particular node manager is marked as failed.

What happens to the successfully completed mappers and reducers in an incomplete job if node manager fails?

All the successful mappers on a failed node will be rerun by the application master on different node because mappers intermediate output is stored on local filesystem.
If reducers have run successfully on failed node manager, they aren't considered for rerun because their output is stored on HDFS and is not lost.

Blacklisting of Node Managers is done by ______________

Application Master

When are node managers blacklisted?

Node managers may be blacklisted if the number of failures for the application is high. Application master will try to reschedule tasks on different nodes if more than three tasks fail on a node manager and this node manager is marked as blacklisted. Please note, the three failed tasks should be part of same application. If three tasks from different applications are failed, the node manager is not marked as blacklisted.

Explain failover controller?

The transition of a resource manager from standby to active is handled by a failover controller. The default failover controller is an automatic one, which uses ZooKeeper to ensure that there is only a single active resource manager at one time.

Define shuffle stage in MapReduce?

The process by which the system performs the sort—and transfers the map outputs to the reducers as inputs—is known as the shuffle.

How do reducers know which machines to fetch map output from?

As map tasks complete successfully, they notify their application master using the heartbeat mechanism. Therefore, for a given job, the application master knows the mapping between map outputs and hosts. A thread in the reducer periodically asks the master for map output hosts until it has retrieved them all.

How many phases does reduce task have?

Reduce task is divided into three phases - Copy phase, Sort phase and Reduce phase.
Copy Phase - The reduce task starts copying the map outputs as soon as a map task completes. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads. Map outputs are first copied to reducer tasks memory, when full it's spilled to disk.
Sort Phase - When data from all mappers are copied, then sort phase starts. Sort phase merges the map outputs based on the merge factor, maintaining the data's sort ordering. The default merge factor is 10.
Reduce Phase - During the reduce phase, the reduce function is invoked for each key in the sorted output. The output of this phase is written directly to the output filesystem, typically HDFS.

What are the default types of Input/Output Key and Values?

Default Input Key class is LongWritable and Input Value class is Text.
Default Output Key class is LongWritable and Output Value class is Text.

How to exclude certain files from the input in MapReduce?

To exclude certain files from the input, you can set a filter using the setInputPathFilter() method on FileInputFormat. FileInputFormat uses a default filter that excludes hidden files.

How to exclude certain files from the input using Hadoop Streaming?

Setting paths is done with the -input option for the Streaming interface. -input option accepts regular expressions to include the required files from the input directory.
The input path /user/root/wordcount contains two files words_include.txt and words_exclude.txt. Below is the command to include the file words_include.txt and exclude words_exclude.txt.

hadoop jar hadoop-streaming-2.7.1.2.4.0.0-169.jar
-input /user/root/wordcount/*include*
-output /user/root/out
-mapper /bin/cat

What is the optimal split size?

By default, split size is same as HDFS block size. If split size is greater than block size, then number of map tasks that are not local to block increases reducing performance.

If reducer doesn't emit output, are the part files still created?

Yes. FileOutputFormat subclasses will create output (part-r-nnnnn) files, even if they are empty.

How to avoid creating reducer part files with zero size?

LazyOutputFormat ensures that the output file is created only when the first record is emitted for a given partition.

How is sort order controlled?

If the property mapreduce.job.output.key.comparator.class is set, either explicitly or by calling setSortComparatorClass() on Job, then an instance of that class is used.
Otherwise, keys must be a subclass of WritableComparable, and the registered comparator for the key class is used.
If there is no registered comparator, then a RawComparator is used. The RawComparator deserializes the byte streams being compared into objects and delegates to the WritableComparable’s compareTo() method.

What are the types of samplers used in mapreduce?

InputSampler
IntervalSampler
RandomSampler
SplitSampler

What is the difference between MapSide Join and ReduceSide Join?

If the join is performed by the mapper it is called a map-side join, whereas if it is performed by the reducer it is called a reduce-side join.
For map-side join, each input dataset must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for a particular key must reside in the same partition.
A reduce-side join is more general than a map-side join, in that the input datasets don’t have to be structured in any particular way, but it is less efficient because both datasets have to go through the MapReduce shuffle.

What is the filesystem used by distcp when transferring data between two clusters running different versions of Hadoop?

HftpFileSystem is recommended to use as a source. Distcp command has to be run from destination server. HFTP is a read-only filesystem.

hadoop distcp hftp://namenodeA:port/data/logs hdfs://namenodeB/data/logs

Map and Reduce functions to find maximum temperature in Python

Sample dataset which has year and temperature.
200935
200942
200912
201040
201020
201015

Map function in Python:

#!/usr/bin/env python
import re
import sys

for line in sys.stdin:
val = line.strip()
(key,value) = (val[0:4],val[4:6])
print "%s\t%s" % (key,value)

Reduce function in Python:

#!/usr/bin/env python

import sys

(prev_year, max_temp) = (None, -sys.maxint)
for line in sys.stdin:
(year,temp) = line.strip().split("\t")
if not prev_year:
prev_year = year
max_temp = temp
if prev_year != year:
print "%s\t%s" % (prev_year,max_temp)
prev_year = year
max_temp = temp
if prev_year == year:
max_temp = max(int(max_temp),int(temp))

print "%s\t%s" % (prev_year,max_temp)

Friday, May 12, 2017

Amazon Athena: Key highlights on Amazon Athena

Amazon Athena is a serverless interactive query service that is used to analyze data in Amazon S3 using standard SQL.

Amazon Athena applies schema-on-read.
Amazon Athena can be used to analyze structured, semi-structured and unstructured datasets.
Athena integrates with Amazon QuickSight for easy visualization.
Amazon Athena uses Presto, a distributed SQL engine to execute the queries.
Apache Hive DDL is used in Athena to define tables.
Amazon Athena can be accessed in either of two ways.

AWS Management Console
JDBC Connection

A user requires User permissions and Amazon S3 permissions to run queries.

Athena User permissions: Amazon IAM provides two managed policies for Athena: AmazonAthenaFullAccess and AWSQuicksightAthenaAccess. These policies can be attached to the user profile to get required permissions.
Athena S3 permissions: To access data from a particular S3 location, a Athena user needs appropriate permissions on S3 buckets.

Amazon Athena supports encryption.
If dataset is encrypted on Amazon S3, a table DDL can have TBLPROPERTIES('has_encrypted_data'='true') to inform Athena that data to read is encrypted.
Amazon Athena stores the query result sets by default in S3 staging directory.
Any settings defined to store encrypted results, will apply to all tables and queries. Configuring settings for individual databases, tables or queries is not possible.
To encrypt query results stored in Amazon S3 using the console, provide the required details in Settings Tab.
Tables can be created in Athena using

AWS console
Using JDBC Driver
using Athena create table wizard.

Databases and Tables are simply logical objects pointing to actual files on Amazon S3.
Athena catalog stores the metadata about the databases and tables.
Athena doesn't support

CREATE TABLE AS SELECT
Transaction based operations.
ALTER INDEX
ALTER TABLE .. ARCHIVE PARTITION
ALTER TABLE ... CLUSTERED BY ..
ALTER TABLE .. EXCHANGE PARTITION
ALTER TABLE .. NOT CLUSTERED
ALTER TABLE ... NOT SKEWED
ALTER TABLE .. RENAME TO
COMMIT
CREATE INDEX
CREATE ROLE
CREATE TABLE LIKE
CREATE VIEW
DESCRIBE DATABASE
INSERT INTO
ROLLBACK
EXPORT/IMPORT TABLE
DELETE FROM
ALTER TABLE .. ADD COLUMNS
ALTER TABLE..REPLACE COLUMNS
ALTER TABLE .. CHANGE COLUMNS
ALTER TABLE ..TOUCH
UDF's are not supported
Stored procedures

Athena limitations:

Operations that change table stats like create,update ,delete tables are ACID compliant.
All tables are EXTERNAL.
Table names are case-insensitive
Table names allow only underscore character, cannot contain any other special character.

Advantages of accessing Athena using JDBC Driver

Using driver Athena can connect to third party applications such as SQL Workbench.
We can run queries programmatically against Athena.

JDBC Driver Options:

s3_staging_dir
query_results_encryption_option
query_results_aws_kms_key
aws_credentials_provider_class
aws_credentails_provider_arguments
max_error_retries
connection_timeout
socket_timeout
retry_base_delay
retry_max_backoff_time
log_path
log_level

JDBC commands:

connection.createStatement();
statement.executeQuery("<query>")

AWS CloudTrail is a service that records AWS API calls and events for AWS accounts. CloudTrail generates encrypted (*.gzip) logfiles and stores them in Amazon S3 in JSON format.
CloudTrail SerDe is used by Athena to read log files generated by CloudTrail in JSON format.
Compression Formats supported:

Snappy
Zlib
GZIP
LZO

Thursday, May 4, 2017

Hadoop Streaming: Specifying Map-Only Jobs

Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The Map/Reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.

-D mapred.reduce.tasks=0

Hadoop Streaming also supports map-only jobs by specifying "-D mapred.reduce.tasks=0".

To specify map-only jobs, use

hadoop jar hadoop-streaming-2.7.1.2.4.0.0-169.jar
-D mapred.reduce.tasks=0
-input /user/root/wordcount
-output /user/root/out
-mapper /bin/cat

We can also achieve map-only jobs by specifying "-numReduceTasks 0"

hadoop jar hadoop-streaming-2.7.1.2.4.0.0-169.jar
-input /user/root/wordcount
-output /user/root/out
-mapper /bin/cat
-numReduceTasks 0

Hadoop MapReduce job on Hortonworks Sandbox to find maximum temperature on NCDC weather dataset

If you want to setup Hortonworks Sandbox, follow the link
http://hortonworks.com/wp-content/uploads/2012/03/Tutorial_Hadoop_HDFS_MapReduce.pdf

Please follow my another post on how to download weather dataset from NCDC.

http://hadoopsupport.blogspot.com/2017/05/download-national-climatic-data-center.html

I have downloaded a sample weather dataset for the year 2017 on my local filesystem in the directory, /home/pluralsight/mapreduce/weather. You are free to download any number of files and any file.

[root@sandbox weather]# pwd

/home/pluralsight/mapreduce/weather

[root@sandbox weather]# ls -ltr

total 72

-rw-r--r-- 1 root root 70234 2017-05-04 16:59 007026-99999-2017.gz

Upload this file to HDFS

[root@sandbox weather]# hadoop dfs -put /home/pluralsight/mapreduce/weather/007026-99999-2017.gz /home/pluralsight/weather

[root@sandbox weather]# hadoop dfs -ls /home/pluralsight/weather/

Found 1 items

-rw-r--r-- 3 root hdfs 70234 2017-05-04 17:16 /home/pluralsight/weather/007026-99999-2017.gz

Now the file is in HDFS directory. Below is the MapReduce program in java to find the maximum temperature.

Application to find maximum temperature, maxTempDriver.java

------------------------------------------------------------------------------------------------------------------------

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.io.LongWritable;

import java.io.IOException;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class maxTempDriver extends Configured implements Tool{

public static class MaxTempMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

@Override

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{

String line = value.toString();

String year = line.substring(15,19);

int airTemp = Integer.parseInt(line.substring(87,92));

context.write(new Text(year), new IntWritable(airTemp));

}

public static class MaxTempReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;

for (IntWritable value: values) {

maxValue = Math.max(maxValue, value.get());

}

context.write(key, new IntWritable(maxValue));

}

public int run(String[] args) throws Exception {

Job job = new Job(getConf(),"Max Temp");

job.setJarByClass(maxTempDriver.class);

job.setMapperClass(MaxTempMapper.class);

job.setReducerClass(MaxTempReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job,new Path(args[1]));

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception{

int exitCode = ToolRunner.run(new maxTempDriver(),args);

System.exit(exitCode);

}

-------------------------------------------------------------------------------------------------------------------

From the command line, compile .java program. First, check to see if your CLASSPATH variable is set in your environment.

[root@sandbox maxTemp]# echo $CLASSPATH

[root@sandbox maxTemp]#

If your CLASSPATH is not set to include hadoop jar files required to compile your program, you can set them directly in the command. We need hadoop-common*.jar and hadoop-mapreduce-client-core*.jar files.

You can use find command to get the absolute path of these jar files in your sandbox.

[root@sandbox maxTemp]# find / -name hadoop-common*.jar

[root@sandbox maxTemp]# find / -name hadoop-mapreduce*.jar

For example in my environment hadoop-common*.jar file is present in /usr/hdp/2.4.0.0-169/hadoop/hadoop-common.jar. Please include the relative paths of these jar files in your javac command as per your find results.

[root@sandbox maxTemp]# javac -classpath /usr/hdp/2.4.0.0-169/hadoop/hadoop-common.jar:/usr/hdp/2.4.0.0-169/hadoop/client/hadoop-mapreduce-client-core.jar maxTempDriver.java

This command creates three class files, one for driver, one for mapper and one for reducer classes.

[root@sandbox maxTemp]# ls -ltr

total 20

-rw-r--r-- 1 root root 2254 2017-05-04 15:55 maxTempDriver.java

-rw-r--r-- 1 root root 1665 2017-05-04 15:58 maxTempDriver$MaxTempMapper.class

-rw-r--r-- 1 root root 1707 2017-05-04 15:58 maxTempDriver$MaxTempReducer.class

-rw-r--r-- 1 root root 1689 2017-05-04 15:58 maxTempDriver.class

For a start in distributed setting, job's classes must be packaged into a job JAR file to send to cluster. Creating a JAR file can be conveniently achieved by

[root@sandbox maxTemp]# jar -cvf maxTempDriver.jar *.class

This command packages all .class files into maxTempDriver.jar JAR file.

[root@sandbox maxTemp]# ls -ltr

total 20

-rw-r--r-- 1 root root 2254 2017-05-04 15:55 maxTempDriver.java

-rw-r--r-- 1 root root 1665 2017-05-04 15:58 maxTempDriver$MaxTempMapper.class

-rw-r--r-- 1 root root 1707 2017-05-04 15:58 maxTempDriver$MaxTempReducer.class

-rw-r--r-- 1 root root 1689 2017-05-04 15:58 maxTempDriver.class

-rw-r--r-- 1 root root 3038 2017-05-04 16:02 maxTempDriver.jar

To launch the job, we need to run the driver. Driver picks the cluster information by default from conf/core-site.xml file.

[root@sandbox wordCount]# hadoop jar maxTempDriver.jar maxTempDriver --libjars maxTempDriver.jar /home/pluralsight/weather /home/pluralsight/out

Here is the stats summary of map and reduce jobs.

17/05/04 17:46:41 INFO mapreduce.Job: map 100% reduce 100%

17/05/04 17:46:41 INFO mapreduce.Job: Job job_1493910685734_0010 completed successfully

17/05/04 17:46:41 INFO mapreduce.Job: Counters: 49

File System Counters

FILE: Number of bytes read=43863

FILE: Number of bytes written=354815

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=70380

HDFS: Number of bytes written=9

HDFS: Number of read operations=6

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Launched reduce tasks=1

Data-local map tasks=1

Total time spent by all maps in occupied slots (ms)=4220

Total time spent by all reduces in occupied slots (ms)=4377

Total time spent by all map tasks (ms)=4220

Total time spent by all reduce tasks (ms)=4377

Total vcore-seconds taken by all map tasks=4220

Total vcore-seconds taken by all reduce tasks=4377

Total megabyte-seconds taken by all map tasks=1055000

Total megabyte-seconds taken by all reduce tasks=1094250

Map-Reduce Framework

Map input records=3987

Map output records=3987

Map output bytes=35883

Map output materialized bytes=43863

Input split bytes=146

Combine input records=0

Combine output records=0

Reduce input groups=1

Reduce shuffle bytes=43863

Reduce input records=3987

Reduce output records=1

Spilled Records=7974

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=93

CPU time spent (ms)=2300

Physical memory (bytes) snapshot=361009152

Virtual memory (bytes) snapshot=1722482688

Total committed heap usage (bytes)=265814016

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=70234

File Output Format Counters

Bytes Written=9

After successful completion of mapreduce job, output is saved by OutputCommitter on HDFS in the directory /home/pluralsight/out. Lets have a look at the output,

[root@sandbox wordCount]# hadoop dfs -cat /home/pluralsight/out2/*

2017 290

Since, we analyzed weather dataset for 2017 year we have single record in the output file. You can include multiple files from different years to get maximum temperature by year. Please note, maximum temperature for 2017 is shows as 290 degress which is unusual. If we include the logic to check the valid temperatures in our code, we can catch these scenarios.

Hope this helps.

Learning Big Data

Thursday, July 25, 2024

Amazon S3: Basic Concepts

Resources for Hadoop Questions

Hadoop Interview Questions and Answers

Map and Reduce functions to find maximum temperature in Python

Friday, May 12, 2017

Amazon Athena: Key highlights on Amazon Athena

Thursday, May 4, 2017

Hadoop Streaming: Specifying Map-Only Jobs

Hadoop MapReduce job on Hortonworks Sandbox to find maximum temperature on NCDC weather dataset

Amazon S3: Basic Concepts

Total Pageviews

Search This Blog