Hadoop Map Reduce Process

Map Reduce Details-

MapReduce Diagram

Components:
• The client-
MapReduce job submission is done by client.
• The jobtracker
Responsible for co-ordinating the Job run.
• The tasktrackers
Job is collection of task, (Splitted into several tasks) Tasktrackers is responsible for running job.
• HDFS
It is distributed file System used by Hadoop. Input, intermediate and Output files are put on HDFS in case of Job runs in Hadoop pseudo Distributed and fully distributed mode.

JobClient’s

1) JobClients submitJob() method requests for jobtracker for a new job ID with getNewJobId() method.

2) For job to run, it needs some resources, for example-Job Jar file, some configuration files
and Input Split(Input is splitted and each task run on given split). So these all resources needs to be copied at jobtrackers directory. (Dir inside JobID directory).

Job Initialization

1) JobTracker puts job in queue.

2) Scheduler picks it and initialize it.

3) Job Scheduler retrieves input splits. For each input split one map task is created. Task also gets ID.

4) Number of reduce tasks can also determined by mapred.reduce.tasks property.

Task Assignment

1) Tasktrackers periodically sends heartbeat to JobTracker saying I am alive.

2) Tasktracker also conveys whether it is ready to run new task, If yes then JobTracker will allocate it a task.

3) Tasktracker has fixed numbers of slot for map and reduce tasks. Where map tasks gets priority over reduce task in case of default scheduler.

4) Data-local task: Task is running on same node that input splits resides on.

5) Rack-local task: Task and input data split are not same node but on same rack.

It is possible that some tasks are neither data-local nor rack-local

Task Execution

1) Once tasktracker has been assigned a task, It needs resources to run it. So it copies Job Jar from shared filesystem to tasktrackers file system. It creates local working directory for task and extracts all contents of jar into that directory.

2) Then It creates Task Runner Instance to run new task. Task Runner launches a new Java Virtual Machine to run each task. Reason behind seperate JVM is bugs in users MapReduce App shouldn’t affect tasktracker.

Job Completion

1) JobClient is polling for status to JobTracker. As soon as JobTracker gets notification for last task status it announces job is complete.

Task Failure

1) If users MapReduce code throws runtime exception then task gets failed.

2) The tasktracker marks the task attempt as failed, freeing up a slot to run another task.

Tasktracker Failure

1) TaskTracker may crash and fail in such cases it stops sending heartbeats to jobtracker.

2) JobTracker check for property mapred.tasktracker.expiry.interval property, in milliseconds.

If it doesnot receives heartbeat till this interval then it learns that tasktracker is failed and remove it from its pool of tasktrackers to schedule tasks on.

Jobtracker Failure

1) Failure of the jobtracker is the most serious failure mode. Currently, Hadoop has no mechanism for dealing with failure of the jobtracker.

Shuffle and Sort

1) Out put of Map tasks are input to reducer tasks, so process of transferring it is called as shuffle.

2) Map Tasks output is not written to disk directly but they have circular buffer of size 100 mb. When 805 of capacity of buffer is complete then data is spilled to disk. Spill files gets created at value-mapred.local.dir property in job specific sub directory.

3) Threads divides data into partitions and performs sort and sends output to combiner class. Output file are available for reducer as input over HTTP. Reducer has copier threads which fetches map data. No of copier threads are configurable – mapred.reduce.parallel.copies property.

4) How do reducers know which tasktrackers to fetch map output from-

Heartbeat notifications are sent as follows-

MapTaskComplete-> TaskTracker->JobTracker

Where a thread in the reducer periodically asks the jobtracker for map output locations.

Advertisements

One response to “Hadoop Map Reduce Process

  1. Nice summary of the overall process. http://www.dynamicalsoftware.com/analytics/oss is some coverage on what Java coders need to know in order to write MRv2 jobs. This is more than just another word count example. It takes you through the steps of writing reports for San Francisco crime data and for importing that data into OLAP.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s