Anatomy of a MapReduce Job in Apache Hadoop

Hadoop Framework comprises of two main components, namely,

Hadoop Distributed File System (HDFS) for Data Storage and
MapReduce for Data Processing.

In this post we will discuss the Anatomy of a MapReduce Job in Apache Hadoop. A typical Hadoop MapReduce job is divided into a set of Map and Reduce tasks that execute on a Hadoop cluster. The execution flow occurs as follows:

Input data is split into small subsets of data.
Map tasks work on these data splits.
The intermediate input data from Map tasks is then submitted to Reduce task after an intermediate process called ‘shuffle’.
The Reduce task(s) works on this intermediate data to generate the result of a MapReduce Job.

Let’s concentrate on Map and Reduce phase in this blog. (We will review the input data splitting and shuffle process in detail in our future blogs). Learn more about Big Data and its applications from the Data Engineering Training.

Let us look at a simple MapReduce job execution using one of the sample examples, “teragen” in CDH3. This program is used for generating large amount of data for bench marking the clusters available in Cloudera CDH3 Quick Demo VM.

The data size to be generated and the output file location are specified as an argument to the ‘teragen’ program. The ‘teragen’ class/program runs a MapReduce job to generate the data. We will analyze this MapReduce job execution.

This output file stores the output data on HDFS. The following figure shows the execution process and all the intermediate phases of a MapReduce Job execution:

Let’s review the execution log to understand the Job execution flow:

Learn more about Big Data and its applications from the Azure Data Engineering Course in Canada.

The ‘teragen’ program launches two map tasks and 3 reduce tasks to generate the required data.

The Map tasks generate the records.
The generated records go to a combiner task as input. The ‘combiner’ is an intermediate process.
(We will discuss combiner in detail in our future blog. As of now consider it as an intermediate process before reduce task).
The combiner output records go as an input into a Reduce task.
Finally, the reducer task aggregates the data and generates the output records.

Note that the Reduce task starts after the map task completion and the number of records continue to reduce at each level. From this Big Data Course ,you will get a better understanding about HDFS and MapReduce.

Got a question for us? Mention them in the comments section and we will get back to you.

Get started with Big Data and Hadoop

Anatomy of a MapReduce Job in Apache Hadoop

Recommended videos for you

Boost Your Data Career with Predictive Analytics! Learn How ?

Secure Your Hadoop Cluster With Kerberos

Power of Python With BigData

MapReduce Design Patterns – Application of Join Pattern

Hive Tutorial – Understanding Hive In Depth

Big Data Tutorial – Get Started With Big Data And Hadoop

Real-Time Analytics with Apache Storm

Is It The Right Time For Me To Learn Hadoop ? Find out.

Introduction to Apache Solr-1

Big Data – XML Parsing With MapReduce

Hadoop Tutorial – A Complete Tutorial For Hadoop

Bulk Loading Into HBase With MapReduce

Reduce Side Joins With MapReduce

Streaming With Apache Spark and Scala

Introduction to Big Data TDD and Pig Unit

Apache Spark For Faster Batch Processing

Hadoop for Java Professionals

Hadoop Architecture – Hadoop Tutorial on HDFS Architecture

Webinar: Introduction to Big Data & Hadoop

Improve Customer Service With Big Data

Recommended blogs for you

Introduction to Hadoop 2.0 and Advantages of Hadoop 2.0 over 1.0

Infographics: How Big is Big Data?

Implementing Hadoop & R Analytic Skills in Banking Domain

Top Hadoop Interview Questions On Apache PIG For 2025

Pig Programming: Create Your First Apache Pig Script

Apache Spark Ecosystem

Hive & Yarn Get Electrified By Spark

Big Data Applications-Sears Case Study

Hive and Yarn Examples on Spark

Splunk Knowledge Objects: Splunk Timechart, Data Models And Alert

Spark Tutorial: Real Time Cluster Computing Framework

DynamoDB vs MongoDB: Which One Meets Your Business Needs Better?

What are Kafka Streams and How are they implemented?

A Beginner’s Guide to Understanding Big Data & Hadoop

Hadoop Administration Interview Questions and Answers For 2025

Apache Hadoop 2.0 and YARN

Hadoop Career: Career in Big Data Analytics

Apache Flume Tutorial : Twitter Data Streaming

Overview of Hadoop 2.0 Cluster Architecture Federation

Hadoop Learners’ Profile

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

PySpark Certification Training Course

Microsoft Fabric Data Engineer Associate Trai ...

Apache Kafka Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Splunk Certification Training: Power User and ...

Apache Spark and Scala Certification Training ...

ELK Stack Training & Certification

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Anatomy of a MapReduce Job in Apache Hadoop