Map Side Join Vs. Join

Last updated on May 22,2019 97.1K Views

Map Side Join Vs. Join

edureka.co

In this blog, we shall discuss about Map side join and its advantages over the normal join operation in Hive. This is an important concept that you’ll need to learn to implement your Big Data Hadoop Certification projectsBut before knowing about this, we should first understand the concept of ‘Join’ and what happens internally when we perform the join in Hive.

Join is a clause that combines the records of two tables (or Data-Sets).
Assume that we have two tables A and B. When we perform join operation on them, it will return the records which are the combination of all columns o f A and B.

Now let us understand the functionality of normal join with an example..

Whenever, we apply join operation, the job will be assigned to a Map Reduce task which consists of two stages- a ‘Map stage’ and a ‘Reduce stage’. A mapper’s job during Map Stage is to “read” the data from join tables and to “return” the ‘join key’ and ‘join value’ pair into an intermediate file. Further, in the shuffle stage, this intermediate file is then sorted and merged. The reducer’s job during reduce stage is to take this sorted result as input and complete the task of join.

 

How will the map-side join optimize the task?

Assume that we have two tables of which one of them is a small table. When we submit a map reduce task, a Map Reduce local task will be created before the original join Map Reduce task which will read data of the small table from HDFS and store it into an in-memory hash table. After reading, it serializes the in-memory hash table into a hash table file.

In the next stage, when the original join Map Reduce task is running, it moves the data in the hash table file to the Hadoop distributed cache, which populates these files to each mapper’s local disk. So all the mappers can load this persistent hash table file back into the memory and do the join work as before. The execution flow of the optimized map join is shown in the figure below. After optimization, the small table needs to be read just once. Also if multiple mappers are running on the same machine, the distributed cache only needs to push one copy of the hash table file to this machine.

Advantages of using map side join:

Disadvantages of Map-side join:

Simple Example for Map Reduce Joins:

Let us create two tables:

Create two input files as shown in the following image to load the data into the tables created. 

employee.txt

dept.txt

Now, let us load the data into the tables.

Let us perform the Map-side Join on the two tables to extract the list of departments in which each employee is working.

Here, the second table dept is a small table. Remember, always the number of department will be less than the number of employees in an organization.

Now let’s perform the same task with the help of normal Reduce-side join.

While executing both the joins, you can find the two differences:

Hence, Map-side Join is your best bet when one of the tables is small enough to fit in memory to complete the job in a short span of time.

In Real-time environment, you will be have data-sets with huge amount of data. So performing analysis and retrieving the data will be time consuming if one of the data-sets is of a smaller size. In such cases Map-side join will help to complete the job in less time.

There has never been a better time to master Hadoop! Get started now with the specially curated Big Data and Hadoop course by Edureka.

References:
https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919

Related Posts:

7 Ways Big Data Training Can Change Your Organization

10 Reasons Why Big Data Analytics is the Best Career Move

Get started with Big Data and Hadoop

Get started with Comprehensive MapReduce

Get started with MapReduce Design Patterns

Introduction to Apache Mapreduce & HDFS

BROWSE COURSES