What is the map side join? How is it different from a reduced side joi

1.	What is the map side join? How is it different from a reduced side join?
Answer» A join operation is used to combine two or more datasets. In Map Reduce joins are of two types – map side joins and reduces side join. A map side join is one in which join between two tables is performed in the Map phase without the INVOLVEMENT of the Reduce phase. It can be used when one of the data sets is much smaller than other data set and can easily be STORED in DistributedCache. One of the WAYS to store datasets in DistributedCache is to do it in setup() method of Mapper. Since in map side join, the join is performed in mapper phase itself, it reduces the cost that is incurred for sorting and merging data in the shuffle and reduce phase, thereby improving the performance of the task Reduce side join on the other hand works well for large datasets. Here the reducer is responsible for performing the join operation. This type of join is much simpler to implement as data undergo sorting and shuffling before reaching the reducer and values having IDENTICAL keys are sent to the same reducer. The reducer is responsible for performing the join operation. It is comparatively simple and easier to implement than the map side join as the sorting and shuffling phase sends the values having identical keys to the same reducer and therefore, by default, the data is organized for us. HOWEVER, the I/O cost is much higher due to data movement involved in the sorting and shuffling phase.

What is the map side join? How is it different from a reduced side join?

Answer»

A join operation is used to combine two or more datasets. In Map Reduce joins are of two types – map side joins and reduces side join.

A map side join is one in which join between two tables is performed in the Map phase without the INVOLVEMENT of the Reduce phase. It can be used when one of the data sets is much smaller than other data set and can easily be STORED in DistributedCache. One of the WAYS to store datasets in DistributedCache is to do it in setup() method of Mapper. Since in map side join, the join is performed in mapper phase itself, it reduces the cost that is incurred for sorting and merging data in the shuffle and reduce phase, thereby improving the performance of the task

Reduce side join on the other hand works well for large datasets. Here the reducer is responsible for performing the join operation. This type of join is much simpler to implement as data undergo sorting and shuffling before reaching the reducer and values having IDENTICAL keys are sent to the same reducer. The reducer is responsible for performing the join operation. It is comparatively simple and easier to implement than the map side join as the sorting and shuffling phase sends the values having identical keys to the same reducer and therefore, by default, the data is organized for us. HOWEVER, the I/O cost is much higher due to data movement involved in the sorting and shuffling phase.

What is the map side join? How is it different from a reduced side join?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment