1.

In a map-reduce job, under what scenario does a combiner get triggered? What are the various options to reduce the shuffling of data in a map-reduce job?

Answer»

The map-reduce framework doesn’t guarantee that the combiner will be executed for every job run. The combiner is executed at each BUFFER spill. During a spill, the thread writing data to the disk first divides data into partitions corresponding to the number of reducers. Within each partition, the thread performs an in-memory SORT on the data and applies the combiner function (if any) on the output of sort.

 Various ways to reduce data shuffling in a map-reduce job are:

  • USE a combiner to perform associative or commutative operations on the mapper output and hence reduce shuffling of data
  • If the data size is huge, then having a single reducer is not a good idea. An optimal number of reducers should be chosen in this case
  • Compressing mapper output reduces the AMOUNT of data that gets written to disk and TRANSFERRED to reducer hence reducing the shuffling of data
  • Increase the buffer size used by mappers during sorting. This will reduce the number of spills to the disk. This can be controlled using property mapreduce.task.io.sort.mb
  • Leveraging cleanup() method of Mapper – This method is called once per mapper task and can be used to perform any associative or commutative operation on the output of the map(0 functions.


Discussion

No Comment Found

Related InterviewSolutions