InterviewSolution
| 1. |
How does a map task partition the output in the case of multiple reducers? |
|
Answer» In the case of large data, it’s advised to USE more than one reducer. In the case of multiple reducers, the thread spilling MAP output to disk first divides the data into partitions corresponding to the number of reducers. Within each partition, an in-memory sort on the data is PERFORMED. A combiner, if any, is applied to the output of the sort. Finally, the data is sent to reducer based on the partitioning key. Partitioning ensures that all the values for each key are grouped together and the values having the same key go to the same reducer, thus allowing for even distribution of the map output over the reducer. The Default partitioner in a map-reduce job is Hash Partitioner which computes a hash value for the key and assigns the partition-based its result. However, care must be taken to ensure that partitioning logic is optimal and data gets sent evenly to the reducers. In the case of a sub-optimal design, some reducers will have more WORK to do than others, as a result, the ENTIRE job will wait for that one reducer to finish its extra load share. |
|