InterviewSolution
| 1. |
<property> <name>yarn.nodemanager.resource.memory-mb</name> <value>102400</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>48</value> </property> <name>yarn.scheduler.minimum-allocation-mb</name> <value> what is the correct value here</value> |
|
Answer» Basically "mapreduce.task.io.sort.mb" is the total amount of buffer memory which is to use while sorting files. It is representing in megabytes. Tune or provide the io.sort.mb value in such a way that the NUMBER of spilled records equals or is as close to equal the number of map output records. Map-reduce job makes the assurance that the input to every reducer is sorted by key. The process by which the system performs the sort and then transfers the mapper output to the reducers as inputs are known as shuffle. In the Map-reduce job, shuffle is an area of the code where fine-tuning and improvements are continually being MADE. In many ways, the shuffle is the heart of the map-reduce job. When the map function starts producing output, the process takes an advantage of BUFFERING and writes in memory and doing some presorting for more efficiency as well. Each map task has a circular memory buffer that writes the output too. The buffer is 100mb by DEFAULT, a size which can be tuned by changing the io.sort.mb property when the contents of the buffer reach a certain threshold size. Usually the default threshold size of io.sort.spill is 0.8 or 80% when it reaches the threshold a background thread will start to spill the contents to disk. Mapper output will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time the map will block until the spill is complete. Spills are written in a round-robin fashion to the directories specified by the mapred.local.dir property in a subdirectory. Each time when the memory buffer reaches the spill threshold at that time a new spill file is created, so after the map task has written its last output record there could be several spill files before the task is finished. The spill files are merged into single partitioned and sorted the output file. The configuration property io.sort.factor controls the maximum number of streams to merge at once. the default value of io.sort.factor is 10. Just want to brief about how io.sort.factor is working, when the Mapper task is running it continuously writing data into Buffers, to maintain the buffer we have to set up a PARAMETER called io.sort.spill .percent. The value of io.sort.spill.percent will indicate, after which point the data will be written into disk instead of a buffer which is filling up. All of this spilling to disk is done in a separate thread so that the Map can continue running. There may be multiple spills on the task tracker after the map task finished. Those files have to be merged into one single sorted file per partition which is fetched by a reducer. The property io.sort.factor says how many of those spill files will be merged into one file at a time. |
|