InterviewSolution
| 1. |
What is the problem of having lots of small files in HDFS? what is the remediation plan? |
|
Answer» Basically, we store files under some folders in HDFS, most of the time the folder that we give will be based on Application Name. When we talk about small files it should be LESSER than the block size, for EXAMPLE, if the block size is 64mb or 128mb then smaller files are considered as lesser than the block size. If the files are smaller than the block size then we will face a problem at the HDFS level as well as Map-Reduce Level. In HDFS when we are storing files/Directories, corresponding metadata will be stored in the Name Node, each file, directory, block metadata information will approximately occupy 150 bytes. Suppose if you have 1 million files and each are using approximately a block size or lesser then the block size then metadata size of the corresponding files/directories are approximately 300MB of memory, In such case lot of memory is occupied in the name node and after some time threshold will be reached and further it will be a problem with the CURRENT hardware. Certainly, performance will be a downgrade. During the execution of Map-reduce, when the file size is less than or equivalent to the block size, for each block size or equivalent split size one mapper will launch so approximately large number of Mapper will launch for a large number of small files in this case processing time will be more for each file having small chunk of data .when we are reading and writing a large number of small files seek time will be more which will impact performance and seeks are generally expensive operation . Since Hadoop is designed in such a WAY to run over your entire dataset, it is best to minimize seeks by using large files. Remediation plan: |
|