InterviewSolution
| 1. |
Why should we run the HDFS balancer periodically? Brief about the same? |
|
Answer» HDFS data might not always be distributed uniformly across DataNodes for different REASONS like if some DataNodes have less disk space available for use by HDFS or During the normal usage/ when usage is more, the disk utilization on the DataNode machines may become uneven or when a new Data Nodes are added to an existing cluster at that time also data nodes utilizations are uneven. to mitigate this problem balancer is required. A balancer is a tool that balances disk space usage on an HDFS cluster and it analyzes BLOCK placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced, which MEANS that the utilization of every DataNode more or less equally distributed. The balancer does not balance between individual volumes on a single DataNode. HDFS balancer [-policy <policy>] The two supported policies are Blackpool and data node. Setting the policy to Blackpool means that the cluster is balanced if each pool in each node is balanced while the data node means that a cluster is balanced if each DataNode is balanced. The default policy is the data node. HDFS balancer [-threshold <threshold>] specifies a number in [1.0, 100.0] REPRESENTING the acceptable threshold of the percentage of STORAGE capacity so that storage utilization outside the average +/- the threshold is considered as over/underutilized. The default threshold is 10.0. |
|