59 + Interview Questions in Hadoop in Big Data Page 2 InterviewSolution

51.	Explain how the Name node gets to know all the available data node in the Hadoop cluster.
Answer» In Hadoop cluster when we are talking about Data node, Data node is where the actual data we are keeping. Data nodes are sending a heartbeat message to the name node in every 3 seconds to confirm that they are active. If the Name Node does not receive a heartbeat from a particular data node for 10 minutes, then it considers that data node to be dead. Then Name Node initiates the replication of Dead data node blocks to some other data nodes which are active. Data nodes can talk to each other to rebalance the data, MOVE and copy the data around and keep the replication active in the cluster. You can get the BLOCK report using below HDFS commands. Example: hadoop fsck / ==> Filesystem check on HDFS # hadoop fsck /hadoop/container/pbibhu Total size: 16666944775310 B <=== see here Total dirs: 3922 Total files: 418464 Total blocks (validated): 202705 (avg. block size 32953610 B) Minimally replicated blocks: 202705 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) DEFAULT replication factor: 3 Average block replication:3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 18 Number of racks: 1 FSCK ended at Thu Oct 20 20:49:59 CET 2011 in 7516 milliseconds The filesystem under path '/hadoop/container/pbibhu 'is HEALTHY Name node is the node which stores the file system metadata when we are talking about metadata, it is having information LIKE List of file names, Owner, Permissions, Timestamps, Size, Replication Factor, List of Blocks for each file etc. Metadata, which files maps to what block location and which blocks are stored in which data node. When data nodes are storing a block of information, it maintains a checksum for each block as well. when any data has been written to HDFS, checksum value has been written simultaneously and when it reads by default verifies the same checksum value. The data nodes update the name node with the block information periodically and before updating verify the value of the checksum. when the checksum value is not correct for a particular block then we will consider as DISK level corruption for that particular block , it skips that block information while reporting to the name node, in this way name node will get to know the disk level corruption on that data node and takes necessary steps like it can be replicated from its alternate locations to other active data nodes to bring the replication factor back to the normal level. Data nodes can be listed in DFS.HOSTS file, It contains a list of hosts that are permitted to connect to the Name Node. Example: Add this property to hdfs-site.xml: <property> <name>dfs.hosts</name> <value>/home/hadoop/includes</value> </property> includes: host name1 hostname2 hostname3 If include file is empty then all hosts are permitted but it is not a definitive list of active data nodes. Name node will consider those data nodes from which Name Node will receive the heart beats.

51.

Explain how the Name node gets to know all the available data node in the Hadoop cluster.

Answer»

In Hadoop cluster when we are talking about Data node, Data node is where the actual data we are keeping. Data nodes are sending a heartbeat message to the name node in every 3 seconds to confirm that they are active. If the Name Node does not receive a heartbeat from a particular data node for 10 minutes, then it considers that data node to be dead. Then Name Node initiates the replication of Dead data node blocks to some other data nodes which are active. Data nodes can talk to each other to rebalance the data, MOVE and copy the data around and keep the replication active in the cluster. You can get the BLOCK report using below HDFS commands.

Example:

hadoop fsck / ==> Filesystem check on HDFS

# hadoop fsck /hadoop/container/pbibhu

Total size: 16666944775310 B <=== see here
Total dirs: 3922
Total files: 418464
Total blocks (validated): 202705 (avg. block size 32953610 B)
Minimally replicated blocks: 202705 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
DEFAULT replication factor: 3
Average block replication:3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 18
Number of racks: 1

FSCK ended at Thu Oct 20 20:49:59 CET 2011 in 7516 milliseconds

The filesystem under path '/hadoop/container/pbibhu 'is HEALTHY

Name node is the node which stores the file system metadata when we are talking about metadata, it is having information LIKE List of file names, Owner, Permissions, Timestamps, Size, Replication Factor, List of Blocks for each file etc. Metadata, which files maps to what block location and which blocks are stored in which data node. When data nodes are storing a block of information, it maintains a checksum for each block as well. when any data has been written to HDFS, checksum value has been written simultaneously and when it reads by default verifies the same checksum value.

The data nodes update the name node with the block information periodically and before updating verify the value of the checksum. when the checksum value is not correct for a particular block then we will consider as DISK level corruption for that particular block , it skips that block information while reporting to the name node, in this way name node will get to know the disk level corruption on that data node and takes necessary steps like it can be replicated from its alternate locations to other active data nodes to bring the replication factor back to the normal level. Data nodes can be listed in DFS.HOSTS file, It contains a list of hosts that are permitted to connect to the Name Node.

Example:

Add this property to hdfs-site.xml: <property> <name>dfs.hosts</name> <value>/home/hadoop/includes</value> </property> includes: host name1 hostname2 hostname3

If include file is empty then all hosts are permitted but it is not a definitive list of active data nodes. Name node will consider those data nodes from which Name Node will receive the heart beats.

Discussion

52.	What is distributed cache in Hadoop?
Answer» It is a facility provided by Hadoop map-reduce framework to ACCESS small file needed by an application during its execution. These files are small as it is in KB's and MB's in size. The type of files are mainly text, archive or jar files. These files are small that is why it will keep in the cache memory which is one of the fast memories. Applications which need to use distributed cache to distribute a file should make sure that the file is available and can be accessed via URLs. URLs can either be hdfs:// or http:// Once the file is present on the mentioned URL, the Map-Reduce framework will copy the necessary files on all the nodes before initiation of the tasks on those nodes. In case the files provided are archives, these will be automatically unarchived on the nodes after transfer. Example: In a Hadoop CLUSTER, we have three data nodes there are 30 tasks we run in the cluster. So each node will get 10 tasks each. Our nature of the task is such kind of task where it needs some information or a particular jar to be adopted before its execution. To fulfil this, we can cache these files which contain the info or jar files. Before execution of the job, the cache files will copy to each slave node application master. Application master than reads the files and start the tasks. The task can be MAPPER or reducer and these are read-only files. By default Hadoop, the distributed cache is 10GB if you want to change the same you have to modify the size in mapred-site.xml. Here it is coming to our mind that why cache memory is required to perform the tasks. why can’t we keep the file in HDFS on each data node already present and have the application read it? they are a total of 30 tasks and in real time it should be more than 100 or 1000 tasks. If we put the files in HDFS than to perform 30 tasks the application has to access the HDFS location 30 times and then read it but HDFS is not very efficient to access small files for this many times. this is the reason why we are using cache memory and it REDUCES the number of reads from HDFS locations.

52.

What is distributed cache in Hadoop?

Answer»

It is a facility provided by Hadoop map-reduce framework to ACCESS small file needed by an application during its execution. These files are small as it is in KB's and MB's in size. The type of files are mainly text, archive or jar files. These files are small that is why it will keep in the cache memory which is one of the fast memories. Applications which need to use distributed cache to distribute a file should make sure that the file is available and can be accessed via URLs. URLs can either be hdfs:// or http://

Once the file is present on the mentioned URL, the Map-Reduce framework will copy the necessary files on all the nodes before initiation of the tasks on those nodes. In case the files provided are archives, these will be automatically unarchived on the nodes after transfer.

Example: In a Hadoop CLUSTER, we have three data nodes there are 30 tasks we run in the cluster. So each node will get 10 tasks each. Our nature of the task is such kind of task where it needs some information or a particular jar to be adopted before its execution. To fulfil this, we can cache these files which contain the info or jar files. Before execution of the job, the cache files will copy to each slave node application master. Application master than reads the files and start the tasks. The task can be MAPPER or reducer and these are read-only files. By default Hadoop, the distributed cache is 10GB if you want to change the same you have to modify the size in mapred-site.xml. Here it is coming to our mind that why cache memory is required to perform the tasks. why can’t we keep the file in HDFS on each data node already present and have the application read it? they are a total of 30 tasks and in real time it should be more than 100 or 1000 tasks. If we put the files in HDFS than to perform 30 tasks the application has to access the HDFS location 30 times and then read it but HDFS is not very efficient to access small files for this many times. this is the reason why we are using cache memory and it REDUCES the number of reads from HDFS locations.

Discussion

53.	As per the configuration, HDFS is in High availability mode with automatic failover. Explain in brief about the daemon which will take care of the failover.
Answer» High Availability of cluster was introduced in Hadoop 2 to SOLVE the single point of Name node failure problem in Hadoop 1. The High availability Name node architecture provides an opportunity to have two name nodes as Active name node and Passive/Standby name node. So, both are running Name Nodes at the same time in a High Availability cluster. Whenever Active Name Node goes down due to crashes of server or graceful failover during the maintenance period at the same time control will go to passive/Standby Name Node automatically and it reduces the cluster downtime. There are two problems in maintaining consistency in the HDFS High Availability cluster: Active and Passive/Standby Name Node should be in sync always because they are referring to the same metadata, to doing the same group of daemons called journal nodes will help. This will allow restoring the Hadoop cluster to the same namespace state whenever it got crashed or failed and it will provide us to have fast failover. One name node should be active at a time because two active Name Node will cause to loss or corruption of the data. This kind of scenario is known as a split-brain scenario where a cluster gets divided into a smaller cluster and each one believing that it is the only active cluster. FENCING helps to avoid such scenarios. Fencing is a process where it ensures that only one Name Node remains active at a particular time. It means whenever two Name Node will be in Active state fencing will kill one of the Name node active states. As discussed above There are two types of failover: A. Graceful Failover: In this case, we manually initiate the failover for routine maintenance. B. Automatic Failover: In this case, the failover is initiated automatically in case of Name Node failure or Name node crashes. In either case of a Name Node failure, Passive or Stand by Name Node can take control of exclusive lock in Zookeeper and showing as it WANTS to become the next Active Name Node. In HDFS High availability cluster, APACHE Zookeeper is a service which provides the automatic failover. When the Name Node is active at that time Zookeeper maintains a session with the active Name Node. In any scenario when active Name Node get failed at that time the session will expire and the Zookeeper will inform to Passive or Stand by Name Node to initiate the failover process. The ZookeeperFailoverController (ZKFC) is a Zookeeper client that also monitors and manages the Name Node status. Each of the Name Nodes runs a ZKFC also. ZKFC is responsible for monitoring the health of the Name Nodes periodically. When zookeeper is installed in your cluster you should make sure that below are the process, or daemons running in Active Name Node, Standby Name Node and Data node. When you do JPS (Java Virtual Machine Process Status Tool ) in Active NameNode you should get below Daemons: Zookeeper Zookeeper Fail Over controller JournalNode NameNode When you do JPS (Java Virtual Machine Process Status Tool ) in Standby NameNode you should get below Daemons: Zookeeper Zookeeper Fail Over controller JournalNode NameNode When you do JPS (Java Virtual Machine Process Status Tool ) in DataNode you should get below Daemons: Zookeeper JournalNode DataNode

53.

As per the configuration, HDFS is in High availability mode with automatic failover. Explain in brief about the daemon which will take care of the failover.

Answer»

High Availability of cluster was introduced in Hadoop 2 to SOLVE the single point of Name node failure problem in Hadoop 1.

The High availability Name node architecture provides an opportunity to have two name nodes as Active name node and Passive/Standby name node. So, both are running Name Nodes at the same time in a High Availability cluster.

Whenever Active Name Node goes down due to crashes of server or graceful failover during the maintenance period at the same time control will go to passive/Standby Name Node automatically and it reduces the cluster downtime. There are two problems in maintaining consistency in the HDFS High Availability cluster:

Active and Passive/Standby Name Node should be in sync always because they are referring to the same metadata, to doing the same group of daemons called journal nodes will help. This will allow restoring the Hadoop cluster to the same namespace state whenever it got crashed or failed and it will provide us to have fast failover.
One name node should be active at a time because two active Name Node will cause to loss or corruption of the data. This kind of scenario is known as a split-brain scenario where a cluster gets divided into a smaller cluster and each one believing that it is the only active cluster. FENCING helps to avoid such scenarios. Fencing is a process where it ensures that only one Name Node remains active at a particular time. It means whenever two Name Node will be in Active state fencing will kill one of the Name node active states.

As discussed above There are two types of failover: A. Graceful Failover: In this case, we manually initiate the failover for routine maintenance. B. Automatic Failover: In this case, the failover is initiated automatically in case of Name Node failure or Name node crashes.

In either case of a Name Node failure, Passive or Stand by Name Node can take control of exclusive lock in Zookeeper and showing as it WANTS to become the next Active Name Node.

In HDFS High availability cluster, APACHE Zookeeper is a service which provides the automatic failover. When the Name Node is active at that time Zookeeper maintains a session with the active Name Node. In any scenario when active Name Node get failed at that time the session will expire and the Zookeeper will inform to Passive or Stand by Name Node to initiate the failover process.

The ZookeeperFailoverController (ZKFC) is a Zookeeper client that also monitors and manages the Name Node status. Each of the Name Nodes runs a ZKFC also. ZKFC is responsible for monitoring the health of the Name Nodes periodically.

When zookeeper is installed in your cluster you should make sure that below are the process, or daemons running in Active Name Node, Standby Name Node and Data node.

When you do JPS (Java Virtual Machine Process Status Tool ) in Active NameNode you should get below Daemons:

Zookeeper
Zookeeper Fail Over controller
JournalNode
NameNode

When you do JPS (Java Virtual Machine Process Status Tool ) in Standby NameNode you should get below Daemons:

Zookeeper
Zookeeper Fail Over controller
JournalNode
NameNode

When you do JPS (Java Virtual Machine Process Status Tool ) in DataNode you should get below Daemons:

Zookeeper
JournalNode
DataNode

Discussion

54.	What are the differences between the Linux file system and Hadoop distributed file system?
Answer» Linux file system *************** You can STORE the Linux file under a single disk You can store the small file in a disk if the file size is more than the disk size then you can not store. Each block size is 4 KB. If the machine is down, we can not able to get the data and failover chances are more HADOOP Distributed file system *************************** A distributed file system can be stored in a logical layer which is created on ONE or more disk. You can store files as larger as you can, you need to add more disk to the logical layer Each block size is 64MB/128MB/256MB as per the Hadoop version and you can customize the size too. Data is replicated in different nodes. Clients are able to read the data if any node fails. Failover is less

54.

What are the differences between the Linux file system and Hadoop distributed file system?

Answer»

Linux file system

*****************

You can STORE the Linux file under a single disk
You can store the small file in a disk if the file size is more than the disk size then you can not store.
Each block size is 4 KB.
If the machine is down, we can not able to get the data and failover chances are more

HADOOP Distributed file system

*****************************

A distributed file system can be stored in a logical layer which is created on ONE or more disk.
You can store files as larger as you can, you need to add more disk to the logical layer
Each block size is 64MB/128MB/256MB as per the Hadoop version and you can customize the size too.
Data is replicated in different nodes. Clients are able to read the data if any node fails. Failover is less

Discussion

55.	What is the command to Copy a directory from one node in the cluster to another
Answer» UBUNTU@ubuntu-VirtualBox:~$ HDFS DFS -DISTCP hdfs://namenodeA/apache_hadoop hdfs://namenodeB/Hadoop

Discussion

56.	Explain in details the difference between NameNode, Checkpoint NameNode and Backup Node ?
Answer» NameNode- It is also known as Master node. It maintains the file SYSTEM tree and the metadata for all the files and directories present in the system. NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. It records the metadata of all the files stored in the cluster i.e. location of blocks stored, size of the files, HIERARCHY,permissions ETC . NameNode is the master daemon that manages and maintains all the DataNodes (slave nodes). There are two files associated with the metadata: FsImage: It is the snapshot of the file system when Name Node is started. EditLogs: It is the sequence of changes made to the file system after the Name Node is started. Checkpoint node- Checkpoint node is the new implementation of Secondary NameNode . It is used to create periodic checkpoints of file system metadata by merging edits file with fsimage file and finally it uploads the new image back to the active NameNode It is structured in the same directory as the NameNode and stores the latest checkpoint . Backup Node - Backup Node is an extended checkpoint node that performs checkpointing and also supports online streaming of file system edits. Its main role is to act as the dynamic Backup for the Filesystem Namespace (Metadata )in the Primary Namenode of the Hadoop Ecosystem. The Backup node keeps an in-memory, up-to-date copy of the file system namespace which is always SYNCHRONIZED with the active NameNode state. Backup node does not need to download fsimage and edits files from the active NameNode to create a checkpoint, as it already has an up-to-date state of the namespace in it’s own main memory. So, creating checkpoint in backup node is just saving a copy of file system meta-data (namespace) from main-memory to its LOCAL files system.

56.

Explain in details the difference between NameNode, Checkpoint NameNode and Backup Node ?

Answer»

NameNode- It is also known as Master node. It maintains the file SYSTEM tree and the metadata for all the files and directories present in the system. NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. It records the metadata of all the files stored in the cluster i.e. location of blocks stored, size of the files, HIERARCHY,permissions ETC .

NameNode is the master daemon that manages and maintains all the DataNodes (slave nodes).

There are two files associated with the metadata:

FsImage: It is the snapshot of the file system when Name Node is started.
EditLogs: It is the sequence of changes made to the file system after the Name Node is started.

Checkpoint node- Checkpoint node is the new implementation of Secondary NameNode . It is used to create periodic checkpoints of file system metadata by merging edits file with fsimage file and finally it uploads the new image back to the active NameNode

It is structured in the same directory as the NameNode and stores the latest checkpoint .

Backup Node - Backup Node is an extended checkpoint node that performs checkpointing and also supports online streaming of file system edits.

Its main role is to act as the dynamic Backup for the Filesystem Namespace (Metadata )in the Primary Namenode of the Hadoop Ecosystem.

The Backup node keeps an in-memory, up-to-date copy of the file system namespace which is always SYNCHRONIZED with the active NameNode state.

Backup node does not need to download fsimage and edits files from the active NameNode to create a checkpoint, as it already has an up-to-date state of the namespace in it’s own main memory. So, creating checkpoint in backup node is just saving a copy of file system meta-data (namespace) from main-memory to its LOCAL files system.

Discussion

57.	What are the Hadoop's three configuration files?
Answer» FOLLOWING are the THREE CONFIGURATION FILES in HADOOP: core-site.xml mapred-site.xml hdfs-site.xml

57.

What are the Hadoop's three configuration files?

Answer»

FOLLOWING are the THREE CONFIGURATION FILES in HADOOP:

core-site.xml
mapred-site.xml
hdfs-site.xml

Discussion

58.	What are the functionalities of JobTracker
Answer» Below are the main TASKS of JobTracker: Accept jobs from the client. Communicate with the NameNode to determine the location of the data. Locate TASKTRACKER Nodes with AVAILABLE slots. Submit the WORK to the chosen TaskTracker node and MONITORS the progress of each task.

58.

What are the functionalities of JobTracker

Answer»

Below are the main TASKS of JobTracker:

Accept jobs from the client.
Communicate with the NameNode to determine the location of the data.
Locate TASKTRACKER Nodes with AVAILABLE slots.
Submit the WORK to the chosen TaskTracker node and MONITORS the progress of each task.

Discussion

59.	What are different hdfs dfs shell commands to perform copy operation?
Answer» $ HADOOP FS -copyToLocal $ hadoop fs -copyFromLocal $ hadoop fs -PUT

Discussion

Explore topic-wise InterviewSolutions in .

Explain how the Name node gets to know all the available data node in the Hadoop cluster.

What is distributed cache in Hadoop?

As per the configuration, HDFS is in High availability mode with automatic failover. Explain in brief about the daemon which will take care of the failover.

What are the differences between the Linux file system and Hadoop distributed file system?

What is the command to Copy a directory from one node in the cluster to another

Explain in details the difference between NameNode, Checkpoint NameNode and Backup Node ?

What are the Hadoop's three configuration files?

What are the functionalities of JobTracker

What are different hdfs dfs shell commands to perform copy operation?