Explore topic-wise InterviewSolutions in .

This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.

51.

Explain how the  Name node gets to know all the available data node in the Hadoop cluster.

Answer»

In Hadoop cluster when we are talking about Data node, Data node is where the actual data we are keeping. Data nodes are sending a heartbeat message to the name node in every 3 seconds to confirm that they are active. If the Name Node does not receive a heartbeat from a particular data node for 10 minutes, then it considers that data node to be dead. Then Name Node initiates the replication of Dead data node blocks to some other data nodes which are active. Data nodes can talk to each other to rebalance the data, MOVE and copy the data around and keep the replication active in the cluster. You can get the BLOCK report using below HDFS commands.

Example:

hadoop fsck / ==> Filesystem check on HDFS

# hadoop fsck /hadoop/container/pbibhu

  • Total size: 16666944775310 B <=== see here
  • Total dirs: 3922
  • Total files: 418464
  • Total blocks (validated): 202705 (avg. block size 32953610 B)
  • Minimally replicated blocks: 202705 (100.0 %)
  • Over-replicated blocks: 0 (0.0 %)
  • Under-replicated blocks: 0 (0.0 %)
  • Mis-replicated blocks: 0 (0.0 %)
  • DEFAULT replication factor: 3
  • Average block replication:3.0
  • Corrupt blocks: 0
  • Missing replicas: 0 (0.0 %)
  • Number of data-nodes: 18
  • Number of racks: 1

FSCK ended at Thu Oct 20 20:49:59 CET 2011 in 7516 milliseconds

The filesystem under path '/hadoop/container/pbibhu 'is HEALTHY

Name node is the node which stores the file system metadata when we are talking about metadata, it is having information LIKE List of file names, Owner, Permissions, Timestamps, Size, Replication Factor, List of Blocks for each file etc. Metadata, which files maps to what block location and which blocks are stored in which data node. When data nodes are storing a block of information, it maintains a checksum for each block as well. when any data has been written to HDFS, checksum value has been written simultaneously and when it reads by default verifies the same checksum value. 

The data nodes update the name node with the block information periodically and before updating verify the value of the checksum. when the checksum value is not correct for a particular block then we will consider as DISK level corruption for that particular block , it skips that block information while reporting to the name node, in this way name node will get to know the disk level corruption on that data node and takes necessary steps like it can be replicated from its alternate locations to other active data nodes to bring the replication factor back to the normal level. Data nodes can be listed in DFS.HOSTS file, It contains a list of hosts that are permitted to connect to the Name Node.

Example:

Add this property to hdfs-site.xml: <property>   <name>dfs.hosts</name>    <value>/home/hadoop/includes</value> </property> includes: host name1 hostname2 hostname3

If include file is empty then all hosts are permitted but it is not a definitive list of active data nodes. Name node will consider those data nodes from which Name Node will receive the heart beats.

52.

What is distributed cache in Hadoop?

Answer»

It is a facility provided by Hadoop map-reduce framework to ACCESS small file needed by an application during its execution. These files are small as it is in KB's and MB's in size. The type of files are mainly text, archive or jar files. These files are small that is why it will keep in the cache memory which is one of the fast memories. Applications which need to use distributed cache to distribute a file should make sure that the file is available and can be accessed via URLs. URLs can either be hdfs:// or http://

Once the file is present on the mentioned URL, the Map-Reduce framework will copy the necessary files on all the nodes before initiation of the tasks on those nodes. In case the files provided are archives, these will be automatically unarchived on the nodes after transfer.

Example: In a Hadoop CLUSTER, we have three data nodes there are 30 tasks we run in the cluster. So each node will get 10 tasks each. Our nature of the task is such kind of task where it needs some information or a particular jar to be adopted before its execution. To fulfil this, we can cache these files which contain the info or jar files. Before execution of the job, the cache files will copy to each slave node application master. Application master than reads the files and start the tasks. The task can be MAPPER or reducer and these are read-only files. By default Hadoop, the distributed cache is 10GB if you want to change the same you have to modify the size in mapred-site.xml. Here it is coming to our mind that why cache memory is required to perform the tasks.  why can’t we keep the file in HDFS on each data node already present and have the application read it? they are a total of 30 tasks and in real time it should be more than 100 or 1000 tasks. If we put the files in HDFS than to perform 30 tasks the application has to access the HDFS location 30 times and then read it but HDFS is not very efficient to access small files for this many times. this is the reason why we are using cache memory and it REDUCES the number of reads from HDFS locations.

53.

As per the configuration, HDFS is in High availability mode with automatic failover. Explain in brief about the daemon which will take care of the failover.

Answer»

High Availability of cluster was introduced in Hadoop 2 to SOLVE the single point of Name node failure problem in Hadoop 1.

The High availability Name node architecture provides an opportunity to have two name nodes as Active name node and Passive/Standby name node. So, both are running Name Nodes at the same time in a High Availability cluster.

Whenever Active Name Node goes down due to crashes of server or graceful failover during the maintenance period at the same time control will go to passive/Standby  Name Node automatically and it reduces the cluster downtime. There are two problems in maintaining consistency in the HDFS High Availability cluster:

  1. Active and Passive/Standby Name Node should be in sync always because they are referring to the same metadata, to doing the same group of daemons called journal nodes will help. This will allow restoring the Hadoop cluster to the same namespace state whenever it got crashed or failed and it will provide us to have fast failover.
  2. One name node should be active at a time because two active Name Node will cause to loss or corruption of the data. This kind of scenario is known as a split-brain scenario where a cluster gets divided into a smaller cluster and each one believing that it is the only active cluster. FENCING helps to avoid such scenarios. Fencing is a process where it ensures that only one Name Node remains active at a particular time. It means whenever two Name Node will be in Active state fencing will kill one of the Name node active states.

As discussed above  There are two types of failover: A. Graceful Failover: In this case, we manually initiate the failover for routine maintenance. B. Automatic Failover: In this case, the failover is initiated automatically in case of Name Node failure or Name node crashes.

In either case of a Name Node failure, Passive or Stand by Name Node can take control of exclusive lock in Zookeeper and showing as it WANTS to become the next Active Name Node.

In HDFS High availability cluster, APACHE Zookeeper is a service which provides the automatic failover. When the Name Node is active at that time Zookeeper maintains a session with the active Name Node. In any scenario when active Name Node get failed at that time the session will expire and the Zookeeper will inform to Passive or Stand by Name Node to initiate the failover process.

The ZookeeperFailoverController (ZKFC) is a Zookeeper client that also monitors and manages the Name Node status. Each of the Name Nodes runs a ZKFC also. ZKFC is responsible for monitoring the health of the Name Nodes periodically.

When zookeeper is installed in your cluster you should make sure that below are the process, or daemons running in Active Name Node, Standby Name Node and Data node.

When you do JPS (Java Virtual Machine Process Status Tool ) in Active NameNode you should get below Daemons:

  • Zookeeper
  • Zookeeper Fail Over controller
  • JournalNode
  • NameNode

When you do JPS (Java Virtual Machine Process Status Tool ) in Standby NameNode you should get below Daemons:

  • Zookeeper
  • Zookeeper Fail Over controller
  • JournalNode
  • NameNode

When you do JPS (Java Virtual Machine Process Status Tool ) in DataNode you should get below Daemons:

  • Zookeeper
  • JournalNode
  • DataNode
54.

What are the differences between the Linux file system and Hadoop distributed file system?

Answer»

Linux file system

*****************

  1. You can STORE the Linux file under a single disk
  2. You can store the small file in a disk if the file size is more than the disk size then you can not store.
  3. Each block size is 4 KB.
  4. If the machine is down, we can not able to get the data and failover chances are more

HADOOP  Distributed file system

*****************************

  1. A distributed file system can be stored in a logical layer which is created on ONE or more disk.
  2. You can store files as larger as you can, you need to add more disk to the logical layer
  3. Each block size is 64MB/128MB/256MB as per the Hadoop version and you can customize the size too.
  4. Data is replicated in different nodes. Clients are able to read the data if any node fails. Failover is less
55.

What is the command to Copy a directory from one node in the cluster to another

Answer» UBUNTU@ubuntu-VirtualBox:~$ HDFS DFS -DISTCP hdfs://namenodeA/apache_hadoop hdfs://namenodeB/Hadoop
56.

Explain in details the difference between NameNode, Checkpoint NameNode and Backup Node ?

Answer»

NameNode- It is also known as Master node. It maintains the file SYSTEM tree and the metadata for all the files and directories present in the system. NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. It records the metadata of all the files stored in the cluster i.e. location of blocks stored, size of the files, HIERARCHY,permissions ETC .

NameNode is the master daemon that manages and maintains all the DataNodes (slave nodes).

There are two files associated with the metadata:

  • FsImage: It is the snapshot of the file system when Name Node is started.
  • EditLogs: It is the sequence of changes made to the file system after the Name Node is started.

Checkpoint node- Checkpoint node is the new implementation of Secondary NameNode . It is used to create periodic checkpoints of file system metadata by merging edits file with fsimage file and finally it uploads the new image back to the active NameNode 

It is structured in the same directory as the NameNode and stores the latest checkpoint .

Backup Node - Backup Node is an extended checkpoint node that performs checkpointing and also supports online streaming of file system edits.

Its main role is to act as the dynamic Backup for the Filesystem Namespace (Metadata )in the Primary Namenode of the Hadoop Ecosystem.

The Backup node keeps an in-memory, up-to-date copy of the file system namespace which is always SYNCHRONIZED with the active NameNode state.

Backup node does not need to download fsimage and edits files from the active NameNode to create a checkpoint, as it already has an up-to-date state of the namespace in it’s own main memory.  So, creating checkpoint in backup node is just saving a copy of file system meta-data (namespace) from main-memory to its LOCAL files system.

57.

What are the Hadoop's three configuration files?

Answer»

FOLLOWING are the THREE CONFIGURATION FILES in HADOOP:

  • core-site.xml
  • mapred-site.xml
  • hdfs-site.xml
58.

What are the functionalities of JobTracker

Answer»

Below are the main TASKS of JobTracker:

  • Accept jobs from the client.
  • Communicate with the NameNode to determine the location of the data.
  • Locate TASKTRACKER Nodes with AVAILABLE slots.
  • Submit the WORK to the chosen TaskTracker node and MONITORS the progress of each task.
59.

What are different hdfs dfs shell commands to perform copy operation?

Answer»

$ HADOOP FS -copyToLocal  $ hadoop fs -copyFromLocal  $ hadoop fs -PUT