Explore topic-wise InterviewSolutions in .

This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.

1.

List Hadoop HDFS Commands.

Answer»

A)version:  hadoop version

interviewbit:~$ hadoop versionHadoop 3.1.2Source code repository HTTPS://github.com/apache/hadoop.git -rCompiled by sunlig on 2019-01-29T01:39Zinterviewbit:~$

B) mkdir: Used to create a new directory.

interviewbit:~$ hadoop FD -mkdir/interviewbit interviewbit:~$

C) cat: We are using the cat command to display the content of the file present in the directory of HDFS.
hadoop fs –cat /path_to_file_in_hdfs

interviewbit:~$ hadoop fs -cat/interviewbit/sample Hello from InterviewBit…File in HDFS …interviewbit:~$

D)mv : The HDFS mv command moves the files or directories from the SOURCE to a destination WITHIN HDFS.
hadoop fs -mv <src> <dest>

interviewbit:~$ hadoop fs -ls/Found 2 Itemsdrwxv -xv -x - interviewbit supergroup 0 2020-01-29:11:11/ Intr1drwxv -xv -x - interviewbit supergroup 0 2020-01-29:11:11/ Interviewbitinterviewbit:~$ hadoop fs -mv/ Intr1/ Interviewbitinterviewbit:~$ hadoop fs -ls/Found 1 Itemdrwxv -xv -x - interviewbit supergroup 0 2020-01-29:11:11/ Interviewbit

E) copyToLocal: This command copies the file from the file present in the newDataFlair directory of HDFS to the local file system.
hadoop fs -copyToLocal <hdfs source> <localdst>

interviewbit:~$ hadoop fs -copyFromLocal -/test1/interviewbit/CopyTestinterviewbit:~$ 

F) GET: Copies the file from the Hadoop File System to the Local File System.
hadoop fs -get<src> <localdest>

interviewbit:~$ hadoop fs - get/testFile ~/copyFromHadoopinterviewbit:~$
2.

Mention the types of Znode.

Answer»
  • Persistent Znodes:
    The default znode in ZOOKEEPER is the Persistent Znode. It permanently stays in the zookeeper server until any other clients leave it apart.
  • Ephemeral Znodes:
    These are the temporary znodes. It is smashed whenever the creator client logs out of the ZooKeeper server. For example, assume client1 created eznode1. Once client1 logs out of the ZooKeeper server, the eznode1 gets destroyed.
  • Sequential Znodes:
    Sequential znode is assigned a 10-digit number in numerical order at the end of its NAME. Assume client1 produced a sznode1. In the ZooKeeper server, the sznode1 will be named LIKE this:
    sznode0000000001
    If the client1 generates another sequential znode, it will bear the FOLLOWING number in a sequence. So the subsequent sequential znode is &LT;znode name>0000000002.
3.

What are the Benefits of using zookeeper?

Answer»
  • Simple distributed coordination process: The coordination process AMONG all NODES in Zookeeper is straightforward.
  • Synchronization: Mutual exclusion and co-operation among server processes. 
  • Ordered Messages: Zookeeper tracks with a number by denoting its order with the stamping of each update; with the help of all this, messages are ordered here.
  • Serialization: Encode the data according to specific rules. ENSURE your application runs consistently. 
  • Reliability: The zookeeper is very RELIABLE. In case of an update, it keeps all the data until forwarded.
  • Atomicity: Data transfer either succeeds or fails, but no transaction is partial.
4.

What is Apache ZooKeeper?

Answer»

Apache ZOOKEEPER is an open-source service that supports controlling a huge set of hosts. Management and coordination in a distributed environment are complex. Zookeeper automates this PROCESS and enables developers to concentrate on building SOFTWARE features rather than bother about its distributed nature.

Zookeeper helps to maintain configuration knowledge, naming, group services for distributed applications. It implements various PROTOCOLS on the cluster so that the application should not execute them on its own. It provides a single coherent view of many machines.

5.

List the YARN components.

Answer»
  • Resource MANAGER: It runs on a master daemon and controls the resource allocation in the cluster.
  • Node Manager: It runs on the slave DAEMONS and executes a task on each SINGLE Data Node.
  • Application Master: It controls the user job lifecycle and resource demands of single applications. It WORKS with the Node Manager and monitors the execution of tasks.
  • Container: It is a combination of resources, including RAM, CPU, Network, HDD, etc., on a single node.
6.

What is Yarn?

Answer»

Yarn stands for Yet ANOTHER RESOURCE Negotiator. It is the resource management layer of Hadoop. The Yarn was LAUNCHED in Hadoop 2.x. Yarn provides many data processing engines like graph processing, batch processing, interactive processing, and stream processing to execute and process data saved in the Hadoop Distributed File System. Yarn also offers job scheduling. It extends the capability of Hadoop to other evolving technologies so that they can TAKE good advantage of HDFS and economic clusters. 
Apache Yarn is the data operating method for Hadoop 2.x. It consists of a master daemon known as “Resource Manager,” a slave daemon called node manager, and APPLICATION Master.

7.

Explain the Apache Pig architecture.

Answer»

Apache Pig architecture includes a Pig Latin interpreter that applies Pig Latin scripts to process and interpret massive datasets. Programmers use Pig Latin language to examine huge datasets in the Hadoop environment. Apache pig has a vibrant set of datasets showing different data operations like join, filter, sort, load, group, etc.
Programmers must practice Pig Latin language to address a Pig script to perform a particular task. Pig TRANSFORMS these Pig scripts into a series of MAP-Reduce jobs to reduce programmers’ work. Pig Latin programs are performed via various mechanisms such as UDFs, embedded, and Grunt shells.

Apache Pig architecture consists of the following major components:

  • Parser: The Parser handles the Pig Scripts and checks the syntax of the script.
  • Optimizer: The optimizer receives the logical plan (DAG). And carries out the logical optimization such as projection and push down.
  • Compiler: The compiler converts the logical plan into a series of MapReduce jobs.
  • Execution ENGINE: In the end, the MapReduce jobs get submitted to Hadoop in sorted order.
  • Execution Mode: Apache Pig is EXECUTED in local and Map Reduce modes. The SELECTION of execution mode depends on where the data is stored and where you want to run the Pig script.
8.

What is Apache Pig?

Answer»

MapReduce needs PROGRAMS to be translated into MAP and reduce stages. As not all data analysts are accustomed to MapReduce, Yahoo researchers introduced Apache pig to bridge the gap. Apache Pig was CREATED on top of Hadoop, producing a high level of abstraction and enabling PROGRAMMERS to spend less TIME writing complex MapReduce programs.

9.

What is an Apache Hive?

Answer»

Hive is an open-source SYSTEM that processes structured data in Hadoop, living on top of the latter for summing Big Data and FACILITATING analysis and QUERIES. In addition, hive enables SQL developers to write Hive Query Language statements similar to standard SQL statements for data query and analysis. It is CREATED to make MapReduce programming EASIER because you don’t know and write lengthy Java code.

10.

List the components of Apache Spark.

Answer»

Apache Spark comprises the Spark Core Engine, Spark Streaming, MLlib, GraphX, Spark SQL, and Spark R. 

The Spark Core Engine can be USED along with any of the other FIVE components specified. It is not required to use all the Spark components collectively. Depending on the use case and request, ONE or more can be used along with Spark Core.

11.

What is shuffling in MapReduce?

Answer»

In Hadoop MapReduce, shuffling is USED to transfer data from the mappers to the important reducers. It is the process in which the system sorts the UNSTRUCTURED data and transfers the OUTPUT of the map as an input to the reducer. It is a SIGNIFICANT process for reducers. Otherwise, they would not accept any information. Moreover, since this process can begin EVEN before the map phase is completed, it helps to save time and complete the process in a lesser amount of time.

12.

Explain Hadoop MapReduce.

Answer»

Hadoop MapReduce is a software framework for processing ENORMOUS data sets. It is the main component for data processing in the Hadoop framework. It divides the input data into SEVERAL parts and runs a program on every data component parallel at one. The word MapReduce refers to two separate and different tasks.

The first is the map operation, which takes a set of data and transforms it into a different collection of data, where individual elements are divided into tuples. The reduce operation consolidates those data tuples BASED on the key and subsequently modifies the value of the key.

Let us take an example of a text file CALLED example_data.txt and understand how MapReduce works.

The content of the example_data.txt file is:
coding,jamming,ice,river,man,driving

Now, assume we have to find out the word count on the example_data.txt using MapReduce. So, we will be looking for the unique words and the number of times those unique words appeared.

  • First, we break the input into three divisions, as seen in the figure. This will share the work among all the map nodes.
  • Then, all the words are tokenized in each of the mappers, and a hardcoded value (1) to each of the tokens is given. The reason behind giving a hardcoded value equal to 1 is that every word by itself will, at least, occur once.
  • Now, a LIST of key-value pairs will be created where the key is nothing but the individual words and value is one. So, for the first line (Coding Ice Jamming), we have three key-value pairs – Coding, 1; Ice, 1; Jamming, 1.
  • The mapping process persists the same on all the nodes.
  • Next, a partition process occurs where sorting and shuffling follow so that all the tuples with the same key are sent to the identical reducer.
  • Subsequent to the sorting and shuffling phase, every reducer will have a unique key and a list of values matching that very key. For example, Coding, [1,1]; Ice, [1,1,1].., etc.
  • Now, each Reducer adds the values which are present in that list of values. As shown in the example, the reducer gets a list of values [1,1] for the key Jamming. Then, it adds the number of ones in the same list and gives the final output as – Jamming, 2.
  • Lastly, all the output key/value pairs are then assembled and written in the output file.
13.

List Hadoop Configuration files.

Answer»
14.

Compare the main differences between HDFS (Hadoop Distributed File System ) and Network Attached Storage(NAS) ?

Answer»
HDFSNAS
HDFS is a Distributed File SYSTEM that is mainly used to store data by COMMODITY hardware.NAS is a file-level computer data storage server CONNECTED to a computer network that PROVIDES network access to a heterogeneous group of clients.
HDFS is programmed to work with the MapReduce paradigm.NAS is not suitable to work with a MapReduce paradigm.
HDFS is Cost-effective.NAS is a high-end storage device that is highly expensive.
15.

What are the Limitations of Hadoop 1.0 ?

Answer»
  • Only one NAMENODE is possible to configure.
  • Secondary NameNode was to TAKE HOURLY backup of MetaData from NameNode.
  • It is only suitable for Batch Processing of a vast amount of DATA, which is already in the Hadoop System.
  • It is not ideal for Real-time Data Processing.
  • It supports up to 4000 Nodes per Cluster.
  • It has a SINGLE component: JobTracker to perform many activities like Resource Management, Job Scheduling, Job Monitoring, Re-scheduling Jobs etc.
  • JobTracker is the single point of failure.
  • It supports only one Name No and One Namespace per Cluster.
  • It does not help the Horizontal Scalability of NameNode.
  • It runs only Map/Reduce jobs.
16.

Mention different Features of HDFS.

Answer»
  • Fault Tolerance
    Hadoop framework DIVIDES data into blocks and creates various copies of blocks on several machines in the cluster. So, when any device in the cluster fails, clients can still access their data from the other machine containing the exact copy of data blocks. 
  • High Availability
    In the HDFS environment, the data is duplicated by generating a copy of the blocks. So, whenever a user wants to obtain this data, or in case of an unfortunate situation, users can simply access their data from the other nodes because duplicate IMAGES of blocks are already PRESENT in the other nodes of the HDFS cluster. 
  • High Reliability
    HDFS splits the data into blocks, these blocks are stored by the Hadoop framework on nodes existing in the cluster. It saves data by generating a duplicate of every block current in the cluster. Hence presents a fault tolerance facility. By default, it creates 3 duplicates of each block containing information present in the nodes. Therefore, the data is promptly OBTAINABLE to the users. Hence the user does not face the difficulty of data loss. Therefore, HDFS is very reliable. 
  • Replication
    Replication resolves the problem of data loss in adverse conditions like device failure, crashing of nodes, etc. It manages the process of replication at frequent intervals of time. Thus, there is a low probability of a loss of user data. 
  • Scalability
    HDFS stocks the data on multiple nodes. So, in case of an increase in demand, it can scale the cluster.
17.

Explain the Storage Unit In Hadoop (HDFS).

Answer»

HDFS is the Hadoop Distributed File System, is the storage layer for Hadoop. The files in HDFS are SPLIT into BLOCK-size PARTS called data blocks. These blocks are saved on the slave nodes in the cluster. By default, the size of the block is 128 MB by default, which can be configured as per our necessities. It follows the master-slave architecture. It contains two daemons- DataNodes and NameNode.

NameNode
The NameNode is the master daemon that operates on the master node. It saves the filesystem metadata, that is, files names, data about blocks of a file, blocks locations, PERMISSIONS, etc. It manages the Datanodes.
DataNode
The DataNodes are the slave daemon that operates on the slave nodes. It saves the actual BUSINESS data. It serves the client read/write requests based on the NameNode instructions. It stores the blocks of the files, and NameNode stores the metadata like block locations, permission, etc.

18.

Explain Hadoop. List the core components of Hadoop

Answer»

Hadoop is a famous BIG data tool utilized by MANY companies globally. Few successful Hadoop USERS:

  • Uber
  • The Bank of Scotland
  • Netflix
  • The National Security Agency (NSA) of the United States
  • Twitter

There are three components of Hadoop are:

  1. Hadoop YARN - It is a RESOURCE management unit of Hadoop.
  2. Hadoop Distributed File System (HDFS) - It is the storage unit of Hadoop.
  3. Hadoop MapReduce - It is the processing unit of Hadoop.
19.

Explain big data and list its characteristics.

Answer»

Gartner defined Big Data as– 
Big data” is high-volume, velocity, and variety INFORMATION assets that demand cost-effective, innovative forms of information processing for enhanced INSIGHT and decision making.

Simply, big data is larger, more complex data sets, particularly from new data sources. These data sets are so large that conventional data processing software can’t manage them. But these massive VOLUMES of data can be used to address business problems you wouldn’t have been able to tackle before.

Image Source: ResearchGate

Characteristics of Big Data are: 

  • Volume: A large amount of data stored in data warehouses refers to Volume.
  • Velocity: Velocity typically refers to the pace at which data is being generated in real-time.
  • Variety: Variety of Big Data relates to structured, unstructured, and SEMISTRUCTURED data that is collected from multiple sources.
  • Veracity: Data veracity generally refers to how accurate the data is.
  • Value: No matter how fast the data is produced or its amount, it has to be reliable and valuable. Otherwise, the information is not good enough for processing or analysis.