24 + Interview Questions in Hadoop Interview Questions for Experienced in Hadoop Interview Questions

1.	Mention the consequences of Distributed Applications.
Answer» Heterogeneity: The design of applications should allow the users to access services and run applications over a heterogeneous collection of computers and networks taking into consideration Hardware devices, OS, networks, Programming languages. Transparency: Distributed system Designers must hide the complexity of the system as MUCH as they can. Some Terms of transparency are location, access, migration, RELOCATION, and so on. Openness: It is a characteristic that determines whether the system can be extended and reimplemented in VARIOUS ways. Security: Distributed system Designers must take care of confidentiality, integrity, and availability. Scalability: A system is said to be scalable if it can handle the addition of users and RESOURCES without SUFFERING a noticeable loss of performance. Recommended Resource: Spark Interview

1.

Mention the consequences of Distributed Applications.

Answer»

Heterogeneity: The design of applications should allow the users to access services and run applications over a heterogeneous collection of computers and networks taking into consideration Hardware devices, OS, networks, Programming languages.
Transparency: Distributed system Designers must hide the complexity of the system as MUCH as they can. Some Terms of transparency are location, access, migration, RELOCATION, and so on.
Openness: It is a characteristic that determines whether the system can be extended and reimplemented in VARIOUS ways.
Security: Distributed system Designers must take care of confidentiality, integrity, and availability.
Scalability: A system is said to be scalable if it can handle the addition of users and RESOURCES without SUFFERING a noticeable loss of performance.

Recommended Resource:
Spark Interview

Discussion

2.	Explain the architecture of Flume.
Answer» In general Apache Flume architecture is composed of the following components: Flume SOURCE Flume Channel Flume Sink Flume Agent Flume Event Flume Source: Flume Source is available on various networking PLATFORMS like Facebook or Instagram. It is a Data generator that collects data from the generator, and then the data is transferred to a Flume Channel in the form of a Flume. Flume Channel: The data from the flume source is sent to an Intermediate Store which buffers the events till they GET transferred into Sink. The Intermediate Store is called Flume Channel. Channel is an intermediate source. It is a bridge between Source and a Sink Flume channel. It supports both the Memory channel and File channel. The file channel is non-volatile which means once the data is entered into the channel, the data will never be lost unless you delete it. In contrast, in the Memory channel, events are stored in memory, so it’s volatile, meaning data may be lost, but Memory Channel is very fast in nature. Flume sink: Data repositories like HDFS, have Flume Sink. Which takes Flume events from the Flume Channel and stores them into the Destination specified like HDFS. It is done in such a way where it should deliver the events to the Store or another agent. Various sinks like Hive Sink, THRIFT Sink, etc are supported by the Flume. Flume Agent: A Java process that works on Source, Channel, Sink combination is called Flume Agent. One or more than one agent is possible in Flume. Connected Flume agents which are distributed in nature can also be collectively called Flume. Flume Event: An Event is the unit of data transported in Flume. The general representation of the Data Object in Flume is called Event. The event is made up of a payload of a byte array with OPTIONAL headers.

2.

Explain the architecture of Flume.

Answer»

In general Apache Flume architecture is composed of the following components:

Flume SOURCE
Flume Channel
Flume Sink
Flume Agent
Flume Event

Flume Source: Flume Source is available on various networking PLATFORMS like Facebook or Instagram. It is a Data generator that collects data from the generator, and then the data is transferred to a Flume Channel in the form of a Flume.
Flume Channel: The data from the flume source is sent to an Intermediate Store which buffers the events till they GET transferred into Sink. The Intermediate Store is called Flume Channel. Channel is an intermediate source. It is a bridge between Source and a Sink Flume channel. It supports both the Memory channel and File channel. The file channel is non-volatile which means once the data is entered into the channel, the data will never be lost unless you delete it. In contrast, in the Memory channel, events are stored in memory, so it’s volatile, meaning data may be lost, but Memory Channel is very fast in nature.
Flume sink: Data repositories like HDFS, have Flume Sink. Which takes Flume events from the Flume Channel and stores them into the Destination specified like HDFS. It is done in such a way where it should deliver the events to the Store or another agent. Various sinks like Hive Sink, THRIFT Sink, etc are supported by the Flume.
Flume Agent: A Java process that works on Source, Channel, Sink combination is called Flume Agent. One or more than one agent is possible in Flume. Connected Flume agents which are distributed in nature can also be collectively called Flume.
Flume Event: An Event is the unit of data transported in Flume. The general representation of the Data Object in Flume is called Event. The event is made up of a payload of a byte array with OPTIONAL headers.

Discussion

3.	What is Apache Flume in Hadoop ?
Answer» APACHE Flume is a tool/service/data INGESTION mechanism for assembling, aggregating, and carrying huge amounts of streaming data such as record files, EVENTS from various references to a centralized data store. Flume is a very stable, DISTRIBUTED, and configurable tool. It is generally designed to copy streaming data (LOG data) from various web servers to HDFS.

3.

What is Apache Flume in Hadoop ?

Answer»

APACHE Flume is a tool/service/data INGESTION mechanism for assembling, aggregating, and carrying huge amounts of streaming data such as record files, EVENTS from various references to a centralized data store.

Flume is a very stable, DISTRIBUTED, and configurable tool. It is generally designed to copy streaming data (LOG data) from various web servers to HDFS.

Discussion

4.	What is the default File format to import data using Apache sqoop?
Answer» There are BASICALLY TWO file FORMATS sqoop allos to IMPORT DATA they are: Delimited Text File format Sequence File Format

4.

What is the default File format to import data using Apache sqoop?

Answer»

There are BASICALLY TWO file FORMATS sqoop allos to IMPORT DATA they are:

Delimited Text File format
Sequence File Format

Discussion

5.	Where is table data stored in Apache Hive by default?
Answer» By DEFAULT,the is table DATA in APACHE HIVE is stored in: Hdfs://namenode_server/user/hive/warehouse

Discussion

6.	If the source data gets updated every now and then, how will you synchronize the data in HDFS that is imported by Sqoop?
Answer» If the SOURCE data GETS updated in a very short interval of time, the synchronization of data in HDFS that is imported by Sqoop is done with the HELP of incremental parameters. We should USE incremental import along with the append choice even when the table is refreshed continuously with new rows. PRINCIPALLY where values of a few columns are examined, and if it encounters any revised value for those columns, only a new row will be inserted. Similar to incremental import, the origin has a date column examined for all the records that have been modified after the last import, depending on the previous revised column in the beginning. The values would be modernized.

6.

If the source data gets updated every now and then, how will you synchronize the data in HDFS that is imported by Sqoop?

Answer»

If the SOURCE data GETS updated in a very short interval of time, the synchronization of data in HDFS that is imported by Sqoop is done with the HELP of incremental parameters.

We should USE incremental import along with the append choice even when the table is refreshed continuously with new rows. PRINCIPALLY where values of a few columns are examined, and if it encounters any revised value for those columns, only a new row will be inserted. Similar to incremental import, the origin has a date column examined for all the records that have been modified after the last import, depending on the previous revised column in the beginning. The values would be modernized.

Discussion

7.

How do you differentiate inner bag and outer bag in Pig.

Answer»

Inner Bag	OUTER Bag
An inner bag just Contains a bag inside a tuple.	An outer bag which is also called a relation is NOTHING but a bag of TUPLES.
Example : (4,{(4,2,1),(4,3,3,)}) In this example the complete relation is an outer bag and {(4,2,1),(4,3,3,)} is an inner bag.	Example:{(park, New York),(Hollywood, Los Angeles)} Which is a bag of tuples, nothing but an outer bag.
An inner bag is a relation inside any other bag.	In an outer bag, relations are SIMILAR to relations in relational databases.

Discussion

8.	Why do we need to perform partitioning in Hive?
Answer» Apache HIVE organizes tables into PARTITIONS. Partitioning is the manner in which a table is split into RELATED components depending on the values of APPROPRIATE columns like date, city, and department. Every table in the hive can have one or more than one partition keys to recognize a distinct partition. With the help of partitions, it is effortless to do queries on SLICES of the data.

8.

Why do we need to perform partitioning in Hive?

Answer»

Apache HIVE organizes tables into PARTITIONS. Partitioning is the manner in which a table is split into RELATED components depending on the values of APPROPRIATE columns like date, city, and department.

Every table in the hive can have one or more than one partition keys to recognize a distinct partition. With the help of partitions, it is effortless to do queries on SLICES of the data.

Discussion

9.	Are Multiline Comments supported in Hive? Why?
Answer» No, as of now MULTILINE COMMENTS are not SUPPORTED in HIVE, only single-line comments are supported.

Discussion

10.

Compare differences between Local Metastore and Remote Metastore

Answer»

Local Metastore	Remote Metastore
Local metastore is a metastore service that RUNS in the same JVM in which the Hive service is running.	Remote Metastore has its own SEPARATE JVM which runs with its own JVM process.
It can ALSO connect to a separate database running in a separate JVM in the same or separate machine.	The MAIN advantage of remote mode over Local mode is that Remote mode does not acquire the administrator to share JDBC login information for the metastore database.

Discussion

11.	Explain a metastore in Hive?
Answer» Metastore is USED to store the metadata information; it’s also possible to use RDBMS and the open-source ORM layer, converting OBJECT Representation into a relational schema. It’s the central repository of Apache Hive metadata. It stores metadata for Hive tables (similar to their schema and LOCATION) and partitions in a relational database. It gives the client access to this information by USING metastore service API. Disk storage for the Hive metadata is SEPARATE from HDFS storage.

Discussion

12.	What applications are supported by Apache Hive?
Answer» The APPLICATIONS that are SUPPORTED by APACHE HIVE are, Java PHP Python C++ Ruby

12.

What applications are supported by Apache Hive?

Answer»

The APPLICATIONS that are SUPPORTED by APACHE HIVE are,

Java
PHP
Python
C++
Ruby

Discussion

13.	Give a brief on how Spark is good at low latency workloads like graph processing and Machine Learning.
Answer» The data is stored in MEMORY by APACHE Spark for faster processing and development of machine learning models, which may need a lot of Machine Learning algorithms for multiple repetitions and various CONCEPTUAL steps to create an optimized model. In the case of Graph algorithms, it moves within all the nodes and edges to make a graph. These LOW latency WORKLOADS, which need many iterations, enhance the performance.

Discussion

14.	Explain the Resilient Distributed Datasets in Spark.
Answer» Resilient Distributed DATASETS is the basic data structure of Apache Spark. It is INSTALLED in the Spark Core. They are immutable and fault-tolerant. RDDs are generated by transforming already PRESENT RDDs or storing an outer dataset from well-built storage like HDFS or HBase. Since they have distributed collections of objects, they can be operated in parallel. Resilient Distributed Datasets are DIVIDED into parts such that they can be executed on various nodes of a CLUSTER.

14.

Explain the Resilient Distributed Datasets in Spark.

Answer»

Resilient Distributed DATASETS is the basic data structure of Apache Spark. It is INSTALLED in the Spark Core. They are immutable and fault-tolerant. RDDs are generated by transforming already PRESENT RDDs or storing an outer dataset from well-built storage like HDFS or HBase.

Since they have distributed collections of objects, they can be operated in parallel. Resilient Distributed Datasets are DIVIDED into parts such that they can be executed on various nodes of a CLUSTER.

Discussion

15.	What are the basic parameters of a mapper?
Answer» The primary PARAMETERS of a mapper are text, LONGWRITABLE, text, and IntWritable. The initial TWO represent INPUT parameters, and the other two signify intermediate output parameters.

Discussion

16.	List the actions that happen when a DataNode fails.
Answer» Both the Jobtracker and the name node DETECT the failure on which blocks were the DataNode FAILED. On the failed node all the tasks are rescheduled by locating other DataNodes with copies of these blocks User’s DATA will be replicated to another node from namenode to maintain the CONFIGURED replication factor.

Discussion

17.	Explain the distributed Cache in MapReduce framework.
Answer» Distributed Cache is a significant feature provided by the MapReduce Framework, practiced when you want to share the FILES across all nodes in a Hadoop cluster. These files can be jar files or simple properties files. Hadoop's MapReduce framework allows the facility to cache SMALL to moderate read-only files such as text files, ZIP files, jar files, ETC., and distribute them to all the Datanodes(worker-nodes) MapReduce jobs are running. All Datanode gets a COPY of the file(local-copy), which is sent by Distributed Cache.

17.

Explain the distributed Cache in MapReduce framework.

Answer»

Distributed Cache is a significant feature provided by the MapReduce Framework, practiced when you want to share the FILES across all nodes in a Hadoop cluster. These files can be jar files or simple properties files.

Hadoop's MapReduce framework allows the facility to cache SMALL to moderate read-only files such as text files, ZIP files, jar files, ETC., and distribute them to all the Datanodes(worker-nodes) MapReduce jobs are running. All Datanode gets a COPY of the file(local-copy), which is sent by Distributed Cache.

Discussion

18.	Explain the actions followed by a Jobtracker in Hadoop.
Answer» The client application is USED to submit the jobs to the Jobtracker. The JobTracker associates with the NameNode to determine the data location. With the help of AVAILABLE slots and the near the data, JobTracker LOCATES TASKTRACKER nodes. It submits the work on the selected TaskTracker Nodes. When a TASK fails, JobTracker notifies and decides the further steps. JobTracker monitors the TaskTracker nodes.

Discussion

19.	Explain the purpose of the dfsadmin tool?
Answer» The dfsadmin tools are a specific set of tools designed to HELP you root out information about your Hadoop Distributed FILE system (HDFS). As a BONUS, you can use them to perform some administration operations on HDFS as well.

Discussion

20.	Which Command is used to find the status of the Blocks and File-system health?
Answer» The command used to find the status of the BLOCK is: HDFS fsck <PATH> -files –blocks And the command used to find File-system HEALTH is: hdfs fsck/ -files –blocks –locations > dfs-fsck.log

Discussion

21.	Where are the two types of metadata that NameNode server stores?
Answer» The two types of metadata that NameNode server stores are in DISK and RAM. Metadata is linked to two files which are: EditLogs: It contains all the latest CHANGES in the file SYSTEM regarding the last FsImage. FsImage: It contains the whole state of the namespace of the file system from the origination of the NameNode. Once the file is deleted from HDFS, the NameNode will immediately store this in the EditLog. All the file systems and metadata which are present in the Namenode’s Ram are read by the Secondary NameNode continuously and later get recorded into the file system or hard disk. EditLogs is combined with FsImage in the NameNode. Periodically, Secondary NameNode downloads the EditLogs from the NameNode, and then it is implemented to FsImage. The new FsImage is then copied back into the NameNode and used only after the NameNode has started the SUBSEQUENT time.

21.

Where are the two types of metadata that NameNode server stores?

Answer»

The two types of metadata that NameNode server stores are in DISK and RAM.
Metadata is linked to two files which are:

EditLogs: It contains all the latest CHANGES in the file SYSTEM regarding the last FsImage.
FsImage: It contains the whole state of the namespace of the file system from the origination of the NameNode.

Once the file is deleted from HDFS, the NameNode will immediately store this in the EditLog.
All the file systems and metadata which are present in the Namenode’s Ram are read by the Secondary NameNode continuously and later get recorded into the file system or hard disk. EditLogs is combined with FsImage in the NameNode. Periodically, Secondary NameNode downloads the EditLogs from the NameNode, and then it is implemented to FsImage. The new FsImage is then copied back into the NameNode and used only after the NameNode has started the SUBSEQUENT time.

Discussion

22.	How can you skip the bad records in Hadoop?
Answer» Hadoop provides an option where a particular set of lousy input records can be skipped when processing map inputs. APPLICATIONS can manage this feature through the SkipBadRecords class. This feature can be USED when map tasks fail deterministically on a particular input. This USUALLY happens due to faults in the map FUNCTION. The user would have to fix these ISSUES.

Discussion

23.	What is the default replication factor?
Answer» By default, the replication factor is 3. There are no two copies that will be on the same data NODE. Usually, the first two copies will be on the same rack, and the third copy will be off the shelf. It is advised to set the replication factor to at LEAST three so that one copy is always safe, EVEN if something happens to the rack. We can set the default replication factor of the file system as WELL as of each file and directory exclusively. For files that are not essential, we can lower the replication factor, and critical files should have a high replication factor.

Discussion

24.	Why are blocks in HDFS huge?
Answer» By default, the size of the HDFS data block is 128 MB. The ideas for the large size of blocks are: To reduce the expense of seek: Because of the large size blocks, the time consumed to shift the data from the disk can be longer than the usual time taken to commence the block. As a result, the multiple blocks are transferred at the disk TRANSFER rate. If there are small blocks, the NUMBER of blocks will be too many in Hadoop HDFS and too much metadata to STORE. Managing such a VAST number of blocks and metadata will create overhead and head to traffic in a network.

24.

Why are blocks in HDFS huge?

Answer»

By default, the size of the HDFS data block is 128 MB. The ideas for the large size of blocks are:

To reduce the expense of seek: Because of the large size blocks, the time consumed to shift the data from the disk can be longer than the usual time taken to commence the block. As a result, the multiple blocks are transferred at the disk TRANSFER rate.
If there are small blocks, the NUMBER of blocks will be too many in Hadoop HDFS and too much metadata to STORE. Managing such a VAST number of blocks and metadata will create overhead and head to traffic in a network.

Discussion

Explore topic-wise InterviewSolutions in Current Affairs.

Mention the consequences of Distributed Applications.

Explain the architecture of Flume.

What is Apache Flume in Hadoop ?

What is the default File format to import data using Apache sqoop?

Where is table data stored in Apache Hive by default?

If the source data gets updated every now and then, how will you synchronize the data in HDFS that is imported by Sqoop?

How do you differentiate inner bag and outer bag in Pig.

Why do we need to perform partitioning in Hive?

Are Multiline Comments supported in Hive? Why?

Compare differences between Local Metastore and Remote Metastore

Explain a metastore in Hive?

What applications are supported by Apache Hive?

Give a brief on how Spark is good at low latency workloads like graph processing and Machine Learning.

Explain the Resilient Distributed Datasets in Spark.

What are the basic parameters of a mapper?

List the actions that happen when a DataNode fails.

Explain the distributed Cache in MapReduce framework.

Explain the actions followed by a Jobtracker in Hadoop.

Explain the purpose of the dfsadmin tool?

Which Command is used to find the status of the Blocks and File-system health?

Where are the two types of metadata that NameNode server stores?

How can you skip the bad records in Hadoop?

What is the default replication factor?

Why are blocks in HDFS huge?