81 + Interview Questions in Data Engineer Interview Questions in Big Data Page 2 InterviewSolution

51.	Can you list down all the collections/complex data types present in Hive?
Answer» HIVE SUPPORTS below-given COMPLEX DATA TYPES: Array Union Map Struct

51.

Can you list down all the collections/complex data types present in Hive?

Answer»

HIVE SUPPORTS below-given COMPLEX DATA TYPES:

Array
Union
Map
Struct

Discussion

52.	Can you briefly explain all the available components of Hive Data Model?
Answer» The available components of HIVE Data Model are as below: TABLES: These tables are similar to RDBMS. Joins and unions can be used on tables. All the table data is STORED in HDFS. We can also apply filters in tables. Partitions: We can actually specify partition keys in each table to determine how data is getting stored. Using partitions, we can specify what datasets can be looked upon rather than the whole table. Buckets: Data can be divided into buckets in each partition. We can easily evaluate QUERIES BASED upon some specific sample data with the help of buckets.

52.

Can you briefly explain all the available components of Hive Data Model?

Answer»

The available components of HIVE Data Model are as below:

TABLES: These tables are similar to RDBMS. Joins and unions can be used on tables. All the table data is STORED in HDFS. We can also apply filters in tables.
Partitions: We can actually specify partition keys in each table to determine how data is getting stored. Using partitions, we can specify what datasets can be looked upon rather than the whole table.
Buckets: Data can be divided into buckets in each partition. We can easily evaluate QUERIES BASED upon some specific sample data with the help of buckets.

Discussion

53.	What do you mean by safe mode in Hadoop?
Answer» In Apache Hadoop, SAFE mode is a mode that is used for the PURPOSE of MAINTENANCE. It acts as read-only mode for NameNode in order to AVOID any modifications to the file systems. During Safe mode in HDFS, Data blocks can’t be replicated or deleted. COLLECTION of data and statistics from all the DataNodes happen during this time.

Discussion

54.	Why Context object is used in Hadoop?
Answer» In Hadoop, Context object is used ALONG with the Mapper class so that it can interact with the other remaining parts of the system. USING the Context object, all the JOBS and the system configuration details can be EASILY obtained in its constructor. Information can be easily passed or sent to the methods like cleanup(), setup() and map() using the Context object. During map operations, vital information can be made AVAILABLE using the Context object.

54.

Why Context object is used in Hadoop?

Answer»

In Hadoop, Context object is used ALONG with the Mapper class so that it can interact with the other remaining parts of the system. USING the Context object, all the JOBS and the system configuration details can be EASILY obtained in its constructor.

Information can be easily passed or sent to the methods like cleanup(), setup() and map() using the Context object. During map operations, vital information can be made AVAILABLE using the Context object.

Discussion

55.	How you define the distance between two nodes while using Hadoop?
Answer» The distance between TWO NODES is equal to the simple sum of the distance to the closest nodes. In order to calculate the distance between two nodes, we can use getDistance() METHOD for the same.

Discussion

56.	What are the primary phases of reducer in Hadoop?
Answer» In Hadoop, the PRIMARY phases of reducer are as follows: Shuffle: In this phase, the mapper’s sorted OUTPUT becomes the input to the reducer. Sort: In this phase, Hadoop sorts the input to Reducer using the same key. Both the Sort and Shuffle phases HAPPEN concurrently. Reduce: This phase occurs after the sort and shuffle phase. In this phase, output values ASSOCIATED with a specific key are reduced to consolidate the data into the reducer FINAL output. Reducer output sorting is not there.

56.

What are the primary phases of reducer in Hadoop?

Answer»

In Hadoop, the PRIMARY phases of reducer are as follows:

Shuffle: In this phase, the mapper’s sorted OUTPUT becomes the input to the reducer.
Sort: In this phase, Hadoop sorts the input to Reducer using the same key. Both the Sort and Shuffle phases HAPPEN concurrently.
Reduce: This phase occurs after the sort and shuffle phase. In this phase, output values ASSOCIATED with a specific key are reduced to consolidate the data into the reducer FINAL output. Reducer output sorting is not there.

Discussion

57.	What do you mean by replication factor In Hadoop?
Answer» In Hadoop, REPLICATION FACTOR depicts the number of times the framework replicates or duplicates the Data blocks in a system. The default replication factor in Hadoop is 3 which can be manipulated as per the system requirements. The main advantage of the replication PROCESS is to ensure data availability. We can CONFIGURE the replication factor in hdfs-site.xml FILE which can be less than or more than 3 according to the requirements.

57.

What do you mean by replication factor In Hadoop?

Answer»

In Hadoop, REPLICATION FACTOR depicts the number of times the framework replicates or duplicates the Data blocks in a system. The default replication factor in Hadoop is 3 which can be manipulated as per the system requirements. The main advantage of the replication PROCESS is to ensure data availability.

We can CONFIGURE the replication factor in hdfs-site.xml FILE which can be less than or more than 3 according to the requirements.

Discussion

58.	What are the different types of usage modes of operations in Hadoop?
Answer» Hadoop operations can be used in three different modes. These are listed below: Standalone mode: NameNode, DataNode, SECONDARY NAME node, Job Tracker, and Task Tracker will not run in Standalone mode. It is also called Local mode as Hadoop is made to run on this mode by default. Pseudo distributed mode: A single node is used in this mode also and the main thing in this type of mode is that all the tasks and processes in a cluster run independently to each other. Fully distributed mode: This acts as the most IMPORTANT mode as here MULTIPLE nodes are used. Few are used for Resource Manager and NameNode and REST of the nodes are used for the Node manager and DataNode.

58.

What are the different types of usage modes of operations in Hadoop?

Answer»

Hadoop operations can be used in three different modes. These are listed below:

Standalone mode: NameNode, DataNode, SECONDARY NAME node, Job Tracker, and Task Tracker will not run in Standalone mode. It is also called Local mode as Hadoop is made to run on this mode by default.
Pseudo distributed mode: A single node is used in this mode also and the main thing in this type of mode is that all the tasks and processes in a cluster run independently to each other.
Fully distributed mode: This acts as the most IMPORTANT mode as here MULTIPLE nodes are used. Few are used for Resource Manager and NameNode and REST of the nodes are used for the Node manager and DataNode.

Discussion

59.	What do you mean by FIFO scheduling in HDFS?
Answer» FIFO also KNOWN as First In First Out is the simple job SCHEDULING algorithm in HADOOP which implies that the tasks or processes that come first will be served first. In Hadoop, FIFO is the default scheduler. All the tasks or processes are placed in a QUEUE and they get their TURN to get executed according to their order of submission. There is one major disadvantage of this type of scheduling which is that the higher priority tasks have to wait for their turn which can impact the process.

Discussion

60.	What are the main methods of Reducer?
Answer» The main methods of REDUCER are given below: setup(): This METHOD is used for the configuration of parameters like the size of input DATA, distributed cache, etc. reduce(): It acts as the heart of the reducer which is called once PER key with the associated reduced task cleanup(): This method is used to CLEAR out all the temporary files and it is called only once at the end of reduce task.

60.

What are the main methods of Reducer?

Answer»

The main methods of REDUCER are given below:

setup(): This METHOD is used for the configuration of parameters like the size of input DATA, distributed cache, etc.
reduce(): It acts as the heart of the reducer which is called once PER key with the associated reduced task
cleanup(): This method is used to CLEAR out all the temporary files and it is called only once at the end of reduce task.

Discussion

61.	What are various XML configuration files present in Hadoop?
Answer» The VARIOUS XML configuration FILES present in Hadoop are as FOLLOWS: Mapred-site YARN-site Core-site HDFS-site Hadoop-env.sh Masters Slaves

61.

What are various XML configuration files present in Hadoop?

Answer»

The VARIOUS XML configuration FILES present in Hadoop are as FOLLOWS:

Mapred-site
YARN-site
Core-site
HDFS-site
Hadoop-env.sh
Masters
Slaves

Discussion

62.	Can you please list down four Vs of big data?
Answer» FOUR VS of BIG data describes four DIMENSIONS of big data. These are listed below: Variety Volume Veracity Velocity

62.

Can you please list down four Vs of big data?

Answer»

FOUR VS of BIG data describes four DIMENSIONS of big data. These are listed below:

Variety
Volume
Veracity
Velocity

Discussion

63.	What will happen if a user submits a new job while NameNode is down?
Answer» When NameNode is down, it means that the ENTIRE cluster is down. So, the cluster won’t be accessible as it is down. All the SERVICES which are running on that cluster will also be down. So, in this scenario, if any user tries to submit a new job will get an error and job will get FAILED. All the EXISTING jobs which are running will also get failed. So briefly, we can say that when NameNode will get down, all the new, as well as existing jobs, will get failed as all services will be down. The user has to wait for the NameNode to restart and can RUN a job once the NameNode will get up.

63.

What will happen if a user submits a new job while NameNode is down?

Answer»

When NameNode is down, it means that the ENTIRE cluster is down. So, the cluster won’t be accessible as it is down. All the SERVICES which are running on that cluster will also be down. So, in this scenario, if any user tries to submit a new job will get an error and job will get FAILED. All the EXISTING jobs which are running will also get failed.

So briefly, we can say that when NameNode will get down, all the new, as well as existing jobs, will get failed as all services will be down. The user has to wait for the NameNode to restart and can RUN a job once the NameNode will get up.

Discussion

64.	What is Rack Awareness in HDFS?
Answer» In Hadoop, Rack awareness is the concept of choosing the DataNodes which are closer according to the rack information. By default, Hadoop ASSUMES that all the nodes belong to the same rack. In order to improve the network TRAFFIC while reading/writing HDFS files that are on the same or a NEARBY rack, NameNode USES the DataNode to read/ write requests. To achieve rack information, the rack ids of each DataNode are MAINTAINED by HDFS NameNode. This concept in HDFS is known as Rack Awareness.

64.

What is Rack Awareness in HDFS?

Answer»

In Hadoop, Rack awareness is the concept of choosing the DataNodes which are closer according to the rack information. By default, Hadoop ASSUMES that all the nodes belong to the same rack.

In order to improve the network TRAFFIC while reading/writing HDFS files that are on the same or a NEARBY rack, NameNode USES the DataNode to read/ write requests. To achieve rack information, the rack ids of each DataNode are MAINTAINED by HDFS NameNode. This concept in HDFS is known as Rack Awareness.

Discussion

65.	what are the important languages or fields used by data engineer?
Answer» Below are various fields or languages used by data engineer: Machine learning INCLUDES programming languages like Python, Java, Javascript, Scala etc. Knowledge of mathematics (linear algebra and PROBABILITY) is a must. SQL, NOSQL DATABASES, and Hive QL Apache Airflow, Apache Kafka, and Apache Spark Hadoop Ecosystem

65.

what are the important languages or fields used by data engineer?

Answer»

Below are various fields or languages used by data engineer:

Machine learning INCLUDES programming languages like Python, Java, Javascript, Scala etc.
Knowledge of mathematics (linear algebra and PROBABILITY) is a must.
SQL, NOSQL DATABASES, and Hive QL
Apache Airflow, Apache Kafka, and Apache Spark
Hadoop Ecosystem

Discussion

66.

What are the differences between NAS and DAS in Hadoop?

Answer»

The DIFFERENCE between NAS and DAS is as follows:

NAS	DAS
NAS stand for Network Attached Storage	DAS stands for DIRECT Attached Storage
Storage capacity of NAS is between 109 to 1012 in byte.	Storage capacity of DAS is 109 in byte.
In NAS, Storage is distributed over distinct servers on a network	In DAS, storage is attached to the node where computation process is taking place.
It has MODERATE storage management cost	It has HIGH storage management cost
Data transmission takes place using Ethernet or TCP/IP.	Data transmission takes place using IDE/ SCSI

Discussion

67.	What should be the daily responsibilities of a data engineer?
Answer» This QUESTION is asked by interviewers to check your understanding of the role of a DATA engineer. They use a systematic approach to develop, test, and maintain data architectures. They align the architecture design keeping into consideration business requirements. They help in obtaining data from the right SOURCES and after the formulation of data set processes, they store OPTIMIZED data. They help to deploy machine learning and statistical models. They dive into data and help to develop PIPELINES to automate tasks where manual participation can be avoided. They help in simplifying the data cleansing process. They conduct research to address the issues and enhance data reliability, accuracy, flexibility, and quality.

67.

What should be the daily responsibilities of a data engineer?

Answer»

This QUESTION is asked by interviewers to check your understanding of the role of a DATA engineer.

They use a systematic approach to develop, test, and maintain data architectures.
They align the architecture design keeping into consideration business requirements.
They help in obtaining data from the right SOURCES and after the formulation of data set processes, they store OPTIMIZED data.
They help to deploy machine learning and statistical models.
They dive into data and help to develop PIPELINES to automate tasks where manual participation can be avoided.
They help in simplifying the data cleansing process.
They conduct research to address the issues and enhance data reliability, accuracy, flexibility, and quality.

Discussion

68.	What are the default port numbers for Task Tracker, Job Tracker, and NameNode in Hadoop?
Answer» The DEFAULT PORTS for TASK TRACKER, Job Tracker, and NAMENODE in Hadoop are as below: The default port of Job Tracker is: 50030 The default port of Task Tracker is: 50060 The default port of NameNode is: 50070

68.

What are the default port numbers for Task Tracker, Job Tracker, and NameNode in Hadoop?

Answer»

The DEFAULT PORTS for TASK TRACKER, Job Tracker, and NAMENODE in Hadoop are as below:

The default port of Job Tracker is: 50030
The default port of Task Tracker is: 50060
The default port of NameNode is: 50070

Discussion

69.	How does the NameNode communicate with the DataNode?
Answer» NameNode communicates and gets INFORMATION from DataNode via MESSAGES or signals. There are TWO types of messages/signals that are USED for this communication across the channel: Block report signals: These are the list of all HDFS data blocks stored on DataNode. They correspond to all local files and send this report to NameNode. Heartbeat signals: These signals sent between DataNode and NameNode are taken as sign of vitality. They are used to check whether the DataNode is alive and functional. It acts as a periodic report to check whether to use NameNode or not. If this signal is not sent, it implies DataNode has some technical issues or health issues and it has stopped working. The default heartbeat signal is 3 seconds.

69.

How does the NameNode communicate with the DataNode?

Answer»

NameNode communicates and gets INFORMATION from DataNode via MESSAGES or signals.

There are TWO types of messages/signals that are USED for this communication across the channel:

Block report signals: These are the list of all HDFS data blocks stored on DataNode. They correspond to all local files and send this report to NameNode.
Heartbeat signals: These signals sent between DataNode and NameNode are taken as sign of vitality. They are used to check whether the DataNode is alive and functional. It acts as a periodic report to check whether to use NameNode or not. If this signal is not sent, it implies DataNode has some technical issues or health issues and it has stopped working. The default heartbeat signal is 3 seconds.

Discussion

70.	Explain steps to achieve security in Hadoop?
Answer» Below are the steps to achieve security in Hadoop: The first step in securing an Apache Hadoop cluster is to enable encryption so that the authentication channel of the CLIENT to SERVER can be secured. Then TIME stamp is provided to the client. This received time stamp by the client is then used to request TGS to create a SERVICE ticket. Then comes the last step where the client uses the already CREATED service ticket for self-authentication to a specific server.

70.

Explain steps to achieve security in Hadoop?

Answer»

Below are the steps to achieve security in Hadoop:

The first step in securing an Apache Hadoop cluster is to enable encryption so that the authentication channel of the CLIENT to SERVER can be secured. Then TIME stamp is provided to the client.
This received time stamp by the client is then used to request TGS to create a SERVICE ticket.
Then comes the last step where the client uses the already CREATED service ticket for self-authentication to a specific server.

Discussion

71.	How does the block scanner handle corrupted DataNode blocks?
Answer» Following are the steps followed by the block scanner when it detects a corrupted DataNode block- Whenever the block scanner comes across a block that is corrupted, the DataNode reports this particular block to the NAMENODE. The NameNode then processes the block and helps to create the replica of the same using the existing corrupted block. The system does not delete the corrupted block until the replication count of the NEWLY created replicas matches with the replication factor which is 3 by default. This whole process helps HDFS in maintaining the INTEGRITY of the data during read operation PERFORMED by a client.

71.

How does the block scanner handle corrupted DataNode blocks?

Answer»

Following are the steps followed by the block scanner when it detects a corrupted DataNode block-

Whenever the block scanner comes across a block that is corrupted, the DataNode reports this particular block to the NAMENODE.
The NameNode then processes the block and helps to create the replica of the same using the existing corrupted block.
The system does not delete the corrupted block until the replication count of the NEWLY created replicas matches with the replication factor which is 3 by default.

This whole process helps HDFS in maintaining the INTEGRITY of the data during read operation PERFORMED by a client.

Discussion

72.	What is Block and what role does Block Scanner play in HDFS?
Answer» BLOCKS are considered as the smallest unit of data that is allocated to a file that is created automatically by the Hadoop System for STORAGE of data in a different set of NODES in a distributed system. Large files are automatically sliced into small chunks called as blocks by Hadoop. Block scanner as its name suggests, is used to verify whether the small chunks of files known as blocks that are created by Hadoop are successfully stored in DataNode or not. It helps to DETECT the corrupt blocks present in DataNode.

72.

What is Block and what role does Block Scanner play in HDFS?

Answer»

BLOCKS are considered as the smallest unit of data that is allocated to a file that is created automatically by the Hadoop System for STORAGE of data in a different set of NODES in a distributed system. Large files are automatically sliced into small chunks called as blocks by Hadoop.

Block scanner as its name suggests, is used to verify whether the small chunks of files known as blocks that are created by Hadoop are successfully stored in DataNode or not. It helps to DETECT the corrupt blocks present in DataNode.

Discussion

73.	Can you explain the important features of Hadoop?
Answer» Some of the important features of Hadoop are as below: Hadoop is an open-source framework that can be used free of COST by user. Data processing is very fast because Hadoop supports the feature of parallel processing of data. In ORDER to avoid data loss, Data redundancy is GIVEN high priority. It stores data in separate clusters which are independent of the other operations. It is highly scalable hence large amount of data is DIVIDED into multiple machines (cost-effective) in a cluster which can process parallelly. Hadoop provides flexibility as it can be used with any KIND of dataset like structured (MySQL Data), Semi-Structured (JSON, XML), and Un-structured (Images and Videos) very efficiently.

73.

Can you explain the important features of Hadoop?

Answer»

Some of the important features of Hadoop are as below:

Hadoop is an open-source framework that can be used free of COST by user.
Data processing is very fast because Hadoop supports the feature of parallel processing of data.
In ORDER to avoid data loss, Data redundancy is GIVEN high priority.
It stores data in separate clusters which are independent of the other operations.
It is highly scalable hence large amount of data is DIVIDED into multiple machines (cost-effective) in a cluster which can process parallelly.
Hadoop provides flexibility as it can be used with any KIND of dataset like structured (MySQL Data), Semi-Structured (JSON, XML), and Un-structured (Images and Videos) very efficiently.

Discussion

74.	What is Hadoop Streaming?
Answer» Hadoop streaming is ONE of the widely used utilities that COMES with the Hadoop distribution. This utility is provided for allowing the user to create and run Map/Reduce jobs with the help of various programming languages LIKE Ruby, PERL, PYTHON, C++, etc. which can then be submitted to a specific cluster for usage.

Discussion

75.	What is NameNode in HDFS?
Answer» NameNode is the MASTER node in the HADOOP HDFS Architecture. It is used to store all the data of HDFS and also keep track of various files in all clusters. The NameNodes don’t store the actual data but only the metadata of HDFS. The actual data GETS stored in the DataNodes.

Discussion

76.	What are the various components of a Hadoop application?
Answer» HDFS: HDFS stands for Hadoop Distributed File System. While WORKING with Hadoop, all the data gets stored in The Hadoop Distributed File System. It is fault-tolerant and PROVIDES a distributed file system with very high bandwidth. Hadoop COMMON: It consists of a SET of all common utilities and LIBRARIES that are utilized by Hadoop. Hadoop YARN: It is used for managing resources in the Hadoop system. Task scheduling for users can also be performed using YARN. Hadoop MapReduce: It is based according to the algorithm that provides provision for large-scale processing data.

Discussion

77.	What is Hadoop? Can you please explain briefly?
Answer» In today’s world, the majority of big applications are generating big data that requires vast space and a large AMOUNT of PROCESSING POWER, Hadoop plays a significant ROLE in providing such provision to the database world.

Discussion

78.

What are the differences between structured and unstructured data?

Answer»

The difference between structured and unstructured data is as follows-

Parameter	Structured Data	Unstructured Data
Storage	DBMS	File STRUCTURES are unmanaged
Standard	ODBC, ADO.net, and SQL	XML, STMP, CSV, and SMS
Integration Tool	ELT (Extract, Transform, Load)	Batch PROCESSING or Manual data entry
SCALING	Schema scaling is difficult	Schema Scaling is very easy.
Version management	Versioning over tuples, row and tables	Versioned as a whole is possible
Example	An ordered text dataset file	Images, video files, audio files, etc.

Discussion

79.	Can you explain the various types of design schemas relevant to data modelling?
Answer» Companies can ask you QUESTIONS about design schemas in order to test your knowledge regarding the fundamentals of DATA engineering. Data Modelling consists of mainly two types of schemas: Star schema: Star schema consists of dimension tables that SURROUND a fact table Snowflake schema: Snowflake schema also contains SIMILAR dimension tables surrounding a fact table which are further SURROUNDED by dimension tables.

79.

Can you explain the various types of design schemas relevant to data modelling?

Answer»

Companies can ask you QUESTIONS about design schemas in order to test your knowledge regarding the fundamentals of DATA engineering. Data Modelling consists of mainly two types of schemas:

Star schema: Star schema consists of dimension tables that SURROUND a fact table
Snowflake schema: Snowflake schema also contains SIMILAR dimension tables surrounding a fact table which are further SURROUNDED by dimension tables.

Discussion

80.	What is Data Modelling?
Answer» Data modelling is the scientific process of converting and transforming complex software data systems by breaking them up into simple diagrams that are easy to understand, thus making the system INDEPENDENT of any pre-requisites. You can explain any prior experience with Data Modelling, if any, in the FORM of some SCENARIOS.

Discussion

81.	What is Data Engineering?
Answer» This may seem like a pretty BASIC question, but regardless of your SKILL level, this is one of the most COMMON QUESTIONS that can come up during your interview. So, what is it? Briefly, Data Engineering is a term used in big data. It is the process of transforming the RAW entity of data (data generated from various sources) into useful information that can be used for various purposes.

Discussion

Explore topic-wise InterviewSolutions in .

Can you list down all the collections/complex data types present in Hive?

Can you briefly explain all the available components of Hive Data Model?

What do you mean by safe mode in Hadoop?

Why Context object is used in Hadoop?

How you define the distance between two nodes while using Hadoop?

What are the primary phases of reducer in Hadoop?

What do you mean by replication factor In Hadoop?

What are the different types of usage modes of operations in Hadoop?

What do you mean by FIFO scheduling in HDFS?

What are the main methods of Reducer?

What are various XML configuration files present in Hadoop?

Can you please list down four Vs of big data?

What will happen if a user submits a new job while NameNode is down?

What is Rack Awareness in HDFS?

what are the important languages or fields used by data engineer?

What are the differences between NAS and DAS in Hadoop?

What should be the daily responsibilities of a data engineer?

What are the default port numbers for Task Tracker, Job Tracker, and NameNode in Hadoop?

How does the NameNode communicate with the DataNode?

Explain steps to achieve security in Hadoop?

How does the block scanner handle corrupted DataNode blocks?

What is Block and what role does Block Scanner play in HDFS?

Can you explain the important features of Hadoop?

What is Hadoop Streaming?

What is NameNode in HDFS?

What are the various components of a Hadoop application?

What is Hadoop? Can you please explain briefly?

What are the differences between structured and unstructured data?

Can you explain the various types of design schemas relevant to data modelling?

What is Data Modelling?

What is Data Engineering?