Explore topic-wise InterviewSolutions in .

This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.

51.

Can you list down all the collections/complex data types present in Hive?

Answer»

HIVE SUPPORTS below-given COMPLEX DATA TYPES:

  • Array
  • Union
  • Map
  • Struct
52.

Can you briefly explain all the available components of Hive Data Model?

Answer»

The available components of HIVE Data Model are as below:

  • TABLES: These tables are similar to RDBMS. Joins and unions can be used on tables. All the table data is STORED in HDFS. We can also apply filters in tables.  
  • Partitions: We can actually specify partition keys in each table to determine how data is getting stored. Using partitions, we can specify what datasets can be looked upon rather than the whole table.
  • Buckets: Data can be divided into buckets in each partition. We can easily evaluate QUERIES BASED upon some specific sample data with the help of buckets.
53.

What do you mean by safe mode in Hadoop?

Answer»

In Apache Hadoop, SAFE mode is a mode that is used for the PURPOSE of MAINTENANCE. It acts as read-only mode for NameNode in order to AVOID any modifications to the file systems. During Safe mode in HDFS, Data blocks can’t be replicated or deleted. COLLECTION of data and statistics from all the DataNodes happen during this time.

54.

Why Context object is used in Hadoop?

Answer»

In Hadoop, Context object is used ALONG with the Mapper class so that it can interact with the other remaining parts of the system. USING the Context object, all the JOBS and the system configuration details can be EASILY obtained in its constructor.

Information can be easily passed or sent to the methods like cleanup(), setup() and map() using the Context object. During map operations, vital information can be made AVAILABLE using the Context object.

55.

How you define the distance between two nodes while using Hadoop?

Answer»

The distance between TWO NODES is equal to the simple sum of the distance to the closest nodes. In order to calculate the distance between two nodes, we can use getDistance() METHOD for the same.

56.

What are the primary phases of reducer in Hadoop?

Answer»

In Hadoop, the PRIMARY phases of reducer are as follows:

  • Shuffle: In this phase, the mapper’s sorted OUTPUT becomes the input to the reducer.
  • Sort: In this phase, Hadoop sorts the input to Reducer using the same key. Both the Sort and Shuffle phases HAPPEN concurrently.
  • Reduce: This phase occurs after the sort and shuffle phase. In this phase, output values ASSOCIATED with a specific key are reduced to consolidate the data into the reducer FINAL output. Reducer output sorting is not there.
57.

What do you mean by replication factor In Hadoop?

Answer»

In Hadoop, REPLICATION FACTOR depicts the number of times the framework replicates or duplicates the Data blocks in a system. The default replication factor in Hadoop is 3 which can be manipulated as per the system requirements. The main advantage of the replication PROCESS is to ensure data availability.

We can CONFIGURE the replication factor in hdfs-site.xml FILE which can be less than or more than 3 according to the requirements.

58.

What are the different types of usage modes of operations in Hadoop?

Answer»

Hadoop operations can be used in three different modes. These are listed below:

  • Standalone mode: NameNode, DataNode, SECONDARY NAME node, Job Tracker, and Task Tracker will not run in Standalone mode. It is also called Local mode as Hadoop is made to run on this mode by default.  
  • Pseudo distributed mode: A single node is used in this mode also and the main thing in this type of mode is that all the tasks and processes in a cluster run independently to each other.
  • Fully distributed mode: This acts as the most IMPORTANT mode as here MULTIPLE nodes are used. Few are used for Resource Manager and NameNode and REST of the nodes are used for the Node manager and DataNode.
59.

What do you mean by FIFO scheduling in HDFS?

Answer»

FIFO also KNOWN as First In First Out is the simple job SCHEDULING algorithm in HADOOP which implies that the tasks or processes that come first will be served first. In Hadoop, FIFO is the default scheduler. All the tasks or processes are placed in a QUEUE and they get their TURN to get executed according to their order of submission. There is one major disadvantage of this type of scheduling which is that the higher priority tasks have to wait for their turn which can impact the process.

60.

What are the main methods of Reducer?

Answer»

The main methods of REDUCER are given below:

  • setup(): This METHOD is used for the configuration of parameters like the size of input DATA, distributed cache, etc.
  • reduce(): It acts as the heart of the reducer which is called once PER key with the associated reduced task
  • cleanup(): This method is used to CLEAR out all the temporary files and it is called only once at the end of reduce task.
61.

What are various XML configuration files present in Hadoop?

Answer»

The VARIOUS XML configuration FILES present in Hadoop are as FOLLOWS:

  • Mapred-site
  • YARN-site
  • Core-site
  • HDFS-site
  • Hadoop-env.sh
  • Masters
  • Slaves
62.

Can you please list down four Vs of big data?

Answer»

FOUR VS of BIG data describes four DIMENSIONS of big data. These are listed below:

  • Variety
  • Volume
  • Veracity
  • Velocity
63.

What will happen if a user submits a new job while NameNode is down?

Answer»

When NameNode is down, it means that the ENTIRE cluster is down. So, the cluster won’t be accessible as it is down. All the SERVICES which are running on that cluster will also be down. So, in this scenario, if any user tries to submit a new job will get an error and job will get FAILED. All the EXISTING jobs which are running will also get failed.

So briefly, we can say that when NameNode will get down, all the new, as well as existing jobs, will get failed as all services will be down. The user has to wait for the NameNode to restart and can RUN a job once the NameNode will get up.

64.

What is Rack Awareness in HDFS?

Answer»

In Hadoop, Rack awareness is the concept of choosing the DataNodes which are closer according to the rack information. By default, Hadoop ASSUMES that all the nodes belong to the same rack.

In order to improve the network TRAFFIC while reading/writing HDFS files that are on the same or a NEARBY rack, NameNode USES the DataNode to read/ write requests. To achieve rack information, the rack ids of each DataNode are MAINTAINED by HDFS NameNode. This concept in HDFS is known as Rack Awareness.

65.

what are the important languages or fields used by data engineer?

Answer»

Below are various fields or languages used by data engineer:

  • Machine learning INCLUDES programming languages like Python, Java, Javascript, Scala etc.
  • Knowledge of mathematics (linear algebra and PROBABILITY) is a must.
  • SQL, NOSQL DATABASES, and Hive QL
  • Apache Airflow, Apache Kafka, and Apache Spark
  • Hadoop Ecosystem
66.

What are the differences between NAS and DAS in Hadoop?

Answer»

The DIFFERENCE between NAS and DAS is as follows:

NAS
DAS
NAS stand for Network Attached Storage
DAS stands for DIRECT Attached Storage
Storage capacity of NAS is between 109 to 1012 in byte.
Storage capacity of DAS is 109 in byte.
In NAS, Storage is distributed over distinct servers on a network
In DAS, storage is attached to the node where computation process is taking place.
It has MODERATE storage management cost
It has HIGH storage management cost
Data transmission takes place using Ethernet or TCP/IP.
Data transmission takes place using IDE/ SCSI
67.

What should be the daily responsibilities of a data engineer?

Answer»

This QUESTION is asked by interviewers to check your understanding of the role of a DATA engineer.

  • They use a systematic approach to develop, test, and maintain data architectures.
  • They align the architecture design keeping into consideration business requirements.
  • They help in obtaining data from the right SOURCES and after the formulation of data set processes, they store OPTIMIZED data.
  • They help to deploy machine learning and statistical models.
  • They dive into data and help to develop PIPELINES to automate tasks where manual participation can be avoided.
  • They help in simplifying the data cleansing process.
  • They conduct research to address the issues and enhance data reliability, accuracy, flexibility, and quality.
68.

What are the default port numbers for Task Tracker, Job Tracker, and NameNode in Hadoop?

Answer»

The DEFAULT PORTS for TASK TRACKER, Job Tracker, and NAMENODE in Hadoop are as below:

  • The default port of Job Tracker is: 50030
  • The default port of Task Tracker is: 50060
  • The default port of NameNode is: 50070
69.

How does the NameNode communicate with the DataNode?

Answer»

NameNode communicates and gets INFORMATION from DataNode via MESSAGES or signals.  

There are TWO types of messages/signals that are USED for this communication across the channel:

  • Block report signals: These are the list of all HDFS data blocks stored on DataNode. They correspond to all local files and send this report to NameNode.
  • Heartbeat signals: These signals sent between DataNode and NameNode are taken as sign of vitality. They are used to check whether the DataNode is alive and functional. It acts as a periodic report to check whether to use NameNode or not. If this signal is not sent, it implies DataNode has some technical issues or health issues and it has stopped working. The default heartbeat signal is 3 seconds.
70.

Explain steps to achieve security in Hadoop?

Answer»

Below are the steps to achieve security in Hadoop:

  • The first step in securing an Apache Hadoop cluster is to enable encryption so that the authentication channel of the CLIENT to SERVER can be secured. Then TIME stamp is provided to the client.
  • This received time stamp by the client is then used to request TGS to create a SERVICE ticket.
  • Then comes the last step where the client uses the already CREATED service ticket for self-authentication to a specific server.
71.

How does the block scanner handle corrupted DataNode blocks?

Answer»

Following are the steps followed by the block scanner when it detects a corrupted DataNode block-

  • Whenever the block scanner comes across a block that is corrupted, the DataNode reports this particular block to the NAMENODE.
  • The NameNode then processes the block and helps to create the replica of the same using the existing corrupted block.
  • The system does not delete the corrupted block until the replication count of the NEWLY created replicas matches with the replication factor which is 3 by default.

This whole process helps HDFS in maintaining the INTEGRITY of the data during read operation PERFORMED by a client.

72.

What is Block and what role does Block Scanner play in HDFS?

Answer»

BLOCKS are considered as the smallest unit of data that is allocated to a file that is created automatically by the Hadoop System for STORAGE of data in a different set of NODES in a distributed system. Large files are automatically sliced into small chunks called as blocks by Hadoop.

Block scanner as its name suggests, is used to verify whether the small chunks of files known as blocks that are created by Hadoop are successfully stored in DataNode or not. It helps to DETECT the corrupt blocks present in DataNode.

73.

Can you explain the important features of Hadoop?

Answer»

Some of the important features of Hadoop are as below:

  • Hadoop is an open-source framework that can be used free of COST by user.
  • Data processing is very fast because Hadoop supports the feature of parallel processing of data.
  • In ORDER to avoid data loss, Data redundancy is GIVEN high priority.
  • It stores data in separate clusters which are independent of the other operations.
  • It is highly scalable hence large amount of data is DIVIDED into multiple machines (cost-effective) in a cluster which can process parallelly.
  • Hadoop provides flexibility as it can be used with any KIND of dataset like structured (MySQL Data), Semi-Structured (JSON, XML), and Un-structured (Images and Videos) very efficiently.
74.

What is Hadoop Streaming?

Answer»

Hadoop streaming is ONE of the widely used utilities that COMES with the Hadoop distribution. This utility is provided for allowing the user to create and run Map/Reduce jobs with the help of various programming languages LIKE Ruby, PERL, PYTHON, C++, etc. which can then be submitted to a specific cluster for usage.

75.

What is NameNode in HDFS?

Answer»

NameNode is the MASTER node in the HADOOP HDFS Architecture. It is used to store all the data of HDFS and also keep track of various files in all clusters. The NameNodes don’t store the actual data but only the metadata of HDFS. The actual data GETS stored in the DataNodes.

76.

What are the various components of a Hadoop application?

Answer»
  • HDFS: HDFS stands for Hadoop Distributed File System. While WORKING with Hadoop, all the data gets stored in The Hadoop Distributed File System. It is fault-tolerant and PROVIDES a distributed file system with very high bandwidth.
  • Hadoop COMMON: It consists of a SET of all common utilities and LIBRARIES that are utilized by Hadoop.
  • Hadoop YARN: It is used for managing resources in the Hadoop system. Task scheduling for users can also be performed using YARN.
  • Hadoop MapReduce: It is based according to the algorithm that provides provision for large-scale processing data.
77.

What is Hadoop? Can you please explain briefly?

Answer»

In today’s world, the majority of big applications are generating big data that requires vast space and a large AMOUNT of PROCESSING POWER, Hadoop plays a significant ROLE in providing such provision to the database world.

78.

What are the differences between structured and unstructured data?

Answer»

The difference between structured and unstructured data is as follows-

Parameter
Structured Data
Unstructured Data
Storage
DBMS
File STRUCTURES are unmanaged
Standard
ODBC, ADO.net, and SQL
XML, STMP, CSV, and SMS
Integration Tool
ELT (Extract, Transform, Load)
Batch PROCESSING or Manual data entry
SCALING
Schema scaling is difficult
Schema Scaling is very easy.
Version management
Versioning over tuples, row and tables
Versioned as a whole is possible
Example
An ordered text dataset file
Images, video files, audio files, etc.
79.

Can you explain the various types of design schemas relevant to data modelling?

Answer»

Companies can ask you QUESTIONS about design schemas in order to test your knowledge regarding the fundamentals of DATA engineering. Data Modelling consists of mainly two types of schemas:

  • Star schema: Star schema consists of dimension tables that SURROUND a fact table
  • Snowflake schema: Snowflake schema also contains SIMILAR dimension tables surrounding a fact table which are further SURROUNDED by dimension tables.
80.

What is Data Modelling?

Answer»

Data modelling is the scientific process of converting and transforming complex software data systems by breaking them up into simple diagrams that are easy to understand, thus making the system INDEPENDENT of any pre-requisites. You can explain any prior experience with Data Modelling, if any, in the FORM of some SCENARIOS.

81.

What is Data Engineering?

Answer»

This may seem like a pretty BASIC question, but regardless of your SKILL level, this is one of the most COMMON QUESTIONS that can come up during your interview. So, what is it? Briefly, Data Engineering is a term used in big data. It is the process of transforming the RAW entity of data (data generated from various sources) into useful information that can be used for various purposes.