10 + Interview Questions in Spark Interview Questions for Freshers in Spark Interview Questions

1.	What is YARN in Spark?
Answer» YARN is one of the key FEATURES PROVIDED by SPARK that PROVIDES a central resource management PLATFORM for delivering scalable operations throughout the cluster. YARN is a cluster management technology and a Spark is a tool for data processing.

Discussion

2.	What do you understand by Shuffling in Spark?
Answer» The process of redistribution of data across different partitions which might or might not CAUSE data movement across the JVM processes or the EXECUTORS on the separate MACHINES is known as shuffling/repartitioning. Partition is NOTHING but a smaller logical division of data. It is to be noted that SPARK has no control over what partition the data gets distributed across.

2.

What do you understand by Shuffling in Spark?

Answer»

The process of redistribution of data across different partitions which might or might not CAUSE data movement across the JVM processes or the EXECUTORS on the separate MACHINES is known as shuffling/repartitioning. Partition is NOTHING but a smaller logical division of data.

It is to be noted that SPARK has no control over what partition the data gets distributed across.

Discussion

3.	What are the data formats supported by Spark?
Answer» Spark supports both the raw files and the STRUCTURED file FORMATS for efficient reading and processing. File formats like paraquet, JSON, XML, CSV, RC, Avro, TSV, etc are supported by Spark.

Discussion

4.

What is the difference between repartition and coalesce?

Answer»

Repartition	Coalesce
Usage repartition can increase/decrease the NUMBER of data partitions.	Spark coalesce can only reduce the number of data partitions.
Repartition creates NEW data partitions and performs a full SHUFFLE of evenly distributed data.	Coalesce makes use of already existing partitions to reduce the amount of shuffled data unevenly.
Repartition internally calls coalesce with shuffle parameter thereby making it SLOWER than coalesce.	Coalesce is faster than repartition. HOWEVER, if there are unequal-sized data partitions, the speed might be slightly slower.

Discussion

5.	What are receivers in Apache Spark Streaming?
Answer» Receivers are those entities that consume DATA from different data SOURCES and then MOVE them to Spark for processing. They are created by using streaming contexts in the form of long-running tasks that are scheduled for operating in a round-robin fashion. Each receiver is configured to use up only a single core. The receivers are made to run on various executors to accomplish the task of data streaming. There are two types of receivers depending on how the data is sent to Spark: RELIABLE receivers: Here, the receiver sends an acknowledegment to the data sources post successful reception of data and its replication on the Spark STORAGE space. Unreliable receiver: Here, there is no acknowledgement sent to the data sources.

5.

What are receivers in Apache Spark Streaming?

Answer»

Receivers are those entities that consume DATA from different data SOURCES and then MOVE them to Spark for processing. They are created by using streaming contexts in the form of long-running tasks that are scheduled for operating in a round-robin fashion. Each receiver is configured to use up only a single core. The receivers are made to run on various executors to accomplish the task of data streaming. There are two types of receivers depending on how the data is sent to Spark:

RELIABLE receivers: Here, the receiver sends an acknowledegment to the data sources post successful reception of data and its replication on the Spark STORAGE space.
Unreliable receiver: Here, there is no acknowledgement sent to the data sources.

Discussion

6.	List the types of Deploy Modes in Spark.
Answer» There are 2 deploy modes in Spark. They are: Client Mode: The deploy mode is said to be in client mode when the spark driver component runs on the machine node from where the spark job is submitted. The main disadvantage of this mode is if the machine node fails, then the entire job fails. This mode supports both interactive shells or the job submission commands. The performance of this mode is worst and is not preferred in production environments. CLUSTER Mode: If the spark job driver component does not run on the machine from which the spark job has been submitted, then the deploy mode is said to be in cluster mode. The spark job launches the driver component within the cluster as a part of the sub-process of ApplicationMaster. This mode supports deployment only using the spark-submit command (interactive shell mode is not SUPPORTED). Here, since the driver programs are run in ApplicationMaster, in case the program fails, the driver program is re-instantiated. In this mode, there is a dedicated cluster manager (such as stand-alone, YARN, Apache Mesos, Kubernetes, etc) for allocating the resources required for the job to run as shown in the below architecture. Apart from the above two modes, if we have to run the application on our local machines for unit TESTING and development, the deployment mode is called “Local Mode”. Here, the JOBS run on a single JVM in a single machine which makes it highly inefficient as at some point or the other there would be a shortage of resources which results in the failure of jobs. It is also not possible to scale up resources in this mode DUE to the restricted memory and space.

6.

List the types of Deploy Modes in Spark.

Answer»

There are 2 deploy modes in Spark. They are:

Client Mode: The deploy mode is said to be in client mode when the spark driver component runs on the machine node from where the spark job is submitted.
- The main disadvantage of this mode is if the machine node fails, then the entire job fails.
- This mode supports both interactive shells or the job submission commands.
- The performance of this mode is worst and is not preferred in production environments.
CLUSTER Mode: If the spark job driver component does not run on the machine from which the spark job has been submitted, then the deploy mode is said to be in cluster mode.
- The spark job launches the driver component within the cluster as a part of the sub-process of ApplicationMaster.
- This mode supports deployment only using the spark-submit command (interactive shell mode is not SUPPORTED).
- Here, since the driver programs are run in ApplicationMaster, in case the program fails, the driver program is re-instantiated.
- In this mode, there is a dedicated cluster manager (such as stand-alone, YARN, Apache Mesos, Kubernetes, etc) for allocating the resources required for the job to run as shown in the below architecture.

Apart from the above two modes, if we have to run the application on our local machines for unit TESTING and development, the deployment mode is called “Local Mode”. Here, the JOBS run on a single JVM in a single machine which makes it highly inefficient as at some point or the other there would be a shortage of resources which results in the failure of jobs. It is also not possible to scale up resources in this mode DUE to the restricted memory and space.

Discussion

7.	What does DAG refer to in Apache Spark?
Answer» DAG stands for Directed Acyclic GRAPH with no directed cycles. There WOULD be finite VERTICES and edges. Each edge from one vertex is directed to ANOTHER vertex in a sequential manner. The vertices refer to the RDDs of Spark and the edges represent the operations to be performed on those RDDs.

Discussion

8.	What is RDD?
Answer» RDD stands for Resilient Distribution DATASETS. It is a fault-tolerant collection of parallel running operational elements. The partitioned data of RDD is distributed and IMMUTABLE. There are two types of datasets: Parallelized COLLECTIONS: Meant for running parallelly. Hadoop datasets: These PERFORM operations on FILE record systems on HDFS or other storage systems.

8.

What is RDD?

Answer»

RDD stands for Resilient Distribution DATASETS. It is a fault-tolerant collection of parallel running operational elements. The partitioned data of RDD is distributed and IMMUTABLE. There are two types of datasets:

Parallelized COLLECTIONS: Meant for running parallelly.
Hadoop datasets: These PERFORM operations on FILE record systems on HDFS or other storage systems.

Discussion

9.	What are the features of Apache Spark?
Answer» High Processing Speed: Apache Spark helps in the achievement of a very high processing speed of data by reducing read-write operations to disk. The speed is almost 100x faster while performing in-memory computation and 10x faster while performing disk computation. Dynamic Nature: Spark provides 80 high-level operators which help in the easy development of parallel applications. In-Memory Computation: The in-memory computation feature of Spark due to its DAG execution engine increases the speed of data processing. This also SUPPORTS data caching and reduces the time required to fetch data from the disk. Reusability: Spark codes can be reused for batch-processing, data streaming, running ad-hoc queries, etc. Fault Tolerance: Spark supports fault tolerance using RDD. Spark RDDs are the abstractions designed to handle failures of worker nodes which ensures zero data loss. Stream Processing: Spark supports stream processing in real-time. The problem in the earlier MapReduce framework was that it could process only already existing data. Lazy Evaluation: Spark transformations done using Spark RDDs are lazy. Meaning, they do not generate results right away, but they CREATE new RDDs from existing RDD. This lazy evaluation increases the system efficiency. Support Multiple Languages: Spark supports multiple languages like R, Scala, Python, Java which provides dynamicity and helps in overcoming the Hadoop limitation of application development only using Java. Hadoop Integration: Spark also supports the Hadoop YARN cluster manager thereby making it flexible. Supports Spark GraphX for graph parallel execution, Spark SQL, libraries for MACHINE learning, etc. Cost Efficiency: Apache Spark is considered a better cost-efficient solution when compared to Hadoop as Hadoop required large storage and data CENTERS while data processing and replication. Active Developer’s Community: Apache Spark has a large developers BASE involved in continuous development. It is considered to be the most important project undertaken by the Apache community.

Discussion

10.	Can you tell me what is Apache Spark about?
Answer» Apache Spark is an open-source framework engine that is known for its speed, easy-to-use nature in the field of big data PROCESSING and analysis. It also has built-in MODULES for GRAPH processing, machine LEARNING, streaming, SQL, etc. The spark execution engine supports in-memory computation and cyclic data flow and it can run either on cluster mode or STANDALONE mode and can access diverse data sources like HBase, HDFS, Cassandra, etc.

Discussion

Explore topic-wise InterviewSolutions in Current Affairs.

What is YARN in Spark?

What do you understand by Shuffling in Spark?

What are the data formats supported by Spark?

What is the difference between repartition and coalesce?

What are receivers in Apache Spark Streaming?

List the types of Deploy Modes in Spark.

What does DAG refer to in Apache Spark?

What is RDD?

What are the features of Apache Spark?

Can you tell me what is Apache Spark about?