InterviewSolution
This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.
| 1. |
What is YARN in Spark? |
| Answer» | |
| 2. |
What do you understand by Shuffling in Spark? |
|
Answer» The process of redistribution of data across different partitions which might or might not CAUSE data movement across the JVM processes or the EXECUTORS on the separate MACHINES is known as shuffling/repartitioning. Partition is NOTHING but a smaller logical division of data. It is to be noted that SPARK has no control over what partition the data gets distributed across. |
|
| 3. |
What are the data formats supported by Spark? |
|
Answer» Spark supports both the raw files and the STRUCTURED file FORMATS for efficient reading and processing. File formats like paraquet, JSON, XML, CSV, RC, Avro, TSV, etc are supported by Spark. |
|
| 4. |
What is the difference between repartition and coalesce? |
||||||||
Answer»
|
|||||||||
| 5. |
What are receivers in Apache Spark Streaming? |
|
Answer» Receivers are those entities that consume DATA from different data SOURCES and then MOVE them to Spark for processing. They are created by using streaming contexts in the form of long-running tasks that are scheduled for operating in a round-robin fashion. Each receiver is configured to use up only a single core. The receivers are made to run on various executors to accomplish the task of data streaming. There are two types of receivers depending on how the data is sent to Spark: |
|
| 6. |
List the types of Deploy Modes in Spark. |
|
Answer» There are 2 deploy modes in Spark. They are:
Apart from the above two modes, if we have to run the application on our local machines for unit TESTING and development, the deployment mode is called “Local Mode”. Here, the JOBS run on a single JVM in a single machine which makes it highly inefficient as at some point or the other there would be a shortage of resources which results in the failure of jobs. It is also not possible to scale up resources in this mode DUE to the restricted memory and space. |
|
| 7. |
What does DAG refer to in Apache Spark? |
|
Answer» DAG stands for Directed Acyclic GRAPH with no directed cycles. There WOULD be finite VERTICES and edges. Each edge from one vertex is directed to ANOTHER vertex in a sequential manner. The vertices refer to the RDDs of Spark and the edges represent the operations to be performed on those RDDs. |
|
| 8. |
What is RDD? |
|
Answer» RDD stands for Resilient Distribution DATASETS. It is a fault-tolerant collection of parallel running operational elements. The partitioned data of RDD is distributed and IMMUTABLE. There are two types of datasets:
|
|
| 9. |
What are the features of Apache Spark? |
Answer»
|
|
| 10. |
Can you tell me what is Apache Spark about? |
|
Answer» Apache Spark is an open-source framework engine that is known for its speed, easy-to-use nature in the field of big data PROCESSING and analysis. It also has built-in MODULES for GRAPH processing, machine LEARNING, streaming, SQL, etc. The spark execution engine supports in-memory computation and cyclic data flow and it can run either on cluster mode or STANDALONE mode and can access diverse data sources like HBase, HDFS, Cassandra, etc. |
|