InterviewSolution
| 1. |
What PySpark DAGScheduler? |
|
Answer» DAG stands for Direct ACYCLIC Graph. DAGScheduler constitutes the scheduling layer of Spark which implements scheduling of tasks in a stage-oriented manner using jobs and stages. The logical execution plan (Dependencies lineage of transformation actions upon RDDs) is transformed into a physical execution plan CONSISTING of stages. It computes a DAG of stages needed for each job and keeps track of what stages are RDDs are materialized and finds a minimal schedule for running the jobs. These stages are then submitted to TaskScheduler for running the stages. This is represented in the image flow below: DAGScheduler performs the following three things in Spark:
PySpark’s DAGScheduler follows event-queue architecture. Here a thread POSTS events of type DAGSchedulerEvent such as new stage or job. The DAGScheduler then reads the stages and sequentially EXECUTES them in topological order. |
|