1.

What PySpark DAGScheduler?

Answer»

DAG stands for Direct ACYCLIC Graph. DAGScheduler constitutes the scheduling layer of Spark which implements scheduling of tasks in a stage-oriented manner using jobs and stages. The logical execution plan (Dependencies lineage of transformation actions upon RDDs) is transformed into a physical execution plan CONSISTING of stages. It computes a DAG of stages needed for each job and keeps track of what stages are RDDs are materialized and finds a minimal schedule for running the jobs. These stages are then submitted to TaskScheduler for running the stages. This is represented in the image flow below:

DAGScheduler performs the following three things in Spark:

  • Compute DAG execution for the job.
  • Determine preferred locations for running each task
  • Failure Handling due to OUTPUT files lost during shuffling.

PySpark’s DAGScheduler follows event-queue architecture. Here a thread POSTS events of type DAGSchedulerEvent such as new stage or job. The DAGScheduler then reads the stages and sequentially EXECUTES them in topological order.



Discussion

No Comment Found