1.

What is the working of DAG in Spark?

Answer»

DAG stands for Direct Acyclic Graph which has a set of finite vertices and edges. The vertices represent RDDs and the edges represent the operations to be performed on RDDs sequentially. The DAG created is submitted to the DAG Scheduler which splits the graphs into stages of tasks BASED on the transformations applied to the data. The stage view has the details of the RDDs of that stage.

The working of DAG in SPARK is defined as per the workflow diagram below:

  • The first task is to INTERPRET the code with the help of an interpreter. If you use the Scala code, then the Scala interpreter interprets the code.
  • Spark then creates an operator graph when the code is entered in the Spark console.
  • When the ACTION is called on Spark RDD, the operator graph is submitted to the DAG Scheduler.
  • The operators are divided into stages of task by the DAG Scheduler. The stage consists of detailed step-by-step operation on the input data. The operators are then PIPELINED together.
  • The stages are then passed to the Task Scheduler which launches the task via the cluster manager to work on independently without the dependencies between the stages.
  • The worker nodes then execute the task.

Each RDD keeps track of the pointer to one/more parent RDD along with its relationship with the parent. For example, consider the operation val childB=parentA.map() on RDD, then we have the RDD childB that keeps track of its parentA which is called RDD lineage.



Discussion

No Comment Found