What is checkpointing in Spark? How does it help Spark achieve exactly

1.	What is checkpointing in Spark? How does it help Spark achieve exactly-once semantics? How does Checkpointing differ from Persistence?
Answer» Checkpointing is defined as the process of truncating the RDD lineage graph and storing it to a fault-tolerant file system such as HDFS. By default, Spark maintains a history of all transformations you apply to a DATAFRAME or RDD. While this enables Spark to be fault-tolerant, it also results in a performance hit an ENTIRE set of transformations on RDD/Dataframe needs to be recomputed in case fault occurs during application execution. This can be AVOIDED with the use of CHECKPOINTS. Checkpointing truncates the RDD lineage graph and saves it to HDFS. Spark then keeps track of only the transformations that have been applied after checkpointing. Checkpointing helps Spark achieve exactly once, fault-tolerant guarantee. It uses checkpointing and write-ahead logs to record the offset range of data processed in each trigger. In case of a failure, data can be replayed using checkpointed offsets after a failure. Persisting an RDD stores it to Disk or Memory. However, Spark remembers the RDD lineage though it doesn’t call it. After the job run is complete, the cache is cleared. With Checkpointing, RDD is stored to HDFS and the lineage gets deleted. When the job run is completed, the checkpoint file is not deleted.

What is checkpointing in Spark? How does it help Spark achieve exactly-once semantics? How does Checkpointing differ from Persistence?

Answer»

Checkpointing is defined as the process of truncating the RDD lineage graph and storing it to a fault-tolerant file system such as HDFS.

By default, Spark maintains a history of all transformations you apply to a DATAFRAME or RDD. While this enables Spark to be fault-tolerant, it also results in a performance hit an ENTIRE set of transformations on RDD/Dataframe needs to be recomputed in case fault occurs during application execution. This can be AVOIDED with the use of CHECKPOINTS. Checkpointing truncates the RDD lineage graph and saves it to HDFS. Spark then keeps track of only the transformations that have been applied after checkpointing.

Checkpointing helps Spark achieve exactly once, fault-tolerant guarantee. It uses checkpointing and write-ahead logs to record the offset range of data processed in each trigger. In case of a failure, data can be replayed using checkpointed offsets after a failure.

Persisting an RDD stores it to Disk or Memory. However, Spark remembers the RDD lineage though it doesn’t call it. After the job run is complete, the cache is cleared.

With Checkpointing, RDD is stored to HDFS and the lineage gets deleted. When the job run is completed, the checkpoint file is not deleted.

What is checkpointing in Spark? How does it help Spark achieve exactly-once semantics? How does Checkpointing differ from Persistence?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment