What is checkpointing in Streaming and when should you enable it?

1.	What is checkpointing in Streaming and when should you enable it?
Answer» A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic like system failures, JVM crashes, etc.. For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures. There are two types of data that are checkpointed. Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from a failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary for some stateful transformations that combine data ACROSS multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency CHAIN to keep increasing with time. To avoid such unbounded increases in recovery time or proportional to dependency chain, intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage like HDFS to cut off the dependency chains. To summarize, metadata checkpointing is primarily needed for recovery from driver failures, WHEREAS data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used. *When to enable Checkpointing* : Checkpointing must be enabled for applications with any of the following requirements: Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing. Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. Note that simple streaming applications without the stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that CASE (some received but unprocessed data MAY be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.

Answer»

A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic like system failures, JVM crashes, etc.. For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures. There are two types of data that are checkpointed.

Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from a failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes:
- Configuration - The configuration that was used to create the streaming application.
- DStream operations - The set of DStream operations that define the streaming application.
- Incomplete batches - Batches whose jobs are queued but have not completed yet.
Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary for some stateful transformations that combine data ACROSS multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency CHAIN to keep increasing with time. To avoid such unbounded increases in recovery time or proportional to dependency chain, intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage like HDFS to cut off the dependency chains.

To summarize, metadata checkpointing is primarily needed for recovery from driver failures, WHEREAS data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used.

When to enable Checkpointing :

Checkpointing must be enabled for applications with any of the following requirements:

Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing.
Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information.

Note that simple streaming applications without the stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that CASE (some received but unprocessed data MAY be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.

What is checkpointing in Streaming and when should you enable it?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment