InterviewSolution
| 1. |
What is checkpointing in Streaming and when should you enable it? |
|
Answer» A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic like system failures, JVM crashes, etc.. For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures. There are two types of data that are checkpointed.
To summarize, metadata checkpointing is primarily needed for recovery from driver failures, WHEREAS data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used. When to enable Checkpointing : Checkpointing must be enabled for applications with any of the following requirements:
Note that simple streaming applications without the stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that CASE (some received but unprocessed data MAY be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future. |
|