InterviewSolution
| 1. |
What is persistence and caching in Spark and why do we need them? |
|
Answer» cache and persist methods are optimization techniques in SPARK in which saves the result of RDD evaluation. By using cache and persist we can save the intermediate RESULTS so that we can use them further if required. We can make RDD persist in memory(which can be in-memory or dist )using cache() and persist() methods. If we make RDDs cache() method, it stores all the RDD data in-memory. We use persist() method in RDD to save all the RDD in memory as well. But the difference is, cache() stores RDD in system/clusters in-memory, but persist() method can use various STORAGE levels to store the RDD. By DEFAULT, persist() uses MEMORY_ONLY, it is equal as cache() method. Below are the various levels of persist().
Need for persistence : In Spark, we often use the same RDD’s multiple times. When we repeatedly process the same RDD multiple times, it requires time to evaluate each time. This task can be time and memory consuming, especially iterative algorithms that require data multiple times. To solve the problem of repeated computation we require persistence technique. |
|