What is persistence and caching in Spark and why do we need them?

1.	What is persistence and caching in Spark and why do we need them?
Answer» cache and persist methods are optimization techniques in SPARK in which saves the result of RDD evaluation. By using cache and persist we can save the intermediate RESULTS so that we can use them further if required. We can make RDD persist in memory(which can be in-memory or dist )using cache() and persist() methods. If we make RDDs cache() method, it stores all the RDD data in-memory. We use persist() method in RDD to save all the RDD in memory as well. But the difference is, cache() stores RDD in system/clusters in-memory, but persist() method can use various STORAGE levels to store the RDD. By DEFAULT, persist() uses MEMORY_ONLY, it is equal as cache() method. Below are the various levels of persist(). MEMORY_ONLY – Stores RDD in in-memory. but If the RDD does not fit in memory, then some partitions will not be cached and will recompute on the fly each time needed. This is the default level. MEMORY_AND_DISK – Stores RDD in both in-memory and the disk. If the RDD does not fit in memory, it stores some partitions that don’t fit on the disk and read them from there when they are needed. MEMORY_ONLY_SER– Stores RDD in-memory. But it stores RDD as serialized Java objects. This is more space-efficient than deserialized objects. especially when using a fast serializer. but it is hard for CPU to read. MEMORY_AND_DISK_SER – Stores RDD in both in-memory and the disk.it stores RDD as serialized Java objects. This is more space-efficient than deserialized objects but it spills partitions that don’t fit in memory to disk. DISK_ONLY – It stores the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2 – It replicates each partition on two cluster nodes. OFF_HEAP – Like MEMORY_ONLY_SER but store the data in off-heap memory. This requires enabling of off-heap memory. *Need for persistence* : In Spark, we often use the same RDD’s multiple times. When we repeatedly process the same RDD multiple times, it requires time to evaluate each time. This task can be time and memory consuming, especially iterative algorithms that require data multiple times. To solve the problem of repeated computation we require persistence technique.

Answer»

cache and persist methods are optimization techniques in SPARK in which saves the result of RDD evaluation. By using cache and persist we can save the intermediate RESULTS so that we can use them further if required.

We can make RDD persist in memory(which can be in-memory or dist )using cache() and persist() methods.

If we make RDDs cache() method, it stores all the RDD data in-memory.

We use persist() method in RDD to save all the RDD in memory as well. But the difference is, cache() stores RDD in system/clusters in-memory, but persist() method can use various STORAGE levels to store the RDD. By DEFAULT, persist() uses MEMORY_ONLY, it is equal as cache() method.

Below are the various levels of persist().

MEMORY_ONLY – Stores RDD in in-memory. but If the RDD does not fit in memory, then some partitions will not be cached and will recompute on the fly each time needed. This is the default level.
MEMORY_AND_DISK – Stores RDD in both in-memory and the disk. If the RDD does not fit in memory, it stores some partitions that don’t fit on the disk and read them from there when they are needed.
MEMORY_ONLY_SER– Stores RDD in-memory. But it stores RDD as serialized Java objects. This is more space-efficient than deserialized objects. especially when using a fast serializer. but it is hard for CPU to read.
MEMORY_AND_DISK_SER – Stores RDD in both in-memory and the disk.it stores RDD as serialized Java objects. This is more space-efficient than deserialized objects but it spills partitions that don’t fit in memory to disk.
DISK_ONLY – It stores the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2 – It replicates each partition on two cluster nodes.
OFF_HEAP – Like MEMORY_ONLY_SER but store the data in off-heap memory. This requires enabling of off-heap memory.

Need for persistence :

In Spark, we often use the same RDD’s multiple times. When we repeatedly process the same RDD multiple times, it requires time to evaluate each time. This task can be time and memory consuming, especially iterative algorithms that require data multiple times. To solve the problem of repeated computation we require persistence technique.

What is persistence and caching in Spark and why do we need them?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment