1.

What is caching or persistence in Spark? How are the two different from each other? What are various storage levels for persisting RDDs?

Answer»

Caching or Persistence is an optimization technique involving saving the results date result to disk or memory. 

An RDD can be involved in multiple transformations/actions. Each such transformation will require the same RDD to be evaluated multiple times. This is both time and memory consuming. It can be easily avoided by caching or persisting the RDD. The difference between cache() and persist() in Spark is that in the CASE of former, storage level is Memory Only while later provides a host of other storage levels.

There are five different storage levels in Spark:

  • MEMORY_ONLY – RDD is stored as a deserialized OBJECT in the JVM. If the size of the object is greater than the memory available, then only partitions that fit in the memory will be cached while remaining ones will be recomputed whenever needed.
  • MEMORY_AND_DISK – RDD is stored as a deserialized object in the JVM. If the size of the object is greater than the memory available, then additional partitions are spilled on to the disk and fetched from disk whenever needed.
  • MEMORY_ONLY_SER - RDD is stored as a serialized object in the JVM. It is more memory efficient compared to STORING deserialized objects however it INCREASES CPU overhead. 
  • MEMORY_AND_DISK_SER – RDD is stored as a serialized object in the JVM. If the size of the object is greater than the memory available, then additional partitions are spilled on to the disk and stored in serialized form as well.
  • DISK_ONLY – In this storage level, RDD is stored only on Disk and not on the Heap. This OPTION can be used when low heap memory is available however it increases CPU compute time considerably.


Discussion

No Comment Found