What are the different persistence levels in Apache Spark?

Answer»

Spark persists intermediary data from DIFFERENT SHUFFLE operations automatically. But it is recommended to call the persist() method on the RDD. There are different persistence levels for storing the RDDs on memory or disk or both with different levels of replication. The persistence levels available in Spark are:

MEMORY_ONLY: This is the default persistence level and is used for storing the RDDs as the deserialized version of Java OBJECTS on the JVM. In case the RDDs are huge and do not fit in the memory, then the partitions are not cached and they will be recomputed as and when needed.
MEMORY_AND_DISK: The RDDs are stored again as deserialized Java objects on JVM. In case the memory is INSUFFICIENT, then partitions not fitting on the memory will be stored on disk and the data will be read from the disk as and when needed.
MEMORY_ONLY_SER: The RDD is stored as serialized Java Objects as One Byte per partition.
MEMORY_AND_DISK_SER: This level is similar to MEMORY_ONLY_SER but the difference is that the partitions not fitting in the memory are saved on the disk to avoid recomputations on the fly.
DISK_ONLY: The RDD partitions are stored only on the disk.
OFF_HEAP: This level is the same as the MEMORY_ONLY_SER but here the data is stored in the off-heap memory.

The syntax for using persistence levels in the persist() method is:

df.persist(StorageLevel.<level_value>)

The following table summarizes the details of persistence levels:

Persistence Level	Space CONSUMED	CPU time	In-memory?	On-disk?
MEMORY_ONLY	High	Low	Yes	No
MEMORY_ONLY_SER	Low	High	Yes	No
MEMORY_AND_DISK	High	Medium	Some	Some
MEMORY_AND_DISK_SER	Low	High	Some	Some
DISK_ONLY	Low	High	No	Yes
OFF_HEAP	Low	High	Yes (but off-heap)	No

What are the different persistence levels in Apache Spark?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment