|
Answer» Spark persists intermediary data from DIFFERENT SHUFFLE operations automatically. But it is recommended to call the persist() method on the RDD. There are different persistence levels for storing the RDDs on memory or disk or both with different levels of replication. The persistence levels available in Spark are: - MEMORY_ONLY: This is the default persistence level and is used for storing the RDDs as the deserialized version of Java OBJECTS on the JVM. In case the RDDs are huge and do not fit in the memory, then the partitions are not cached and they will be recomputed as and when needed.
- MEMORY_AND_DISK: The RDDs are stored again as deserialized Java objects on JVM. In case the memory is INSUFFICIENT, then partitions not fitting on the memory will be stored on disk and the data will be read from the disk as and when needed.
- MEMORY_ONLY_SER: The RDD is stored as serialized Java Objects as One Byte per partition.
- MEMORY_AND_DISK_SER: This level is similar to MEMORY_ONLY_SER but the difference is that the partitions not fitting in the memory are saved on the disk to avoid recomputations on the fly.
- DISK_ONLY: The RDD partitions are stored only on the disk.
- OFF_HEAP: This level is the same as the MEMORY_ONLY_SER but here the data is stored in the off-heap memory.
The syntax for using persistence levels in the persist() method is: df.persist(StorageLevel.<level_value>)The following table summarizes the details of persistence levels: | Persistence Level | Space CONSUMED | CPU time | In-memory? | On-disk? |
|---|
| MEMORY_ONLY | High | Low | Yes | No | | MEMORY_ONLY_SER | Low | High | Yes | No | | MEMORY_AND_DISK | High | Medium | Some | Some | | MEMORY_AND_DISK_SER | Low | High | Some | Some | | DISK_ONLY | Low | High | No | Yes | | OFF_HEAP | Low | High | Yes (but off-heap) | No |
|