1.

What is coalesce in Spark? How is it different from repartition?

Answer»

Coalesce in Spark provides a way to reduce the number of partitions in an RDD or data frame. It works on existing partitions instead of creating new partitions thereby REDUCING the amount of data that are shuffled.

A good use for coalesce is when data in RDD has been filtered out. As a result of FILTERING some of the partitions in RDD may now be empty or have fewer data. Coalesce will help reduce the number of partitions thereby helping optimize any further operations on the RDD. Note that coalesce cannot be used to INCREASE the number of partitions.

Consider the following example:

The data has been read from a CSV file into an RDD having four partitions:

  • Partition A: 11, 12
  • Partition B: 30, 40, 50
  • Partition C: 6, 7, 80
  • Partition D: 9, 10

Filter OPERATION is applied on the RDD which removes all multiples of 10. The resultant RDD will LOOK like below:

  • Partition A: 11, 12
  • Partition B: -
  • Partition C: 6, 7
  • Partition D: 9

As can be seen, the RDD has some empty partitions or ones having very little data. Hence it makes sense to reduce the number of partitions. Using coalesce we can achieve the same. The resultant RDD when coalesce(2) has been applied will look like:

  • Partition A: 11, 12
  • Partition C: 6, 7, 9

Repartition, on the other hand, can be used to increase or decrease the number of partitions in RDD. Repartition works by doing a full shuffle of data and creating new partitions. As full data shuffle is involved it is an expensive operation.



Discussion

No Comment Found