1.

What is shuffling in Spark? When does it occur? What are the various ways by which shuffling of data can be minimized?

Answer»

Shuffling is the process of redistributing data across partitions that may cause data movement across executors.

By default, shuffling doesn’t change the number of partitions, but their content. There are many different operations that require shuffling of data, for instance, join between two tables or byKey operations such as GroupByKey or ReduceByKey.

Shuffling is a costly operation as it involves the movement of data across executors and care MUST be taken to minimize it. This can be done using optimized grouping operation such as using reduceByKey instead of groupByKey. While groupByKey shuffles all the data, reduceByKey shuffles only the RESULTS of AGGREGATIONS of each partition and hence is more optimized than groupByKey.

When joining two tables opt to use the same partitioner on both the tables. This would store values having the same key in same chunk/partition. This way Spark would not have to GO through the entire second table for each partition of the FIRST table hence reducing shuffling of data.

Another optimization is to use broadcast join when joining a large table with a smaller one. This would broadcast a smaller table's data to all the executors hence reducing shuffling of data.



Discussion

No Comment Found