When submitting a Spark job, I notice that a few tasks are relatively

1.	When submitting a Spark job, I notice that a few tasks are relatively taking longer time to get completed. What could be the cause of this and how to resolve this issue?
Answer» When a Spark job is submitted, each worker node launches an executor. The data is read from the source into RDDs or Data Frames which can be considered a sort of big arrays with multiple partitions. Each executor can launch one or more TASKS with each task mapping to a partition thereby increasing parallelism. However in case, the data is skewed i.e. some of the partitions CONTAIN much larger data compared to OTHERS, then tasks OPERATING on larger partitions can take much longer to complete than those which operate on smaller partitions. Data skewness can arise due to multiple reasons e.g. say source contains user data for various countries. If the data is partitioned based on country then a partition for a country having a larger population will have more data leading to data skewness. A better way to handle this situation is to partition data based on a key which results in more balanced spreading of data. Another way to handle this problem is to use repartition. Spark repartition does a full shuffle of data in RDD and creates new partitions with data distributed evenly. Since the data is more evenly spread now, tasks operating on partitions will take an equal AMOUNT of time to process now. Keep in mind that repartitioning your data is a fairly expensive operation. Yet another option is to cache the RDD or Dataframe before heavy operations as caching helps optimize performance to a great extent.

When submitting a Spark job, I notice that a few tasks are relatively taking longer time to get completed. What could be the cause of this and how to resolve this issue?

Answer»

When a Spark job is submitted, each worker node launches an executor. The data is read from the source into RDDs or Data Frames which can be considered a sort of big arrays with multiple partitions. Each executor can launch one or more TASKS with each task mapping to a partition thereby increasing parallelism.

However in case, the data is skewed i.e. some of the partitions CONTAIN much larger data compared to OTHERS, then tasks OPERATING on larger partitions can take much longer to complete than those which operate on smaller partitions.

Data skewness can arise due to multiple reasons e.g. say source contains user data for various countries. If the data is partitioned based on country then a partition for a country having a larger population will have more data leading to data skewness. A better way to handle this situation is to partition data based on a key which results in more balanced spreading of data.

Another way to handle this problem is to use repartition. Spark repartition does a full shuffle of data in RDD and creates new partitions with data distributed evenly. Since the data is more evenly spread now, tasks operating on partitions will take an equal AMOUNT of time to process now. Keep in mind that repartitioning your data is a fairly expensive operation.

Yet another option is to cache the RDD or Dataframe before heavy operations as caching helps optimize performance to a great extent.

When submitting a Spark job, I notice that a few tasks are relatively taking longer time to get completed. What could be the cause of this and how to resolve this issue?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment