What is a broadcast variable in Spark? What purpose does it serve? How

1.	What is a broadcast variable in Spark? What purpose does it serve? How is it different from an accumulator?
Answer» Broadcast variables provide a way to keep a read-only variable CACHED on each executor from the driver program. Broadcast variables allow for efficient sharing of large data sets intended as reference data for workers. If regular variables had to be used for this purpose instead then the variable would have to be shipped to each executor for every transformation and action. One of the common use cases for a broadcast variable is to store and share a lookup table in a JOIN operation. When Spark SHIPS a regular variable to executors, they BECOME local to the executor and its updated value is not relayed back to the driver. Accumulator variables are a special type of variables wherein updates to the variable on executor nodes is relayed back to the driver. They can be used for associative or commutative operations. One of the common use cases is to analyze transaction logs. However when using ACCUMULATORS following need considered: Accumulators used inside transformations won’t get executed until an action gets called If a task is restarted and DAG is recomputed, then accumulators inside transformations might get updated more than once. To be on the safe side, accumulators should be used inside actions only.

What is a broadcast variable in Spark? What purpose does it serve? How is it different from an accumulator?

Answer»

Broadcast variables provide a way to keep a read-only variable CACHED on each executor from the driver program. Broadcast variables allow for efficient sharing of large data sets intended as reference data for workers. If regular variables had to be used for this purpose instead then the variable would have to be shipped to each executor for every transformation and action. One of the common use cases for a broadcast variable is to store and share a lookup table in a JOIN operation.

When Spark SHIPS a regular variable to executors, they BECOME local to the executor and its updated value is not relayed back to the driver. Accumulator variables are a special type of variables wherein updates to the variable on executor nodes is relayed back to the driver. They can be used for associative or commutative operations. One of the common use cases is to analyze transaction logs. However when using ACCUMULATORS following need considered:

Accumulators used inside transformations won’t get executed until an action gets called
If a task is restarted and DAG is recomputed, then accumulators inside transformations might get updated more than once.

To be on the safe side, accumulators should be used inside actions only.

What is a broadcast variable in Spark? What purpose does it serve? How is it different from an accumulator?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment