What are the shared variables in Spark?

1.	What are the shared variables in Spark?
Answer» In addition to RDD abstraction, the second kind of Low-level API is shared VARIABLES in Spark. Spark has two types of DISTRIBUTED shared variables: Broadcast Variables Accumulators These variables can be used in User Defined Functions(UDFs). *Broadcast Variables :* Broadcast variables are the variables to share an immutable value efficiently around the cluster without encapsulating that variable in a FUNCTION closure. The normal way to use a variable in our driver node inside your tasks is to simply reference it in your function closures (e.g., in a map operation), but this can be inefficient, especially for large variables such as a lookup table or a machine learning model. The reason for this is that when you use a variable in a closure, it must be deserialized on the worker nodes many times. Moreover, if you use the same variable in multiple Spark actions and jobs, it will be re-sent to the workers with every job instead of once. This is where broadcast variables come in. Broadcast variables are shared, immutable variables that are cached on every machine in the cluster instead of serialized with every single task. The canonical use case is to pass around a large lookup table that fits in memory on the executors and use that in a function. *Accumulators* : Spark’s second type of shared variables are a way of updating a value inside of a variety of transformations and propagating that value to the driver node in an efficient and fault-tolerant way. Accumulators provide a mutable variable that a Spark cluster can safely update on a per-row basis. We can use these for debugging purposes or to create low-level aggregation. We can use them to implement counters or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types. For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will be applied only once, meaning that restarted tasks will not update the value. In transformations, we should be aware that each task’s update can be applied more than once if tasks or job stages are re-executed. Accumulators do not change the lazy evaluation model of SPAR. Accumulator updates are not guaranteed to be executed when made within a lazy transformation like map(). Accumulators can be both named and unnamed. Named accumulators will display their running results in the Spark UI, whereas unnamed ones will not.

Answer»

In addition to RDD abstraction, the second kind of Low-level API is shared VARIABLES in Spark. Spark has two types of DISTRIBUTED shared variables:

Broadcast Variables
Accumulators

These variables can be used in User Defined Functions(UDFs).

Broadcast Variables :

Broadcast variables are the variables to share an immutable value efficiently around the cluster without encapsulating that variable in a FUNCTION closure. The normal way to use a variable in our driver node inside your tasks is to simply reference it in your function closures (e.g., in a map operation), but this can be inefficient, especially for large variables such as a lookup table or a machine learning model. The reason for this is that when you use a variable in a closure, it must be deserialized on the worker nodes many times. Moreover, if you use the same variable in multiple Spark actions and jobs, it will be re-sent to the workers with every job instead of once. This is where broadcast variables come in. Broadcast variables are shared, immutable variables that are cached on every machine in the cluster instead of serialized with every single task. The canonical use case is to pass around a large lookup table that fits in memory on the executors and use that in a function.

Accumulators :

Spark’s second type of shared variables are a way of updating a value inside of a variety of transformations and propagating that value to the driver node in an efficient and fault-tolerant way. Accumulators provide a mutable variable that a Spark cluster can safely update on a per-row basis. We can use these for debugging purposes or to create low-level aggregation. We can use them to implement counters or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types.

For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will be applied only once, meaning that restarted tasks will not update the value. In transformations, we should be aware that each task’s update can be applied more than once if tasks or job stages are re-executed.

Accumulators do not change the lazy evaluation model of SPAR. Accumulator updates are not guaranteed to be executed when made within a lazy transformation like map().

Accumulators can be both named and unnamed. Named accumulators will display their running results in the Spark UI, whereas unnamed ones will not.

What are the shared variables in Spark?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment