What are the types of PySpark’s shared variables and why are they us

1.	What are the types of PySpark’s shared variables and why are they useful?
Answer» Whenever PySpark performs the transformation operation using FILTER(), map() or reduce(), they are run on a remote node that uses the variables shipped with tasks. These variables are not reusable and cannot be shared across different tasks because they are not returned to the Driver. To SOLVE the issue of reusability and sharing, we have shared variables in PySpark. There are two types of shared variables, they are: Broadcast variables: These are also known as read-only shared variables and are used in cases of data lookup requirements. These variables are cached and are made available on all the cluster nodes so that the tasks can make USE of them. The variables are not sent with every task. They are rather distributed to the nodes using efficient algorithms for reducing the cost of communication. When we run an RDD JOB operation that makes use of Broadcast variables, the following things are done by PySpark: The job is broken into different stages having distributed shuffling. The actions are executed in those stages. The stages are then broken into tasks. The broadcast variables are BROADCASTED to the tasks if the tasks need to use it. Broadcast variables are created in PySpark by making use of the broadcast(variable) method from the SparkContext class. The syntax for this goes as follows: broadcastVar = sc.broadcast([10, 11, 22, 31])broadcastVar.value # access broadcast variable An important point of using broadcast variables is that the variables are not sent to the tasks when the broadcast function is called. They will be sent when the variables are first required by the executors. Accumulator variables: These variables are called updatable shared variables. They are added through associative and commutative operations and are used for performing counter or sum operations. PySpark supports the creation of numeric type accumulators by default. It also has the ability to add custom accumulator types. The custom types can be of two types: Named Accumulators: These accumulators are visible under the “Accumulator” tab in the PySpark web UI as shown in the image below: Here, we will see the Accumulable section that has the sum of the Accumulator values of the variables modified by the tasks listed in the Accumulator column present in the Tasks table. Unnamed Accumulators: These accumulators are not shown on the PySpark Web UI page. It is always recommended to make use of named accumulators. Accumulator variables can be created by using SparkContext.longAccumulator(variable) as shown in the example below: ac = sc.longAccumulator("sumaccumulator")sc.parallelize([2, 23, 1]).foreach(lambda x: ac.add(x)) Depending on the type of accumulator variable data - double, long and collection, PySpark provide DoubleAccumulator, LongAccumulator and CollectionAccumulator respectively.

What are the types of PySpark’s shared variables and why are they useful?

Answer»

Whenever PySpark performs the transformation operation using FILTER(), map() or reduce(), they are run on a remote node that uses the variables shipped with tasks. These variables are not reusable and cannot be shared across different tasks because they are not returned to the Driver. To SOLVE the issue of reusability and sharing, we have shared variables in PySpark. There are two types of shared variables, they are:

Broadcast variables: These are also known as read-only shared variables and are used in cases of data lookup requirements. These variables are cached and are made available on all the cluster nodes so that the tasks can make USE of them. The variables are not sent with every task. They are rather distributed to the nodes using efficient algorithms for reducing the cost of communication. When we run an RDD JOB operation that makes use of Broadcast variables, the following things are done by PySpark:

The job is broken into different stages having distributed shuffling. The actions are executed in those stages.
The stages are then broken into tasks.
The broadcast variables are BROADCASTED to the tasks if the tasks need to use it.

Broadcast variables are created in PySpark by making use of the broadcast(variable) method from the SparkContext class. The syntax for this goes as follows:

broadcastVar = sc.broadcast([10, 11, 22, 31])broadcastVar.value # access broadcast variable

An important point of using broadcast variables is that the variables are not sent to the tasks when the broadcast function is called. They will be sent when the variables are first required by the executors.

Accumulator variables: These variables are called updatable shared variables. They are added through associative and commutative operations and are used for performing counter or sum operations. PySpark supports the creation of numeric type accumulators by default. It also has the ability to add custom accumulator types. The custom types can be of two types:

Named Accumulators: These accumulators are visible under the “Accumulator” tab in the PySpark web UI as shown in the image below:

Here, we will see the Accumulable section that has the sum of the Accumulator values of the variables modified by the tasks listed in the Accumulator column present in the Tasks table.

Unnamed Accumulators: These accumulators are not shown on the PySpark Web UI page. It is always recommended to make use of named accumulators.

Accumulator variables can be created by using SparkContext.longAccumulator(variable) as shown in the example below:

ac = sc.longAccumulator("sumaccumulator")sc.parallelize([2, 23, 1]).foreach(lambda x: ac.add(x))

Depending on the type of accumulator variable data - double, long and collection, PySpark provide DoubleAccumulator, LongAccumulator and CollectionAccumulator respectively.

What are the types of PySpark’s shared variables and why are they useful?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment