What are PySpark serializers?

1.	What are PySpark serializers?
Answer» The serialization process is used to conduct performance tuning on Spark. The data sent or RECEIVED over the network to the disk or memory should be persisted. PySpark supports serializers for this purpose. It supports two types of serializers, they are: PickleSerializer: This serializes objects USING Python’s PickleSerializer (class pyspark.PickleSerializer). This supports almost every Python object. MarshalSerializer: This performs serialization of objects. We can use it by using class pyspark.MarshalSerializer. This serializer is faster than the PickleSerializer but it supports only limited types. Consider an example of serialization which makes use of MarshalSerializer: # --serializing.py----from pyspark.context import SparkContextfrom pyspark.serializers import MarshalSerializersc = SparkContext("LOCAL", "Marshal Serialization", serializer = MarshalSerializer()) #Initialize spark context and serializerprint(sc.parallelize(list(range(1000))).map(lambda x: 3 * x).take(5))sc.stop() When we run the file using the command: $SPARK_HOME/bin/spark-submit serializing.py The OUTPUT of the code WOULD be the list of size 5 of numbers multiplied by 3: [0, 3, 6, 9, 12]

Answer»

The serialization process is used to conduct performance tuning on Spark. The data sent or RECEIVED over the network to the disk or memory should be persisted. PySpark supports serializers for this purpose. It supports two types of serializers, they are:

PickleSerializer: This serializes objects USING Python’s PickleSerializer (class pyspark.PickleSerializer). This supports almost every Python object.
MarshalSerializer: This performs serialization of objects. We can use it by using class pyspark.MarshalSerializer. This serializer is faster than the PickleSerializer but it supports only limited types.

Consider an example of serialization which makes use of MarshalSerializer:

# --serializing.py----from pyspark.context import SparkContextfrom pyspark.serializers import MarshalSerializersc = SparkContext("LOCAL", "Marshal Serialization", serializer = MarshalSerializer()) #Initialize spark context and serializerprint(sc.parallelize(list(range(1000))).map(lambda x: 3 * x).take(5))sc.stop()

When we run the file using the command:

$SPARK_HOME/bin/spark-submit serializing.py

The OUTPUT of the code WOULD be the list of size 5 of numbers multiplied by 3:

[0, 3, 6, 9, 12]

What are PySpark serializers?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment