1.

What are RDDs in PySpark?

Answer»

RDDs expand to Resilient Distributed Datasets. These are the elements that are used for running and operating on multiple nodes to perform parallel processing on a cluster. Since RDDs are suited for parallel processing, they are immutable elements. This means that once we create RDD, we cannot modify it. RDDs are also fault-tolerant which means that whenever failure happens, they can be RECOVERED automatically. Multiple operations can be performed on RDDs to perform a certain task. The operations can be of 2 types:

  • Transformation: These operations when applied on RDDs result in the creation of a new RDD. Some of the examples of transformation operations are filter, groupBy, map.
    Let us take an example to demonstrate transformation OPERATION by considering filter() operation:
from pyspark import SparkContextsc = SparkContext("local", "Transdormation Demo")words_list = sc.parallelize ( ["pyspark", "interview", "questions", "at", "interviewbit"])filtered_words = words_list.filter(lambda x: 'interview' in x)filtered = filtered_words.collect()print(filtered)

The above code FILTERS all the elements in the LIST that has ‘interview’ in the element. The output of the above code would be:

[ "interview", "interviewbit"]
  • Action: These operations instruct Spark to perform some computations on the RDD and return the result to the driver. It sends data from the Executer to the driver. count(), collect(), take() are some of the examples.
    Let us consider an example to demonstrate action operation by making use of the count() function.
from pyspark import SparkContextsc = SparkContext("local", "Action Demo")words = sc.parallelize ( ["pyspark", "interview", "questions", "at", "interviewbit"])counts = words.count()print("Count of elements in RDD -> ", counts)

In this class, we count the number of elements in the spark RDDs. The output of this code is

Count of elements in RDD -> 5


Discussion

No Comment Found