InterviewSolution
| 1. |
What are RDDs in PySpark? |
|
Answer» RDDs expand to Resilient Distributed Datasets. These are the elements that are used for running and operating on multiple nodes to perform parallel processing on a cluster. Since RDDs are suited for parallel processing, they are immutable elements. This means that once we create RDD, we cannot modify it. RDDs are also fault-tolerant which means that whenever failure happens, they can be RECOVERED automatically. Multiple operations can be performed on RDDs to perform a certain task. The operations can be of 2 types:
The above code FILTERS all the elements in the LIST that has ‘interview’ in the element. The output of the above code would be: [ "interview", "interviewbit"]
In this class, we count the number of elements in the spark RDDs. The output of this code is Count of elements in RDD -> 5 |
|