What is RDD? How does spark RDD works? What are the various ways to cr

1.	What is RDD? How does spark RDD works? What are the various ways to create the RDD?
Answer» Resilient distributed dataset (RDD) is a core of Spark framework, which is a fault-tolerant collection of elements that can be operated on in parallel. Below are the key points on RDD: RDD is an IMMUTABLE distributed collection of objects. RDD works on in-memory computation paradigm. RDD is divided into logical partitions, which computed in different worker nodes. Stores the state of memory as an object across the JOBS and the object is sharable between those jobs. Data sharing using RDD faster than the I/O and disk, because its use the in – memory computation. The working of RDD is: Resilient handling a fault-tolerant with the help of RDD spark able to recover or recompute the missing or damaged partitions due to node failures. Distributed mechanism handling data residing on multiple nodes in a cluster. Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects We can create the RDD using below approach: By Referring a dataset: VAL byTextFile = sc.textFile(hdfs:// or s3:// ) By Parallelizing a dataset: Val byParalizeOperation = sc.paralize( Seq(DataFrame or Dataset), numSlices: Integer) By converting dataframe to rdd. Val byDF = df.filter().toRDD RDDs predominately support TWO types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver PROGRAM after running a computation on the dataset.

What is RDD? How does spark RDD works? What are the various ways to create the RDD?

Answer»

Resilient distributed dataset (RDD) is a core of Spark framework, which is a fault-tolerant collection of elements that can be operated on in parallel.

Below are the key points on RDD:

RDD is an IMMUTABLE distributed collection of objects.
RDD works on in-memory computation paradigm.
RDD is divided into logical partitions, which computed in different worker nodes.
Stores the state of memory as an object across the JOBS and the object is sharable between those jobs.
Data sharing using RDD faster than the I/O and disk, because its use the in – memory computation.
The working of RDD is:
- Resilient handling a fault-tolerant with the help of RDD spark able to recover or recompute the missing or damaged partitions due to node failures.
- Distributed mechanism handling data residing on multiple nodes in a cluster.
- Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects

We can create the RDD using below approach:

By Referring a dataset:

VAL byTextFile = sc.textFile(hdfs:// or s3:// )

By Parallelizing a dataset:

Val byParalizeOperation = sc.paralize( Seq(DataFrame or Dataset), numSlices: Integer)

By converting dataframe to rdd.

Val byDF = df.filter().toRDD

RDDs predominately support TWO types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver PROGRAM after running a computation on the dataset.

What is RDD? How does spark RDD works? What are the various ways to create the RDD?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment