1.

What is RDD? How does spark RDD works? What are the various ways to create the RDD?

Answer»

Resilient distributed dataset (RDD) is a core of Spark framework, which is a fault-tolerant collection of elements that can be operated on in parallel.

Below are the key points on RDD:

  • RDD is an IMMUTABLE distributed collection of objects.
  • RDD works on in-memory computation paradigm.
  • RDD is divided into logical partitions, which computed in different worker nodes.
  • Stores the state of memory as an object across the JOBS and the object is sharable between those jobs.
  • Data sharing using RDD faster than the I/O and disk, because its use the in – memory computation. 
  • The working of RDD is:
    • Resilient handling a fault-tolerant with the help of RDD spark able to recover or recompute the missing or damaged partitions due to node failures.
    • Distributed mechanism handling data residing on multiple nodes in a cluster.
    • Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects

We can create the RDD using below approach:

By Referring a dataset:

  • VAL byTextFile = sc.textFile(hdfs:// or s3:// )

By Parallelizing a dataset:

  • Val byParalizeOperation = sc.paralize( Seq(DataFrame or Dataset), numSlices: Integer)

By converting dataframe to rdd.

  • Val byDF = df.filter().toRDD

RDDs predominately support TWO types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver PROGRAM after running a computation on the dataset.



Discussion

No Comment Found