1.

How RDD can be created in Spark?

Answer»

RDDs or Resilient Distributed Datasets are the fundamental data structure present in Spark. They are immutable and fault-tolerant in nature. There are multiple ways to create RDDs in Spark. They are:

  • Creating RDD from a Seq or List USING Parallelize

           RDDs can be created by taking an existing collection from a driver’s program and passing it to the Spark’s SparkContext’s parallelize () method. Here’s an example:
      VAL rdd=spark.sparkContext.parallelize(Seq(("Java", 10000),
      ("Python", 200000), ("Scala", 4000)))
      rdd.foreach(PRINTLN)
       Output
      (Python,100000)
      (Scala,3000)
      (Java,20000)

  • Creating an RDD using a text file

     Mostly, in production systems, USERS can generate RDDs from files by simply reading the data from the files. Let us see how:
     Val rdd = spark.sparkContext.textFile("/path/textFile.txt")
     The above line of code creates an RDD in which each record represents a line of code.

  • Creating RDDs from Dataframes and DataSets

     You can easily convert any DataFrame or DataSet into an RDD. It can be done by using the rdd() method. Here’s how:
     val myRdd2 = spark.range(20).toDF().rdd
     In the above line of code, toDF() creates a DataFrame, and by calling an RDD, the range of code returns with a NEWLY created RDD.



Discussion

No Comment Found