|
Answer» The FOLLOWING image represents how we can visualize RDD creation in PySpark: In the image, we see that the data we have is the list form and post converting to RDDs, we have it stored in different partitions. We have the following approaches for CREATING PySpark RDD: - USING sparkContext.parallelize(): The parallelize() method of the SparkContext can be used for creating RDDs. This method loads EXISTING collection from the driver and parallelizes it. This is a basic approach to create RDD and is used when we have data already present in the memory. This also requires the presence of all data on the Driver before creating RDD. Code to create RDD using the parallelize method for the python list shown in the image above:
list = [1,2,3,4,5,6,7,8,9,10,11,12]rdd=spark.sparkContext.parallelize(list)- Using sparkContext.textFile(): Using this method, we can read .txt file and convert them into RDD. Syntax:
rdd_txt = spark.sparkContext.textFile("/path/to/textFile.txt")- Using sparkContext.wholeTextFiles(): This function returns PairRDD (RDD CONTAINING key-value pairs) with file path being the key and the file content is the value.
#Reads entire file into a RDD as single record.rdd_whole_text = spark.sparkContext.wholeTextFiles("/path/to/textFile.txt")We can also read csv, json, parquet and various other formats and create the RDDs. - Empty RDD with no partition using sparkContext.emptyRDD: RDD with no data is called empty RDD. We can create such RDDs having no partitions by using emptyRDD() method as shown in the code piece below:
empty_rdd = spark.sparkContext.emptyRDD # to create empty rdd of string typeempty_rdd_string = spark.sparkContext.emptyRDD[String]- Empty RDD with partitions using sparkContext.parallelize: When we do not require data but we require partition, then we create empty RDD by using the parallelize method as shown below:
#Create empty RDD with 20 partitionsempty_partitioned_rdd = spark.sparkContext.parallelize([],20)
|