InterviewSolution
| 1. |
What is RDD, DataFrame, and Dataset? |
|
Answer» RDD: RDDs are low-level API and they were the primary API in the Spark 1.x series and are still available in 2.x, but they are not commonly used. However, all spark code you run whether DataFrames or Datasets compiles down to an RDD. RDD stands for Resilient Distributed Dataset. It is a FUNDAMENTAL data structure of Spark and is immutable, partitioned collection of records that can be operated on in parallel. DataFrame: DataFrames are a table like collections with well-defined rows and COLUMNS. Each column must have the same number of rows as all other columns and each column has type information that must be consisted for every row in the collection. To Spark, DataFrame represents immutable lazy evaluated plans that specify what operations to apply to data residing at a location to generate some output. When we perform an action on a DataFrame, we instruct Spark to perform ACTUAL transformations and return results. Dataset: Datasets are a foundational type of STRUCTURED APIs. DataFrames are Datasets of type Row. Datasets are like DataFrames, but Datasets are strictly JVM( Java Virtual Machine) language-specific feature that works with only Java and Scala. We can also say Datasets are ‘ strongly TYPED immutable collection of objects that are mapped to the relational schema’ in Spark. |
|