1.

Differentiate between Spark Datasets, Dataframes and RDDs.

Answer»
CriteriaSpark DatasetsSpark DataframesSpark RDDs
Representation of DataSpark Datasets is a combination of Dataframes and RDDs with features like static type safety and object-oriented interfaces.Spark Dataframe is a distributed collection of data that is organized into NAMED columns.Spark RDDs are a distributed collection of data without schema.
OptimizationDatasets make USE of catalyst optimizers for optimization.Dataframes also makes use of catalyst optimizer for optimization.There is no built-in optimization engine.
Schema ProjectionDatasets find out schema automatically using SQL Engine.Dataframes also find the schema automatically.Schema needs to be DEFINED manually in RDDs.
AGGREGATION SpeedDataset aggregation is faster than RDD but slower than Dataframes.Aggregations are faster in Dataframes DUE to the provision of easy and powerful APIs.RDDs are slower than both the Dataframes and the Datasets while performing even simple operations like data grouping.


Discussion

No Comment Found