| Representation of Data | Spark Datasets is a combination of Dataframes and RDDs with features like static type safety and object-oriented interfaces. | Spark Dataframe is a distributed collection of data that is organized into NAMED columns. | Spark RDDs are a distributed collection of data without schema. |
|---|
| Optimization | Datasets make USE of catalyst optimizers for optimization. | Dataframes also makes use of catalyst optimizer for optimization. | There is no built-in optimization engine. |
|---|
| Schema Projection | Datasets find out schema automatically using SQL Engine. | Dataframes also find the schema automatically. | Schema needs to be DEFINED manually in RDDs. |
|---|
| AGGREGATION Speed | Dataset aggregation is faster than RDD but slower than Dataframes. | Aggregations are faster in Dataframes DUE to the provision of easy and powerful APIs. | RDDs are slower than both the Dataframes and the Datasets while performing even simple operations like data grouping. |
|---|