1.

When to use RDD, DataFrame or Dataset?

Answer»

RDD :

You should generally use RDD in 3 situations.

  1. If your BUSINESS logic needs some functionality that you can’t find in higher level API. For example, if you need very tight control over physical data placement ACROSS the cluster.
  2. If your data is unstructured.
  3. If you want to use some custom shared variable manipulations like Broadcast variable and accumulators.

DataFrames or DATASETS :

  1. When you are dealing with Structured data.
  2. When you want more code optimization and better performance.

All in all, the use of DataFrame/Dataset API is recommendable as easy using and better optimization. Supported by Catalyst and Tungsten, DataFrame/Dataset can reduce your TIME of optimization, thus you can pay more attention to the data itself.



Discussion

No Comment Found