1.

Define Spark DataFrames.

Answer»

Spark Dataframes are the distributed collection of datasets organized into columns similar to SQL. It is equivalent to a table in the relational database and is mainly optimized for big data operations.
Dataframes can be CREATED from an array of data from different data sources such as external databases, existing RDDs, HIVE Tables, etc. Following are the features of Spark Dataframes:

  • Spark Dataframes have the ability of processing data in sizes ranging from Kilobytes to Petabytes on a single node to large clusters.
  • They support different data formats like CSV, AVRO, elastic search, etc, and various storage systems like HDFS, Cassandra, MYSQL, etc.
  • By making use of SparkSQL catalyst optimizer, state of art optimization is achieved.
  • It is possible to easily integrate Spark Dataframes with major Big Data tools using SparkCore.


Discussion

No Comment Found