1.

What do you understand about PySpark DataFrames?

Answer»

PySpark DataFrame is a distributed collection of well-organized data that is equivalent to tables of the relational databases and are placed into named columns. PySpark DataFrame has better optimisation when compared to R or python. These can be created from different sources LIKE Hive Tables, Structured Data Files, existing RDDs, EXTERNAL databases etc as SHOWN in the IMAGE below:

The data in the PySpark DataFrame is distributed across different machines in the cluster and the operations performed on this would be run PARALLELLY on all the machines. These can handle a large collection of structured or semi-structured data of a range of petabytes.



Discussion

No Comment Found