What are some of the demerits of using Spark in applications?

1.	What are some of the demerits of using Spark in applications?
Answer» DESPITE Spark being the powerful data processing engine, there are CERTAIN demerits to using Apache Spark in applications. Some of them are: Spark makes use of more storage space when compared to MapReduce or Hadoop which may lead to certain memory-based problems. Care must be taken by the developers while running the applications. The work should be distributed across multiple clusters instead of running everything on a single node. Since Spark makes use of “in-memory” COMPUTATIONS, they can be a bottleneck to cost-efficient big data processing. While using files present on the path of the local filesystem, the files must be accessible at the same location on all the worker nodes when working on cluster mode as the task execution shuffles between various worker nodes based on the resource availabilities. The files need to be copied on all worker nodes or a separate network-mounted file-sharing system needs to be in place. One of the biggest problems while using Spark is when using a large number of small files. When Spark is used with Hadoop, we know that HDFS gives a limited number of large files instead of a large number of small files. When there is a large number of small gzipped files, Spark needs to uncompress these files by keeping them on its memory and network. So large amount of time is SPENT in burning core capacities for unzipping the files in sequence and performing partitions of the resulting RDDs to get data in a manageable format which would require extensive shuffling overall. This impacts the performance of Spark as much time is spent preparing the data instead of processing them. Spark doesn’t work well in multi-user environments as it is not CAPABLE of handling many users concurrently.

Answer»

DESPITE Spark being the powerful data processing engine, there are CERTAIN demerits to using Apache Spark in applications. Some of them are:

Spark makes use of more storage space when compared to MapReduce or Hadoop which may lead to certain memory-based problems.
Care must be taken by the developers while running the applications. The work should be distributed across multiple clusters instead of running everything on a single node.
Since Spark makes use of “in-memory” COMPUTATIONS, they can be a bottleneck to cost-efficient big data processing.
While using files present on the path of the local filesystem, the files must be accessible at the same location on all the worker nodes when working on cluster mode as the task execution shuffles between various worker nodes based on the resource availabilities. The files need to be copied on all worker nodes or a separate network-mounted file-sharing system needs to be in place.
One of the biggest problems while using Spark is when using a large number of small files. When Spark is used with Hadoop, we know that HDFS gives a limited number of large files instead of a large number of small files. When there is a large number of small gzipped files, Spark needs to uncompress these files by keeping them on its memory and network. So large amount of time is SPENT in burning core capacities for unzipping the files in sequence and performing partitions of the resulting RDDs to get data in a manageable format which would require extensive shuffling overall. This impacts the performance of Spark as much time is spent preparing the data instead of processing them.
Spark doesn’t work well in multi-user environments as it is not CAPABLE of handling many users concurrently.

What are some of the demerits of using Spark in applications?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment