This section includes 7 InterviewSolutions, each offering curated multiple-choice questions to sharpen your Current Affairs knowledge and support exam preparation. Choose a topic below to get started.
| 1. |
What Makes Apache Spark Good At Low-latency Workloads Like Graph Processing And Machine Learning? |
|
Answer» Apache Spark stores data in-memory for faster model building and training. Machine learning algorithms require multiple ITERATIONS to generate a resulting optimal model and similarly graph algorithms traverse all the nodes and edges. These low latency WORKLOADS that NEED multiple iterations can lead to increased performance. Less disk access and controlled NETWORK traffic make a huge DIFFERENCE when there is lots of data to be processed. Apache Spark stores data in-memory for faster model building and training. Machine learning algorithms require multiple iterations to generate a resulting optimal model and similarly graph algorithms traverse all the nodes and edges. These low latency workloads that need multiple iterations can lead to increased performance. Less disk access and controlled network traffic make a huge difference when there is lots of data to be processed. |
|
| 2. |
What Does The Spark Engine Do? |
|
Answer» Spark engine SCHEDULES, distributes and MONITORS the DATA APPLICATION ACROSS the spark cluster. Spark engine schedules, distributes and monitors the data application across the spark cluster. |
|
| 3. |
What Do You Understand By Executor Memory In A Spark Application? |
|
Answer» Every spark application has same fixed heap SIZE and fixed number of cores for a spark executor. The heap size is what REFERRED to as the Spark executor memory which is CONTROLLED with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how MUCH memory of the worker node will the application utilize. Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize. |
|
| 4. |
Is It Necessary To Install Spark On All The Nodes Of A Yarn Cluster While Running Apache Spark On Yarn ? |
|
Answer» No , it is not NECESSARY because APACHE SPARK RUNS on TOP of YARN. No , it is not necessary because Apache Spark runs on top of YARN. |
|
| 5. |
What Are The Disadvantages Of Using Apache Spark Over Hadoop Mapreduce? |
|
Answer» Apache spark does not scale well for compute intensive jobs and consumes large number of system resources. Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient PROCESSING of BIG data. ALSO, Spark does have its own file MANAGEMENT system and hence needs to be integrated with other cloud BASED data platforms or apache hadoop. Apache spark does not scale well for compute intensive jobs and consumes large number of system resources. Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient processing of big data. Also, Spark does have its own file management system and hence needs to be integrated with other cloud based data platforms or apache hadoop. |
|
| 6. |
What Do You Understand By Schemardd? |
|
Answer» An RDD that consists of ROW objects (wrappers around basic string or integer arrays) with schema INFORMATION about the type of data in each COLUMN. An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column. |
|
| 7. |
Define A Worker Node.? |
|
Answer» A node that can RUN the SPARK application CODE in a cluster can be called as a worker node. A worker node can have more than ONE worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined. A node that can run the Spark application code in a cluster can be called as a worker node. A worker node can have more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined. |
|
| 8. |
What Do You Understand By Lazy Evaluation? |
|
Answer» Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but it does nothing, unless asked for the final result. When a transformation like MAP () is called on a RDD-the operation is not PERFORMED immediately. Transformations in Spark are not EVALUATED till you perform an action. This HELPS optimize the overall data processing WORKFLOW. Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but it does nothing, unless asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow. |
|
| 9. |
Explain About The Core Components Of A Distributed Spark Application.? |
|
Answer» DRIVER: The process that runs the main () method of the program to create RDDs and perform TRANSFORMATIONS and actions on them. Executor: The worker processes that run the INDIVIDUAL tasks of a SPARK job. Cluster Manager: A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other EXTERNAL managers like Apache Mesos or YARN. Driver: The process that runs the main () method of the program to create RDDs and perform transformations and actions on them. Executor: The worker processes that run the individual tasks of a Spark job. Cluster Manager: A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN. |
|
| 10. |
Hadoop Uses Replication To Achieve Fault Tolerance. How Is This Achieved In Apache Spark? |
|
Answer» DATA storage model in APACHE Spark is based on RDDs. RDDs help achieve FAULT tolerance through lineage. RDD always has the INFORMATION on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage HELPS build only that particular lost partition. Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition. |
|
| 11. |
How Can You Achieve High Availability In Apache Spark? |
| Answer» | |
| 12. |
How Spark Uses Akka? |
|
Answer» Spark USES Akka basically for scheduling. All the workers REQUEST for a TASK to master after REGISTERING. The master just ASSIGNS the task. Here Spark uses Akka for messaging between the workers and masters. Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters. |
|
| 13. |
How Can You Launch Spark Jobs Inside Hadoop Mapreduce? |
|
Answer» Using SIMR (SPARK in MAPREDUCE) users can RUN any spark job inside MapReduce without REQUIRING any admin rights. Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights. |
|
| 14. |
Does Apache Spark Provide Check Pointing? |
|
Answer» Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage CHAINS. SPARK has an API for check pointing i.e. a REPLICATE flag to PERSIST. However, the DECISION on which data to checkpoint - is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies. Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint - is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies. |
|
| 15. |
How Spark Handles Monitoring And Logging In Standalone Mode? |
|
Answer» Spark has a web BASED user interface for monitoring the cluster in STANDALONE MODE that SHOWS the cluster and job STATISTICS. The log output for each job is written to the work directory of the slave nodes. Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes. |
|
| 16. |
What Are The Various Levels Of Persistence In Apache Spark? |
|
Answer» Apache Spark automatically persists the INTERMEDIARY data from various shuffle OPERATIONS, however it is often suggested that users call persist () method on the RDD in CASE they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with DIFFERENT replication levels. The various storage/persistence levels in Spark are: MEMORY_ONLY Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels. The various storage/persistence levels in Spark are: MEMORY_ONLY |
|
| 17. |
What Is The Difference Between Persist() And Cache()? |
|
Answer» persist () allows the USER to specify the STORAGE level where as cache () USES the DEFAULT storage level. persist () allows the user to specify the storage level where as cache () uses the default storage level. |
|
| 18. |
How Can You Remove The Elements With A Key Present In Any Other Rdd? |
|
Answer» USE the subtractByKey () FUNCTION. Use the subtractByKey () function. |
|
| 19. |
What Is Spark Core? |
|
Answer» It has all the BASIC FUNCTIONALITIES of SPARK, like - memory management, FAULT recovery, interacting with storage systems, scheduling tasks, etc. It has all the basic functionalities of Spark, like - memory management, fault recovery, interacting with storage systems, scheduling tasks, etc. |
|
| 20. |
Is Apache Spark A Good Fit For Reinforcement Learning? |
|
Answer» No. APACHE Spark works well only for SIMPLE machine LEARNING algorithms like clustering, REGRESSION, classification. No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification. |
|
| 21. |
Explain About The Popular Use Cases Of Apache Spark? |
|
Answer» Apache Spark is mainly used for:
Apache Spark is mainly used for: |
|
| 22. |
Explain About The Different Types Of Transformations On Dstreams? |
|
Answer» Stateless Transformations:- Processing of the batch does not DEPEND on the output of the previous batch. Examples: MAP (), reduceByKey (), FILTER (). Stateful Transformations:- Processing of the batch depends on the INTERMEDIARY results of the previous batch. Examples: Transformations that depend on sliding windows. Stateless Transformations:- Processing of the batch does not depend on the output of the previous batch. Examples: map (), reduceByKey (), filter (). Stateful Transformations:- Processing of the batch depends on the intermediary results of the previous batch. Examples: Transformations that depend on sliding windows. |
|
| 23. |
What Do You Understand By Pair Rdd? |
|
Answer» Special operations can be performed on RDDs in Spark using key/value PAIRS and such RDDs are REFERRED to as Pair RDDs. Pair RDDs allow USERS to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a JOIN () method that combines different RDDs TOGETHER, based on the elements having the same key. Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key. |
|
| 24. |
What Are The Key Features Of Apache Spark That You Like? |
Answer»
|
|
| 25. |
What Are The Various Data Sources Available In Sparksql? |
| Answer» | |
| 26. |
What Is The Advantage Of A Parquet File? |
|
Answer» PARQUET FILE is a columnar format file that helps: Parquet file is a columnar format file that helps: |
|
| 27. |
What Are The Common Mistakes Developers Make When Running Spark Applications? |
|
Answer» Developers often make the mistake of:-
Developers often make the mistake of:- |
|
| 28. |
How Can You Compare Hadoop And Spark In Terms Of Ease Of Use? |
|
Answer» Hadoop MapReduce requires PROGRAMMING in Java which is difficult, though Pig and Hive MAKE it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages LIKE Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL LOVERS - making it comparatively easier to use than Hadoop. Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers - making it comparatively easier to use than Hadoop. |
|
| 29. |
Why Is Blinkdb Used? |
|
Answer» BlinkDB is a QUERY engine for executing interactive SQL QUERIES on huge volumes of DATA and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query ACCURACY’ with response time. BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time. |
|
| 30. |
Which Spark Library Allows Reliable File Sharing At Memory Speed Across Different Cluster Frameworks? |
|
Answer» Tachyon Tachyon |
|
| 31. |
Name A Few Companies That Use Apache Spark In Production.? |
|
Answer» PINTEREST, Conviva, Shopify, OPEN Table Pinterest, Conviva, Shopify, Open Table |
|
| 32. |
What Is Catalyst Framework? |
|
Answer» Catalyst FRAMEWORK is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to BUILD a faster PROCESSING system. Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system. |
|
| 33. |
When Running Spark Applications, Is It Necessary To Install Spark On All The Nodes Of Yarn Cluster? |
|
Answer» Spark need not be installed when running a job under YARN or MESOS because Spark can execute on top of YARN or Mesos CLUSTERS without affecting any CHANGE to the CLUSTER. Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster. |
|
| 34. |
What Is A Dstream? |
|
Answer» Discretized STREAM is a sequence of RESILIENT Distributed Databases that represent a stream of data. DSTREAMS can be created from various sources like APACHE Kafka, HDFS, and Apache Flume. DStreams have two operations: –
Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations: – |
|
| 35. |
What Is The Significance Of Sliding Window Operation? |
|
Answer» Sliding WINDOW controls TRANSMISSION of data packets between VARIOUS computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are APPLIED over a sliding window of data. Whenever the window slides, the RDDs that FALL within the particular window are combined and operated upon to produce new RDDs of the windowed DStream. Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream. |
|
| 36. |
What Are The Benefits Of Using Spark With Apache Mesos? |
|
Answer» It RENDERS SCALABLE PARTITIONING among various Spark instances and DYNAMIC partitioning between Spark and other big data frameworks. It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks. |
|
| 37. |
Explain About The Major Libraries That Constitute The Spark Ecosystem? |
|
Answer» Spark MLib- Machine learning library in Spark for commonly USED learning algorithms like clustering, regression, classification, etc. Spark Streaming – This library is used to process real time streaming data. Spark GraphX – Spark API for graph parallel COMPUTATIONS with BASIC operators like joinVertices, subgraph, aggregateMessages, etc. Spark SQL – Helps execute SQL like queries on Spark data using standard VISUALIZATION or BI tools. Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc. Spark Streaming – This library is used to process real time streaming data. Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc. Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools. |
|
| 38. |
How Can You Trigger Automatic Clean-ups In Spark To Handle Accumulated Metadata? |
|
Answer» You can trigger the clean-ups by SETTING the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into DIFFERENT batches and WRITING the intermediary results to the DISK. You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk. |
|
| 39. |
What Is Lineage Graph? |
|
Answer» The RDDs in Spark, depend on one or more other RDDs. The REPRESENTATION of DEPENDENCIES in between RDDs is known as the lineage graph. Lineage graph information is USED to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be RECOVERED USING the lineage graph information. The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information. |
|
| 40. |
Is It Possible To Run Spark And Mesos Along With Hadoop? |
|
Answer» Yes, it is possible to run Spark and Mesos with HADOOP by LAUNCHING each of these as a separate service on the machines. Mesos ACTS as a unified scheduler that ASSIGNS tasks to either Spark or Hadoop. Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop. |
|
| 41. |
Why Is There A Need For Broadcast Variables When Working With Apache Spark? |
|
Answer» These are read only VARIABLES, present in-memory cache on every machine. When WORKING with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every TASK, so DATA can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the RETRIEVAL efficiency when compared to an RDD lookup (). These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup (). |
|
| 42. |
How Can You Minimize Data Transfers When Working With Spark? |
|
Answer» MINIMIZING data TRANSFERS and avoiding shuffling helps WRITE spark programs that run in a fast and reliable manner. The VARIOUS ways in which data transfers can be minimized when working with Apache Spark are:
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are: |
|
| 43. |
How Can Spark Be Connected To Apache Mesos? |
|
Answer» To connect Spark with Mesos:
To connect Spark with Mesos: |
|
| 44. |
Explain About The Different Cluster Managers In Apache Spark? |
|
Answer» The 3 different clusters managers SUPPORTED in Apache Spark are:
The 3 different clusters managers supported in Apache Spark are: |
|
| 45. |
Is It Possible To Run Apache Spark On Apache Mesos? |
|
Answer» YES, Apache Spark can be RUN on the hardware clusters managed by MESOS. Yes, Apache Spark can be run on the hardware clusters managed by Mesos. |
|
| 46. |
Can You Use Spark To Access And Analyse Data Stored In Cassandra Databases? |
|
Answer» YES, it is POSSIBLE if you USE SPARK Cassandra Connector. Yes, it is possible if you use Spark Cassandra Connector. |
|
| 47. |
What Are The Languages Supported By Apache Spark For Developing Big Data Applications? |
|
Answer» SCALA, JAVA, PYTHON, R and CLOJURE Scala, Java, Python, R and Clojure |
|
| 48. |
Explain About Transformations And Actions In The Context Of Rdds.? |
|
Answer» Transformations are functions EXECUTED on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include MAP, filter and reduceByKey. Actions are the results of RDD computations or transformations. After an action is PERFORMED, the data from RDD moves back to the local machine. Some examples of actions include REDUCE, collect, first, and take. Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey. Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take. |
|
| 49. |
What Is Rdd? |
|
Answer» RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant MANNER. RDDs are read-only PORTIONED, collection of records, that are – IMMUTABLE – RDDs cannot be ALTERED. Resilient – If a NODE holding the partition fails the other node takes the data. RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are – Immutable – RDDs cannot be altered. Resilient – If a node holding the partition fails the other node takes the data. |
|
| 50. |
What Is A Sparse Vector? |
|
Answer» A sparse VECTOR has TWO PARALLEL arrays –one for indices and the other for VALUES. These vectors are used for storing non-zero entries to save SPACE. A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space. |
|