57 + Interview Questions in Spark in Big Data Page 2 InterviewSolution

51.	What is an executor in spark and how its support to perform the operation on volume of data?
Answer» Executors are WORKER nodes' processes in charge of running individual tasks when Spark job get submitted. They are launched at the beginning of a Spark application and typically run for the entire LIFETIME of an application. Once they have run the task, they send the results to the DRIVER. They also provide in-memory storage for RDDs that are cached by USER programs through Block Manager. Below are the key points on executors: Every spark application has its own executor process. Executor performs all the data processing. Reads from and Writes data to external sources. Executor stores the computation results data in-memory, cache or on HARD disk drives. Executor also work as a distributed agent responsible for the execution of tasks. When the job getting launched, spark trigger the executor, which act as a worker node which responsible for running individual task, which is assigned by spark driver.

51.

What is an executor in spark and how its support to perform the operation on volume of data?

Answer»

Executors are WORKER nodes' processes in charge of running individual tasks when Spark job get submitted. They are launched at the beginning of a Spark application and typically run for the entire LIFETIME of an application. Once they have run the task, they send the results to the DRIVER. They also provide in-memory storage for RDDs that are cached by USER programs through Block Manager.

Below are the key points on executors:

Every spark application has its own executor process.
Executor performs all the data processing.
Reads from and Writes data to external sources.
Executor stores the computation results data in-memory, cache or on HARD disk drives.

Executor also work as a distributed agent responsible for the execution of tasks. When the job getting launched, spark trigger the executor, which act as a worker node which responsible for running individual task, which is assigned by spark driver.

Discussion

52.	Why we need the master driver in spark?
Answer» Master driver is central point and the entry point of the Spark Shell which is supporting this LANGUAGE (Scala, Python, and R). Below is the sequential process, which driver follows to execute the spark job. Driver runs the main () function of the application which create the spark context. Driver program that runs on the master node of the spark cluster schedules the job execution. Translates the RDD’s into the execution graph and splits the graph into multiple stages. Driver stores the metadata about all the RESILIENT Distributed Databases and their partitions. Driver program converts a user application into SMALLER execution units known as tasks which is also as a stage. Tasks are then executed by the executors i.e. the worker processes which run individual tasks. The complete process can track by cluster manager user interface. Driver exposes the information about the running spark application through a WEB UI at port 4040

52.

Why we need the master driver in spark?

Answer»

Master driver is central point and the entry point of the Spark Shell which is supporting this LANGUAGE (Scala, Python, and R). Below is the sequential process, which driver follows to execute the spark job.

Driver runs the main () function of the application which create the spark context.
Driver program that runs on the master node of the spark cluster schedules the job execution.
Translates the RDD’s into the execution graph and splits the graph into multiple stages.
Driver stores the metadata about all the RESILIENT Distributed Databases and their partitions.
Driver program converts a user application into SMALLER execution units known as tasks which is also as a stage.
Tasks are then executed by the executors i.e. the worker processes which run individual tasks.

The complete process can track by cluster manager user interface. Driver exposes the information about the running spark application through a WEB UI at port 4040

Discussion

53.	What are the key component of spark which internally spark require to execute the job?
Answer» Spark follows a master/SLAVE architecture. Master Daemon: (Master Drive process) Worker Daemon: (Slave process) Spark cluster has a single Master No. of Slave worked as a COMMODITY server. When we SUBMIT the spark job it triggers the spark driver. Getting the current STATUS of spark application Canceling the job Canceling the Stage Running job synchronously Running job asynchronously Accessing PERSISTENT RDD Un-persisting RDD Programmable dynamic allocation

Discussion

54.	What Spark-SQL does, how it’s benefits to programmer to interact with database? And Syntax of creating SQL Context?
Answer» Spark SQL provides programmatic abstraction in the form of data frame and data set which can work the principal of distributed SQL query engine. Spark SQL simplify the INTERACTION to the large amount of data through the dataframe and dataset. Spark-SQL provide a relation processing along with spark functional programming. Support querying data using SQL and HIVE query language. Support Datasource API, Dataframe API, Interpreter & Optimizer, SQL Service. Spark-SQL also providing the new API called Dataset which has CAPABILITY of both Dataframe and core. Spark-SQL I much OPTIMIZE to perform SQL query-based operation on flat file, json. Spark SQL support variety of language LIKE: Java, Scala, Python and R. Code Snippet: val sqlContext = new SQLContext( sc: SparkContext) Dataframe can be create using below approach: Structured data files: Tables in Hive: External databases: Using existing RDD: Spark SQL plays a vital role on optimization technique using Catalyst optimizer, Spark SQL also support UDF, built in function and aggregates function.

54.

What Spark-SQL does, how it’s benefits to programmer to interact with database? And Syntax of creating SQL Context?

Answer»

Spark SQL provides programmatic abstraction in the form of data frame and data set which can work the principal of distributed SQL query engine. Spark SQL simplify the INTERACTION to the large amount of data through the dataframe and dataset.

Spark-SQL provide a relation processing along with spark functional programming.
Support querying data using SQL and HIVE query language.
Support Datasource API, Dataframe API, Interpreter & Optimizer, SQL Service.
Spark-SQL also providing the new API called Dataset which has CAPABILITY of both Dataframe and core.
Spark-SQL I much OPTIMIZE to perform SQL query-based operation on flat file, json.
Spark SQL support variety of language LIKE: Java, Scala, Python and R.
Code Snippet: val sqlContext = new SQLContext( sc: SparkContext)
Dataframe can be create using below approach:
Structured data files:
Tables in Hive:
External databases:
Using existing RDD:

Spark SQL plays a vital role on optimization technique using Catalyst optimizer, Spark SQL also support UDF, built in function and aggregates function.

Discussion

55.	What are the benefits of using Spark streaming for real time processing instead of other framework and tools?
Answer» Spark Streaming supports MICRO-batch-oriented stream processing engine, Spark has a capability to allow the data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP SOCKETS, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Below are the other key benefits which Spark streaming support. Spark streaming is one of features of Spark used to process the real time data efficiently. Spark Streaming implement using Kafka and Zookeeper messaging API, which is again a fault tolerant messaging container can create a messaging cluster. PROVIDE high-throughput and fault-tolerant stream processing Provide DStream data structure which is a basically a stream of RDD to process the real-time data. Spark Streaming fits for scenario where interaction require Kafka to Database or Kafka to Data science model type of context. Spark work on batches which RECEIVES an input data stream and divided into the micro batches, which is further processed by the spark engine to generate the final stream of result in the batches. Below diagram CLEARLY illustrated the workflow of Spark streaming.

55.

What are the benefits of using Spark streaming for real time processing instead of other framework and tools?

Answer»

Spark Streaming supports MICRO-batch-oriented stream processing engine, Spark has a capability to allow the data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP SOCKETS,

and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.

Below are the other key benefits which Spark streaming support.

Spark streaming is one of features of Spark used to process the real time data efficiently.
Spark Streaming implement using Kafka and Zookeeper messaging API, which is again a fault tolerant messaging container can create a messaging cluster.
PROVIDE high-throughput and fault-tolerant stream processing
Provide DStream data structure which is a basically a stream of RDD to process the real-time data.
Spark Streaming fits for scenario where interaction require Kafka to Database or Kafka to Data science model type of context.

Spark work on batches which RECEIVES an input data stream and divided into the micro batches, which is further processed by the spark engine to generate the final stream of result in the batches.

Below diagram CLEARLY illustrated the workflow of Spark streaming.

Discussion

56.	How spark core fit into the picture to solving the big data use case?
Answer» Reduce, collection, aggregation API, stream, parallel stream, optional which can easily HANDLE to all the use CASE where we are dealing volume of data handling. Bullet points are as follows: Spark core is the distributed execution engine for large-scala parallel and distributed data PROCESSING. Spark core provide a real time processing for large data set. Handle memory management and fault recovery. Scheduling, distributing and monitoring jobs on a cluster. Spark core comes with map, flatmap, reduce, reducebykey, groupbykey which handling the key VALUE pair-based data processing for large data set. Spark core also support aggregation operation. Spark core support Java, Scala and Python. Code snippet: val counts = textReader.flatMap(line => line.split(",")).map(word => (word, 1)).reduceByKey(_ + _). APPARENTLY spark use for data processing framework, however we can also use to perform the data analysis and data science.

56.

How spark core fit into the picture to solving the big data use case?

Answer»

Reduce, collection, aggregation API, stream, parallel stream, optional which can easily HANDLE to all the use CASE where we are dealing volume of data handling.

Bullet points are as follows:

Spark core is the distributed execution engine for large-scala parallel and distributed data PROCESSING.
Spark core provide a real time processing for large data set.
Handle memory management and fault recovery.
Scheduling, distributing and monitoring jobs on a cluster.
Spark core comes with map, flatmap, reduce, reducebykey, groupbykey which handling the key VALUE pair-based data processing for large data set.
Spark core also support aggregation operation.
Spark core support Java, Scala and Python.
Code snippet: val counts = textReader.flatMap(line => line.split(",")).map(word => (word, 1)).reduceByKey(_ + _).

APPARENTLY spark use for data processing framework, however we can also use to perform the data analysis and data science.

Discussion

57.	What do the features of Spark provide, which is not available to the Map-Reduce?
Answer» Spark API provides various KEY features, which is very useful for spark real time processing, most of the features has a well support library ALONG with real time processing capability. Below are the key features providing by spark framework: Spark Core Spark Streaming. Spark SQL GrasphX MLib Spark core is a heart of spark framework and well support capability for functional programing practice for the language like Java, Scala, Python, however most of the new release come for JVM language first and then LATER on introduced for python.

57.

What do the features of Spark provide, which is not available to the Map-Reduce?

Answer»

Spark API provides various KEY features, which is very useful for spark real time processing, most of the features has a well support library ALONG with real time processing capability.

Below are the key features providing by spark framework:

Spark Core
Spark Streaming.
Spark SQL
GrasphX
MLib

Spark core is a heart of spark framework and well support capability for functional programing practice for the language like Java, Scala, Python, however most of the new release come for JVM language first and then LATER on introduced for python.

Discussion

Explore topic-wise InterviewSolutions in .

What is an executor in spark and how its support to perform the operation on volume of data?

Why we need the master driver in spark?

What are the key component of spark which internally spark require to execute the job?

What Spark-SQL does, how it’s benefits to programmer to interact with database? And Syntax of creating SQL Context?

What are the benefits of using Spark streaming for real time processing instead of other framework and tools?

How spark core fit into the picture to solving the big data use case?

What do the features of Spark provide, which is not available to the Map-Reduce?