InterviewSolution
This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.
| 1. |
How can you achieve machine learning in Spark? |
|
Answer» Spark provides a very robust, SCALABLE machine learning-based library CALLED MLlib. This library aims at implementing easy and scalable common ML-based algorithms and has the features like classification, clustering, DIMENSIONAL reduction, REGRESSION filtering, etc. More information about this library can be obtained in detail from Spark’s official documentation site here: https://spark.apache.org/docs/latest/ml-guide.html |
|
| 2. |
What API is used for Graph Implementation in Spark? |
|
Answer» Spark provides a powerful API called GraphX that extends Spark RDD for supporting graphs and graph-based COMPUTATIONS. The extended property of Spark RDD is called as Resilient Distributed Property Graph which is a directed multi-graph that has MULTIPLE parallel edges. Each edge and the VERTEX has ASSOCIATED user-defined properties. The presence of parallel edges indicates multiple relationships between the same set of vertices. GraphX has a set of operators such as subgraph, mapReduceTriplets, joinVertices, etc that can SUPPORT graph computation. It also includes a large collection of graph builders and algorithms for simplifying tasks related to graph analytics. |
|
| 3. |
Define Piping in Spark. |
|
Answer» Apache Spark provides the PIPE() method on RDDs which gives the opportunity to compose different PARTS of occupations that can UTILIZE any language as needed as PER the UNIX Standard Streams. Using the pipe() method, the RDD transformation can be WRITTEN which can be used for reading each element of the RDD as String. These can be manipulated as required and the results can be displayed as String. |
|
| 4. |
How is Caching relevant in Spark Streaming? |
|
Answer» Spark Streaming involves the division of data stream’s data into batches of X seconds called DStreams. These DStreams let the developers cache the data into the memory which can be very USEFUL in CASE the data of DStream is used for multiple computations. The caching of data can be done using the cache() method or using persist() method by using appropriate persistence levels. The default persistence level value for input streams receiving data over the networks such as KAFKA, Flume, etc is set to achieve data replication on 2 nodes to accomplish fault tolerance.
The main ADVANTAGES of caching are:
|
|
| 5. |
How are automatic clean-ups triggered in Spark for handling the accumulated metadata? |
|
Answer» The clean-up TASKS can be triggered automatically either by setting spark.cleaner.ttl PARAMETER or by doing the batch-wise division of the long-running jobs and then writing the intermediary RESULTS on the DISK. |
|
| 6. |
What are Sparse Vectors? How are they different from dense vectors? |
|
Answer» Sparse vectors consist of two parallel arrays where one array is for storing indices and the other for storing VALUES. These vectors are used to store non-zero values for saving space. VAL sparseVec: Vector = Vectors.sparse(5, Array(0, 4), Array(1.0, 2.0))
|
|
| 7. |
Can Apache Spark be used along with Hadoop? If yes, then how? |
|
Answer» Yes! The main feature of Spark is its COMPATIBILITY with Hadoop. This makes it a powerful framework as using the combination of these two helps to leverage the processing CAPACITY of Spark by making use of the best of Hadoop’s YARN and HDFS features. Hadoop can be integrated with Spark in the FOLLOWING ways:
|
|
| 8. |
Differentiate between Spark Datasets, Dataframes and RDDs. |
||||||||||||||||||||
Answer»
|
|||||||||||||||||||||
| 9. |
Why do we need broadcast variables in Spark? |
|
Answer» Broadcast variables let the developers maintain read-only variables cached on each machine instead of SHIPPING a COPY of it with tasks. They are used to give every node copy of a large input DATASET efficiently. These variables are broadcasted to the nodes using different algorithms to REDUCE the cost of COMMUNICATION. |
|
| 10. |
What are the steps to calculate the executor memory? |
|
Answer» Consider you have the below details regarding the cluster: Number of nodes = 10Number of cores in each node = 15 coresRAM of each node = 61GBTo IDENTIFY the number of cores, we follow the approach: Number of Cores = number of CONCURRENT tasks that can be run parallelly by the EXECUTOR. The optimal value as part of a general rule of thumb is 5.Hence to calculate the number of executors, we follow the below approach: Number of executors = Number of cores/Concurrent Task = 15/5 = 3Number of executors = Number of nodes * Number of executor in each node = 10 * 3 = 30 executors per SPARK job |
|
| 11. |
What are the different persistence levels in Apache Spark? |
|||||||||||||||||||||||||||||||||||
|
Answer» Spark persists intermediary data from DIFFERENT SHUFFLE operations automatically. But it is recommended to call the persist() method on the RDD. There are different persistence levels for storing the RDDs on memory or disk or both with different levels of replication. The persistence levels available in Spark are:
The syntax for using persistence levels in the persist() method is: df.persist(StorageLevel.<level_value>)The following table summarizes the details of persistence levels:
|
||||||||||||||||||||||||||||||||||||
| 12. |
What module is used for implementing SQL in Apache Spark? |
|
Answer» Spark provides a powerful MODULE called SparkSQL which performs relational data processing combined with the power of the functional PROGRAMMING feature of Spark. This module also supports either by means of SQL or HIVE Query Language. It also provides support for different data sources and helps developers write powerful SQL queries using code transformations.
Spark SQL supports the usage of structured and semi-structured data in the following ways:
|
|
| 13. |
What is SchemaRDD in Spark RDD? |
|
Answer» SchemaRDD is an RDD consisting of row objects that are wrappers around integer arrays or strings that has schema information regarding the data TYPE of each column. They were designed to ease the lives of developers while debugging the code and while running unit test cases on the SparkSQL modules. They represent the DESCRIPTION of the RDD which is SIMILAR to the schema of RELATIONAL databases. SchemaRDD also provides the basic functionalities of the COMMON RDDs along with some relational query interfaces of SparkSQL. Consider an example. If you have an RDD named Person that represents a person’s data. Then SchemaRDD represents what data each row of Person RDD represents. If the Person has attributes like name and age, then they are represented in SchemaRDD. |
|
| 14. |
How can the data transfers be minimized while working with Spark? |
|
Answer» Data transfers correspond to the process of shuffling. Minimizing these transfers results in faster and RELIABLE RUNNING Spark applications. There are various ways in which these can be MINIMIZED. They are: |
|
| 15. |
What are some of the demerits of using Spark in applications? |
|
Answer» DESPITE Spark being the powerful data processing engine, there are CERTAIN demerits to using Apache Spark in applications. Some of them are:
|
|
| 16. |
What do you understand by worker node? |
|
Answer» Worker nodes are those nodes that run the Spark application in a cluster. The Spark driver program listens for the INCOMING connections and accepts them from the executors addresses them to the worker nodes for execution. A worker NODE is like a slave node where it gets the WORK from its master node and actually executes them. The worker nodes do data processing and REPORT the resources used to the master. The master decides what amount of resources needs to be allocated and then based on their availability, the tasks are scheduled for the worker nodes by the master. |
|
| 17. |
What are the functions of SparkCore? |
|
Answer» SparkCore is the main engine that is meant for large-scale distributed and parallel data processing. The Spark CORE CONSISTS of the distributed execution engine that offers various APIs in Java, PYTHON, and Scala for developing distributed ETL applications.
|
|
| 18. |
Define Executor Memory in Spark |
|
Answer» The applications developed in Spark have the same FIXED cores count and fixed heap size defined for spark executors. The heap size refers to the memory of the Spark executor that is controlled by making use of the property spark.executor.memory that belongs to the -executor-memory flag. Every Spark applications have one ALLOCATED executor on each worker node it runs. The executor memory is a measure of the memory CONSUMED by the worker node that the application utilizes. |
|
| 19. |
Define Spark DataFrames. |
|
Answer» Spark Dataframes are the distributed collection of datasets organized into columns similar to SQL. It is equivalent to a table in the relational database and is mainly optimized for big data operations.
|
|
| 20. |
What can you say about Spark Datasets? |
|
Answer» Spark Datasets are those data structures of SparkSQL that provide JVM objects with all the benefits (such as data manipulation using lambda functions) of RDDs alongside Spark SQL-optimised execution engine. This was introduced as part of Spark since version 1.6.
Datasets have the following features:
|
|
| 21. |
Write a spark program to check if a given keyword exists in a huge text file or not? |
|
Answer» def keywordExists(line): if (line.find(“my_keyword”) > -1): RETURN 1 return 0lines = sparkContext.textFile(“test_file.txt”);isExist = lines.map(keywordExists);sum = isExist.reduce(sum);print(“Found” if sum>0 ELSE “Not Found”) |
|
| 22. |
What is Spark Streaming and how is it implemented in Spark? |
|
Answer» Spark Streaming is one of the most important features PROVIDED by Spark. It is nothing but a Spark API extension for supporting stream processing of data from different sources.
|
|
| 23. |
Under what scenarios do you use Client and Cluster modes for deployment? |
Answer»
|
|
| 24. |
What is the working of DAG in Spark? |
|
Answer» DAG stands for Direct Acyclic Graph which has a set of finite vertices and edges. The vertices represent RDDs and the edges represent the operations to be performed on RDDs sequentially. The DAG created is submitted to the DAG Scheduler which splits the graphs into stages of tasks BASED on the transformations applied to the data. The stage view has the details of the RDDs of that stage. The working of DAG in SPARK is defined as per the workflow diagram below:
Each RDD keeps track of the pointer to one/more parent RDD along with its relationship with the parent. For example, consider the operation val childB=parentA.map() on RDD, then we have the RDD childB that keeps track of its parentA which is called RDD lineage. |
|
| 25. |
Explain the working of Spark with the help of its architecture. |
|
Answer» Spark APPLICATIONS are run in the form of independent processes that are well coordinated by the Driver program by means of a SparkSession object. The CLUSTER manager or the resource manager entity of Spark assigns the TASKS of running the Spark JOBS to the worker nodes as per one task per partition principle. There are various iterations algorithms that are repeatedly applied to the data to cache the datasets across various iterations. Every task applies its unit of operations to the dataset within its partition and results in the new PARTITIONED dataset. These results are sent back to the main driver application for further processing or to store the data on the disk. The following diagram illustrates this working as described above: |
|
| 26. |
How is Apache Spark different from MapReduce? |
||||||||||||
Answer»
|
|||||||||||||