Explore topic-wise InterviewSolutions in .

This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.

1.

How Sparksql Is Different From Hql And Sql?

Answer»

SPARKSQL is a special component on the spark Core ENGINE that SUPPORT SQL and Hive Query Language without changing any syntax. It’s POSSIBLE to join SQL table and HQL table.

SparkSQL is a special component on the spark Core engine that support SQL and Hive Query Language without changing any syntax. It’s possible to join SQL table and HQL table.

2.

What Are Benefits Of Spark Over Mapreduce?

Answer»

Due to the availability of in-memory processing, Spark IMPLEMENTS the processing around 10-100x FASTER than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.

  • Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, MACHINE learning, Interactive SQL queries. However, Hadoop only supports batch processing.
  • Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
  • Spark is capable of performing computations multiple times on the same DATASET. This is called iterative computation while there is no iterative computing IMPLEMENTED by Hadoop.

Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.

3.

What Is A “parquet” In Spark?

Answer»

“Parquet” is a COLUMNAR format file SUPPORTED by MANY data processing systems. SPARK SQL performs both read and write OPERATIONS with the “Parquet” file.

“Parquet” is a columnar format file supported by many data processing systems. Spark SQL performs both read and write operations with the “Parquet” file.

4.

What Is Hive On Spark?

Answer»

Hive is a component of Hortonworks’ Data Platform (HDP). Hive provides an SQL-like interface to data stored in the HDP. Spark users will automatically get the complete set of Hive’s rich features, INCLUDING any new features that Hive MIGHT INTRODUCE in the future.

The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive operator plans from the semantic analyzer which is translated to a task PLAN that Spark can execute. It also includes query execution, where the GENERATED Spark plan gets actually executed in the Spark cluster.

Hive is a component of Hortonworks’ Data Platform (HDP). Hive provides an SQL-like interface to data stored in the HDP. Spark users will automatically get the complete set of Hive’s rich features, including any new features that Hive might introduce in the future.

The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive operator plans from the semantic analyzer which is translated to a task plan that Spark can execute. It also includes query execution, where the generated Spark plan gets actually executed in the Spark cluster.

5.

What Is Spark?

Answer»

Spark is a PARALLEL data processing framework. It allows to develop fast, unified big data APPLICATION COMBINE batch, streaming and interactive analytics.

Spark is a parallel data processing framework. It allows to develop fast, unified big data application combine batch, streaming and interactive analytics.

6.

List The Functions Of Spark Sql.

Answer»

Spark SQL is capable of:

  • Loading data from a variety of structured SOURCES
  • Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through STANDARD database connectors (JDBC/ODBC). For instance, using BUSINESS intelligence tools like Tableau
  • Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to JOIN RDDs and SQL tables, expose CUSTOM functions in SQL, and more

Spark SQL is capable of:

7.

What Is A Parquet File?

Answer»

PARQUET is a columnar format file supported by many other DATA processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best BIG data analytics format so FAR.

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.

8.

What Is Spark Sql?

Answer»

SQL Spark, better known as SHARK is a novel module INTRODUCED in Spark to WORK with STRUCTURED data and perform structured data processing. Through this module, Spark executes relational SQL QUERIES on the data. The core of the component supports an altogether different RDD called SchemaRDD, composed of rows objects and schema objects defining data type of each column in the row. It is similar to a table in relational database.

SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured data and perform structured data processing. Through this module, Spark executes relational SQL queries on the data. The core of the component supports an altogether different RDD called SchemaRDD, composed of rows objects and schema objects defining data type of each column in the row. It is similar to a table in relational database.

9.

Can We Do Real-time Processing Using Spark Sql?

Answer»

Not directly but we can REGISTER an existing RDD as a SQL table and trigger SQL QUERIES on top of that.

Not directly but we can register an existing RDD as a SQL table and trigger SQL queries on top of that.

10.

What Is “spark Sql”?

Answer»

Spark SQL is a Spark interface to WORK with structured as well as semi-structured data. It has the capability to load data from MULTIPLE structured sources like “TEXT files”, JSON files, Parquet files, among others. Spark SQL provides a special type of RDD called SCHEMARDD. These are ROW objects, where each object represents a record.

Spark SQL is a Spark interface to work with structured as well as semi-structured data. It has the capability to load data from multiple structured sources like “text files”, JSON files, Parquet files, among others. Spark SQL provides a special type of RDD called SchemaRDD. These are row objects, where each object represents a record.

11.

Name A Few Commonly Used Spark Ecosystems.

Answer»

SPARK SQL (SHARK)

Spark Streaming

GRAPHX

MLlib

SparkR

Spark SQL (Shark)

Spark Streaming

GraphX

MLlib

SparkR

12.

Explain About The Common Workflow Of A Spark Program

Answer»
  • The foremost step in a Spark program INVOLVES creating input RDD's from EXTERNAL data.
  • Use various RDD TRANSFORMATIONS like filter() to create new transformed RDD's based on the BUSINESS logic.
  • persist() any intermediate RDD's which MIGHT have to be reused in future.
  • Launch various RDD actions() like first(), count() to begin parallel computation , which will then be optimized and executed by Spark.

13.

What Is The Default Level Of Parallelism In Apache Spark?

Answer»

If the USER does not EXPLICITLY SPECIFY then the NUMBER of partitions are considered as DEFAULT level of parallelism in Apache Spark.

If the user does not explicitly specify then the number of partitions are considered as default level of parallelism in Apache Spark.

14.

Is It Necessary To Start Hadoop To Run Any Apache Spark Application ?

Answer»

Starting hadoop is not manadatory to run any spark application. As there is no seperate storage in Apache Spark, it uses Hadoop HDFS but it is not mandatory. The data can be STORED in local FILE SYSTEM, can be LOADED from local file system and PROCESSED.

Starting hadoop is not manadatory to run any spark application. As there is no seperate storage in Apache Spark, it uses Hadoop HDFS but it is not mandatory. The data can be stored in local file system, can be loaded from local file system and processed.

15.

Define A Worker Node.

Answer»

A NODE that can run the SPARK application code in a CLUSTER can be called as a worker node. A worker node can have more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.

A node that can run the Spark application code in a cluster can be called as a worker node. A worker node can have more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.

16.

Explain About The Core Components Of A Distributed Spark Application.

Answer»
  • Driver- The PROCESS that RUNS the main () method of the PROGRAM to create RDDs and perform transformations and actions on them.
  • Executor –The worker processes that run the individual tasks of a Spark job.
  • Cluster Manager-A pluggable component in Spark, to launch EXECUTORS and Drivers. The cluster manager allows Spark to run on top of other external MANAGERS like Apache Mesos or YARN.

17.

What Is The Difference Between Persist() And Cache()

Answer»

persist () allows the user to SPECIFY the storage LEVEL whereas cache () uses the DEFAULT storage level.

persist () allows the user to specify the storage level whereas cache () uses the default storage level.

18.

Explain About The Popular Use Cases Of Apache Spark

Answer»

APACHE SPARK is mainly used for

  • Iterative MACHINE learning.
  • Interactive DATA analytics and processing.
  • STREAM processing
  • Sensor data processing

Apache Spark is mainly used for

19.

Which One Will You Choose For A Project –hadoop Mapreduce Or Apache Spark?

Answer»

As it is KNOWN that Spark makes use of memory instead of network and disk I/O. However, Spark uses LARGE amount of RAM and requires DEDICATED MACHINE to produce effective results. So the decision to use Hadoop or Spark varies DYNAMICALLY with the requirements of the project and budget of the organization.

As it is known that Spark makes use of memory instead of network and disk I/O. However, Spark uses large amount of RAM and requires dedicated machine to produce effective results. So the decision to use Hadoop or Spark varies dynamically with the requirements of the project and budget of the organization.

20.

How Spark Uses Hadoop?

Answer»

SPARK has its own CLUSTER management computation and mainly uses HADOOP for storage.

Spark has its own cluster management computation and mainly uses Hadoop for storage.

21.

Name A Few Companies That Use Apache Spark In Production.

Answer»

PINTEREST, Conviva, Shopify, OPEN Table

Pinterest, Conviva, Shopify, Open Table

22.

Explain About The Major Libraries That Constitute The Spark Ecosystem

Answer»
  • Spark MLib- Machine LEARNING library in Spark for commonly used learning ALGORITHMS LIKE clustering, regression, classification, etc.
  • Spark Streaming – This library is used to process real TIME streaming data.
  • Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.
  • Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.

23.

Explain About The Different Cluster Managers In Apache Spark

Answer»

The 3 different clusters managers SUPPORTED in APACHE Spark are:

  • YARN
  • Apache Mesos -Has rich resource scheduling capabilities and is well SUITED to run Spark ALONG with other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
  • Standalone deployments – Well suited for NEW deployments which only run and are easy to set up.

The 3 different clusters managers supported in Apache Spark are:

24.

Explain About Transformations And Actions In The Context Of Rdds.

Answer»

TRANSFORMATIONS are FUNCTIONS executed on demand, to produce a new RDD. All transformations are followed by actions. Some EXAMPLES of transformations include map, filter and reduceByKey.

Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the LOCAL machine. Some examples of actions include REDUCE, collect, first, and take.

Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.

Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

25.

Most Of The Data Users Know Only Sql And Are Not Good At Programming. Shark Is A Tool, Developed For People Who Are From A Database Background - To Access Scala Mlib Capabilities Through Hive Like Sql Interface. Shark Tool Helps Data Users Run Hive On Spark - Offering Compatibility With Hive Metastore, Queries And Data.

Answer»
  1. Sensor DATA Processing –APACHE Spark’s ‘In-memory computing’ works BEST here, as data is retrieved and combined from DIFFERENT sources.
  2. Spark is preferred over Hadoop for real time querying of data
  3. Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best SOLUTION.