Explore topic-wise InterviewSolutions in .

This section includes InterviewSolutions, each offering curated multiple-choice questions to sharpen your knowledge and support exam preparation. Choose a topic below to get started.

1.

What are the industrial benefits of PySpark?

Answer»

These days, almost every industry makes use of big data to evaluate where they stand and grow. When you hear the term big data, Apache Spark comes to mind. Following are the industry benefits of using PySpark that supports Spark:

  • Media streaming: Spark can be used to achieve real-time streaming to provide personalized recommendations to subscribers. Netflix is one such example that uses Apache Spark. It processes around 450 billion EVENTS every day to flow to its server-side apps.
  • Finance: Banks use Spark for accessing and analyzing the social media profiles and in turn GET insights on what strategies would help them to make the right decisions regarding customer segmentation, CREDIT risk assessments, early fraud detection etc.
  • Healthcare: Providers use Spark for analyzing the past records of the patients to identify what health issues the patients MIGHT face posting their discharge. Spark is also used to perform genome sequencing for reducing the time required for processing genome data.
  • Travel Industry: Companies like TripAdvisor uses Spark to help users plan the perfect trip and provide personalized recommendations to the travel enthusiasts by comparing data and review from hundreds of websites regarding the place, hotels, etc.
  • Retail and e-commerce: This is one important industry domain that requires big data analysis for targeted advertising. Companies like Alibaba run Spark jobs for analyzing petabytes of data for enhancing customer experience, providing targetted offers, sales and optimizing the overall performance.
2.

What is PySpark UDF?

Answer»

UDF stands for User Defined FUNCTIONS. In PySpark, UDF can be created by creating a python function and wrapping it with PySpark SQL’s udf() method and using it on the DataFrame or SQL. These are generally created when we do not have the functionalities supported in PySpark’s LIBRARY and we have to use our own LOGIC on the data. UDFS can be reused on any number of SQL expressions or DataFrames.

3.

What are the types of PySpark’s shared variables and why are they useful?

Answer»

Whenever PySpark performs the transformation operation using FILTER(), map() or reduce(), they are run on a remote node that uses the variables shipped with tasks. These variables are not reusable and cannot be shared across different tasks because they are not returned to the Driver. To SOLVE the issue of reusability and sharing, we have shared variables in PySpark. There are two types of shared variables, they are:

Broadcast variables: These are also known as read-only shared variables and are used in cases of data lookup requirements. These variables are cached and are made available on all the cluster nodes so that the tasks can make USE of them. The variables are not sent with every task. They are rather distributed to the nodes using efficient algorithms for reducing the cost of communication. When we run an RDD JOB operation that makes use of Broadcast variables, the following things are done by PySpark:

  • The job is broken into different stages having distributed shuffling. The actions are executed in those stages.
  • The stages are then broken into tasks.
  • The broadcast variables are BROADCASTED to the tasks if the tasks need to use it.

Broadcast variables are created in PySpark by making use of the broadcast(variable) method from the SparkContext class. The syntax for this goes as follows:

broadcastVar = sc.broadcast([10, 11, 22, 31])broadcastVar.value # access broadcast variable

An important point of using broadcast variables is that the variables are not sent to the tasks when the broadcast function is called. They will be sent when the variables are first required by the executors.

Accumulator variables: These variables are called updatable shared variables. They are added through associative and commutative operations and are used for performing counter or sum operations. PySpark supports the creation of numeric type accumulators by default. It also has the ability to add custom accumulator types. The custom types can be of two types:

  • Named Accumulators: These accumulators are visible under the “Accumulator” tab in the PySpark web UI as shown in the image below:

Here, we will see the Accumulable section that has the sum of the Accumulator values of the variables modified by the tasks listed in the Accumulator column present in the Tasks table.

  • Unnamed Accumulators: These accumulators are not shown on the PySpark Web UI page. It is always recommended to make use of named accumulators.

Accumulator variables can be created by using SparkContext.longAccumulator(variable) as shown in the example below:

ac = sc.longAccumulator("sumaccumulator")sc.parallelize([2, 23, 1]).foreach(lambda x: ac.add(x))

Depending on the type of accumulator variable data - double, long and collection, PySpark provide DoubleAccumulator, LongAccumulator and CollectionAccumulator respectively.

4.

What is SparkSession in Pyspark?

Answer»

SparkSession is the entry point to PYSPARK and is the replacement of SparkContext since PySpark version 2.0. This acts as a starting point to access all of the PySpark FUNCTIONALITIES related to RDDs, DataFrame, DATASETS etc. It is also a Unified API that is used in replacing the SQLContext, StreamingContext, HiveContext and all other contexts.

The SparkSession internally creates SparkContext and SparkConfig based on the details provided in SparkSession. SparkSession can be created by making USE of BUILDER patterns.

5.

What do you understand about PySpark DataFrames?

Answer»

PySpark DataFrame is a distributed collection of well-organized data that is equivalent to tables of the relational databases and are placed into named columns. PySpark DataFrame has better optimisation when compared to R or python. These can be created from different sources LIKE Hive Tables, Structured Data Files, existing RDDs, EXTERNAL databases etc as SHOWN in the IMAGE below:

The data in the PySpark DataFrame is distributed across different machines in the cluster and the operations performed on this would be run PARALLELLY on all the machines. These can handle a large collection of structured or semi-structured data of a range of petabytes.

6.

Is PySpark faster than pandas?

Answer»

PySpark SUPPORTS parallel execution of statements in a DISTRIBUTED environment, i.e on different cores and different machines which are not PRESENT in PANDAS. This is why PySpark is faster than pandas.

7.

What are the advantages of PySpark RDD?

Answer»

PySpark RDDs have the following advantages:

  • In-Memory Processing: PySpark’s RDD helps in loading data from the disk to the memory. The RDDs can even be persisted in the memory for reusing the computations.
  • Immutability: The RDDs are immutable which means that once CREATED, they cannot be modified. While applying any TRANSFORMATION operations on the RDDs, a new RDD WOULD be created.
  • Fault Tolerance: The RDDs are fault-tolerant. This means that whenever an operation fails, the data gets automatically reloaded from other available partitions. This RESULTS in seamless execution of the PySpark applications.
  • Lazy Evolution: The PySpark transformation operations are not performed as soon as they are encountered. The operations would be stored in the DAG and are evaluated once it finds the first RDD action.
  • Partitioning: Whenever RDD is created from any data, the elements in the RDD are partitioned to the cores available by default.
8.

What are the different cluster manager types supported by PySpark?

Answer»

A cluster manager is a cluster mode platform that helps to run Spark by providing all resources to WORKER nodes BASED on the REQUIREMENTS.

The above figure shows the position of cluster manager in the Spark ECOSYSTEM. Consider a master node and multiple worker nodes present in the cluster. The master nodes provide the worker nodes with the resources like memory, processor allocation etc depending on the nodes requirements with the help of the cluster manager.

PySpark supports the following cluster manager types:

  • Standalone – This is a SIMPLE cluster manager that is included with Spark.
  • Apache Mesos – This manager can run Hadoop MapReduce and PySpark apps.
  • Hadoop YARN – This manager is used in Hadoop2.
  • Kubernetes – This is an open-source cluster manager that helps in automated deployment, scaling and automatic management of containerized apps.
  • local – This is simply a mode for running Spark applications on laptops/desktops.
9.

Does PySpark provide a machine learning API?

Answer»

Similar to Spark, PySpark provides a machine learning API which is known as MLlib that supports various ML algorithms like:

  • mllib.classification − This supports different methods for binary or multiclass classification and regression analysis like Random Forest, Decision Tree, Naive Bayes etc.
  • mllib.clustering − This is used for solving clustering problems that aim in grouping entities subsets with one another depending on similarity.
  • mllib.fpm − FPM stands for Frequent Pattern Matching. This library is used to mine frequent items, subsequences or other structures that are used for analyzing LARGE datasets.
  • mllib.linalg − This is used for solving problems on linear algebra.
  • mllib.recommendation − This is used for collaborative filtering and in recommender systems.
  • spark.mllib − This is used for supporting model-based collaborative filtering where small latent FACTORS are identified using the Alternating LEAST Squares (ALS) algorithm which is used for predicting missing entries.
  • mllib.regression − This is used for solving problems using regression algorithms that find relationships and VARIABLE dependencies.
10.

What are RDDs in PySpark?

Answer»

RDDs expand to Resilient Distributed Datasets. These are the elements that are used for running and operating on multiple nodes to perform parallel processing on a cluster. Since RDDs are suited for parallel processing, they are immutable elements. This means that once we create RDD, we cannot modify it. RDDs are also fault-tolerant which means that whenever failure happens, they can be RECOVERED automatically. Multiple operations can be performed on RDDs to perform a certain task. The operations can be of 2 types:

  • Transformation: These operations when applied on RDDs result in the creation of a new RDD. Some of the examples of transformation operations are filter, groupBy, map.
    Let us take an example to demonstrate transformation OPERATION by considering filter() operation:
from pyspark import SparkContextsc = SparkContext("local", "Transdormation Demo")words_list = sc.parallelize ( ["pyspark", "interview", "questions", "at", "interviewbit"])filtered_words = words_list.filter(lambda x: 'interview' in x)filtered = filtered_words.collect()print(filtered)

The above code FILTERS all the elements in the LIST that has ‘interview’ in the element. The output of the above code would be:

[ "interview", "interviewbit"]
  • Action: These operations instruct Spark to perform some computations on the RDD and return the result to the driver. It sends data from the Executer to the driver. count(), collect(), take() are some of the examples.
    Let us consider an example to demonstrate action operation by making use of the count() function.
from pyspark import SparkContextsc = SparkContext("local", "Action Demo")words = sc.parallelize ( ["pyspark", "interview", "questions", "at", "interviewbit"])counts = words.count()print("Count of elements in RDD -> ", counts)

In this class, we count the number of elements in the spark RDDs. The output of this code is

Count of elements in RDD -> 5
11.

What are PySpark serializers?

Answer»

The serialization process is used to conduct performance tuning on Spark. The data sent or RECEIVED over the network to the disk or memory should be persisted. PySpark supports serializers for this purpose. It supports two types of serializers, they are:

  • PickleSerializer: This serializes objects USING Python’s PickleSerializer (class pyspark.PickleSerializer). This supports almost every Python object.
  • MarshalSerializer: This performs serialization of objects. We can use it by using class pyspark.MarshalSerializer. This serializer is faster than the PickleSerializer but it supports only limited types.

Consider an example of serialization which makes use of MarshalSerializer:

# --serializing.py----from pyspark.context import SparkContextfrom pyspark.serializers import MarshalSerializersc = SparkContext("LOCAL", "Marshal Serialization", serializer = MarshalSerializer()) #Initialize spark context and serializerprint(sc.parallelize(list(range(1000))).map(lambda x: 3 * x).take(5))sc.stop()

When we run the file using the command:

$SPARK_HOME/bin/spark-submit serializing.py

The OUTPUT of the code WOULD be the list of size 5 of numbers multiplied by 3:

[0, 3, 6, 9, 12]
12.

Why do we use PySpark SparkFiles?

Answer»

PySpark’s SparkFiles are used for LOADING the files onto the Spark application. This functionality is present under SparkContext and can be called USING the sc.addFile() method for loading files on Spark. SparkFiles can also be used for GETTING the PATH using the SparkFiles.get() method. It can also be used to resolve paths to files added using the sc.addFile() method.

13.

What is PySpark SparkContext?

Answer»

PySpark SparkContext is an initial entry point of the spark functionality. It also represents Spark Cluster Connection and can be used for creating the Spark RDDS (Resilient DISTRIBUTED Datasets) and broadcasting the variables on the cluster.

The following diagram represents the architectural diagram of PySpark’s SparkContext:

When we want to run the Spark application, a driver program that has the main FUNCTION will be STARTED. From this point, the SparkContext that we defined gets initiated. Later on, the driver program performs operations inside the executors of the worker NODES. Additionally, JVM will be launched using Py4J which in turn creates JavaSparkContext. Since PySpark has default SparkContext available as “sc”, there will not be a creation of a new SparkContext.

14.

What are the advantages and disadvantages of PySpark?

Answer»

Advantages of PySpark:

  • Simple to use: Parallelized code can be written in a SIMPLER manner.
  • Error Handling: PySpark FRAMEWORK easily handles errors.
  • Inbuilt Algorithms: PySpark provides MANY of the useful algorithms in Machine Learning or Graphs.
  • Library Support: Compared to Scala, Python has a huge library collection for working in the field of data science and data visualization.
  • EASY to Learn: PySpark is an easy to learn language.

Disadvantages of PySpark:

  • Sometimes, it becomes difficult to express problems using the MapReduce model.
  • Since Spark was originally developed in Scala, while using PySpark in Python programs they are relatively less efficient and approximately 10x times slower than the Scala programs. This WOULD impact the performance of heavy data processing applications.
  • The Spark Streaming API in PySpark is not mature when compared to Scala. It still requires improvements.
  • PySpark cannot be used for modifying the internal function of the Spark due to the abstractions provided. In such cases, Scala is preferred.
15.

What are the characteristics of PySpark?

Answer»

There are 4 CHARACTERISTICS of PySpark:

  • ABSTRACTED Nodes: This means that the individual worker nodes can not be addressed.
  • Spark API: PySpark provides APIs for utilizing Spark features.
  • Map-REDUCE Model: PySpark is based on Hadoop’s Map-Reduce model this means that the programmer provides the map and the reduce functions.
  • Abstracted Network: Networks are abstracted in PySpark which means that the only possible COMMUNICATION is implicit communication.
16.

What is PySpark?

Answer»

PySpark is an Apache Spark interface in Python. It is used for collaborating with Spark using APIs written in Python. It also supports Spark’s FEATURES like Spark DataFrame, Spark SQL, Spark Streaming, Spark MLlib and Spark Core. It provides an interactive PySpark shell to analyze STRUCTURED and semi-structured data in a distributed environment. PySpark supports reading data from multiple sources and different formats. It also facilitates the use of RDDs (RESILIENT Distributed DATASETS). PySpark features are implemented in the py4j library in python.

PySpark can be installed using PyPi by using the COMMAND:

pip install pyspark