What is Spark streaming and how does it work?

1.	What is Spark streaming and how does it work?
Answer» Spark Streaming is real-time processing of streaming data. Through Spark streaming, we achieve fault tolerant processing of live data stream. The input data can be from any source. For example, like Kafka, Flume, kinesis, twitter or HDFS/S3. Spark includes two streaming API’s: DStream API. Structured Stream API. *DStream API :* Spark’s DStream API has been used broadly for stream processing since its first release in 2012. Many companies use and operate Spark Streaming at scale in production today due to its high-level API INTERFACE and simple exactly once semantics. Interactions with RDD code, such as joins with static data, are also natively supported in Spark Streaming. Operating Spark streaming is not much more difficult than operating a normal Spark cluster. However, the DStreams API has some limitations. It is based purely on Java/Python objects and functions, as opposed to the richer concept of structured tables in DataFrames and Datasets. This limits the engine’s opportunity to perform optimizations. The API is purely based on processing time, to handle event-time OPERATIONS, applications need to be implemented on their own. Finally, DStreams can only operate in a micro-batch fashion, and exposes the duration of micro-batches in some parts of its API, making it difficult to support alternative execution modes. *Structured Stream API:* Structured Streaming is a higher-level streaming API built from the ground up on Spark’s Structured APIs. It is available in all the environments where structured processing runs, including Scala, Java, Python, R, and SQL. Like DStreams, it is a declarative API based on high-level operations, but by building on the structured data model, Structured Streaming can perform more types of optimizations automatically. However, unlike DStreams, Structured Streaming has NATIVE support for event time data. More fundamentally, beyond SIMPLIFYING stream processing, Structured Streaming is also designed to make it EASY to build end-to-end continuous applications using Apache Spark that combine streaming, batch, and interactive queries. Structured Streaming will automatically update the result of this computation in an incremental fashion as data arrives.

Answer»

Spark Streaming is real-time processing of streaming data. Through Spark streaming, we achieve fault tolerant processing of live data stream. The input data can be from any source. For example, like Kafka, Flume, kinesis, twitter or HDFS/S3. Spark includes two streaming API’s:

DStream API.
Structured Stream API.

DStream API :

Spark’s DStream API has been used broadly for stream processing since its first release in 2012. Many companies use and operate Spark Streaming at scale in production today due to its high-level API INTERFACE and simple exactly once semantics. Interactions with RDD code, such as joins with static data, are also natively supported in Spark Streaming. Operating Spark streaming is not much more difficult than operating a normal Spark cluster. However, the DStreams API has some limitations.

It is based purely on Java/Python objects and functions, as opposed to the richer concept of structured tables in DataFrames and Datasets. This limits the engine’s opportunity to perform optimizations.
The API is purely based on processing time, to handle event-time OPERATIONS, applications need to be implemented on their own.
Finally, DStreams can only operate in a micro-batch fashion, and exposes the duration of micro-batches in some parts of its API, making it difficult to support alternative execution modes.

Structured Stream API:

Structured Streaming is a higher-level streaming API built from the ground up on Spark’s Structured APIs. It is available in all the environments where structured processing runs, including Scala, Java, Python, R, and SQL. Like DStreams, it is a declarative API based on high-level operations, but by building on the structured data model, Structured Streaming can perform more types of optimizations automatically. However, unlike DStreams, Structured Streaming has NATIVE support for event time data.

More fundamentally, beyond SIMPLIFYING stream processing, Structured Streaming is also designed to make it EASY to build end-to-end continuous applications using Apache Spark that combine streaming, batch, and interactive queries. Structured Streaming will automatically update the result of this computation in an incremental fashion as data arrives.

What is Spark streaming and how does it work?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment