Skip to main content

Must know things about Apache Spark Structured Streaming before using it!

Spark Structured Streaming in Depth – Introduction

Structured streaming is a new high-level streaming API

It is pure declarative based on automatically incrementalizing a static relational query. (It supports SparkSQL or DataFrames)

It supports end to end Realtime application as well as Batch. (We can even use same Realtime code for batch jobs with just one line of code change)

Once messages are received in structured streaming engine it is directly available in DataFrame and it achieves high performance using Spark SQL’s code generation engine. As per research paper it outperforms Apache Flink 2X and Apache Kafka Streams by 90X.

Structured streaming declarative API also supports Rollbacks, Code updates.

Challenges we faced with older streaming APIs are –

  1. Depending on the workload or messaging queues (Kafka/RabbitMQ/Kinesis) developers need to think on job triggering modes, stage storage in checkpointing and at-least-once message processing.
  2. Focused was more on streaming computation rather than complex joining streaming.

Structured streaming brings nice features like interactive queries, continues stream processing, static and streaming data joins.

Following are the two main characteristics of Structured Stream:

  1. Incremental query model - Once data is received from receiver or read using file stream reader user can use existing Spark Batch API to write streaming query.
  2. End to end support – Structured streaming API provided direct way (API) to communicate with other components like Kafka / Kinesis or other type of messaging queue and storage like S3, HDFC, Cassandra etc.

By default, Structured streaming support “exactly-once” computation.

Structured streaming internally recuses Spark SQL engine to optimize query and run time code generation. By default, engine runs on (Tigger based) micro batch execution model

.trigger(Trigger.ProcessingTime("10 second"))

But we can directly shift existing code to Trigger continuous low-latency model.

.trigger(Trigger.Continuous("1 second"))

Failure recovery – Each application in spark structured streaming maintains a write-ahead event log in JSON format (Readable). It is useful in recovering from an arbitrary point.

Use case where we can use Structured Streaming:

  1. Realtime sensor data analytics where we can join raw sensor value with static data (csv) on realtime streaming query.
  2. Log analytics – Alerts, Custom application metrics
  3. Incremental ETL load etc.

Next post, I will explain more about structured streaming internals, its architecture and code examples. Please subscribe to my blog for latest updates. Cheers!