Stateful Transformations with Windowing in Spark Streaming

Contributed by Prithviraj Bose

In this blog we will discuss the windowing concept of Apache Spark’s stateful transformations.

What is stateful transformation?

Spark streaming uses a micro batch architecture where the incoming data is grouped into micro batches called Discretized Streams (DStreams) which also serves as the basic programming abstraction. The DStreams internally have Resilient Distributed Datasets (RDD) and as a result of this standard RDD transformations and actions can be done.

In streaming if we have a use case to track data across batches then we need state-ful DStreams.

For example we may track a user’s interaction in a website during the user session or we may track a particular twitter hashtag across time and see which users across the globe is talking about it.

Types of state-ful transformation.

State-ful DStreams are of two types – window based tracking and full session tracking.

For stateful tracking all incoming data should be transformed to key-value pairs such that the key states can be tracked across batches. This is a precondition.

Further we should also enable checkpointing, a concept which we will discuss in the later blogs.

> Window based tracking

In window based tracking the incoming batches are grouped in time intervals, i.e. group batches every ‘x’ seconds. Further computations on these batches are done using slide intervals.

For example if the window interval = 3 secs and slide interval = 2 secs, then all incoming data will be grouped in batches every 3 seconds and the computations on these batches will happen every 2 seconds. Alternatively we can say, do computations every 2 seconds on the batches that arrived in the last 3 seconds.

In the above diagram we see that the incoming batches are grouped every 3 units of time (window interval) and the computations are done every 2 units of time (slide interval).
Note: Unlike Apache Flink, Apache Spark does not have a concept of tumbling window, all windows are sliding.

API

A popular API for window based transformations is

PairDStreamFunctions.reduceByKeyAndWindow.

There are several overloaded versions of this API, let’s see the one that has the most number of parameters. After this explanation the rest of the overloaded versions of this API should be self explanatory.

Returns: The transformed DStream[(K, V)]

reduceFunc: The associative reduce function.

invReduceFunc: The inverse of the above reduce function. This is required for efficient computing of incoming and outgoing batches. With the help of this function the value of the batches which are outgoing is deducted from the accumulated value of the above reduce function. For example, if we are computing the sum of the incoming values for the respective keys then for the outgoing batches we will subtract the values for the respective keys (provided they are present in the current batch else ignore).

windowDuration: Units of time for grouping the batches, this should be a multiple of the batch interval.

slideDuration: Units of time for computation, this should be a multiple of the batch interval.partitioner: The partitioner to use for storing the resulting DStream. For more information on partitioning read this.

filterFunc: Function to filter out expired key-value pairs, i.e. for example if we don’t get an update for a key for sometime we may wish to remove it.

Here’s a program to count the words coming from a socket stream. We have used the an overloaded version of the above function with a window interval of 4 secs and a slide interval of 2 secs.

In my next blog I will write about full session tracking and checkpointing.

Got a question for us? Please mention it in the comments section and we will get back to you.

Related Posts:

Stateful Transformations with Windowing in Spark Streaming

Recommended videos for you

Real-Time Analytics with Apache Storm

Reduce Side Joins With MapReduce

Webinar: Introduction to Big Data & Hadoop

Hadoop Tutorial – A Complete Tutorial For Hadoop

Improve Customer Service With Big Data

Tailored Big Data Solutions Using MapReduce Design Patterns

Is Hadoop A Necessity For Data Science?

Is It The Right Time For Me To Learn Hadoop ? Find out.

Hadoop Cluster With High Availability

What is Apache Storm all about?

What is Big Data and Why Learn Hadoop!!!

MapReduce Design Patterns – Application of Join Pattern

Boost Your Data Career with Predictive Analytics! Learn How ?

Big Data Tutorial – Get Started With Big Data And Hadoop

Power of Python With BigData

Hadoop-A Highly Available And Secure Enterprise Data Warehousing Solution

MapReduce Tutorial – All You Need To Know About MapReduce

Advanced Security In Hadoop Cluster

Apache Spark Redefining Big Data Processing

Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

Recommended blogs for you

Running Scala Application In Eclipse IDE Using Sbteclipse

Is Big Data the Right Move for You?

Top 3 Big Data Certifications : Become a Big Data Hadoop Professional

Top Hadoop Interview Questions To Prepare In 2025 – Apache Hive

Career Advantages of Hadoop Certification

Top Hadoop Interview Questions On Apache PIG For 2025

How Predictive Analysis can Help you Combat Employee Attrition

Steps to Create UDF in Apache Pig

Basics of HBase

Apache Pig UDF: Part 2 – Load Functions

Anatomy of a MapReduce Job in Apache Hadoop

How to become a Hadoop Administrator?

Apache Flink: The Next Gen Big Data Analytics Framework For Stream And Batch Data Processing

How to Run Hive Scripts?

Operators in Apache Pig: Part 2- Diagnostic Operators

Dataframes in Spark: All you need to know about Structured Data Processing

Oozie Tutorial: Learn How to Schedule your Hadoop Jobs

Mastered Hadoop? Time to get started with Apache Spark

What is Azure Data Factory – Here’s Everything You Need to Know

Drilling Down On Apache Drill, the New-Age Query Engine

Join the discussionCancel reply

Trending Courses in Big Data

PySpark Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Apache Kafka Certification Training Course

Big Data Hadoop Certification Training Course

Apache Spark and Scala Certification Training ...

Big Data Hadoop Administration Certification ...

Splunk Certification Training: Power User and ...

Comprehensive MapReduce Certification Trainin ...

MapReduce Design Patterns Certification Train ...

ELK Stack Training & Certification

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Stateful Transformations with Windowing in Spark Streaming