Spark Accumulators Explained

Apache Spark and Scala (25 Blogs) Become a Certified Professional

Contributed by Prithviraj Bose

Here’s a blog on the stuff that you need to know about Spark accumulators. With Apache Spark Certification being a key skill that most IT recruiters hunt for, its growth and demand in the industry has been exponential since its inception.

What are accumulators?

Accumulators are variables that are used for aggregating information across the executors. For example, this information can pertain to data or API diagnosis like how many records are corrupted or how many times a particular library API was called.

To understand why we need accumulators, let’s see a small example.

Here’s an imaginary log of transactions of a chain of stores around the central Kolkata region.

There are 4 fields,

Field 1 -> City

Field 2 -> Locality

Field 3 -> Category of item sold

Field 4 -> Value of item sold

However, the logs can be corrupted. For example, the second line is a blank line, the fourth line reports some network issues and finally the last line shows a sales value of zero (which cannot happen!).

We can use accumulators to analyse the transaction log to find out the number of blank logs (blank lines), number of times the network failed, any product that does not have a category or even number of times zero sales were recorded. The full sample log can be found here.
Accumulators are applicable to any operation which are,
1. Commutative -> f(x, y) = f(y, x), and
2. Associative -> f(f(x, y), z) = f(f(x, z), y) = f(f(y, z), x)
For example, sum and max functions satisfy the above conditions whereas average does not.

Why use Spark Accumulators?

Now why do we need accumulators and why not just use variables as shown in the code below.

The problem with the above code is that when the driver prints the variable blankLines its value will be zero. This is because when Spark ships this code to every executor the variables become local to that executor and its updated value is not relayed back to the driver. To avoid this problem we need to make blankLines an accumulator such that all the updates to this variable in every executor is relayed back to the driver.

So the above code should be written as,

This guarantees that the accumulator blankLines is updated across every executor and the updates are relayed back to the driver.

We can implement other counters for network errors or zero sales value, etc. The full source code along with the implementation of the other counters can be found here.

People familiar with Hadoop Map-Reduce will notice that Spark’s accumulators are similar to Hadoop’s Map-Reduce counters.

Caveats

When using accumulators there are some caveats that we as programmers need to be aware of,

Computations inside transformations are evaluated lazily, so unless an action happens on an RDD the transformationsare not executed. As a result of this, accumulators used inside functions like map() or filter() wont get executed unless some action happen on the RDD.
Spark guarantees to update accumulators inside actionsonly once. So even if a task is restarted and the lineage is recomputed, the accumulators will be updated only once.
Spark does not guarantee this for transformations. So if a task is restarted and the lineage is recomputed, there are chances of undesirable side effects when the accumulators will be updated more than once.

To be on the safe side, always use accumulators inside actions ONLY.
The code here shows a simple yet effective example on how to achieve this.
For more information on accumulators, read this.

Got a question for us? Mention them in the comment section and we will get back to you.

Related Posts:

Get Started with Apache Spark and Scala

Apache Spark combineByKey Explained

Hadoop-A Highly Available And Secure Enterprise Data Warehousing Solution

Spark Accumulators Explained: Apache Spark

What are accumulators?

Why use Spark Accumulators?

Caveats

Recommended videos for you

Apache Spark Redefining Big Data Processing

Spark SQL | Apache Spark

Top Hadoop Interview Questions and Answers – Ace Your Interview

Hadoop Tutorial – A Complete Tutorial For Hadoop

Boost Your Data Career with Predictive Analytics! Learn How ?

Hadoop Cluster With High Availability

Big Data Processing with Spark and Scala

Hadoop Architecture – Hadoop Tutorial on HDFS Architecture

Distributed Cache With MapReduce

Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

Secure Your Hadoop Cluster With Kerberos

Reduce Side Joins With MapReduce

Introduction to Big Data TDD and Pig Unit

Big Data Tutorial – Get Started With Big Data And Hadoop

Ways to Succeed with Hadoop in 2015

Streaming With Apache Spark and Scala

Bulk Loading Into HBase With MapReduce

Real-Time Analytics with Apache Storm

Improve Customer Service With Big Data

Hadoop-A Highly Available And Secure Enterprise Data Warehousing Solution

Recommended blogs for you

Top 10 Reasons to Learn Hadoop

Spark Java Tutorial : Your One Stop Solution to Spark in Java

Machine Learning and Big Data: Is it the future?

Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

Hadoop Tutorial: All you need to know about Hadoop!

Increasing Demand for ‘ Hadoop and NoSQL Skills ’

Drilling Down On Apache Drill, the New-Age Query Engine

Azure Data Engineer Roadmap in 2025

Spark Tutorial: Real Time Cluster Computing Framework

Big Data Career Is The Right Way Forward. Know Why!

Azure Synapse vs. Databricks – What Are the Differences?

Operators in Apache Pig: Part 2- Diagnostic Operators

HBase Architecture: HBase Data Model & HBase Read/Write Mechanism

Jupyter Notebook Cheat Sheet : A Beginner’s Guide to Jupyter Notebook

Using Big Data to Boost Telecom’s Marketing Capabilities

Top 50+ Apache Spark Interview Questions and Answers for 2025

7 Ways Big Data Training Can Change Your Organization

What are the Key Terminologies in Hadoop Security?

HDFS Commands: Hadoop Shell Commands to Manage HDFS

Why Should a Data Warehouse Professional Move to Big Data Hadoop?

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

Microsoft Fabric DP-700 Certification Trainin ...

PySpark Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Apache Kafka Certification Training Course

ELK Stack Training & Certification

Apache Spark and Scala Certification Training ...

Splunk Certification Training: Power User and ...

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Spark Accumulators Explained: Apache Spark