Distributed Caching With Broadcast Variables: Apache Spark

Contributed by Prithviraj Bose

Broadcast variables are useful when large datasets needs to be cached in executors. This blog explains how to get started.

What are Broadcast Variables?

Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables these variables would be shipped to each executor for every transformation and action, and this can cause network overhead. However, with broadcast variables, they are shipped once to all executors and are cached for future reference.

Broadcast Variables Use case

Imagine that while doing a transformation we need to lookup a large table of zip codes/pin codes. Here, it is neither feasible to send the large lookup table every time to the executors, nor can we query the database every time. The solution should be to convert this lookup table to a broadcast variables and Spark will cache it in every executor for future reference.

Let’s take a simple example to understand the above concepts. We have a CSV file with names of countries and their capitals. The CSV file can be found here.

Assuming we are processing demographic data of countries and we need to get the capital of that country. In this case we can convert the data in the CSV file to a broadcast variable.

First we load the CSV file in a map, if the file is found then the method returns Some(countries) else it returnsNone.

After successful loading of the CSV file we convert the map to a broadcast variable and use it in our programme.

In the code snippet above we load the CSV file to a mapcountries then we convert that map to a broadcast variablecountriesCache. Subsequently, we create an RDD from the keys of countries. In the searchCountryDetails method we search for all the countries starting with a user defined letter and the method returns an RDD of countries along with their capitals. The broadcast variable countrieCache is used for looking up the capitals.
This way we need not send the whole CSV data every time we need to search.

The code for the searchCountryDetails is shown below,

The whole source code can be found here.

Got a question for us? Mention them in the comment section and we will get back to you.

Related Posts:

Get Started with Apache Spark and Scala

Distributed Caching With Broadcast Variables: Apache Spark

Broadcast Variables Use case

Recommended videos for you

Webinar: Introduction to Big Data & Hadoop

New-Age Search through Apache Solr

What Is Hadoop – All You Need To Know About Hadoop

Real-Time Analytics with Apache Storm

HBase Tutorial – A Complete Guide On Apache HBase

Filtering on HBase Using MapReduce Filtering Pattern

What is Big Data and Why Learn Hadoop!!!

MapReduce Tutorial – All You Need To Know About MapReduce

Apache Spark For Faster Batch Processing

Is Hadoop A Necessity For Data Science?

Improve Customer Service With Big Data

Distributed Cache With MapReduce

Reduce Side Joins With MapReduce

Boost Your Data Career with Predictive Analytics! Learn How ?

Ways to Succeed with Hadoop in 2015

Hadoop for Java Professionals

Secure Your Hadoop Cluster With Kerberos

Big Data Processing with Spark and Scala

Hadoop-A Highly Available And Secure Enterprise Data Warehousing Solution

Tailored Big Data Solutions Using MapReduce Design Patterns

Recommended blogs for you

A Beginner’s Guide to Understanding Big Data & Hadoop

Azure Data Engineer Roadmap in 2025

Tutorial: Setting Up a Virtual Environment in Hadoop

Why Should a Mainframe Professional Move to Big Data and Hadoop?

Real Time Big Data Applications in Various Domains

Sample HBase POC

What is Big Data? – A Beginner’s Guide to the World of Big Data

Splunk Architecture: Tutorial On Forwarder, Indexer And Search Head

Apache Spark Architecture – Spark Cluster Architecture Explained

Azure Databricks Architecture Overview

How To Install MongoDB on Mac Operating System?

Rio Olympics 2016: Big Data powers the biggest sporting spectacle of the year!

Pig Tutorial: Apache Pig Architecture & Twitter Case Study

Azure Data Factory Vs Databricks

Operators in Apache Pig: Part 2- Diagnostic Operators

PySpark CheatSheet: Spark RDD with Python

Big Data Characteristics: Know the 5’Vs of Big Data

How to become an Apache Spark Developer?

Spark GraphX Tutorial – Graph Analytics In Apache Spark

Game Changing Big Data Use Cases

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

PySpark Certification Training Course

Microsoft Fabric Data Engineer Associate Trai ...

Apache Kafka Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Splunk Certification Training: Power User and ...

Apache Spark and Scala Certification Training ...

ELK Stack Training & Certification

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Distributed Caching With Broadcast Variables: Apache Spark