Apache Spark combineByKey Explained

Apache Spark and Scala (25 Blogs)

Contributed by Prithviraj Bose

Spark is a lightning-fast cluster computing framework designed for rapid computation and the demand for professionals with Apache Spark and Scala Certification is substantial in the market today. Here’s a powerful API in Spark which is combineByKey.

Scala API: org.apache.spark.PairRDDFunctions.combineByKey.

Python API: pyspark.RDD.combineByKey.

The API takes three functions (as lambda expressions in Python or anonymous functions in Scala), namely,

Create combiner function: x
Merge value function: y
Merge combiners function: z

and the API format is combineByKey(x, y, z).

Let’s see an example (in Scala).The full Scala source can be found here.

Our objective is to find the average score per student.

Here’s a placeholder class ScoreDetail storing students name along with the score of a subject.

Some test data is generated and converted to key-pair values where key = Students name and value = ScoreDetail instance.

Then we create a Pair RDD as shown in the code fragment below. Just for experimentation, I have created a hash partitioner of size 3, so the three partitions will contain 2, 2 and 4 key value pairs respectively. This is highlighted in the section where we explore each partition.

Now we can explore each partition. The first line prints the length of each partition (number of key value pairs per partition) and the second line prints the contents of each partition.

And here’s the finale movement where we compute the average score per student after combining the scores across the partitions.

The above code flow is as follows…
First we need to create a combiner function which is essentially a tuple = (value, 1) for every key encountered in each partition. After this phase the output for every (key, value) in a partition is (key, (value, 1)).

Then on the next iteration the combiner functions per partition is merged using the merge value function for every key. After this phase the output of every (key, (value, 1)) is (key, (total, count)) in every partition.

Finally the merge combiner function merges all the values across the partitions in the executors and sends the data back to the driver. After this phase the output of every (key, (total, count)) per partition is
(key, (totalAcrossAllPartitions, countAcrossAllPartitions)).

The map converts the
(key, tuple) = (key, (totalAcrossAllPartitions, countAcrossAllPartitions))
to compute the average per key as (key, tuple._1/tuple._2).

The last line prints the average scores for all the students at the driver’s end.

Got a question for us? Mention them in the comment section and we will get back to you.

Related Posts:

Apache Spark combineByKey Explained

Recommended videos for you

HBase Tutorial – A Complete Guide On Apache HBase

Is It The Right Time For Me To Learn Hadoop ? Find out.

Advanced Security In Hadoop Cluster

Big Data Tutorial – Get Started With Big Data And Hadoop

Hive Tutorial – Understanding Hive In Depth

Spark SQL | Apache Spark

Apache Spark Will Replace Hadoop ! Know Why

Hadoop Architecture – Hadoop Tutorial on HDFS Architecture

Is Hadoop A Necessity For Data Science?

Apache Spark For Faster Batch Processing

Logistic Regression In Data Science

MapReduce Design Patterns – Application of Join Pattern

What is Big Data and Why Learn Hadoop!!!

When not to use Hadoop

Hadoop for Java Professionals

Introduction to Hadoop Administration

Administer Hadoop Cluster

Apache Spark Redefining Big Data Processing

5 Scenarios: When To Use & When Not to Use Hadoop

Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

Recommended blogs for you

Top Hadoop Developer Skills You Need to Master in 2026

Hadoop Cluster : The all you need to know Guide

Apache Pig UDF: Part 1 – Eval, Aggregate & Filter Functions

All You Need To Know About Splunk

Big Data Career Is The Right Way Forward. Know Why!

Zookeeper Tutorial: The Guide you need to Master Zookeeper

Oracle to HDFS using Sqoop

What is Azure Cosmos DB? – Types, Features, Benefits

Hadoop Job Opportunities 101: Your Guide To Bagging Top Hadoop Jobs In 2020

Game Changing Big Data Use Cases

Why Hadoop?

NameNode High Availability with Quorum Journal Manager

Splunk Architecture: Tutorial On Forwarder, Indexer And Search Head

How to become a Hadoop Administrator?

Rio Olympics 2016: Big Data powers the biggest sporting spectacle of the year!

Top Hadoop Interview Questions On Apache PIG For 2025

Introduction to Spark with Python – PySpark for Beginners

Drilling Down On Apache Drill, the New-Age Query Engine

RDDs in PySpark – Building Blocks Of PySpark

How To Install MongoDB On Windows Operating System?

Join the discussionCancel reply

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Apache Spark combineByKey Explained