Apache Spark combineByKey Explained

Apache Spark and Scala (26 Blogs) Become a Certified Professional

Contributed by Prithviraj Bose

Spark is a lightning-fast cluster computing framework designed for rapid computation and the demand for professionals with Apache Spark and Scala Certification is substantial in the market today. Here’s a powerful API in Spark which is combineByKey.

Scala API: org.apache.spark.PairRDDFunctions.combineByKey.

Python API: pyspark.RDD.combineByKey.

The API takes three functions (as lambda expressions in Python or anonymous functions in Scala), namely,

Create combiner function: x
Merge value function: y
Merge combiners function: z

and the API format is combineByKey(x, y, z).

Let’s see an example (in Scala).The full Scala source can be found here.

Our objective is to find the average score per student.

Here’s a placeholder class ScoreDetail storing students name along with the score of a subject.

Some test data is generated and converted to key-pair values where key = Students name and value = ScoreDetail instance.

Then we create a Pair RDD as shown in the code fragment below. Just for experimentation, I have created a hash partitioner of size 3, so the three partitions will contain 2, 2 and 4 key value pairs respectively. This is highlighted in the section where we explore each partition.

Now we can explore each partition. The first line prints the length of each partition (number of key value pairs per partition) and the second line prints the contents of each partition.

And here’s the finale movement where we compute the average score per student after combining the scores across the partitions.

The above code flow is as follows…
First we need to create a combiner function which is essentially a tuple = (value, 1) for every key encountered in each partition. After this phase the output for every (key, value) in a partition is (key, (value, 1)).

Then on the next iteration the combiner functions per partition is merged using the merge value function for every key. After this phase the output of every (key, (value, 1)) is (key, (total, count)) in every partition.

Finally the merge combiner function merges all the values across the partitions in the executors and sends the data back to the driver. After this phase the output of every (key, (total, count)) per partition is
(key, (totalAcrossAllPartitions, countAcrossAllPartitions)).

The map converts the
(key, tuple) = (key, (totalAcrossAllPartitions, countAcrossAllPartitions))
to compute the average per key as (key, tuple._1/tuple._2).

The last line prints the average scores for all the students at the driver’s end.

Got a question for us? Mention them in the comment section and we will get back to you.

Related Posts:

Apache Spark combineByKey Explained

Recommended videos for you

Hadoop Tutorial – A Complete Tutorial For Hadoop

What is Big Data and Why Learn Hadoop!!!

Tailored Big Data Solutions Using MapReduce Design Patterns

Introduction to Big Data TDD and Pig Unit

What is Apache Storm all about?

Power of Python With BigData

Boost Your Data Career with Predictive Analytics! Learn How ?

Apache Spark Redefining Big Data Processing

Is Hadoop A Necessity For Data Science?

Improve Customer Service With Big Data

Apache Spark For Faster Batch Processing

Bulk Loading Into HBase With MapReduce

MapReduce Tutorial – All You Need To Know About MapReduce

Hadoop Architecture – Hadoop Tutorial on HDFS Architecture

Big Data Processing With Apache Spark

Introduction to Hadoop Administration

Python for Big Data Analytics

Distributed Cache With MapReduce

Advanced Security In Hadoop Cluster

Pig Tutorial – Know Everything About Apache Pig Script

Recommended blogs for you

Drilling Down On Apache Drill, The New-Age Query Engine (Part 2)

Scala Functional Programming

What Is Splunk? A Beginners Guide To Understanding Splunk

RDD using Spark : The Building Block of Apache Spark

Hadoop Cluster Configuration Files

Map Side Join Vs. Join

Pig Tutorial: Apache Pig Architecture & Twitter Case Study

Hive and Yarn Examples on Spark

Why do we need Hadoop for Data Science?

How Predictive Analysis can Help you Combat Employee Attrition

Essential Hadoop Tools for Crunching Big Data

Apache Spark combineByKey Explained

What’s New in Hadoop 3.0 – Enhancements in Apache Hadoop 3

Anatomy of a MapReduce Job in Apache Hadoop

RDDs in PySpark – Building Blocks Of PySpark

Introduction to Spark with Python – PySpark for Beginners

Game Changing Big Data Use Cases

Big Data Processing with Apache Spark & Scala

What is Big Data Analytics – Turning Insights Into Action

Explaining Kerberos

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

Microsoft Fabric Data Engineer Associate Trai ...

PySpark Certification Training Course

Apache Kafka Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Splunk Certification Training: Power User and ...

ELK Stack Training & Certification

Apache Spark and Scala Certification Training ...

Big Data Hadoop Administration Certification ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Apache Spark combineByKey Explained