What makes Spark faster than MapReduce

Question

Understanding the code design and logic differences between Apache Spark and Hadoop Map Reduce.

Neha · Answer 1 · Jul 27, 2018

Let's first look at mapper side differences

Map Side of Hadoop Map Reduce

Each Map task outputs the data in Key and Value pair.
The output is stored in a CIRCULAR BUFFER instead of writing to disk.
The size of the circular buffer is around 100 MB. If the circular buffer is 80% full by default, then the data will be spilled to disk, which are called shuffle spill files.
On a particular node, many map tasks are run as a result many spill files are created. Hadoop merges all the spill files, on a particular node, into one big file which is SORTED and PARTITIONED based on number of reducers.

Map side of Spark

Reduce side of Hadoop MR:
- PUSHES the intermediate files(shuffle files) created at the map side. And the data is loaded into memory.
- If the buffer reaches 70% of its limit, then the data will be spilled to disk.
- Then the spills are merged to form bigger files.
- Finally the reduce method gets invoked.
Reduce side of Apache Spark:
- PULLS the intermediate files(shuffle files) to Reduce side.
- The data is directly written to memory.
- If the data doesn't fit in-memory, it will be spilled to disk from spark 0.9 on-wards. Before that, an OOM(out of memory) exception would be thrown.
- Finally the reducer functionality gets invoked.

Other important factors are as follows

Spark uses "lazy evaluation" to form a directed acyclic graph (DAG) of consecutive computation stages. In this way, the execution plan can be optimized, e.g. to minimize shuffling data around. In contrast, this should be done manually in MapReduce by tuning each MR step.
Spark ecosystem has established a versatile stack of components to handle SQL, ML, Streaming, Graph Mining tasks. But in the hadoop ecosystem you have to install other packages to do these individual things.