Let's first look at mapper side differences
Map Side of Hadoop Map Reduce
- Each Map task outputs the data in Key and Value pair.
- The output is stored in a CIRCULAR BUFFER instead of writing to disk.
- The size of the circular buffer is around 100 MB. If the circular buffer is 80% full by default, then the data will be spilled to disk, which are called shuffle spill files.
- On a particular node, many map tasks are run as a result many spill files are created. Hadoop merges all the spill files, on a particular node, into one big file which is SORTED and PARTITIONED based on number of reducers.
Map side of Spark
- Initial Design:
- The output of map side is written to OS BUFFER CACHE.
- The operating system will decide if the data can stay in OS buffer cache or should it be spilled to DISK.
- Each map task creates as many shuffle spill files as number of reducers.
- SPARK doesn't merge and partition shuffle spill files into one big file, which is the case with Apache Hadoop.
- Example: If there are 6000 (R) reducers and 2000 (M) map tasks, there will be (M*R) 6000*2000=12 million shuffle files. This is because, in spark, each map task creates as many shuffle spill files as number of reducers. This caused performance degradation.
- This was the initial design of Apache Spark.
- Reduce side of Hadoop MR:
- PUSHES the intermediate files(shuffle files) created at the map side. And the data is loaded into memory.
- If the buffer reaches 70% of its limit, then the data will be spilled to disk.
- Then the spills are merged to form bigger files.
- Finally the reduce method gets invoked.
- Reduce side of Apache Spark:
- PULLS the intermediate files(shuffle files) to Reduce side.
- The data is directly written to memory.
- If the data doesn't fit in-memory, it will be spilled to disk from spark 0.9 on-wards. Before that, an OOM(out of memory) exception would be thrown.
- Finally the reducer functionality gets invoked.
Other important factors are as follows
- Spark uses "lazy evaluation" to form a directed acyclic graph (DAG) of consecutive computation stages. In this way, the execution plan can be optimized, e.g. to minimize shuffling data around. In contrast, this should be done manually in MapReduce by tuning each MR step.
- Spark ecosystem has established a versatile stack of components to handle SQL, ML, Streaming, Graph Mining tasks. But in the hadoop ecosystem you have to install other packages to do these individual things.