I was trying a sample mapredyce code written in python using hadoop streaming in cloudera quickstart VM. But, I am stuck in between.
Here is my mapper code:
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
val = line.strip()
(year, temp, q) = (val[15:19], val[87:92], val[92:93])
if (temp != "+9999" and re.match("[01459]", q)):
print "%s\t%s" % (year, temp)
Here is my reducer code:
#!/usr/bin/env python
import sys
(last_key, max_val) = (None, -sys.maxint)
for line in sys.stdin:
(key, val) = line.strip().split("\t")
if last_key and last_key != key:
print "%s\t%s" % (last_key, max_val)
(last_key, max_val) = (key, int(val))
else:
(last_key, max_val) = (key, max(max_val, int(val)))
if last_key:
print "%s\t%s" % (last_key, max_val)
source: https://github.com/tomwhite/hadoop-book/tree/master/ch02-mr-intro/src/main/python
This is the command that I am executing in order to run the mapreduce job:
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/sample.txt \
-output /user/cloudera/output
-mapper /home/cloudera/streaming-sample/max_temperature_map.py \
-reducer /home/cloudera/streaming-sample/max_temperature_reduce.py
This is the error log snippet that I am getting:
Please help me understanding what I am doing wrong here.