Hey I’m fairly new to the world of Big Data.
I came across this tutorial on
http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/
It describes in detail of how to run MapReduce job using mrjob both locally and on Elastic Map Reduce.
Well I’m trying to run this on my own Hadoop cluser. I ran the job using the following command.
python density.py tiny.dat -r hadoop --hadoop-bin /usr/bin/hadoop > outputmusic
And this is what I get:
HADOOP: Running job: job_1369345811890_0245
HADOOP: Job job_1369345811890_0245 running in uber mode : false
HADOOP: map 0% reduce 0%
HADOOP: Task Id : attempt_1369345811890_0245_m_000000_0, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP: at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP: at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP: at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP: at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP: at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP: at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP: at java.security.AccessController.doPrivileged(Native Method)
HADOOP: at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP: at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000001_0, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP: at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP: at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP: at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP: at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP: at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP: at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP: at java.security.AccessController.doPrivileged(Native Method)
HADOOP: at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP: at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000000_1, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP: at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP: at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP: at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP: at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP: at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP: at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP: at java.security.AccessController.doPrivileged(Native Method)
HADOOP: at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP: at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Container killed by the ApplicationMaster.
HADOOP:
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000001_1, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP: at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP: at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP: at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP: at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP: at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP: at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP: at java.security.AccessController.doPrivileged(Native Method)
HADOOP: at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP: at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000000_2, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP: at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP: at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP: at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP: at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP: at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP: at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP: at java.security.AccessController.doPrivileged(Native Method)
HADOOP: at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP: at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000001_2, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP: at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP: at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP: at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP: at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP: at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP: at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP: at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP: at java.security.AccessController.doPrivileged(Native Method)
HADOOP: at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP: at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: map 100% reduce 0%
HADOOP: Job job_1369345811890_0245 failed with state FAILED due to: Task failed task_1369345811890_0245_m_000001
HADOOP: Job failed as tasks failed. failedMaps:1 failedReduces:0
HADOOP:
HADOOP: Counters: 6
HADOOP: Job Counters
HADOOP: Failed map tasks=7
HADOOP: Launched map tasks=8
HADOOP: Other local map tasks=6
HADOOP: Data-local map tasks=2
HADOOP: Total time spent by all maps in occupied slots (ms)=32379
HADOOP: Total time spent by all reduces in occupied slots (ms)=0
HADOOP: Job not Successful!
HADOOP: Streaming Command Failed!
STDOUT: packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.2.1.jar] /tmp/streamjob3272348678857116023.jar tmpDir=null
Traceback (most recent call last):
File "density.py", line 34, in <module>
MRDensity.run()
File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/job.py", line 344, in run
mr_job.run_job()
File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/job.py", line 381, in run_job
runner.run()
File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/runner.py", line 316, in run
self._run()
File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/hadoop.py", line 175, in _run
self._run_job_in_hadoop()
File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/hadoop.py", line 325, in _run_job_in_hadoop
raise CalledProcessError(step_proc.returncode, streaming_args)
subprocess.CalledProcessError: Command '['/usr/bin/hadoop', 'jar', '/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.1.jar', '-cmdenv', 'PYTHONPATH=mrjob.tar.gz', '-input', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/input', '-output', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/output', '-cacheFile', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/files/density.py#density.py', '-cacheArchive', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/files/mrjob.tar.gz#mrjob.tar.gz', '-mapper', 'python density.py --step-num=0 --mapper --protocol json --output-protocol json --input-protocol raw_value', '-jobconf', 'mapred.reduce.tasks=0']' returned non-zero exit status 1
Note: As suggested in some other forums I’ve included
#! /usr/bin/python
at the beginning of both my python files density.py and track.py. It seems to have worked for most people but I still continue getting the above exceprions.
Edit: I included the definition of one of the functions being used in the original density.py which was definied in another file track.py in density.py itself. The job ran succesfully. But it would really be helpful if someone knows why this is happening.
I have a 4 node cluster and R, Hadoop, rmr2 installed on all the nodes. Running the a sample job produce the following errors. I am not sure where to look.. Any insights will be very helpful
thanks
Babu
14/05/20 12:02:47 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/tmp/RtmpcjltyX/rmr-local-env66c625272a9d, /tmp/RtmpcjltyX/rmr-global-env66c65806b668, /tmp/RtmpcjltyX/rmr-streaming-map66c62b1b7c09, /tmp/RtmpcjltyX/rmr-streaming-reduce66c6674c771a] [/opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-2.jar] /tmp/streamjob2642347914010493291.jar tmpDir=null
14/05/20 12:02:48 INFO client.RMProxy: Connecting to ResourceManager at HadoopS.dbstraining.local/192.168.100.40:8032
14/05/20 12:02:48 INFO client.RMProxy: Connecting to ResourceManager at HadoopS.dbstraining.local/192.168.100.40:8032
14/05/20 12:02:49 INFO mapred.FileInputFormat: Total input paths to process : 1
14/05/20 12:02:49 INFO mapreduce.JobSubmitter: number of splits:2
14/05/20 12:02:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1395320373286_0043
14/05/20 12:02:49 INFO impl.YarnClientImpl: Submitted application application_1395320373286_0043
14/05/20 12:02:49 INFO mapreduce.Job: The url to track the job: http://HadoopS.dbstraining.local:8088/proxy/application_1395320373286_0043/
14/05/20 12:02:49 INFO mapreduce.Job: Running job: job_1395320373286_0043
14/05/20 12:02:56 INFO mapreduce.Job: Job job_1395320373286_0043 running in uber mode : false
14/05/20 12:02:56 INFO mapreduce.Job: map 0% reduce 0%
14/05/20 12:03:00 INFO mapreduce.Job: Task Id : attempt_1395320373286_0043_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)
14/05/20 12:03:02 INFO mapreduce.Job: Task Id : attempt_1395320373286_0043_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)
14/05/20 12:03:07 INFO mapreduce.Job: map 50% reduce 0%
14/05/20 12:03:08 INFO mapreduce.Job: Task Id : attempt_1395320373286_0043_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)
14/05/20 12:03:13 INFO mapreduce.Job: Task Id : attempt_1395320373286_0043_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)
14/05/20 12:03:19 INFO mapreduce.Job: map 100% reduce 100%
14/05/20 12:03:19 INFO mapreduce.Job: Job job_1395320373286_0043 failed with state FAILED due to: Task failed task_1395320373286_0043_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
14/05/20 12:03:19 INFO mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=93541
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5678
HDFS: Number of bytes written=0
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=5
Launched map tasks=6
Other local map tasks=4
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=23439
Total time spent by all reduces in occupied slots (ms)=0
Map-Reduce Framework
Map input records=1
Map output records=0
Map output bytes=0
Map output materialized bytes=192
Input split bytes=110
Combine input records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=33
CPU time spent (ms)=850
Physical memory (bytes) snapshot=484114432
Virtual memory (bytes) snapshot=1568194560
Total committed heap usage (bytes)=582287360
File Input Format Counters
Bytes Read=5568
14/05/20 12:03:19 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
14/05/20 12:03:27 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1440 minutes, Emptier interval = 0 minutes.
Moved: ‘hdfs://HadoopS.dbstraining.local:8020/tmp/file66c65a3a2c24’ to trash at: hdfs://HadoopS.dbstraining.local:8020/user/bmathew/.Trash/Current
This error occurs when mapper is unable to read input file.
try with simple input file and check first.
Please follow bug reporting guidelines in the wiki. From this log nobody
other than the other user who replied can make a guess as to what went
wrong, certainly not I.
On May 23, 2014 10:53 PM, «babumathew» notifications@github.com wrote:
I have a 4 node cluster and R, Hadoop, rmr2 installed on all the nodes.
Running the a sample job produce the following errors. I am not sure where
to look.. Any insights will be very helpfulthanks
Babu14/05/20 12:02:47 WARN streaming.StreamJob: -file option is deprecated,
please use generic option -files instead.
packageJobJar: [/tmp/RtmpcjltyX/rmr-local-env66c625272a9d,
/tmp/RtmpcjltyX/rmr-global-env66c65806b668,
/tmp/RtmpcjltyX/rmr-streaming-map66c62b1b7c09,
/tmp/RtmpcjltyX/rmr-streaming-reduce66c6674c771a]
[/opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-2.jar]
/tmp/streamjob2642347914010493291.jar tmpDir=null
14/05/20 12:02:48 INFO client.RMProxy: Connecting to ResourceManager at
HadoopS.dbstraining.local/192.168.100.40:8032
14/05/20 12:02:48 INFO client.RMProxy: Connecting to ResourceManager at
HadoopS.dbstraining.local/192.168.100.40:8032
14/05/20 12:02:49 INFO mapred.FileInputFormat: Total input paths to
process : 1
14/05/20 12:02:49 INFO mapreduce.JobSubmitter: number of splits:2
14/05/20 12:02:49 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1395320373286_0043
14/05/20 12:02:49 INFO impl.YarnClientImpl: Submitted application
application_1395320373286_0043
14/05/20 12:02:49 INFO mapreduce.Job: The url to track the job:
http://HadoopS.dbstraining.local:8088/proxy/application_1395320373286_0043/
14/05/20 12:02:49 INFO mapreduce.Job: Running job: job_1395320373286_0043
14/05/20 12:02:56 INFO mapreduce.Job: Job job_1395320373286_0043 running
in uber mode : false
14/05/20 12:02:56 INFO mapreduce.Job: map 0% reduce 0%
14/05/20 12:03:00 INFO mapreduce.Job: Task Id :
attempt_1395320373286_0043_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)14/05/20 12:03:02 INFO mapreduce.Job: Task Id :
attempt_1395320373286_0043_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)14/05/20 12:03:07 INFO mapreduce.Job: map 50% reduce 0%
14/05/20 12:03:08 INFO mapreduce.Job: Task Id :
attempt_1395320373286_0043_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)14/05/20 12:03:13 INFO mapreduce.Job: Task Id :
attempt_1395320373286_0043_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)14/05/20 12:03:19 INFO mapreduce.Job: map 100% reduce 100%
14/05/20 12:03:19 INFO mapreduce.Job: Job job_1395320373286_0043 failed
with state FAILED due to: Task failed task_1395320373286_0043_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:014/05/20 12:03:19 INFO mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=93541
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5678
HDFS: Number of bytes written=0
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=5
Launched map tasks=6
Other local map tasks=4
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=23439
Total time spent by all reduces in occupied slots (ms)=0
Map-Reduce Framework
Map input records=1
Map output records=0
Map output bytes=0
Map output materialized bytes=192
Input split bytes=110
Combine input records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=33
CPU time spent (ms)=850
Physical memory (bytes) snapshot=484114432
Virtual memory (bytes) snapshot=1568194560
Total committed heap usage (bytes)=582287360
File Input Format Counters
Bytes Read=5568
14/05/20 12:03:19 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine,
vectorized.reduce, :
hadoop streaming failed with error code 1
14/05/20 12:03:27 INFO fs.TrashPolicyDefault: Namenode trash
configuration: Deletion interval = 1440 minutes, Emptier interval = 0
minutes.
Moved: ‘hdfs://HadoopS.dbstraining.local:8020/tmp/file66c65a3a2c24’ to
trash at: hdfs://HadoopS.dbstraining.local:8020/user/bmathew/.Trash/Current—
Reply to this email directly or view it on GitHub
#112.
Dear,
me too I have a 4 hadoop cluster (Cloudera CDH-5.2.1-1.cdh5.2.1.p0.12).
Running the wordcount example in cluster mode I have the same error than the user who initialiazed this thread, but randomly). With Randomly I mean it can occur a single time during a run of the wordcount code or not at all. Sometimes the failing tasks on the nodes get executed properly on another node leading to a succesfull Job but sometimes the job fails because of to many failed tasks.
In local mode the code runs as expected.
The file I use is the file from: http://www.textfiles.com/politics/0814gulf.txt
Before running the code the file was cleaned using dos2unix and a control character removal tool. As I said in local mode everything runs fine, and with the same file wordcount from hadoop works as intended.
The following R packages are installed:
Packages in library /usr/local/lib/R/site-library:
bitops Bitwise Operations
caTools Tools: moving window statistics, GIF, Base64,
ROC AUC, etc.
devtools Tools to make developing R code easier
digest Create Cryptographic Hash Digests of R Objects
evaluate Parsing and evaluation tools that provide more
details than the default.
functional Curry, Compose, and other higher-order
functions
httr Tools for Working with URLs and HTTP
iterators Iterator construct for R
itertools Iterator Tools
jsonlite A Robust, High Performance JSON Parser and
Generator for R
manipulate Interactive Plots for RStudio
memoise Memoise functions
mime Map filenames to MIME types
plyr Tools for splitting, applying and combining
data
R6 Classes with reference semantics
Rcpp Seamless R and C++ Integration
RCurl General network (HTTP/FTP/...) client interface
for R
reshape2 Flexibly Reshape Data: A Reboot of the Reshape
Package.
rhdfs R and Hadoop Distributed Filesystem
rJava Low-level R to Java interface
RJSONIO Serialize R objects to JSON, JavaScript Object
Notation
rmr2 R and Hadoop Streaming Connector
rstudio Tools and Utilities for RStudio
rstudioapi Safely access the RStudio API.
stringr Make it easier to work with strings.
whisker {{mustache}} for R, logicless templating
Here the stderr + syslog from one of the failed jobs:
Log Type: stderr
Log Length: 1130
Loading objects:
wordcount
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
Please review your hadoop settings. See help(hadoop.settings)
default.output.format
in.folder
in.memory.combine
input.format
libs
map
map.file
map.line
out.folder
output.format
pkg.opts
postamble
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
tempfile
vectorized.reduce
verbose
work.dir
Loading required package: methods
Loading required package: rmr2
Loading required package: rJava
Loading required package: rhdfs
HADOOP_CMD=/usr/bin/hadoop
Be sure to run hdfs.init()
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
libs
map
map.file
map.line
out.folder
output.format
pkg.opts
postamble
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
tempfile
vectorized.reduce
verbose
work.dir
Log Type: stdout
Log Length: 0
Log Type: syslog
Log Length: 8901
Showing 4096 bytes of 8901 total. Click here for the full log.
:NA [rec/s]
2015-01-05 15:07:57,389 INFO [main] org.apache.hadoop.streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2015-01-05 15:07:57,407 INFO [main] org.apache.hadoop.streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
2015-01-05 15:07:58,981 INFO [Thread-12] org.apache.hadoop.streaming.PipeMapRed: Records R/W=8392/1
2015-01-05 15:07:59,017 INFO [Thread-13] org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2015-01-05 15:07:59,071 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.EOFException
2015-01-05 15:07:59,072 INFO [main] org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: java.io.EOFException
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:334)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:211)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:56)
at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:376)
2015-01-05 15:07:59,075 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: java.io.EOFException
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:334)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:211)
at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:56)
at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:376)
2015-01-05 15:07:59,080 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task
2015-01-05 15:07:59,087 WARN [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete hdfs://babar.hadoop:8020/user/eric.falk/gulf.out/_temporary/1/_temporary/attempt_1418911299833_0092_m_000001_0
2015-01-05 15:07:59,192 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2015-01-05 15:07:59,193 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2015-01-05 15:07:59,193 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.
Thanks Eric
I am facing the same issue, Did u find a solution ?
I was facing a similar issue while working with a python script. Just add
in the beginning of your scripts. The same can be done for other scripting languages.
I m trying to run Mapper For Co-occurance on AWS EMR
Errorlog
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
System Log:
2019-04-18 00:34:29,518 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-1-199.us-east-2.compute.internal/172.31.1.199:8032
2019-04-18 00:34:29,741 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-1-199.us-east-2.compute.internal/172.31.1.199:8032
2019-04-18 00:34:30,046 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening ‘s3://abhavtwitterdataset/mr/mapper2.py’ for reading
2019-04-18 00:34:30,240 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening ‘s3://abhavtwitterdataset/mr/reducer.py’ for reading
2019-04-18 00:34:30,815 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2019-04-18 00:34:30,822 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev 59c952a855a0301a4f9e1b2736510df04a640bd3]
2019-04-18 00:34:30,919 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input files to process : 4
2019-04-18 00:34:31,403 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): number of splits:9
2019-04-18 00:34:31,559 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Submitting tokens for job: job_1555547086751_0002
2019-04-18 00:34:31,802 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (main): Submitted application application_1555547086751_0002
2019-04-18 00:34:31,876 INFO org.apache.hadoop.mapreduce.Job (main): The url to track the job: http://ip-172-31-1-199.us-east-2.compute.internal:20888/proxy/application_1555547086751_0002/
2019-04-18 00:34:31,878 INFO org.apache.hadoop.mapreduce.Job (main): Running job: job_1555547086751_0002
2019-04-18 00:34:41,076 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1555547086751_0002 running in uber mode : false
2019-04-18 00:34:41,078 INFO org.apache.hadoop.mapreduce.Job (main): map 0% reduce 0%
2019-04-18 00:34:59,350 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000001_0, Status : FAILED
2019-04-18 00:35:01,388 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000003_0, Status : FAILED
2019-04-18 00:35:09,627 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000000_0, Status : FAILED
2019-04-18 00:35:11,646 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000002_0, Status : FAILED
2019-04-18 00:35:12,657 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000004_0, Status : FAILED
2019-04-18 00:35:13,667 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000005_0, Status : FAILED
2019-04-18 00:35:15,682 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000001_1, Status : FAILED
2019-04-18 00:35:15,683 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000006_0, Status : FAILED
2019-04-18 00:35:30,760 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000005_1, Status : FAILED
2019-04-18 00:35:30,761 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000001_2, Status : FAILED
2019-04-18 00:35:34,782 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000003_1, Status : FAILED
2019-04-18 00:35:39,816 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000000_1, Status : FAILED
2019-04-18 00:35:40,824 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000004_1, Status : FAILED
2019-04-18 00:35:41,829 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000002_1, Status : FAILED
2019-04-18 00:35:45,851 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000006_1, Status : FAILED
2019-04-18 00:35:45,853 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000005_2, Status : FAILED
2019-04-18 00:36:00,941 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000002_2, Status : FAILED
2019-04-18 00:36:00,944 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000006_2, Status : FAILED
2019-04-18 00:36:03,957 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 100%
2019-04-18 00:36:04,966 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1555547086751_0002 failed with state FAILED due to: Task failed task_1555547086751_0002_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
2019-04-18 00:36:05,073 INFO org.apache.hadoop.mapreduce.Job (main): Counters: 17
Job Counters
Failed map tasks=19
Killed map tasks=8
Killed reduce tasks=3
Launched map tasks=24
Other local map tasks=17
Data-local map tasks=7
Total time spent by all maps in occupied slots (ms)=20982096
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=437127
Total time spent by all reduce tasks (ms)=0
Total vcore-milliseconds taken by all map tasks=437127
Total vcore-milliseconds taken by all reduce tasks=0
Total megabyte-milliseconds taken by all map tasks=671427072
Total megabyte-milliseconds taken by all reduce tasks=0
Map-Reduce Framework
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
2019-04-18 00:36:05,074 ERROR org.apache.hadoop.streaming.StreamJob (main): Job not successful!
Code of Mapper For Co-occurance
#!/usr/bin/env python
"""mapper2.py"""
import sys
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
nltk.data.path.append("/tmp")
nltk.download('punkt', download_dir="/tmp")
for line in sys.stdin:
line = re.sub(r'httpS+', '', line)
# replace 't by t
line = re.sub(r"'t",'t',line)
stopwords = ['about', 'all', 'along', 'also', 'an', 'any', 'and', 'are', 'around', 'after', 'according', 'another',
'already', 'because', 'been', 'being', 'but', 'become', 'can', 'could', 'called',
'during', 'do', 'dont', 'does', 'doesn', 'did', 'didnt', 'etc', 'for', 'from', 'far',
'get', 'going', 'had', 'has', 'have', 'he', 'her', 'here', 'him', 'his', 'how',
'into', 'isnt', 'its', 'just', 'let', 'like', 'may', 'more', 'must', 'most',
'not', 'now', 'new', 'next', 'one', 'other', 'our', 'out', 'over', 'own', 'put', 'right',
'say', 'said', 'should', 'she', 'since', 'some', 'still', 'such',
'take', 'that', 'than', 'the', 'their', 'them', 'then', 'there', 'these',
'they', 'this', 'those', 'through', 'time', 'told', 'thing',
'use' ,'until', 'via', 'very', 'under',
'was', 'way', 'were', 'what', 'which', 'when', 'where', 'who', 'why', 'will', 'with', 'would', 'wouldnt',
'yes', 'you', 'your']
lines = nltk.sent_tokenize(line)
for line in lines:
#remove puntuation
line = re.sub(r'[^ws]',' ',line)
# split the line into words
words = line.split()
for k in range(len(words) - 1):
#stemming
#l = ps.stem(words[k])
l = words[k]
l = l.lower()
if l not in stopwords and len(l)>2 and not (l.isdigit()):
for j in words[k+1:]:
r = j.lower()
if r in stopwords or l == r or len(r)<=2 or r.isdigit():
continue
key = l+"-"+r
print("%st%s" % (key.lower(),1))
I’m following this tutorial:
http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/
I put cities.txt in /user/root/ and the R script as following :
#!/usr/bin/env Rscript f <- file("stdin") open(f) state_data = read.table(f) summary(state_data)
and then run the command:
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming-2.7.1.2.3.4.0-3485.jar -input /user/root/cities.txt -output /user/root/streamer -mapper /bin/cat -reducer script.R -numReduceTasks 2 -file script.R
Map works till 100% and reduce shows this error:
16/03/01 11:06:30 INFO mapreduce.Job: map 100% reduce 50% 16/03/01 11:06:34 INFO mapreduce.Job: Task Id : attempt_1456773989186_0009_r_000001_0, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Does any one have any idea or encountered this before ?
Thanks.
I’m trying to run my own mapper and reducer Python scripts using Hadoop Streaming on my cluster built on VMware Workstation VMs.
Hadoop version — 2.7, Python — 3.5, OS — CentOS 7.2 on all the VMs.
I have a separate machine which plays a role of a client application host and submits the mapreduce job to the resource manager. Map and reduce scripts are stored there as well.
I’m using the following hadoop command to run a job:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -output result1 -input /user/hadoop/hr/profiles -file /home/hadoop/map.py -mapper map.py -file /home/hadoop/reduce.py -reducer reduce.py
I also tried to insert «python3» interpreter before -mapper and -reducer scripts:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -output result1 -input /user/hadoop/hr/profiles -file /home/hadoop/map.py -mapper "python3.5 map.py" -file /home/hadoop/reduce.py -reducer "python3.5 reduce.py"
However, the job always fails and I’m still getting the same error messages in the log:
2016-10-07 21:57:10,485 INFO [IPC Server handler 1 on 41498] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1475888525921_0004_m_000001_0 is : 0.0
2016-10-07 21:57:10,520 FATAL [IPC Server handler 2 on 41498] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1475888525921_0004_m_000001_0 - exited : java.lang.RuntimeException: **PipeMapRed.waitOutputThreads(): subprocess failed with code 127**
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2016-10-07 21:57:10,520 INFO [IPC Server handler 2 on 41498] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1475888525921_0004_m_000001_0: Error: java.lang.RuntimeException: **PipeMapRed.waitOutputThreads(): subprocess failed with code 127**
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2016-10-07 21:57:10,521 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1475888525921_0004_m_000001_0: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2016-10-07 21:57:10,523 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1475888525921_0004_m_000001_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
2016-10-07 21:57:10,523 INFO [ContainerLauncher #2] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1475888525921_0004_01_000003 taskAttempt attempt_1475888525921_0004_m_000001_0
2016-10-07 21:57:10,524 INFO [ContainerLauncher #2] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1475888525921_0004_m_000001_0
2016-10-07 21:57:10,524 INFO [ContainerLauncher #2] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : slave-1:56838
Python 3.5 interpreter is installed on all VMs across the cluster and its path is added to system’s PATH as well. I can launch the interpreter on all nodes using python3.5 command.
I tried to run the same command with the same scripts on my NameNode and it worked. So it seems like it’s an HDFS security issue.
I’ve already read many posts related to this problem and tried everything what was suggested, but still no progress.
I’ve tried the following already:
- disabling dfs permissions
- redirecting stdout to sdterr in map and reduce scripts
- since I’m using VMs, I reduced RAM and CPU requirements for containers: 256MB and 1 core
- adding «python3» interpreter before -mapper and -reducer options in «hadoop jar» command
- I replaced CRLF with Unix LF in my scripts
All my scripts have #!/opt/rh/rh-python35/root/usr/bin/python3.5 string pointing to interpreter’s location. I’ve tested my scripts several times — they are working just fine.
I’m completely new to this topic and now I’m stuck. Please, if you know how to fix this, share your experience. Thanks in advance.