Error java lang runtimeexception pipemapred waitoutputthreads subprocess failed with code 1 - Исправление ошибок и поиск оптимальных решений проблем

Hey I’m fairly new to the world of Big Data.
I came across this tutorial on
http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/

It describes in detail of how to run MapReduce job using mrjob both locally and on Elastic Map Reduce.

Well I’m trying to run this on my own Hadoop cluser. I ran the job using the following command.

python density.py tiny.dat -r hadoop --hadoop-bin /usr/bin/hadoop > outputmusic

And this is what I get:

HADOOP: Running job: job_1369345811890_0245
HADOOP: Job job_1369345811890_0245 running in uber mode : false
HADOOP:  map 0% reduce 0%
HADOOP: Task Id : attempt_1369345811890_0245_m_000000_0, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000001_0, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000000_1, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Container killed by the ApplicationMaster.
HADOOP:
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000001_1, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000000_2, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP: Task Id : attempt_1369345811890_0245_m_000001_2, Status : FAILED
HADOOP: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
HADOOP:         at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
HADOOP:         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
HADOOP:         at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
HADOOP:         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
HADOOP:         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
HADOOP:         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
HADOOP:         at java.security.AccessController.doPrivileged(Native Method)
HADOOP:         at javax.security.auth.Subject.doAs(Subject.java:415)
HADOOP:         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
HADOOP:         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
HADOOP:
HADOOP:  map 100% reduce 0%
HADOOP: Job job_1369345811890_0245 failed with state FAILED due to: Task failed task_1369345811890_0245_m_000001
HADOOP: Job failed as tasks failed. failedMaps:1 failedReduces:0
HADOOP:
HADOOP: Counters: 6
HADOOP:         Job Counters
HADOOP:                 Failed map tasks=7
HADOOP:                 Launched map tasks=8
HADOOP:                 Other local map tasks=6
HADOOP:                 Data-local map tasks=2
HADOOP:                 Total time spent by all maps in occupied slots (ms)=32379
HADOOP:                 Total time spent by all reduces in occupied slots (ms)=0
HADOOP: Job not Successful!
HADOOP: Streaming Command Failed!
STDOUT: packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.2.1.jar] /tmp/streamjob3272348678857116023.jar tmpDir=null
Traceback (most recent call last):
  File "density.py", line 34, in <module>
    MRDensity.run()
  File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/job.py", line 344, in run
    mr_job.run_job()
  File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/job.py", line 381, in run_job
    runner.run()
  File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/runner.py", line 316, in run
    self._run()
  File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/hadoop.py", line 175, in _run
    self._run_job_in_hadoop()
  File "/usr/lib/python2.6/site-packages/mrjob-0.2.4-py2.6.egg/mrjob/hadoop.py", line 325, in _run_job_in_hadoop
    raise CalledProcessError(step_proc.returncode, streaming_args)
subprocess.CalledProcessError: Command '['/usr/bin/hadoop', 'jar', '/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.2.1.jar', '-cmdenv', 'PYTHONPATH=mrjob.tar.gz', '-input', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/input', '-output', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/output', '-cacheFile', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/files/density.py#density.py', '-cacheArchive', 'hdfs:///user/E824259/tmp/mrjob/density.E824259.20130611.053850.343441/files/mrjob.tar.gz#mrjob.tar.gz', '-mapper', 'python density.py --step-num=0 --mapper --protocol json --output-protocol json --input-protocol raw_value', '-jobconf', 'mapred.reduce.tasks=0']' returned non-zero exit status 1

Note: As suggested in some other forums I’ve included

#! /usr/bin/python

at the beginning of both my python files density.py and track.py. It seems to have worked for most people but I still continue getting the above exceprions.

Edit: I included the definition of one of the functions being used in the original density.py which was definied in another file track.py in density.py itself. The job ran succesfully. But it would really be helpful if someone knows why this is happening.

Источник

I have a 4 node cluster and R, Hadoop, rmr2 installed on all the nodes. Running the a sample job produce the following errors. I am not sure where to look.. Any insights will be very helpful

thanks
Babu

14/05/20 12:02:47 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/tmp/RtmpcjltyX/rmr-local-env66c625272a9d, /tmp/RtmpcjltyX/rmr-global-env66c65806b668, /tmp/RtmpcjltyX/rmr-streaming-map66c62b1b7c09, /tmp/RtmpcjltyX/rmr-streaming-reduce66c6674c771a] [/opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-2.jar] /tmp/streamjob2642347914010493291.jar tmpDir=null
14/05/20 12:02:48 INFO client.RMProxy: Connecting to ResourceManager at HadoopS.dbstraining.local/192.168.100.40:8032
14/05/20 12:02:48 INFO client.RMProxy: Connecting to ResourceManager at HadoopS.dbstraining.local/192.168.100.40:8032
14/05/20 12:02:49 INFO mapred.FileInputFormat: Total input paths to process : 1
14/05/20 12:02:49 INFO mapreduce.JobSubmitter: number of splits:2
14/05/20 12:02:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1395320373286_0043
14/05/20 12:02:49 INFO impl.YarnClientImpl: Submitted application application_1395320373286_0043
14/05/20 12:02:49 INFO mapreduce.Job: The url to track the job: http://HadoopS.dbstraining.local:8088/proxy/application_1395320373286_0043/
14/05/20 12:02:49 INFO mapreduce.Job: Running job: job_1395320373286_0043
14/05/20 12:02:56 INFO mapreduce.Job: Job job_1395320373286_0043 running in uber mode : false
14/05/20 12:02:56 INFO mapreduce.Job: map 0% reduce 0%
14/05/20 12:03:00 INFO mapreduce.Job: Task Id : attempt_1395320373286_0043_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)

14/05/20 12:03:02 INFO mapreduce.Job: Task Id : attempt_1395320373286_0043_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)

14/05/20 12:03:07 INFO mapreduce.Job: map 50% reduce 0%
14/05/20 12:03:08 INFO mapreduce.Job: Task Id : attempt_1395320373286_0043_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)

14/05/20 12:03:13 INFO mapreduce.Job: Task Id : attempt_1395320373286_0043_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)

14/05/20 12:03:19 INFO mapreduce.Job: map 100% reduce 100%
14/05/20 12:03:19 INFO mapreduce.Job: Job job_1395320373286_0043 failed with state FAILED due to: Task failed task_1395320373286_0043_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0

14/05/20 12:03:19 INFO mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=93541
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5678
HDFS: Number of bytes written=0
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=5
Launched map tasks=6
Other local map tasks=4
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=23439
Total time spent by all reduces in occupied slots (ms)=0
Map-Reduce Framework
Map input records=1
Map output records=0
Map output bytes=0
Map output materialized bytes=192
Input split bytes=110
Combine input records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=33
CPU time spent (ms)=850
Physical memory (bytes) snapshot=484114432
Virtual memory (bytes) snapshot=1568194560
Total committed heap usage (bytes)=582287360
File Input Format Counters
Bytes Read=5568
14/05/20 12:03:19 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
14/05/20 12:03:27 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1440 minutes, Emptier interval = 0 minutes.
Moved: ‘hdfs://HadoopS.dbstraining.local:8020/tmp/file66c65a3a2c24’ to trash at: hdfs://HadoopS.dbstraining.local:8020/user/bmathew/.Trash/Current

This error occurs when mapper is unable to read input file.
try with simple input file and check first.

Please follow bug reporting guidelines in the wiki. From this log nobody
other than the other user who replied can make a guess as to what went
wrong, certainly not I.
On May 23, 2014 10:53 PM, «babumathew» notifications@github.com wrote:

I have a 4 node cluster and R, Hadoop, rmr2 installed on all the nodes.
Running the a sample job produce the following errors. I am not sure where
to look.. Any insights will be very helpful

thanks
Babu

14/05/20 12:02:47 WARN streaming.StreamJob: -file option is deprecated,
please use generic option -files instead.
packageJobJar: [/tmp/RtmpcjltyX/rmr-local-env66c625272a9d,
/tmp/RtmpcjltyX/rmr-global-env66c65806b668,
/tmp/RtmpcjltyX/rmr-streaming-map66c62b1b7c09,
/tmp/RtmpcjltyX/rmr-streaming-reduce66c6674c771a]
[/opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-2.jar]
/tmp/streamjob2642347914010493291.jar tmpDir=null
14/05/20 12:02:48 INFO client.RMProxy: Connecting to ResourceManager at
HadoopS.dbstraining.local/192.168.100.40:8032
14/05/20 12:02:48 INFO client.RMProxy: Connecting to ResourceManager at
HadoopS.dbstraining.local/192.168.100.40:8032
14/05/20 12:02:49 INFO mapred.FileInputFormat: Total input paths to
process : 1
14/05/20 12:02:49 INFO mapreduce.JobSubmitter: number of splits:2
14/05/20 12:02:49 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1395320373286_0043
14/05/20 12:02:49 INFO impl.YarnClientImpl: Submitted application
application_1395320373286_0043
14/05/20 12:02:49 INFO mapreduce.Job: The url to track the job:
http://HadoopS.dbstraining.local:8088/proxy/application_1395320373286_0043/
14/05/20 12:02:49 INFO mapreduce.Job: Running job: job_1395320373286_0043
14/05/20 12:02:56 INFO mapreduce.Job: Job job_1395320373286_0043 running
in uber mode : false
14/05/20 12:02:56 INFO mapreduce.Job: map 0% reduce 0%
14/05/20 12:03:00 INFO mapreduce.Job: Task Id :
attempt_1395320373286_0043_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)

14/05/20 12:03:02 INFO mapreduce.Job: Task Id :
attempt_1395320373286_0043_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)

14/05/20 12:03:07 INFO mapreduce.Job: map 50% reduce 0%
14/05/20 12:03:08 INFO mapreduce.Job: Task Id :
attempt_1395320373286_0043_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)

14/05/20 12:03:13 INFO mapreduce.Job: Task Id :
attempt_1395320373286_0043_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1
at
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)

14/05/20 12:03:19 INFO mapreduce.Job: map 100% reduce 100%
14/05/20 12:03:19 INFO mapreduce.Job: Job job_1395320373286_0043 failed
with state FAILED due to: Task failed task_1395320373286_0043_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0

14/05/20 12:03:19 INFO mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=93541
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5678
HDFS: Number of bytes written=0
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=5
Launched map tasks=6
Other local map tasks=4
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=23439
Total time spent by all reduces in occupied slots (ms)=0
Map-Reduce Framework
Map input records=1
Map output records=0
Map output bytes=0
Map output materialized bytes=192
Input split bytes=110
Combine input records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=33
CPU time spent (ms)=850
Physical memory (bytes) snapshot=484114432
Virtual memory (bytes) snapshot=1568194560
Total committed heap usage (bytes)=582287360
File Input Format Counters
Bytes Read=5568
14/05/20 12:03:19 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine,
vectorized.reduce, :
hadoop streaming failed with error code 1
14/05/20 12:03:27 INFO fs.TrashPolicyDefault: Namenode trash
configuration: Deletion interval = 1440 minutes, Emptier interval = 0
minutes.
Moved: ‘hdfs://HadoopS.dbstraining.local:8020/tmp/file66c65a3a2c24’ to
trash at: hdfs://HadoopS.dbstraining.local:8020/user/bmathew/.Trash/Current

—
Reply to this email directly or view it on GitHub
#112.

Dear,

me too I have a 4 hadoop cluster (Cloudera CDH-5.2.1-1.cdh5.2.1.p0.12).
Running the wordcount example in cluster mode I have the same error than the user who initialiazed this thread, but randomly). With Randomly I mean it can occur a single time during a run of the wordcount code or not at all. Sometimes the failing tasks on the nodes get executed properly on another node leading to a succesfull Job but sometimes the job fails because of to many failed tasks.

In local mode the code runs as expected.

The file I use is the file from: http://www.textfiles.com/politics/0814gulf.txt
Before running the code the file was cleaned using dos2unix and a control character removal tool. As I said in local mode everything runs fine, and with the same file wordcount from hadoop works as intended.

The following R packages are installed:

Packages in library /usr/local/lib/R/site-library:

bitops                  Bitwise Operations
caTools                 Tools: moving window statistics, GIF, Base64,
                        ROC AUC, etc.
devtools                Tools to make developing R code easier
digest                  Create Cryptographic Hash Digests of R Objects
evaluate                Parsing and evaluation tools that provide more
                        details than the default.
functional              Curry, Compose, and other higher-order
                        functions
httr                    Tools for Working with URLs and HTTP
iterators               Iterator construct for R
itertools               Iterator Tools
jsonlite                A Robust, High Performance JSON Parser and
                        Generator for R
manipulate              Interactive Plots for RStudio
memoise                 Memoise functions
mime                    Map filenames to MIME types
plyr                    Tools for splitting, applying and combining
                        data
R6                      Classes with reference semantics
Rcpp                    Seamless R and C++ Integration
RCurl                   General network (HTTP/FTP/...) client interface
                        for R
reshape2                Flexibly Reshape Data: A Reboot of the Reshape
                        Package.
rhdfs                   R and Hadoop Distributed Filesystem
rJava                   Low-level R to Java interface
RJSONIO                 Serialize R objects to JSON, JavaScript Object
                        Notation
rmr2                    R and Hadoop Streaming Connector
rstudio                 Tools and Utilities for RStudio
rstudioapi              Safely access the RStudio API.
stringr                 Make it easier to work with strings.
whisker                 {{mustache}} for R, logicless templating

Here the stderr + syslog from one of the failed jobs:

 Log Type: stderr

Log Length: 1130

Loading objects:
  wordcount
Loading objects:
  backend.parameters
  combine
  combine.file
  combine.line
  debug
  default.input.format
Please review your hadoop settings. See help(hadoop.settings)
  default.output.format
  in.folder
  in.memory.combine
  input.format
  libs
  map
  map.file
  map.line
  out.folder
  output.format
  pkg.opts
  postamble
  preamble
  profile.nodes
  reduce
  reduce.file
  reduce.line
  rmr.global.env
  rmr.local.env
  save.env
  tempfile
  vectorized.reduce
  verbose
  work.dir
Loading required package: methods
Loading required package: rmr2
Loading required package: rJava
Loading required package: rhdfs

HADOOP_CMD=/usr/bin/hadoop

Be sure to run hdfs.init()
Loading objects:
  backend.parameters
  combine
  combine.file
  combine.line
  debug
  default.input.format
  default.output.format
  in.folder
  in.memory.combine
  input.format
  libs
  map
  map.file
  map.line
  out.folder
  output.format
  pkg.opts
  postamble
  preamble
  profile.nodes
  reduce
  reduce.file
  reduce.line
  rmr.global.env
  rmr.local.env
  save.env
  tempfile
  vectorized.reduce
  verbose
  work.dir


Log Type: stdout

Log Length: 0


Log Type: syslog

Log Length: 8901

Showing 4096 bytes of 8901 total. Click here for the full log.

:NA [rec/s]
2015-01-05 15:07:57,389 INFO [main] org.apache.hadoop.streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
2015-01-05 15:07:57,407 INFO [main] org.apache.hadoop.streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
2015-01-05 15:07:58,981 INFO [Thread-12] org.apache.hadoop.streaming.PipeMapRed: Records R/W=8392/1
2015-01-05 15:07:59,017 INFO [Thread-13] org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2015-01-05 15:07:59,071 WARN [Thread-12] org.apache.hadoop.streaming.PipeMapRed: java.io.EOFException
2015-01-05 15:07:59,072 INFO [main] org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: java.io.EOFException
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:334)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:211)
    at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
    at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:56)
    at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:376)
2015-01-05 15:07:59,075 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: java.io.EOFException
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:334)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:211)
    at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
    at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:56)
    at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:376)

2015-01-05 15:07:59,080 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task
2015-01-05 15:07:59,087 WARN [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete hdfs://babar.hadoop:8020/user/eric.falk/gulf.out/_temporary/1/_temporary/attempt_1418911299833_0092_m_000001_0
2015-01-05 15:07:59,192 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2015-01-05 15:07:59,193 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2015-01-05 15:07:59,193 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.

Thanks Eric

I am facing the same issue, Did u find a solution ?

I was facing a similar issue while working with a python script. Just add

in the beginning of your scripts. The same can be done for other scripting languages.

Источник

I m trying to run Mapper For Co-occurance on AWS EMR

Errorlog

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)

System Log:

2019-04-18 00:34:29,518 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-1-199.us-east-2.compute.internal/172.31.1.199:8032
2019-04-18 00:34:29,741 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-1-199.us-east-2.compute.internal/172.31.1.199:8032
2019-04-18 00:34:30,046 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening ‘s3://abhavtwitterdataset/mr/mapper2.py’ for reading
2019-04-18 00:34:30,240 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening ‘s3://abhavtwitterdataset/mr/reducer.py’ for reading
2019-04-18 00:34:30,815 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2019-04-18 00:34:30,822 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev 59c952a855a0301a4f9e1b2736510df04a640bd3]
2019-04-18 00:34:30,919 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input files to process : 4
2019-04-18 00:34:31,403 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): number of splits:9
2019-04-18 00:34:31,559 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Submitting tokens for job: job_1555547086751_0002
2019-04-18 00:34:31,802 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (main): Submitted application application_1555547086751_0002
2019-04-18 00:34:31,876 INFO org.apache.hadoop.mapreduce.Job (main): The url to track the job: http://ip-172-31-1-199.us-east-2.compute.internal:20888/proxy/application_1555547086751_0002/
2019-04-18 00:34:31,878 INFO org.apache.hadoop.mapreduce.Job (main): Running job: job_1555547086751_0002
2019-04-18 00:34:41,076 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1555547086751_0002 running in uber mode : false
2019-04-18 00:34:41,078 INFO org.apache.hadoop.mapreduce.Job (main): map 0% reduce 0%
2019-04-18 00:34:59,350 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000001_0, Status : FAILED
2019-04-18 00:35:01,388 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000003_0, Status : FAILED
2019-04-18 00:35:09,627 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000000_0, Status : FAILED
2019-04-18 00:35:11,646 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000002_0, Status : FAILED
2019-04-18 00:35:12,657 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000004_0, Status : FAILED
2019-04-18 00:35:13,667 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000005_0, Status : FAILED
2019-04-18 00:35:15,682 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000001_1, Status : FAILED
2019-04-18 00:35:15,683 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000006_0, Status : FAILED
2019-04-18 00:35:30,760 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000005_1, Status : FAILED
2019-04-18 00:35:30,761 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000001_2, Status : FAILED
2019-04-18 00:35:34,782 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000003_1, Status : FAILED
2019-04-18 00:35:39,816 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000000_1, Status : FAILED
2019-04-18 00:35:40,824 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000004_1, Status : FAILED
2019-04-18 00:35:41,829 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000002_1, Status : FAILED
2019-04-18 00:35:45,851 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000006_1, Status : FAILED
2019-04-18 00:35:45,853 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000005_2, Status : FAILED
2019-04-18 00:36:00,941 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000002_2, Status : FAILED
2019-04-18 00:36:00,944 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1555547086751_0002_m_000006_2, Status : FAILED
2019-04-18 00:36:03,957 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 100%
2019-04-18 00:36:04,966 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1555547086751_0002 failed with state FAILED due to: Task failed task_1555547086751_0002_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0

2019-04-18 00:36:05,073 INFO org.apache.hadoop.mapreduce.Job (main): Counters: 17
Job Counters
Failed map tasks=19
Killed map tasks=8
Killed reduce tasks=3
Launched map tasks=24
Other local map tasks=17
Data-local map tasks=7
Total time spent by all maps in occupied slots (ms)=20982096
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=437127
Total time spent by all reduce tasks (ms)=0
Total vcore-milliseconds taken by all map tasks=437127
Total vcore-milliseconds taken by all reduce tasks=0
Total megabyte-milliseconds taken by all map tasks=671427072
Total megabyte-milliseconds taken by all reduce tasks=0
Map-Reduce Framework
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
2019-04-18 00:36:05,074 ERROR org.apache.hadoop.streaming.StreamJob (main): Job not successful!

Code of Mapper For Co-occurance

#!/usr/bin/env python
"""mapper2.py"""

import sys
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer

nltk.data.path.append("/tmp")
nltk.download('punkt', download_dir="/tmp")

for line in sys.stdin:
    line = re.sub(r'httpS+', '', line)
    # replace 't  by t
    line = re.sub(r"'t",'t',line)
stopwords = ['about', 'all', 'along', 'also', 'an', 'any', 'and', 'are', 'around', 'after', 'according', 'another',
            'already', 'because', 'been', 'being', 'but', 'become', 'can', 'could', 'called',
            'during', 'do', 'dont', 'does', 'doesn', 'did', 'didnt', 'etc', 'for', 'from', 'far',
            'get', 'going', 'had', 'has', 'have', 'he', 'her', 'here', 'him', 'his', 'how',
            'into', 'isnt', 'its', 'just', 'let', 'like', 'may', 'more', 'must', 'most', 
            'not', 'now', 'new', 'next', 'one', 'other', 'our', 'out', 'over', 'own', 'put', 'right',
            'say', 'said', 'should', 'she', 'since', 'some', 'still', 'such', 
            'take', 'that', 'than', 'the', 'their', 'them', 'then', 'there', 'these',
            'they', 'this', 'those', 'through', 'time', 'told', 'thing', 
            'use' ,'until', 'via', 'very', 'under',
            'was', 'way', 'were', 'what', 'which', 'when', 'where', 'who', 'why', 'will', 'with', 'would', 'wouldnt', 
            'yes', 'you', 'your']

lines = nltk.sent_tokenize(line)

for line in lines:
    #remove puntuation
    line = re.sub(r'[^ws]',' ',line)
    # split the line into words
    words = line.split()

    for k in range(len(words) - 1):
        #stemming
        #l = ps.stem(words[k])
        l = words[k]
        l = l.lower()
        if l not in stopwords and len(l)>2 and  not (l.isdigit()):
            for j in words[k+1:]:
                r = j.lower()
                if r in stopwords or l == r or len(r)<=2 or r.isdigit():
                    continue
                key = l+"-"+r
                print("%st%s" % (key.lower(),1))

Источник

I’m following this tutorial:

http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/

I put cities.txt in /user/root/ and the R script as following :

#!/usr/bin/env Rscript
f <- file("stdin")
open(f)
state_data = read.table(f)
summary(state_data)

and then run the command:

 hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming-2.7.1.2.3.4.0-3485.jar -input /user/root/cities.txt -output /user/root/streamer -mapper /bin/cat -reducer script.R -numReduceTasks 2 -file script.R

Map works till 100% and reduce shows this error:

16/03/01 11:06:30 INFO mapreduce.Job:  map 100% reduce 50%
16/03/01 11:06:34 INFO mapreduce.Job: Task Id : attempt_1456773989186_0009_r_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
    at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
    at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Does any one have any idea or encountered this before ?

Thanks.

Источник

I’m trying to run my own mapper and reducer Python scripts using Hadoop Streaming on my cluster built on VMware Workstation VMs.

Hadoop version — 2.7, Python — 3.5, OS — CentOS 7.2 on all the VMs.

I have a separate machine which plays a role of a client application host and submits the mapreduce job to the resource manager. Map and reduce scripts are stored there as well.
I’m using the following hadoop command to run a job:

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -output result1 -input /user/hadoop/hr/profiles -file /home/hadoop/map.py -mapper map.py -file /home/hadoop/reduce.py -reducer reduce.py

I also tried to insert «python3» interpreter before -mapper and -reducer scripts:

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -output result1 -input /user/hadoop/hr/profiles -file /home/hadoop/map.py -mapper "python3.5 map.py" -file /home/hadoop/reduce.py -reducer "python3.5 reduce.py"

However, the job always fails and I’m still getting the same error messages in the log:

2016-10-07 21:57:10,485 INFO [IPC Server handler 1 on 41498] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1475888525921_0004_m_000001_0 is : 0.0
2016-10-07 21:57:10,520 FATAL [IPC Server handler 2 on 41498] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1475888525921_0004_m_000001_0 - exited : java.lang.RuntimeException: **PipeMapRed.waitOutputThreads(): subprocess failed with code 127**
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

2016-10-07 21:57:10,520 INFO [IPC Server handler 2 on 41498] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1475888525921_0004_m_000001_0: Error: java.lang.RuntimeException: **PipeMapRed.waitOutputThreads(): subprocess failed with code 127**
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

2016-10-07 21:57:10,521 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1475888525921_0004_m_000001_0: Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

2016-10-07 21:57:10,523 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1475888525921_0004_m_000001_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP
2016-10-07 21:57:10,523 INFO [ContainerLauncher #2] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1475888525921_0004_01_000003 taskAttempt attempt_1475888525921_0004_m_000001_0
2016-10-07 21:57:10,524 INFO [ContainerLauncher #2] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1475888525921_0004_m_000001_0
2016-10-07 21:57:10,524 INFO [ContainerLauncher #2] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : slave-1:56838

Python 3.5 interpreter is installed on all VMs across the cluster and its path is added to system’s PATH as well. I can launch the interpreter on all nodes using python3.5 command.

I tried to run the same command with the same scripts on my NameNode and it worked. So it seems like it’s an HDFS security issue.

I’ve already read many posts related to this problem and tried everything what was suggested, but still no progress.

I’ve tried the following already:

disabling dfs permissions
redirecting stdout to sdterr in map and reduce scripts
since I’m using VMs, I reduced RAM and CPU requirements for containers: 256MB and 1 core
adding «python3» interpreter before -mapper and -reducer options in «hadoop jar» command
I replaced CRLF with Unix LF in my scripts

All my scripts have #!/opt/rh/rh-python35/root/usr/bin/python3.5 string pointing to interpreter’s location. I’ve tested my scripts several times — they are working just fine.

I’m completely new to this topic and now I’m stuck. Please, if you know how to fix this, share your experience. Thanks in advance.

Источник

Читайте также: