反垃圾rd那邊有一個hql,在執行過程中出現錯誤退出,報java.io.IOException: Broken pipe異常,hql中使用到了python腳本,hql和python腳本近期沒有人改過,在10.1號時還運行正常,但是在10.4號之後運行就老是出現相同的錯誤,而且錯誤出現在stage-2的reduce階段,gateway上面的錯誤提示如下:
2014-10-10 15:05:32,724 Stage-2 map = 100%, reduce = 100%
Ended Job = job_201406171104_4019895 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
jobtracker頁面job報錯信息:
2014-10-10 15:00:29,614 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"1000390355","reducesinkkey1":"14"},"value":{"_col0":"1000390355","_col1":25,"_col2":"Infinity","_col3":"14","_col4":17},"alias":0}
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:268)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:518)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:419)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1061)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"1000390355","reducesinkkey1":"14"},"value":{"_col0":"1000390355","_col1":25,"_col2":"Infinity","_col3":"14","_col4":17},"alias":0}
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:256)
... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Broken pipe
at org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:348)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:247)
... 7 more
Caused by: java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.hive.ql.exec.TextRecordWriter.write(TextRecordWriter.java:43)
at org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:331)
... 15 more
stderr logs:
Traceback (most recent call last):
File "/data10/hadoop/local/taskTracker/liangjun/jobcache/job_201406171104_4019895/attempt_201406171104_4019895_r_000000_0/work/./pranalysis.py", line 86, in <module>
pranalysis(cols[0],pr,cols[1],cols[4],prnum)
File "/data10/hadoop/local/taskTracker/liangjun/jobcache/job_201406171104_4019895/attempt_201406171104_4019895_r_000000_0/work/./pranalysis.py", line 60, in pranalysis
print '%s\t%d\t%d\t%d'%(uid,v[14]-20,type,rank)
TypeError: %d format: a number is required, not float
從以上job的錯誤信息初步判斷,問題原因應該是10.1之後的數據出現問題,導致python腳本執行的時候退出,數據流通道被關閉,而ExecReducer.reduce()方法不知道往python寫數據的通道已經因爲異常而關閉,還繼續往裏寫數據,這時就會出現java.io.IOException: Broken pipe異常。
以下是分析過程:
1、hql和python
hql內容如下:
add file /usr/home/wbdata_anti/shell/sass_offline/pranalysis.py;
select transform(BS.*) using 'pranalysis.py' as uid,prvalue,trend,prlevel
from
(
select B1.uid,B1.flws,B1.pr,iter,B2.alivefans from tmp_anti_user_pagerank1 B1
join
mds_anti_user_flwpr B2
on B1.uid=B2.uid
where iter>'00' and iter<='14' and dt='lowrlfans20141001'
distribute by uid sort by uid,iter
)BS;
python腳本內容如下:
#!/usr/bin/python
#coding=utf-8
import sys,time
import re,math
from optparse import OptionParser
import ConfigParser
reload(sys)
sys.setdefaultencoding('utf-8')
parser = OptionParser(usage="usage:%prog [optinos] filepath")
parser.add_option("-i", "--iter",action = "store",type = 'string', dest = "iter", default = '14',
help="how many iterators" )
(options, args) = parser.parse_args()
def pranalysis(uid,prs,flw,fans,prnum):
tasc=tdesc=0
try:
v=[float(pr)*100000000000 for pr in prs]
fans=int(fans)
interval=fans/100
except:
#rst=sys.exc_info()
#sys.excepthook(rst[0],rst[1],rst[2])
return
for i in range(1,prnum-1) :
if i==1:
if v[i+1]-v[i]>interval and v>fans: tasc += 1
elif v[i]-v[i+1]>interval and v[i+1]<fans: tdesc += 1
continue
if v[i+1]-v[i]>interval: tasc += 1
elif v[i]-v[i+1]>interval: tdesc += 1
# rank indicate the rate between pr and fans. higher rank(big number) mean more possible negative user
rate=v[prnum-1]/fans
rank=4
if rate>3.0: rank=0
elif rate>2.0: rank=1
elif rate>1.3: rank=2
elif rate>0.7: rank=3
elif rate>0.5: rank=4
elif rate>0.3: rank=5
elif rate>0.2: rank=6
else: rank=7
# 0 for stable trend. 1 for round trend, 2, for positive user, 3 for negative user.
type=0
if tasc>0 and tdesc>0:
type=1
elif tasc>0:
type=2
elif tdesc>0:
type=3
else: # tdesc=0 and tasc=0
type=0
#if fans<60:
# type=0
print '%s\t%d\t%d\t%d'%(uid,v[14]-20,type,rank)
#format sort by uid, iter
#uid follow pr iter fans
#1642909335 919 0.00070398898 04 68399779
prnum=int(options.iter)+1
pr=[0]*prnum
idx=1
lastiter='00'
lastuid=''
for line in sys.stdin:
line=line.rstrip('\n')
cols=line.split('\t')
if len(cols)<5: continue
if cols[3]>options.iter or cols[3]=='00': continue
if cols[3]<=lastiter:
print '%s\t%d\t%d\t%d'%(lastuid,2,0,7)
pr=[0]*prnum
idx=1
lastiter=cols[3]
lastuid=cols[0]
pr[idx]=cols[2]
idx+=1
if cols[3]==options.iter:
pranalysis(cols[0],pr,cols[1],cols[4],prnum)
pr=[0]*prnum
lastiter='00'
idx=1
2、stage-2 reduce階段的執行計劃:
Reduce Operator Tree:
Extract
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
expr: _col2
type: string
expr: _col3
type: string
expr: _col4
type: bigint
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Transform Operator
command: pranalysis.py
output info:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
根據執行計劃,可以看出,stage-2 的reduce階段其實很簡單,就是將map階段拿到的數據使用pranalysis.py腳本進行計算,由5列轉換成4列,python輸出的時候有數據格式要求:
print '%s\t%d\t%d\t%d'%(uid,v[14]-20,type,rank)
根據執行計劃定位到的結果,在結合job的stderr logs信息:
Traceback (most recent call last):
File "/data10/hadoop/local/taskTracker/liangjun/jobcache/job_201406171104_4019895/attempt_201406171104_4019895_r_000000_0/work/./pranalysis.py", line 86, in <module>
pranalysis(cols[0],pr,cols[1],cols[4],prnum)
File "/data10/hadoop/local/taskTracker/liangjun/jobcache/job_201406171104_4019895/attempt_201406171104_4019895_r_000000_0/work/./pranalysis.py", line 60, in pranalysis
print '%s\t%d\t%d\t%d'%(uid,v[14]-20,type,rank)
TypeError: %d format: a number is required, not float
可以看出,hql確實是在執行python的時候由於數據出現異常,python計算完成之後的有一個數據的格式是float型的,而我們對該數據預期的格式應該是number型的,導致python腳本異常退出,退出的時候關閉了數據流通道,但是ExecReducer.reduce()方法其實是不知道往python寫數據的通道已經因爲異常而關閉,還繼續往裏寫數據,這時就出現了java.io.IOException:
Broken pipe的異常。
參考:
http://fgh2011.iteye.com/blog/1684544
http://blog.csdn.net/churylin/article/details/11969925