我們以一個例子來介紹如何使用RegexSerDe來處理標準格式的Apache Web日誌,並對其進行統計分析。我的Hive版本是apache-hive-0.13.1-bin
一、在Hive中創建表serde_regex
CREATE TABLE serde_regex(
host STRING,
identity STRING ,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([\\d|.]+)\\s+([^ ]+)\\s+([^ ]+)\\s+\\[(.+)\\]\\s+\"([^ ]+)\\s(.+)\\s([^ ]+)\"\\s+([^ ]+)\\s+([^ ]+)\\s+\"(.+)\"\\s+\"(.+)\"?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE ;
二、往表中導入數據
Apache Web 日誌內容格式見附件下載鏈接(可下載)
往表中導入事例數據
hive> LOAD DATA LOCAL INPATH "./data/apache_log.txt" INTO TABLE serde_regex;
三、查詢分析
查詢分析的時候報了以下錯誤:主要就是缺少RegexSerDe處理類,也就是說,hive會話環境中缺少該類所在的包。
hive> select host,request,agent from serde_regex limit 10;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1419317102229_0001, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0001/
Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job -kill job_1419317102229_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-12-23 14:46:15,389 Stage-1 map = 0%, reduce = 0%
2014-12-23 14:47:02,249 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1419317102229_0001 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://secondmgt:8088/proxy/application_1419317102229_0001/
Examining task ID: task_1419317102229_0001_m_000000 (and more) from job job_1419317102229_0001
Task with the most failures(4):
-----
Task ID:
task_1419317102229_0001_m_000000
URL:
http://secondmgt:8088/taskdetails.jsp?jobid=job_1419317102229_0001&tipid=task_1419317102229_0001_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:425)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 17 more
Caused by: java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:154)
... 22 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:335)
at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:353)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:123)
... 22 more
Caused by: java.lang.ClassNotFoundException: <span style="color:#ff0000;">Class org.apache.hadoop.hive.contrib.serde2.RegexSerDe not found</span>
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:305)
... 24 more
四、異常解決辦法
往hive會話中加入 hive-contrib-0.13.1.jar,該包位置在hive安裝環境的lib目錄下,加入命令如下:
hive> add jar /home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/lib/hive-contrib-0.13.1.jar;
Added /home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/lib/hive-contrib-0.13.1.jar to class path
Added resource: /home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/lib/hive-contrib-0.13.1.jar
再次執行第三步的SELECT查詢命令,結果如下:
hive> select host,request,agent from serde_regex limit 10;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1419317102229_0002, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0002/
Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job -kill job_1419317102229_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-12-23 14:48:50,163 Stage-1 map = 0%, reduce = 0%
2014-12-23 14:49:01,666 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.31 sec
MapReduce Total cumulative CPU time: 3 seconds 310 msec
Ended Job = job_1419317102229_0002
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 3.31 sec HDFS Read: 4321 HDFS Write: 238 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 310 msec
OK
61.160.224.138 GET 7519
61.160.224.138 GET 709
61.160.224.138 GET 815
113.17.174.44 POST 653
61.160.224.138 GET 1670
61.160.224.144 GET 2887
61.160.224.143 GET 2947
61.160.224.145 GET 2581
61.160.224.145 GET 2909
61.160.224.144 GET 15879
Time taken: 26.811 seconds, Fetched: 10 row(s)
問題解決,但是該解決方法只能對本次Hive會話有用,Hive使用命令exit退出後再進入依舊會出現該問題。
遺留問題:個人覺得應該有一種辦法,將該包加入到hadoop/lib下,重啓集羣,可以長久解決該問題。但是沒嘗試過,待驗證。
附件: