用hive+hdfs+sqoop分析日誌的步驟

原創

2020-02-24 09:59

現在的部分工作是進行日誌分析，由於每天的日誌壓縮前80多G左右，用lzop壓縮後10G左右，如果用shell直接進行統計，需要花費很長時間才能完成，而且還需要用java函數對request url進行轉換，於是採用hive+hdfs+sqoop方案進行日誌統計分析

hadoop+hive+hdfs+sqoop的架構就不詳細說了，可以直接用cloudera的repo直接安裝

日誌分析步驟

一下載服務器中的日誌，因爲應用服務使用了多臺服務器，所以需要對日誌進行合併整理，然後用lzop進行壓縮

二在hive中創建表

hive>CREATE TABLE maptile (ipaddress STRING,identity STRING,user STRING,time STRING,method STRING,request STRING,protocol STRING,status STRING,size STRING,referer STRING,agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) (\"[^ ]*) ([^ ]*) ([^ ]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?","output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s")STORED AS TEXTFILE;

三導入日誌數據

hive>load data local inpath '/home/log/1.lzo' overwrite into table maptile;

四在hive中創建日誌統計後結果表

hive>create table result (ip string,num int) partitioned by (dt string);

五統計日誌並將統計結果插入到新表中

hive>insert overwrite table result partition (dt='2011-09-22') select ipaddress,count(1) as numrequest from maptile group by ipaddress sort by numrequest desc;

六將統計結果導出到mysql中

sqoop export --connect jdbc:mysql://localhost:3306/result --username root --password admin --table ip_info --export-dir /user/hive/warehouse/result/dt=2011-09-22 --input-fields-terminated-by '\001'

sqoop export --connect jdbc:mysql://localhost:3306/result --username root --password admin --table ip_info --export-dir /user/hive/warehouse/result/dt=2011-09-22 --input-fields-terminated-by '\001'

以上步驟可以寫入到shell script中設置定時任務自動完成

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

用hive+hdfs+sqoop分析日誌的步驟

簡單的apachebench

NoSql 生態系統

使用cloudera CHD3u1 —— 使用eclipse plugin提交Job

海量空間數據庫建設策略

hash算法及應用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結