Hadoop 提取KPI 進行海量Web日誌分析
Web日誌包含着網站最重要的信息,通過日誌分析,我們可以知道網站的訪問量,哪個網頁訪問人數最多,哪個網頁最有價值等。一般中型的網站(10W的PV以上),每天會產生1G以上Web日誌文件。大型或超大型的網站,可能每小時就會產生10G的數據量。
- Web日誌分析概述
- 需求分析:KPI指標設計
- 算法模型:Hadoop並行算法
- 架構設計:日誌KPI系統架構
- 程序開發:MapReduce程序實現
1. Web日誌分析概述
Web日誌由Web服務器產生,可能是Nginx, Apache, Tomcat等。從Web日誌中,我們可以獲取網站每類頁面的PV值(PageView,頁面訪問量)、獨立IP數;稍微複雜一些的,可以計算得出用戶所檢索的關鍵詞排行榜、用戶停留時間最高的頁面等;更復雜的,構建廣告點擊模型、分析用戶行爲特徵等等。
在Web日誌中,每條日誌通常代表着用戶的一次訪問行爲,例如下面就是一條nginx日誌:
222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939
"http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
拆解爲以下8個變量
- remote_addr: 記錄客戶端的ip地址, 222.68.172.190
- remote_user: 記錄客戶端用戶名稱, –
- time_local: 記錄訪問時間與時區, [18/Sep/2013:06:49:57 +0000]
- request: 記錄請求的url與http協議, “GET /images/my.jpg HTTP/1.1”
- status: 記錄請求狀態,成功是200, 200
- body_bytes_sent: 記錄發送給客戶端文件主體內容大小, 19939
- http_referer: 用來記錄從那個頁面鏈接訪問過來的, “http://www.angularjs.cn/A00n”
- http_user_agent: 記錄客戶瀏覽器的相關信息, “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36”
注:要更多的信息,則要用其它手段去獲取,通過js代碼單獨發送請求,使用cookies記錄用戶的訪問信息。
利用這些日誌信息,我們可以深入挖掘網站的祕密了。
少量數據的情況
少量數據的情況(10Mb,100Mb,10G),在單機處理尚能忍受的時候,我可以直接利用各種Unix/Linux工具,awk、grep、sort、join等都是日誌分析的利器,再配合perl, python,正則表達工,基本就可以解決所有的問題。
例如,我們想從上面提到的nginx日誌中得到訪問量最高前10個IP,實現很簡單:
~ cat access.log.10 | awk '{a[$1]++} END {for(b in a) print b"\t"a[b]}' | sort -k2 -r | head -n 10
163.177.71.12 972
101.226.68.137 972
183.195.232.138 971
50.116.27.194 97
14.17.29.86 96
61.135.216.104 94
61.135.216.105 91
61.186.190.41 9
59.39.192.108 9
220.181.51.212 9
海量數據的情況
當數據量每天以10G、100G增長的時候,單機處理能力已經不能滿足需求。我們就需要增加系統的複雜性,用計算機集羣,存儲陣列來解決。在Hadoop出現之前,海量數據存儲,和海量日誌分析都是非常困難的。只有少數一些公司,掌握着高效的並行計算,分步式計算,分步式存儲的核心技術。
Hadoop的出現,大幅度的降低了海量數據處理的門檻,讓小公司甚至是個人都能力,搞定海量數據。並且,Hadoop非常適用於日誌分析系統。
2.需求分析:KPI指標設計
下面我們將從一個公司案例出發來全面的解釋,如何用進行 海量Web日誌分析,提取KPI數據 。
案例介紹
某電子商務網站,在線團購業務。每日PV數100w,獨立IP數5w。用戶通常在工作日上午10:00-12:00和下午15:00-18:00訪問量最大。日間主要是通過PC端瀏覽器訪問,休息日及夜間通過移動設備訪問較多。網站搜索瀏量佔整個網站的80%,PC用戶不足1%的用戶會消費,移動用戶有5%會消費。
通過簡短的描述,我們可以粗略地看出,這家電商網站的經營狀況,並認識到願意消費的用戶從哪裏來,有哪些潛在的用戶可以挖掘,網站是否存在倒閉風險等。
KPI指標設計
- PV(PageView): 頁面訪問量統計
- IP: 頁面獨立IP的訪問量統計
- Time: 用戶每小時PV的統計
- Source: 用戶來源域名的統計
- Browser: 用戶的訪問設備統計
從商業的角度,個人網站的特徵與電商網站不太一樣,沒有轉化率,同時跳出率也比較高。從技術的角度,同樣都關注KPI指標設計。
3.算法模型:Hadoop並行算法
並行算法的設計:
PV(PageView): 頁面訪問量統計
Map過程{key:request,value:1}
Reduce過程{key:request,value:求和(sum)}
IP: 頁面獨立IP的訪問量統計
Map: {key:request,value:remote_addr}
Reduce: {key:request,value:去重再求和(sum(unique))}
Time: 用戶每小時PV的統計
Map: {key:time_local,value:1}
Reduce: {key:time_local,value:求和(sum)}
Source: 用戶來源域名的統計
Map: {key:http_referer,value:1}
Reduce: {key:http_referer,value:求和(sum)}
Browser: 用戶的訪問設備統計
Map: {key:http_user_agent,value:1}
Reduce: {key:http_user_agent,value:求和(sum)}
4.架構設計:日誌KPI系統架構
上圖中,左邊是Application業務系統,右邊是Hadoop的HDFS, MapReduce。
1.日誌是由業務系統產生的,我們可以設置web服務器每天產生一個新的目錄,目錄下面會產生多個日誌文件,每個日誌文件64M。
2.設置系統定時器CRON,夜間在0點後,向HDFS導入昨天的日誌文件。
3.完成導入後,設置系統定時器,啓動MapReduce程序,提取並計算統計指標。
4.完成計算後,設置系統定時器,從HDFS導出統計指標數據到數據庫,方便以後的即使查詢。
上面這幅圖,我們可以看得更清楚,數據是如何流動的。藍色背景的部分是在Hadoop中的,接下來我們的任務就是完成MapReduce的程序實現。
5.程序開發2:MapReduce程序實現
開發流程:
- 對日誌行的解析
- Map函數實現
- Reduce函數實現
啓動程序實現
1). 對日誌行的解析
新建文件:org.apache.hadoop.mr.kpi
整體代碼
package org.apache.hadoop.mr.kpi;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.HashSet;
import java.util.Locale;
import java.util.Set;
public class KPI {
/**
* 20160512
* @author yue
*/
private String remote_addr; //記錄客戶端的IP地址
private String remote_user; //記錄客戶端用戶名稱,忽略屬性“-”
private String time_local; //記錄訪問時間與時區
private String request; //記錄請求的URL和http協議
private String status; //記錄請求狀態,成功是200
private String body_bytes_sent; //記錄發送給客戶端文件主體內容大小
private String http_referer; //用來記錄從哪個頁面鏈接訪問過來的
private String http_user_agent; //記錄客戶瀏覽器的相關信息
private boolean valid = true ; //判斷數據是否合法
private static KPI parser(String line){
System.out.println(line);
KPI kpi = new KPI();
String[] arr = line.split(" ");
if (arr.length>11){
kpi.setRemote_addr(arr[0]);
kpi.setRemote_user(arr[1]);
kpi.setTime_local(arr[3].substring(1));
kpi.setRequest(arr[6]);
kpi.setStatus(arr[8]);
kpi.setBody_bytes_sent(arr[9]);
kpi.setHttp_referer(arr[10]);
if(arr.length>12){
kpi.setHttp_user_agent(arr[11] + " " + arr[12]);
} else {
kpi.setHttp_user_agent(arr[11]);
}
if(Integer.parseInt(kpi.getStatus()) >= 400){
//大於400,http錯誤
kpi.setValid(false);
}
}else{
kpi.setValid(false);
}
return kpi;
}
/**
* 按page的pv分類
* pageview:頁面訪問量統計
* @return
*/
public static KPI filterPVs(String line){
KPI kpi = parser(line);
Set<String> pages = new HashSet<String>();
pages.add("/about/");
pages.add("/black-ip-clustor/");
pages.add("/cassandra-clustor/");
pages.add("/finance-rhive-repurchase/");
pages.add("/hadoop-familiy-roadmap/");
pages.add("/hadoop-hive-intro/");
pages.add("/hadoop-zookeeper-intro/");
pages.add("/hadoop-mahout-roadmap/");
if(!pages.contains(kpi.getRequest())){
kpi.setValid(false);
}
return kpi;
}
/**
* 按page的獨立IP分類
* @return
*/
public static KPI filterIPs(String line){
KPI kpi = parser(line);
Set<String> pages = new HashSet<String>();
pages.add("/about/");
pages.add("/black-ip-clustor/");
pages.add("/cassandra-clustor/");
pages.add("/finance-rhive-repurchase/");
pages.add("/hadoop-familiy-roadmap/");
pages.add("/hadoop-hive-intro/");
pages.add("/hadoop-zookeeper-intro/");
pages.add("/hadoop-mahout-roadmap/");
if (!pages.contains(kpi.getRequest())){
kpi.setValid(false);
}
return kpi;
}
/**
* PV按瀏覽器分類
* @return
*/
public static KPI filterBroswer(String line){
return parser(line);
}
/**
* PV按小時分類
* @return
*/
public static KPI filterTime(String line){
return parser(line);
}
/**
* Pv按訪問域名分類
* @return
*/
public static KPI filterDomain(String line){
return parser(line);
}
public String getRemote_addr() {
return remote_addr;
}
public void setRemote_addr(String remote_addr) {
this.remote_addr = remote_addr;
}
public String getRemote_user() {
return remote_user;
}
public void setRemote_user(String remote_user) {
this.remote_user = remote_user;
}
public String getTime_local() {
return time_local;
}
public Date getTime_local_Date() throws ParseException{
SimpleDateFormat df = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss",Locale.US);
return df.parse(this.time_local);
}
public String getTime_local_Date_hour() throws ParseException{
SimpleDateFormat df = new SimpleDateFormat("yyyyMMddHH");
return df.format(this.getTime_local_Date());
}
public void setTime_local(String time_local) {
this.time_local = time_local;
}
public String getRequest() {
return request;
}
public void setRequest(String request) {
this.request = request;
}
public String getStatus() {
return status;
}
public void setStatus(String status) {
this.status = status;
}
public String getBody_bytes_sent() {
return body_bytes_sent;
}
public void setBody_bytes_sent(String body_bytes_sent) {
this.body_bytes_sent = body_bytes_sent;
}
public String getHttp_referer() {
return http_referer;
}
public String getHttp_referer_domain(){
if(http_referer.length()<8){
return http_referer;
}
String str = this.http_referer.replace("\\", "").replace("http://", "").replace("https://", "");
return str.indexOf("/")>0?str.substring(0, str.indexOf("/")):str;
}
public void setHttp_referer(String http_referer) {
this.http_referer = http_referer;
}
public String getHttp_user_agent() {
return http_user_agent;
}
public void setHttp_user_agent(String http_user_agent) {
this.http_user_agent = http_user_agent;
}
public boolean isValid() {
return valid;
}
public void setValid(boolean valid) {
this.valid = valid;
}
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
sb.append("valid:" + this.valid);
sb.append("\nremote_addr:" + this.remote_addr);
sb.append("\nremote_user:" + this.remote_user);
sb.append("\ntime_local:" + this.time_local);
sb.append("\nrequest:" + this.request);
sb.append("\nstatus:" + this.status);
sb.append("\nbody_bytes_sent:" + this.body_bytes_sent);
sb.append("\nhttp_referer:" + this.http_referer);
sb.append("\nhttp_user_agent:" + this.http_user_agent);
return super.toString();
}
public static void main(String[] args) {
String line = "222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] \"GET /images/my.jpg HTTP/1.1\" 200 19939 \"http://www.angularjs.cn/A00n\" \"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36\"";
System.out.println(line);
KPI kpi = new KPI();
String[] arr = line.split(" ");
kpi.setRemote_addr(arr[0]);
kpi.setRemote_user(arr[1]);
kpi.setTime_local(arr[3].substring(1));
kpi.setRequest(arr[6]);
kpi.setStatus(arr[8]);
kpi.setBody_bytes_sent(arr[9]);
kpi.setHttp_referer(arr[10]);
kpi.setHttp_user_agent(arr[11] + " " + arr[12]);
System.out.println(kpi);
try {
SimpleDateFormat df = new SimpleDateFormat("yyyy.MM.dd:HH:mm:ss",Locale.US);
System.out.println(df.format(kpi.getTime_local_Date()));
System.out.println(kpi.getTime_local_Date_hour());
System.out.println(kpi.getHttp_referer_domain());
} catch (ParseException e) {
e.printStackTrace();
}
}
}
從日誌文件中,取一行通過main函數寫一個簡單的解析測試。
控制檯輸出:
我們看到日誌行,被正確的解析成了kpi對象的屬性。我們把解析過程,單獨封裝成一個方法。
private static KPI parser(String line) {
System.out.println(line);
KPI kpi = new KPI();
String[] arr = line.split(" ");
if (arr.length > 11) {
kpi.setRemote_addr(arr[0]);
kpi.setRemote_user(arr[1]);
kpi.setTime_local(arr[3].substring(1));
kpi.setRequest(arr[6]);
kpi.setStatus(arr[8]);
kpi.setBody_bytes_sent(arr[9]);
kpi.setHttp_referer(arr[10]);
if (arr.length > 12) {
kpi.setHttp_user_agent(arr[11] + " " + arr[12]);
} else {
kpi.setHttp_user_agent(arr[11]);
}
if (Integer.parseInt(kpi.getStatus()) >= 400) {// 大於400,HTTP錯誤
kpi.setValid(false);
}
} else {
kpi.setValid(false);
}
return kpi;
}
對map方法,reduce方法,啓動方法,我們單獨寫一個類來實現
下面將分別介紹MapReduce的實現類:
- PV:org.apache.hadoop.mr.kpi.KPIPV.java
- IP: org.apache.hadoop.mr.kpi.KPIIP.java
- Time: org.apache.hadoop.mr.kpi.KPITime.java
- Browser: org.apache.hadoop.mr.kpi.KPIBrowser.java
1). PV:org.apache.hadoop.mr.kpi.KPIPV.java
package org.apache.hadoop.mr.kpi;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class KPIPV {
/**
* @author yue
* 20160512
*/
public static class KPIPVMapper extends MapReduceBase implements Mapper<Object ,Text ,Text,IntWritable>{
private IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
KPI kpi = KPI.filterPVs(value.toString());
if(kpi.isValid()){
word.set(kpi.getRequest());
output.collect(word, one);
}
}
}
public static class KPIPVReducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable>{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while(values.hasNext()){
sum += values.next().get();
}
result.set(sum);
output.collect(key, result);
}
}
public static void main(String[] args) throws Exception{
String input = "hdfs://192.168.37.134:9000/user/hdfs/log_kpi";
String output = "hdfs://192.168.37.134:9000/user/hdfs/log_kpi/pv";
JobConf conf = new JobConf(KPIPV.class);
conf.setJobName("KPIPV");
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(KPIPVMapper.class);
conf.setCombinerClass(KPIPVReducer.class);
conf.setReducerClass(KPIPVReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(input));
FileOutputFormat.setOutputPath(conf, new Path(output));
JobClient.runJob(conf);
System.exit(0);
}
}
在程序中會調用KPI類的方法
KPI kpi = KPI.filterPVs(value.toString());
我們運行一下KPIPV.java
用hadoop命令查看HDFS文件
~ hadoop fs -cat /user/hdfs/log_kpi/pv/part-00000
/about 5
/black-ip-list/ 2
/cassandra-clustor/ 3
/finance-rhive-repurchase/ 13
/hadoop-family-roadmap/ 13
/hadoop-hive-intro/ 14
/hadoop-mahout-roadmap/ 20
/hadoop-zookeeper-intro/ 6
2). IP: org.apache.hadoop.mr.kpi.KPIIP.java
package org.apache.hadoop.mr.kpi;
import java.io.IOException;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mr.kpi.KPIIP.KPIIPMapper.KPIIPReducer;
public class KPIIP {
/**
* @author yue
* 20160512
*/
public static class KPIIPMapper extends MapReduceBase implements Mapper<Object,Text,Text,Text>{
private Text word = new Text();
private Text ips = new Text();
public void map(Object key, Text value,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
KPI kpi = KPI.filterIPs(value.toString());
if(kpi.isValid()){
word.set(kpi.getRequest());
ips.set(kpi.getRemote_addr());
output.collect(word, ips);
}
}
public static class KPIIPReducer extends MapReduceBase implements Reducer<Text,Text,Text,Text>{
private Text result = new Text();
private Set<String>count = new HashSet<String>();
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
while(values.hasNext()){
count.add(values.next().toString());
}
result.set(String.valueOf(count.size()));
output.collect(key, result);
}
}
}
public static void main(String[] args) throws Exception{
String input = "hdfs://192.168.37.134:9000/user/hdfs/log_kpi";
String output = "hdfs://192.168.37.134:9000/user/hdfs/log_kpi/ip";
JobConf conf = new JobConf(KPIIP.class);
conf.setJobName("KPIIP");
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Text.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(KPIIPMapper.class);
conf.setCombinerClass(KPIIPReducer.class);
conf.setReducerClass(KPIIPReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(input));
FileOutputFormat.setOutputPath(conf, new Path(output));
JobClient.runJob(conf);
System.exit(0);
}
}
3). Time: org.apache.hadoop.mr.kpi.KPITime.java
package org.apache.hadoop.mr.kpi;
import java.io.IOException;
import java.text.ParseException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class KPITime {
/**
* @author yue 20160512
*/
public static class KPITimeMapper extends MapReduceBase implements
Mapper<Object, Text, Text, IntWritable> {
private IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
KPI kpi = KPI.filterTime(value.toString());
if(kpi.isValid()){
try {
word.set(kpi.getTime_local_Date_hour());
output.collect(word, one);
} catch (ParseException e) {
e.printStackTrace();
}
}
}
}
public static class KPITimeReducer extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable>{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while(values.hasNext()){
sum+=values.next().get();
}
result.set(sum);
output.collect(key, result);
}
}
public static void main(String[] args) throws Exception{
String input = "hdfs://192.168.37.134:9000/user/hdfs/log_kpi";
String output = "hdfs://192.168.37.134:9000/user/hdfs/log_kpi/time";
JobConf conf = new JobConf(KPITime.class);
conf.setJobName("KPITime");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(KPITimeMapper.class);
conf.setCombinerClass(KPITimeReducer.class);
conf.setReducerClass(KPITimeReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(input));
FileOutputFormat.setOutputPath(conf, new Path(output));
JobClient.runJob(conf);
System.exit(0);
}
}
4). Browser: org.apache.hadoop.mr.kpi.KPIBrowser.java
package org.apache.hadoop.mr.kpi;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class KPIBrowser {
/**
* 20160512
* @author yue
*/
public static class KPIBrowserMapper extends MapReduceBase implements Mapper<Object,Text,Text,IntWritable>{
private IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key,Text value,OutputCollector<Text,IntWritable> output , Reporter reporter) throws IOException{
KPI kpi = KPI.filterBroswer(value.toString());
if(kpi.isValid()){
word.set(kpi.getHttp_user_agent());
output.collect(word, one);
}
}
}
public static class KPIBrowserReducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable>{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output, Reporter reporter)throws IOException {
int sum = 0;
while(values.hasNext()){
sum+= values.next().get();
}
result.set(sum);
output.collect(key, result);
}
}
public static void main(String[] args) throws Exception{
String input = "hdfs://192.168.37.134:9000/user/hdfs/log_kpi";
String output = "hdfs://192.168.37.134:9000/user/hdfs/log_kpi/browser";
JobConf conf = new JobConf(KPIBrowser.class);
conf.setJobName("KPIBrowser");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(KPIBrowserMapper.class);
conf.setCombinerClass(KPIBrowserReducer.class);
conf.setReducerClass(KPIBrowserReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(input));
FileOutputFormat.setOutputPath(conf, new Path(output));
JobClient.runJob(conf);
System.exit(0);
}
}