最近在學習大數據的離線分析技術,所以在這裏通過做一個簡單的網站點擊流數據分析離線系統來和大家一起梳理一下離線分析系統的架構模型。當然這個架構模型只能是離線分析技術的一個簡單的入門級架構,實際生產環境中的大數據離線分析技術還涉及到很多細節的處理和高可用的架構。這篇文章的目的只是帶大家入個門,讓大家對離線分析技術有一個簡單的認識,並和大家一起做學習交流。
網站點擊流數據
收集用戶數據
採集數據的方式有多種,一種是通過自己編寫shell腳本或Java編程採集數據,但是工作量大,不方便維護,另一種就是直接使用第三方框架去進行日誌的採集,一般第三方框架的健壯性,容錯性和易用性都做得很好也易於維護。本文采用第三方框架Flume進行日誌採集,Flume是一個分佈式的高效的日誌採集系統,它能把分佈在不同服務器上的海量日誌文件數據統一收集到一個集中的存儲資源中,Flume是Apache的一個頂級項目,與Hadoop也有很好的兼容性。不過需要注意的是Flume並不是一個高可用的框架,這方面的優化得用戶自己去維護。
Flume的agent是運行在JVM上的,所以各個服務器上的JVM環境必不可少。每一個Flume agent部署在一臺服務器上,Flume會收集web server 產生的日誌數據,並封裝成一個個的事件發送給Flume Agent的Source,Flume Agent Source會消費這些收集來的數據事件並放在Flume Agent Channel,Flume Agent Sink會從Channel中收集這些採集過來的數據,要麼存儲在本地的文件系統中要麼作爲一個消費資源分發給下一個裝在分佈式系統中其它服務器上的Flume進行處理。Flume提供了點對點的高可用的保障,某個服務器上的Flume Agent Channel中的數據只有確保傳輸到了另一個服務器上的Flume Agent Channel裏或者正確保存到了本地的文件存儲系統中,纔會被移除。
FTP服務器上的Flume配置文件如下:
agent.channels = memorychannel
agent.sinks = target
agent.sources.origin.type = spooldir
agent.sources.origin.spoolDir = /export/data/trivial/weblogs
agent.sources.origin.channels = memorychannel
agent.sources.origin.deserializer.maxLineLength = 2048
agent.sources.origin.interceptors = i2
agent.sources.origin.interceptors.i2.type = host
agent.sources.origin.interceptors.i2.hostHeader = hostname
agent.sinks.loggerSink.type = logger
agent.sinks.loggerSink.channel = memorychannel
agent.channels.memorychannel.type = memory
agent.channels.memorychannel.capacity = 10000
agent.sinks.target.type = avro
agent.sinks.target.channel = memorychannel
agent.sinks.target.hostname = 172.16.124.130
agent.sinks.target.port = 4545
這裏有幾個參數需要說明,Flume Agent Source可以通過配置deserializer.maxLineLength這個屬性來指定每個Event的大小,默認是每個Event是2048個byte。Flume
Agent Channel的大小默認等於於本地服務器上JVM所獲取到的內存的80%,用戶可以通過byteCapacityBufferPercentage和byteCapacity兩個參數去進行優化。需要特別注意的是FTP上放入Flume監聽的文件夾中的日誌文件不能同名,不然Flume會報錯並停止工作,最好的解決方案就是爲每份日誌文件拼上時間戳。
在Hadoop服務器上的配置文件如下:
agent.sources = origin
agent.channels = memorychannel
agent.sinks = target
agent.sources.origin.type = avro
agent.sources.origin.channels = memorychannel
agent.sources.origin.bind = 0.0.0.0
agent.sources.origin.port = 4545
#agent.sources.origin.interceptors = i1 i2
#agent.sources.origin.interceptors.i1.type = timestamp
#agent.sources.origin.interceptors.i2.type = host
#agent.sources.origin.interceptors.i2.hostHeader = hostname
agent.sinks.loggerSink.type = logger
agent.sinks.loggerSink.channel = memorychannel
agent.channels.memorychannel.type = memory
agent.channels.memorychannel.capacity = 5000000
agent.channels.memorychannel.transactionCapacity = 1000000
agent.sinks.target.type = hdfs
agent.sinks.target.channel = memorychannel
agent.sinks.target.hdfs.path = /flume/events/%y-%m-%d/%H%M%S
agent.sinks.target.hdfs.filePrefix = data-%{hostname}
agent.sinks.target.hdfs.rollInterval = 60
agent.sinks.target.hdfs.rollSize = 1073741824
agent.sinks.target.hdfs.rollCount = 1000000
agent.sinks.target.hdfs.round = true
agent.sinks.target.hdfs.roundValue = 10
agent.sinks.target.hdfs.roundUnit = minute
agent.sinks.target.hdfs.useLocalTimeStamp = true
agent.sinks.target.hdfs.minBlockReplicas=1
agent.sinks.target.hdfs.writeFormat=Text
agent.sinks.target.hdfs.fileType=DataStream
round, roundValue,roundUnit三個參數是用來配置每10分鐘在hdfs裏生成一個文件夾保存從FTP服務器上拉取下來的數據。
使用Mapreduce清洗日誌文件
package com.guludada.clickstream;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.StringTokenizer;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.guludada.dataparser.WebLogParser;
public class logClean {
public static class cleanMap extends Mapper<Object,Text,Text,NullWritable> {
private NullWritable v = NullWritable.get();
private Text word = new Text();
WebLogParser webLogParser = new WebLogParser();
public void map(Object key,Text value,Context context) {
//將一行內容轉成string
String line = value.toString();
String cleanContent = webLogParser.parser(line);
if(cleanContent != "") {
word.set(cleanContent);
try {
context.write(word,v);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://ymhHadoop:9000");
Job job = Job.getInstance(conf);
job.setJarByClass(logClean.class);
//指定本業務job要使用的mapper/Reducer業務類
job.setMapperClass(cleanMap.class);
//指定mapper輸出數據的kv類型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
//指定job的輸入原始文件所在目錄
Date curDate = new Date();
SimpleDateFormat sdf = new SimpleDateFormat("yy-MM-dd");
String dateStr = sdf.format(curDate);
FileInputFormat.setInputPaths(job, new Path("/flume/events/" + dateStr + "/*/*"));
//指定job的輸出結果所在目錄
FileOutputFormat.setOutputPath(job, new Path("/clickstream/cleandata/"+dateStr+"/"));
//將job中配置的相關參數,以及job所用的java類所在的jar包,提交給yarn去運行
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
}
package com.guludada.dataparser;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.guludada.javabean.WebLogBean;
/**
* 用正則表達式匹配出合法的日誌記錄
*
*
*/
public class WebLogParser {
public String parser(String weblog_origin) {
WebLogBean weblogbean = new WebLogBean();
// 獲取IP地址
Pattern IPPattern = Pattern.compile("\\d+.\\d+.\\d+.\\d+");
Matcher IPMatcher = IPPattern.matcher(weblog_origin);
if(IPMatcher.find()) {
String IPAddr = IPMatcher.group(0);
weblogbean.setIP_addr(IPAddr);
} else {
return ""
}
// 獲取時間信息
Pattern TimePattern = Pattern.compile("\\[(.+)\\]");
Matcher TimeMatcher = TimePattern.matcher(weblog_origin);
if(TimeMatcher.find()) {
String time = TimeMatcher.group(1);
String[] cleanTime = time.split(" ");
weblogbean.setTime(cleanTime[0]);
} else {
return "";
}
//獲取其餘請求信息
Pattern InfoPattern = Pattern.compile(
"(\\\"[POST|GET].+?\\\") (\\d+) (\\d+).+?(\\\".+?\\\") (\\\".+?\\\")");
Matcher InfoMatcher = InfoPattern.matcher(weblog_origin);
if(InfoMatcher.find()) {
String requestInfo = InfoMatcher.group(1).replace('\"',' ').trim();
String[] requestInfoArry = requestInfo.split(" ");
weblogbean.setMethod(requestInfoArry[0]);
weblogbean.setRequest_URL(requestInfoArry[1]);
weblogbean.setRequest_protocol(requestInfoArry[2]);
String status_code = InfoMatcher.group(2);
weblogbean.setRespond_code(status_code);
String respond_data = InfoMatcher.group(3);
weblogbean.setRespond_data(respond_data);
String request_come_from = InfoMatcher.group(4).replace('\"',' ').trim();
weblogbean.setRequst_come_from(request_come_from);
String browserInfo = InfoMatcher.group(5).replace('\"',' ').trim();
weblogbean.setBrowser(browserInfo);
} else {
return "";
}
return weblogbean.toString();
}
}
package com.guludada.javabean;
public class WebLogBean {
String IP_addr;
String time;
String method;
String request_URL;
String request_protocol;
String respond_code;
String respond_data;
String requst_come_from;
String browser;
public String getIP_addr() {
return IP_addr;
}
public void setIP_addr(String iP_addr) {
IP_addr = iP_addr;
}
public String getTime() {
return time;
}
public void setTime(String time) {
this.time = time;
}
public String getMethod() {
return method;
}
public void setMethod(String method) {
this.method = method;
}
public String getRequest_URL() {
return request_URL;
}
public void setRequest_URL(String request_URL) {
this.request_URL = request_URL;
}
public String getRequest_protocol() {
return request_protocol;
}
public void setRequest_protocol(String request_protocol) {
this.request_protocol = request_protocol;
}
public String getRespond_code() {
return respond_code;
}
public void setRespond_code(String respond_code) {
this.respond_code = respond_code;
}
public String getRespond_data() {
return respond_data;
}
public void setRespond_data(String respond_data) {
this.respond_data = respond_data;
}
public String getRequst_come_from() {
return requst_come_from;
}
public void setRequst_come_from(String requst_come_from) {
this.requst_come_from = requst_come_from;
}
public String getBrowser() {
return browser;
}
public void setBrowser(String browser) {
this.browser = browser;
}
@Override
public String toString() {
return IP_addr + " " + time + " " + method + " "
+ request_URL + " " + request_protocol + " " + respond_code
+ " " + respond_data + " " + requst_come_from + " " + browser;
}
}
第一次日記清洗後的記錄如下圖:
package com.guludada.clickstream;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.Date;
import java.util.HashMap;
import java.util.Locale;
import java.util.UUID;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.guludada.clickstream.logClean.cleanMap;
import com.guludada.dataparser.SessionParser;
import com.guludada.dataparser.WebLogParser;
import com.guludada.javabean.WebLogSessionBean;
public class logSession {
public static class sessionMapper extends Mapper<Object,Text,Text,Text> {
private Text IPAddr = new Text();
private Text content = new Text();
private NullWritable v = NullWritable.get();
WebLogParser webLogParser = new WebLogParser();
public void map(Object key,Text value,Context context) {
//將一行內容轉成string
String line = value.toString();
String[] weblogArry = line.split(" ");
IPAddr.set(weblogArry[0]);
content.set(line);
try {
context.write(IPAddr,content);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
static class sessionReducer extends Reducer<Text, Text, Text, NullWritable>{
private Text IPAddr = new Text();
private Text content = new Text();
private NullWritable v = NullWritable.get();
WebLogParser webLogParser = new WebLogParser();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
SessionParser sessionParser = new SessionParser();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Date sessionStartTime = null;
String sessionID = UUID.randomUUID().toString();
//將IP地址所對應的用戶的所有瀏覽記錄按時間排序
ArrayList<WebLogSessionBean> sessionBeanGroup = new ArrayList<WebLogSessionBean>();
for(Text browseHistory : values) {
WebLogSessionBean sessionBean = sessionParser.loadBean(browseHistory.toString());
sessionBeanGroup.add(sessionBean);
}
Collections.sort(sessionBeanGroup,new Comparator<WebLogSessionBean>() {
public int compare(WebLogSessionBean sessionBean1, WebLogSessionBean sessionBean2) {
Date date1 = sessionBean1.getTimeWithDateFormat();
Date date2 = sessionBean2.getTimeWithDateFormat();
if(date1 == null && date2 == null) return 0;
return date1.compareTo(date2);
}
});
for(WebLogSessionBean sessionBean : sessionBeanGroup) {
if(sessionStartTime == null) {
//當天日誌中某用戶第一次訪問網站的時間
sessionStartTime = timeTransform(sessionBean.getTime());
content.set(sessionParser.parser(sessionBean, sessionID));
try {
context.write(content,v);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} else {
Date sessionEndTime = timeTransform(sessionBean.getTime());
long sessionStayTime = timeDiffer(sessionStartTime,sessionEndTime);
if(sessionStayTime > 30 * 60 * 1000) {
//將當前瀏覽記錄的時間設爲下一個session的開始時間
sessionStartTime = timeTransform(sessionBean.getTime());
sessionID = UUID.randomUUID().toString();
continue;
}
content.set(sessionParser.parser(sessionBean, sessionID));
try {
context.write(content,v);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
private Date timeTransform(String time) {
Date standard_time = null;
try {
standard_time = sdf.parse(time);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return standard_time;
}
private long timeDiffer(Date start_time,Date end_time) {
long diffTime = 0;
diffTime = end_time.getTime() - start_time.getTime();
return diffTime;
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://ymhHadoop:9000");
Job job = Job.getInstance(conf);
job.setJarByClass(logClean.class);
//指定本業務job要使用的mapper/Reducer業務類
job.setMapperClass(sessionMapper.class);
job.setReducerClass(sessionReducer.class);
//指定mapper輸出數據的kv類型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//指定最終輸出的數據的kv類型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
Date curDate = new Date();
SimpleDateFormat sdf = new SimpleDateFormat("yy-MM-dd");
String dateStr = sdf.format(curDate);
//指定job的輸入原始文件所在目錄
FileInputFormat.setInputPaths(job, new Path("/clickstream/cleandata/"+dateStr+"/*"));
//指定job的輸出結果所在目錄
FileOutputFormat.setOutputPath(job, new Path("/clickstream/sessiondata/"+dateStr+"/"));
//將job中配置的相關參數,以及job所用的java類所在的jar包,提交給yarn去運行
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
}
package com.guludada.dataparser;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;
import com.guludada.javabean.WebLogSessionBean;
public class SessionParser {
SimpleDateFormat sdf_origin = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss",Locale.ENGLISH);
SimpleDateFormat sdf_final = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
public String parser(WebLogSessionBean sessionBean,String sessionID) {
sessionBean.setSession(sessionID);
return sessionBean.toString();
}
public WebLogSessionBean loadBean(String sessionContent) {
WebLogSessionBean weblogSession = new WebLogSessionBean();
String[] contents = sessionContent.split(" ");
weblogSession.setTime(timeTransform(contents[1]));
weblogSession.setIP_addr(contents[0]);
weblogSession.setRequest_URL(contents[3]);
weblogSession.setReferal(contents[7]);
return weblogSession;
}
private String timeTransform(String time) {
Date standard_time = null;
try {
standard_time = sdf_origin.parse(time);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return sdf_final.format(standard_time);
}
}
package com.guludada.javabean;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
public class WebLogSessionBean {
String time;
String IP_addr;
String session;
String request_URL;
String referal;
public String getTime() {
return time;
}
public void setTime(String time) {
this.time = time;
}
public String getIP_addr() {
return IP_addr;
}
public void setIP_addr(String iP_addr) {
IP_addr = iP_addr;
}
public String getSession() {
return session;
}
public void setSession(String session) {
this.session = session;
}
public String getRequest_URL() {
return request_URL;
}
public void setRequest_URL(String request_URL) {
this.request_URL = request_URL;
}
public String getReferal() {
return referal;
}
public void setReferal(String referal) {
this.referal = referal;
}
public Date getTimeWithDateFormat() {
SimpleDateFormat sdf_final = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
if(this.time != null && this.time != "") {
try {
return sdf_final.parse(this.time);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return null;
}
@Override
public String toString() {
return time + " " + IP_addr + " " + session + " "
+ request_URL + " " + referal;
}
}
第二次清理出來的Session信息結構如下:
時間 | IP | SessionID | 請求頁面URL | Referal URL |
2015-05-30 19:38:00 | 192.168.12.130 | Session1 | /blog/me | www.baidu.com |
2015-05-30 19:39:00 | 192.168.12.130 | Session1 | /blog/me/details | www.mysite.com/blog/me |
2015-05-30 19:38:00 | 192.168.12.40 | Session2 | /blog/me | www.baidu.com |
第三步,清洗第二步生成的Session信息,生成PageViews信息表
package com.guludada.clickstream;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.Date;
import java.util.HashMap;
import java.util.Locale;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.guludada.clickstream.logClean.cleanMap;
import com.guludada.clickstream.logSession.sessionMapper;
import com.guludada.clickstream.logSession.sessionReducer;
import com.guludada.dataparser.PageViewsParser;
import com.guludada.dataparser.SessionParser;
import com.guludada.dataparser.WebLogParser;
import com.guludada.javabean.PageViewsBean;
import com.guludada.javabean.WebLogSessionBean;
public class PageViews {
public static class pageMapper extends Mapper<Object,Text,Text,Text> {
private Text word = new Text();
public void map(Object key,Text value,Context context) {
String line = value.toString();
String[] webLogContents = line.split(" ");
//根據session來分組
word.set(webLogContents[2]);
try {
context.write(word,value);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public static class pageReducer extends Reducer<Text, Text, Text, NullWritable>{
private Text session = new Text();
private Text content = new Text();
private NullWritable v = NullWritable.get();
PageViewsParser pageViewsParser = new PageViewsParser();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
//上一條記錄的訪問信息
PageViewsBean lastStayPageBean = null;
Date lastVisitTime = null;
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//將session所對應的所有瀏覽記錄按時間排序
ArrayList<PageViewsBean> pageViewsBeanGroup = new ArrayList<PageViewsBean>();
for(Text pageView : values) {
PageViewsBean pageViewsBean = pageViewsParser.loadBean(pageView.toString());
pageViewsBeanGroup.add(pageViewsBean);
}
Collections.sort(pageViewsBeanGroup,new Comparator<PageViewsBean>() {
public int compare(PageViewsBean pageViewsBean1, PageViewsBean pageViewsBean2) {
Date date1 = pageViewsBean1.getTimeWithDateFormat();
Date date2 = pageViewsBean2.getTimeWithDateFormat();
if(date1 == null && date2 == null) return 0;
return date1.compareTo(date2);
}
});
//計算每個頁面的停留時間
int step = 0;
for(PageViewsBean pageViewsBean : pageViewsBeanGroup) {
Date curVisitTime = pageViewsBean.getTimeWithDateFormat();
if(lastStayPageBean != null) {
//計算前後兩次訪問記錄相差的時間,單位是秒
Integer timeDiff = (int) ((curVisitTime.getTime() - lastVisitTime.getTime())/1000);
//根據當前記錄的訪問信息更新上一條訪問記錄中訪問的頁面的停留時間
lastStayPageBean.setStayTime(timeDiff.toString());
}
//更新訪問記錄的步數
step++;
pageViewsBean.setStep(step+"");
//更新上一條訪問記錄的停留時間後,將當前訪問記錄設定爲上一條訪問信息記錄
lastStayPageBean = pageViewsBean;
lastVisitTime = curVisitTime;
//輸出pageViews信息
content.set(pageViewsParser.parser(pageViewsBean));
try {
context.write(content,v);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://ymhHadoop:9000");
Job job = Job.getInstance(conf);
job.setJarByClass(PageViews.class);
//指定本業務job要使用的mapper/Reducer業務類
job.setMapperClass(pageMapper.class);
job.setReducerClass(pageReducer.class);
//指定mapper輸出數據的kv類型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//指定最終輸出的數據的kv類型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
Date curDate = new Date();
SimpleDateFormat sdf = new SimpleDateFormat("yy-MM-dd");
String dateStr = sdf.format(curDate);
//指定job的輸入原始文件所在目錄
FileInputFormat.setInputPaths(job, new Path("/clickstream/sessiondata/"+dateStr+"/*"));
//指定job的輸出結果所在目錄
FileOutputFormat.setOutputPath(job, new Path("/clickstream/pageviews/"+dateStr+"/"));
//將job中配置的相關參數,以及job所用的java類所在的jar包,提交給yarn去運行
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
}
package com.guludada.dataparser;
import com.guludada.javabean.PageViewsBean;
import com.guludada.javabean.WebLogSessionBean;
public class PageViewsParser {
/**
* 根據logSession的輸出數據加載PageViewsBean
*
* */
public PageViewsBean loadBean(String sessionContent) {
PageViewsBean pageViewsBean = new PageViewsBean();
String[] contents = sessionContent.split(" ");
pageViewsBean.setTime(contents[0] + " " + contents[1]);
pageViewsBean.setIP_addr(contents[2]);
pageViewsBean.setSession(contents[3]);
pageViewsBean.setVisit_URL(contents[4]);
pageViewsBean.setStayTime("0");
pageViewsBean.setStep("0");
return pageViewsBean;
}
public String parser(PageViewsBean pageBean) {
return pageBean.toString();
}
}
package com.guludada.javabean;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
public class PageViewsBean {
String session;
String IP_addr;
String time;
String visit_URL;
String stayTime;
String step;
public String getSession() {
return session;
}
public void setSession(String session) {
this.session = session;
}
public String getIP_addr() {
return IP_addr;
}
public void setIP_addr(String iP_addr) {
IP_addr = iP_addr;
}
public String getTime() {
return time;
}
public void setTime(String time) {
this.time = time;
}
public String getVisit_URL() {
return visit_URL;
}
public void setVisit_URL(String visit_URL) {
this.visit_URL = visit_URL;
}
public String getStayTime() {
return stayTime;
}
public void setStayTime(String stayTime) {
this.stayTime = stayTime;
}
public String getStep() {
return step;
}
public void setStep(String step) {
this.step = step;
}
public Date getTimeWithDateFormat() {
SimpleDateFormat sdf_final = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
if(this.time != null && this.time != "") {
try {
return sdf_final.parse(this.time);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return null;
}
@Override
public String toString() {
return session + " " + IP_addr + " " + time + " "
+ visit_URL + " " + stayTime + " " + step;
}
}
第三次日誌清洗產生的PageViews數據結構如下圖:SessionID | IP | 訪問時間 | 訪問頁面 | 停留時間 | 第幾步 |
Session1 | 192.168.12.130 | 2016-05-30 15:17:30 | /blog/me | 30000 | 1 |
Session1 | 192.168.12.130 | 2016-05-30 15:18:00 | /blog/me/admin | 30000 | 2 |
Session1 | 192.168.12.130 | 2016-05-30 15:18:30 | /home | 30000 | 3 |
Session2 | 192.168.12.150 | 2016-05-30 15:16:30 | /products | 30000 | 1 |
Session2 | 192.168.12.150 | 2016-05-30 15:17:00 | /products/details | 30000 | 2 |
第四步,再次清洗Session日誌,並生成Visits信息表
package com.guludada.clickstream;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.guludada.clickstream.PageViews.pageMapper;
import com.guludada.clickstream.PageViews.pageReducer;
import com.guludada.clickstream.logClean.cleanMap;
import com.guludada.dataparser.PageViewsParser;
import com.guludada.dataparser.VisitsInfoParser;
import com.guludada.javabean.PageViewsBean;
public class VisitsInfo {
public static class visitMapper extends Mapper<Object,Text,Text,Text> {
private Text word = new Text();
public void map(Object key,Text value,Context context) {
String line = value.toString();
String[] webLogContents = line.split(" ");
//根據session來分組
word.set(webLogContents[2]);
try {
context.write(word,value);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public static class visitReducer extends Reducer<Text, Text, Text, NullWritable>{
private Text content = new Text();
private NullWritable v = NullWritable.get();
VisitsInfoParser visitsParser = new VisitsInfoParser();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
PageViewsParser pageViewsParser = new PageViewsParser();
Map<String,Integer> viewedPagesMap = new HashMap<String,Integer>();
String entry_URL = "";
String leave_URL = "";
int total_visit_pages = 0;
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//將session所對應的所有瀏覽記錄按時間排序
ArrayList<String> browseInfoGroup = new ArrayList<String>();
for(Text browseInfo : values) {
browseInfoGroup.add(browseInfo.toString());
}
Collections.sort(browseInfoGroup,new Comparator<String>() {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
public int compare(String browseInfo1, String browseInfo2) {
String dateStr1 = browseInfo1.split(" ")[0] + " " + browseInfo1.split(" ")[1];
String dateStr2 = browseInfo2.split(" ")[0] + " " + browseInfo2.split(" ")[1];
Date date1;
Date date2;
try {
date1 = sdf.parse(dateStr1);
date2 = sdf.parse(dateStr2);
if(date1 == null && date2 == null) return 0;
return date1.compareTo(date2);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
return 0;
}
}
});
//統計該session訪問的總頁面數,第一次進入的頁面,跳出的頁面
for(String browseInfo : browseInfoGroup) {
String[] browseInfoStrArr = browseInfo.split(" ");
String curVisitURL = browseInfoStrArr[3];
Integer curVisitURLInteger = viewedPagesMap.get(curVisitURL);
if(curVisitURLInteger == null) {
viewedPagesMap.put(curVisitURL, 1);
}
}
total_visit_pages = viewedPagesMap.size();
String visitsInfo = visitsParser.parser(browseInfoGroup, total_visit_pages+"");
content.set(visitsInfo);
try {
context.write(content,v);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://ymhHadoop:9000");
Job job = Job.getInstance(conf);
job.setJarByClass(VisitsInfo.class);
//指定本業務job要使用的mapper/Reducer業務類
job.setMapperClass(visitMapper.class);
job.setReducerClass(visitReducer.class);
//指定mapper輸出數據的kv類型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//指定最終輸出的數據的kv類型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
Date curDate = new Date();
SimpleDateFormat sdf = new SimpleDateFormat("yy-MM-dd");
String dateStr = sdf.format(curDate);
//指定job的輸入原始文件所在目錄
FileInputFormat.setInputPaths(job, new Path("/clickstream/sessiondata/"+dateStr+"/*"));
//指定job的輸出結果所在目錄
FileOutputFormat.setOutputPath(job, new Path("/clickstream/visitsinfo"+dateStr+"/"));
//將job中配置的相關參數,以及job所用的java類所在的jar包,提交給yarn去運行
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
}
package com.guludada.dataparser;
import java.util.ArrayList;
import com.guludada.javabean.PageViewsBean;
import com.guludada.javabean.VisitsInfoBean;
import com.guludada.javabean.WebLogSessionBean;
public class VisitsInfoParser {
public String parser(ArrayList<String> pageViewsGroup,String totalVisitNum) {
VisitsInfoBean visitsBean = new VisitsInfoBean();
String entryPage = pageViewsGroup.get(0).split(" ")[4];
String leavePage = pageViewsGroup.get(pageViewsGroup.size()-1).split(" ")[4];
String startTime = pageViewsGroup.get(0).split(" ")[0] + " " + pageViewsGroup.get(0).split(" ")[1];
String endTime = pageViewsGroup.get(pageViewsGroup.size()-1).split(" ")[0] +
" " +pageViewsGroup.get(pageViewsGroup.size()-1).split(" ")[1];
String session = pageViewsGroup.get(0).split(" ")[3];
String IP = pageViewsGroup.get(0).split(" ")[2];
String referal = pageViewsGroup.get(0).split(" ")[5];
visitsBean.setSession(session);
visitsBean.setStart_time(startTime);
visitsBean.setEnd_time(endTime);
visitsBean.setEntry_page(entryPage);
visitsBean.setLeave_page(leavePage);
visitsBean.setVisit_page_num(totalVisitNum);
visitsBean.setIP_addr(IP);
visitsBean.setReferal(referal);
return visitsBean.toString();
}
}
package com.guludada.javabean;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
public class VisitsInfoBean {
String session;
String start_time;
String end_time;
String entry_page;
String leave_page;
String visit_page_num;
String IP_addr;
String referal;
public String getSession() {
return session;
}
public void setSession(String session) {
this.session = session;
}
public String getStart_time() {
return start_time;
}
public void setStart_time(String start_time) {
this.start_time = start_time;
}
public String getEnd_time() {
return end_time;
}
public void setEnd_time(String end_time) {
this.end_time = end_time;
}
public String getEntry_page() {
return entry_page;
}
public void setEntry_page(String entry_page) {
this.entry_page = entry_page;
}
public String getLeave_page() {
return leave_page;
}
public void setLeave_page(String leave_page) {
this.leave_page = leave_page;
}
public String getVisit_page_num() {
return visit_page_num;
}
public void setVisit_page_num(String visit_page_num) {
this.visit_page_num = visit_page_num;
}
public String getIP_addr() {
return IP_addr;
}
public void setIP_addr(String iP_addr) {
IP_addr = iP_addr;
}
public String getReferal() {
return referal;
}
public void setReferal(String referal) {
this.referal = referal;
}
@Override
public String toString() {
return session + " " + start_time + " " + end_time
+ " " + entry_page + " " + leave_page + " " + visit_page_num
+ " " + IP_addr + " " + referal;
}
}
第四次清洗日誌產生的訪問記錄表結構如下圖:
SessionID | 訪問時間 | 離開時間 | 第一次訪問頁面 | 最後一次訪問的頁面 | 訪問的頁面總數 |
IP | Referal |
Session1 | 2016-05-30 15:17:00 | 2016-05-30 15:19:00 | /blog/me | /blog/others | 5 | 192.168.12.130 | www.baidu.com |
Session2 | 2016-05-30 14:17:00 | 2016-05-30 15:19:38 | /home | /profile | 10 | 192.168.12.140 | www.178.com |
Session3 | 2016-05-30 12:17:00 | 2016-05-30 15:40:00 | /products | /detail | 6 | 192.168.12.150 | www.78dm.net |
MapReduce Troubleshooting
指定某個文件夾路徑下所有文件作爲mapreduce的輸入參數的解決方案。
1.hdfs的文件系統中的路徑是支持正則表達式的
2.使用.setInputDirRecursive(job,true)方法,然後指定文件夾路徑
在分佈式環境下如何設置每個用戶的SessionID
可以使用UUID,UUID是分佈式環境下唯一的元素識別碼,它由日期和時間,時鐘序列,機器識別碼(一般爲網卡MAC地址)三部分組成。這樣就保證了每個用戶的SessionID的唯一性。
HIVE建立數據倉庫
使用MapReduce清洗完日誌文件後,我們就開始使用Hive去構建對應的數據倉庫並使用HiveSql對數據進行分析。而在本系統裏,我們將使用星型模型來構建數據倉庫的ODS(OperationalData Store)層。下面的命令我們可以通過啓動Hive的hiveserver2服務器並使用beeline客戶端進行操作或者直接寫腳本去定時調度。
PageViews數據分析
PageViews的事實表和維度表結構
使用HIVE在數據倉庫中創建PageViews的貼源數據表:
>> create table pageviews(session string,ip string,requestdate string,requesttime string,visitpage string, staytime string,step string) comment ‘this is the table for pageviews’ partitioned by(inputDate string) clustered by(session) sorted by(requestdate,requesttime) into 4 buckets row format delimited fields terminated by ‘ ’;
將HDFS中的數據導入到HIVE的PageViews貼源數據表中
>> load data inpath ‘/clickstream/pageviews’ overwrite into table pageviews partition(inputDate=‘2016-05-17’);
如果沒有標示是在’Local‘本地文件系統中,則會去HDFS中加載數據
根據具體的業務分析邏輯創建ODS層的PageViews事實表,並從PageViews的貼源表中導入數據
這裏根據請求的頁面URL來分組(clustered)是爲了方便統計每個頁面的PV
>> create table ods_pageviews(session string,ip string,viewtime string,visitpage string, staytime string,step string) partitioned by(inputDate string) clustered by(visitpage) sorted by(viewtime) into 4 buckets row format delimited fields terminated by ‘ ’;
>> insert into table ods_pageviews partition(inputDate='2016-05-17') select pv.session,pv.ip,concat(pv.requestdate,"-",pv.requesttime),pv.visitpage,pv.staytime,pv.step from pageviews as pv where pv.inputDate='2016-05-17';
創建PageViews事實表的時間維度表並從當天的事實表裏導入數據
>>create table ods_dim_pageviews_time(time string,year string,month string,day string,hour string,minutes string,seconds string) partitioned by(inputDate String) clustered by(year,month,day) sorted by(time) into 4 buckets row format delimited fields terminated by ' ';
>> insert overwrite table ods_dim_pageviews_time partition(inputDate='2016-05-17') select distinct pv.viewtime, substring(pv.viewtime,0,4),substring(pv.viewtime,6,2),substring(pv.viewtime,9,2),substring(pv.viewtime,12,2),substring(pv.viewtime,15,2),substring(pv.viewtime,18,2) from ods_pageviews as pv;
創建PageViews事實表的URL維度表並從當天的事實表裏導入數據
>> create table ods_dim_pageviews_url(visitpage string,host string,path string,query string) partitioned by(inputDate string) clustered by(visitpage) sorted by(visitpage) into 4 buckets row format delimited fields terminated by ' ';
>> insert into table ods_dim_pageviews_url partition(inputDate='2016-05-17') select distinct pv.visitpage,b.host,b.path,b.query from pageviews pv lateral view parse_url_tuple(concat('https://localhost',pv.visitpage),'HOST','PATH','QUERY') b as host,path,query;
查詢每天PV總數前20的頁面
>> select op.visitpage as path,count(*) as num from ods_pageviews as op join ods_dim_pageviews_url as opurl on (op.visitpage = opurl.visitpage) join ods_dim_pageviews_time as optime on (optime.time = op.viewtime) where optime.year='2013' and optime.month='09' and optime.day='19' group by op.visitpage sort by num desc limit 20;
運行結果:
Visits數據分析
頁面具體訪問記錄Visits的事實表和維度表結構
使用HIVE在數據倉庫中創建Visits信息的貼源數據表:
>> create table visitsinfo(session string,startdate string,starttime string,enddate string,endtime string,entrypage string,leavepage string,viewpagenum string,ip string,referal string) partitioned by(inputDate string) clustered by(session) sorted by(startdate,starttime) into 4 buckets row format delimited fields terminated by ' ';
將HDFS中的數據導入到HIVE的Visits信息貼源數據表中
>> load data inpath '/clickstream/visitsinfo' overwrite into table visitsinfo partition(inputDate='2016-05-18');
根據具體的業務分析邏輯創建ODS層的Visits事實表,並從visitsinfo的貼源表中導入數據
>> create table ods_visits(session string,entrytime string,leavetime string,entrypage string,leavepage string,viewpagenum string,ip string,referal string) partitioned by(inputDate string) clustered by(session) sorted by(entrytime) into 4 buckets row format delimited fields terminated by ' ';
>> insert into table ods_visits partition(inputDate='2016-05-18') select vi.session,concat(vi.startdate,"-",vi.starttime),concat(vi.enddate,"-",vi.endtime),vi.entrypage,vi.leavepage,vi.viewpagenum,vi.ip,vi.referal from visitsinfo as vi where vi.inputDate='2016-05-18';
創建Visits事實表的時間維度表並從當天的事實表裏導入數據
>>create table ods_dim_visits_time(time string,year string,month string,day string,hour string,minutes string,seconds string) partitioned by(inputDate String) clustered by(year,month,day) sorted by(time) into 4 buckets row format delimited fields terminated by ' ';
將“訪問時間”和“離開時間”兩列的值合併後再放入時間維度表中,減少數據的冗餘
>>insert overwrite table ods_dim_visits_time partition(inputDate='2016-05-18') select distinct ov.timeparam, substring(ov.timeparam,0,4),substring(ov.timeparam,6,2),substring(ov.timeparam,9,2),substring(ov.timeparam,12,2),substring(ov.timeparam,15,2),substring(ov.timeparam,18,2) from (select ov1.entrytime as timeparam from ods_visits as ov1 union select ov2.leavetime as timeparam from ods_visits as ov2) as ov;
創建visits事實表的URL維度表並從當天的事實表裏導入數據
>> create table ods_dim_visits_url(pageurl string,host string,path string,query string) partitioned by(inputDate string) clustered by(pageurl) sorted by(pageurl) into 4 buckets row format delimited fields terminated by ' ';
將每個session的進入頁面和離開頁面的URL合併後存入到URL維度表中
>>insert into table ods_dim_visits_url partition(inputDate='2016-05-18') select distinct ov.pageurl,b.host,b.path,b.query from (select ov1.entrypage as pageurl from ods_visits as ov1 union select ov2.leavepage as pageurl from ods_visits as ov2 ) as ov lateral view parse_url_tuple(concat('https://localhost',ov.pageurl),'HOST','PATH','QUERY') b as host,path,query;
將每個session從哪個外站進入當前網站的信息存入到URL維度表中
>>insert into table ods_dim_visits_url partition(inputDate='2016-05-18') select distinct ov.referal,b.host,b.path,b.query from ods_visits as ov lateral view parse_url_tuple(ov.referal,'HOST','PATH','QUERY') b as host,path,query;
統計每個頁面的跳出人數(事實上真正有價值的統計應該是統計頁面的跳出率,但爲了簡單示範,作者在這裏簡化成統計跳出人數)
>> select ov.leavepage as jumpPage, count(*) as jumpNum from ods_visits as ov group by ov.leavepage order by jumpNum desc;
業務頁面轉換率分析(漏斗模型)
Hive在創建表的時候無法實現某個字段自增長的關鍵字,得使用自定義函數(user-defined function)UDF來實現相應的功能。在查詢的時候可以使用row_number()來顯示行數,不過必須要在complete mode下才能使用,所以可以使用row_number() 函數配合開窗函數over(),具體示例如下。 爲簡單起見,這裏我們創建一個臨時表,並手動在裏面插入要查看的業務頁面鏈接以及該頁面的PV總數,通過這幾個參數來計算業務頁面之間的轉換率,也就是所謂的漏斗模型。
假設我們有“/index” -> “/detail” -> “/createOrder” ->”/confirmOrder” 這一業務頁面轉化流程
首先我們要創建業務頁面的PV的臨時信息表,臨時表和裏面的數據會在session結束的時候清理掉
>> create temporary table transactionpageviews(url string,views int) row format delimited fields terminated by ' ';
先統計業務頁面的總PV然後按轉換步驟順序插入每個頁面的PV信息到transactionpageviews表中
>> insert into table transactionpageviews select opurl.path as path,count(*) as num from ods_pageviews as op join ods_dim_pageviews_url as opurl on (op.visitpage = opurl.visitpage) join ods_dim_pageviews_time as optime on (optime.time = op.viewtime) where optime.year='2013' and optime.month='09' and optime.day='19' and opurl.path='/index' group by opurl.path;
>> insert into table transactionpageviews select opurl.path as path,count(*) as num from ods_pageviews as op join ods_dim_pageviews_url as opurl on (op.visitpage = opurl.visitpage) join ods_dim_pageviews_time as optime on (optime.time = op.viewtime) where optime.year='2013' and optime.month='09' and optime.day='19' and opurl.path='/detail' group by opurl.path;
>> insert into table transactionpageviews select opurl.path as path,count(*) as num from ods_pageviews as op join ods_dim_pageviews_url as opurl on (op.visitpage = opurl.visitpage) join ods_dim_pageviews_time as optime on (optime.time = op.viewtime) where optime.year='2013' and optime.month='09' and optime.day='19' and opurl.path='/createOrder' group by opurl.path;
>> insert into table transactionpageviews select opurl.path as path,count(*) as num from ods_pageviews as op join ods_dim_pageviews_url as opurl on (op.visitpage = opurl.visitpage) join ods_dim_pageviews_time as optime on (optime.time = op.viewtime) where optime.year='2013' and optime.month='09' and optime.day='19' and opurl.path='/confirmOrder' group by opurl.path;
計算業務頁面之間的轉換率
>> select row_number() over() as rownum,a.url as url, a.views as pageViews,b.views as lastPageViews,a.views/b.views as transferRation from (select row_number() over() as rownum,views,url from transactionpageviews) as a left join (select row_number() over() as rownum,views,url from transactionpageviews) as b on (a.rownum = b.rownum-1 );
Shell腳本+Crontab定時器執行任務調度
執行initialEnv.sh腳本初始化系統環境,爲了簡單測試,作者只啓動了單臺服務器,下面的腳本是建立在Hadoop的standalone單節點模式,並且Hive也裝在Hadoop服務器上
#!/bin/bash
export HADOOP_HOME=/home/ymh/apps/hadoop-2.6.4
#start hdfs
/home/ymh/apps/hadoop-2.6.4/sbin/start-dfs.sh
#start yarn
if [[ 0 == $? ]]
then
/home/ymh/apps/hadoop-2.6.4/sbin/start-yarn.sh
fi
#start flume
#if [[ 0 == $? ]]
#then
#start flume
#$nohup ~/apache-flume-1.6.0-bin/bin/flume-ng agent -n agent -c conf -f ~/apache-flume-1.6.0-bin/conf/flume-conf.properties &
#fi
#start mysql
if [ 0 = $? ]
then
service mysqld start
fi
#start HIVE SERVER
if [ 0 = $? ]
then
$nohup /apps/apache-hive-1.2.1-bin/bin/hiveserver2 &
fi</span>
執行dataAnalyseTask.sh腳本,先啓動MapReduce程序去清洗當日的日誌信息,隨後使用Hive去構建當日的ODS數據。需要注意的是,本腳本是建立在ODS層中事實表和維度表已經創建完畢的基礎上去執行,所以腳本中不會有創建事實表和維度表的HIVE語句(創建語句見上一個章節的內容),並且爲了節省篇幅,只列出了PageViews數據分析的腳本部分。
#!/bin/bash
CURDATE=$(date +%y-%m-%d)
CURDATEHIVE=$(date +%Y-%m-%d)
/home/ymh/apps/hadoop-2.6.4/bin/hdfs dfs -df /flume/events/$CURDATE
if [[ 1 -ne $? ]]
then
/home/ymh/apps/hadoop-2.6.4/bin/hadoop jar /export/data/mydata/clickstream.jar com.guludada.clickstream.logClean
fi
if [[ 1 -ne $? ]]
then
/home/ymh/apps/hadoop-2.6.4/bin/hadoop jar /export/data/mydata/clickstream.jar com.guludada.clickstream.logSession
fi
if [[ 1 -ne $? ]]
then
/home/ymh/apps/hadoop-2.6.4/bin/hadoop jar /export/data/mydata/clickstream.jar com.guludada.clickstream.PageViews
fi
#Load today's data
if [[ 1 -ne $? ]]
then
/home/ymh/apps/hadoop-2.6.4/bin/hdfs dfs -chmod 777 /clickstream/pageviews/$CURDATE/
echo "load data inpath '/clickstream/pageviews/$CURDATE/' into table pageviews partition(inputDate='$CURDATEHIVE');" | /apps/apache-hive-1.2.1-bin/bin/beeline -u jdbc:hive2://localhost:10000
fi
#Create fact table and its dimension tables
if [[ 1 -ne $? ]]
then
echo "insert into table ods_pageviews partition(inputDate='$CURDATEHIVE') select pv.session,pv.ip,concat(pv.requestdate,'-',pv.requesttime) as viewtime,pv.visitpage,pv.staytime,pv.step from pageviews as pv where pv.inputDate='$CURDATEHIVE';" | /apps/apache-hive-1.2.1-bin/bin/beeline -u jdbc:hive2://localhost:10000
fi
if [[ 1 -ne $? ]]
then
echo "insert into table ods_dim_pageviews_time partition(inputDate='$CURDATEHIVE') select distinct pv.viewtime, substring(pv.viewtime,0,4),substring(pv.viewtime,6,2),substring(pv.viewtime,9,2),substring(pv.viewtime,12,2),substring(pv.viewtime,15,2),substring(pv.viewtime,18,2) from ods_pageviews as pv;" | /apps/apache-hive-1.2.1-bin/bin/beeline -u jdbc:hive2://localhost:10000
fi
if [[ 1 -ne $? ]]
then
echo "insert into table ods_dim_pageviews_url partition(inputDate='$CURDATEHIVE') select distinct pv.visitpage,b.host,b.path,b.query from pageviews pv lateral view parse_url_tuple(concat('https://localhost',pv.visitpage),'HOST','PATH','QUERY') b as host,path,query;" | /apps/apache-hive-1.2.1-bin/bin/beeline -u jdbc:hive2://localhost:10000
fi
</span>
創建crontab文件,指定每天的凌晨01點整執行dataAnalyseTask.sh腳本,該腳本執行“使用MapReduce清理日誌文件”和“使用HiveSql構建分析ODS層數據”兩項任務,並將用戶自定義的crontab文件加入到定時器中
$vi root_crontab_hadoop
$echo "0 1 * * * /myShells/dataAnalyseTask.sh" >> root_crontab_hadoop
$crontab root_crontab_hadoop
至此,使用Hadoop進行離線計算的簡單架構和示例已經全部闡述完畢,而關於如何使用Sqoop將Hive中的數據導入Mysql中,因爲篇幅有限,這裏就不展開了。作者剛開始接觸分佈式離線計算,文章中尚有許多不足的地方,歡迎大家提出寶貴意見並做進一步交流。這篇文章的初衷是作者對自己最近所學知識的一個總結,同時也爲了和大家分享所學到的東西,希望對大家有幫助,謝謝閱讀!