目錄
Flume是一個高可用的,高可靠的,分佈式的海量日誌採集、聚合和傳輸的系統,Flume支持在日誌系統中定製各類數據發送方,用於收集數據;同時,Flume提供對數據進行簡單處理,並寫到各種數據接受方。flume包含三個部分:source、channel和sink。
本實例將使用flume收集日誌數據分別寫入到kafka(topic:logs)和hbase(table:logs)中,日誌來源於:搜狗實驗室,我們將用程序模擬webserver來寫日誌(webserverlogs文件)。
安裝配置kafka集羣請參閱:kafka的配置和分佈式部署,安裝配置Hbase數據庫請參閱:Hbase的配置和分佈式部署
說明:三臺機器的主機名分別爲:bigdata.centos01、bigdata.centos02、bigdata.centos03
一、架構圖
說明:centos02和centos03作爲flume日誌收集節點,centos01將兩臺機器收集到的日誌進行聚合,然後再分別寫入Kafka和Hbase數據庫中。
二、flume的安裝和配置
1. 下載安裝
wget http://archive.apache.org/dist/flume/1.7.0/apache-flume-1.7.0-bin.tar.gz
2. 配置
2.1 日誌收集節點配置
如上圖所示,修改centos02和centos03上flume節點配置
- conf/flume-env.sh修改
export JAVA_HOME=/opt/modules/jdk8
- conf/flume-conf.properties修改
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# source
# exec source:運行unix命令,實時讀取webserverlogs日誌文件,加入到flume
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/datas/webserverlogs
a1.sources.r1.channels = c1
# channel
a1.channels.c1.type = memory
# sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = bigdata.centos01
a1.sinks.k1.port = 5555
2.2 日誌聚集節點配置
如上圖所示,修改centos01上flume節點配置
- conf/flume-env.sh修改
export JAVA_HOME=/opt/modules/jdk8
export HADOOP_HOME=/opt/modules/hadoop-2.5.0
export HBASE_HOME=/opt/modules/hbase-0.98.6-cdh5.3.9
- conf/flume-conf.properties修改
agent.sources = avroSource
agent.channels = kafkaC hbaseC
agent.sinks = kafkaSink hbaseSink
#********************* flume + hbase 集成*********************
# source
agent.sources.avroSource.type = avro
agent.sources.avroSource.channels = hbaseC kafkaC
agent.sources.avroSource.bind = bigdata.centos01
agent.sources.avroSource.port = 5555
# channel
agent.channels.hbaseC.type = memory
agent.channels.hbaseC.capacity = 100000
agent.channels.hbaseC.transactionCapacity = 100000
agent.channels.hbaseC.keep-alive = 10
# sink
agent.sinks.hbaseSink.type = asynchbase
agent.sinks.hbaseSink.table = logs
agent.sinks.hbaseSink.columnFamily = info
agent.sinks.hbaseSink.serializer = org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
agent.sinks.hbaseSink.channel = hbaseC
# 自定義列
agent.sinks.hbaseSink.serializer.payloadColumn=time,userid,searchname,retorder,cliorder,url
#********************* flume + kafka 集成*********************
# channel
agent.channels.kafkaC.type = memory
agent.channels.kafkaC.capacity = 100000
agent.channels.kafkaC.transactionCapacity = 100000
agent.channels.kafkaC.keep-alive = 10
# sink
agent.sinks.kafkaSink.channel = kafkaC
agent.sinks.kafkaSink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkaSink.kafka.topic = logs
agent.sinks.kafkaSink.kafka.bootstrap.servers = bigdata.centos01:9092,bigdata.centos02:9092,bigdata.centos03:9092
agent.sinks.kafkaSink.kafka.producer.acks = 1
agent.sinks.kafkaSink.kafka.flumeBatchSize = 50
2.3 Flume的二次開發
- 下載源碼
http://archive.apache.org/dist/flume/1.7.0/apache-flume-1.7.0-src.tar.gz
- 代碼修改
由於上面centos01 flume hbasesink配置的SimpleAsyncHbaseEventSerializer類不能滿足我們的需求,它默認是往Hbase裏面寫一列,我們的需求是分隔日誌數據分別存儲到我們定義的不同列,故需要對其進行二次開發。主要修改了getActions函數。
package org.apache.flume.sink.hbase;
import com.google.common.base.Charsets;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.apache.flume.conf.ComponentConfiguration;
import org.apache.flume.sink.hbase.SimpleHbaseEventSerializer.KeyType;
import org.hbase.async.AtomicIncrementRequest;
import org.hbase.async.PutRequest;
import java.util.ArrayList;
import java.util.List;
public class SimpleAsyncHbaseEventSerializer implements AsyncHbaseEventSerializer {
private byte[] table;
private byte[] cf;
private byte[] payload;
private byte[] payloadColumn;
private byte[] incrementColumn;
private String rowPrefix;
private byte[] incrementRow;
private KeyType keyType;
@Override
public void initialize(byte[] table, byte[] cf) {
this.table = table;
this.cf = cf;
}
@Override
public List<PutRequest> getActions() {
List<PutRequest> actions = new ArrayList<PutRequest>();
if (payloadColumn != null) {
try {
String[] columns = new String(payloadColumn).split(",");
String[] values = new String(payload).split(",");
if (columns.length != values.length) {
return actions;
}
String rawKey = values[0] + values[1] + System.currentTimeMillis();
for (int i = 0; i < columns.length; i++) {
String column = columns[i];
String value = values[i];
PutRequest putRequest = new PutRequest(table, rawKey.getBytes("utf-8"), cf,
column.getBytes(), value.getBytes("utf-8"));
actions.add(putRequest);
}
} catch (Exception e) {
throw new FlumeException("Could not get row key!", e);
}
}
return actions;
}
public List<AtomicIncrementRequest> getIncrements() {
List<AtomicIncrementRequest> actions = new ArrayList<AtomicIncrementRequest>();
if (incrementColumn != null) {
AtomicIncrementRequest inc = new AtomicIncrementRequest(table,
incrementRow, cf, incrementColumn);
actions.add(inc);
}
return actions;
}
@Override
public void cleanUp() {
// TODO Auto-generated method stub
}
@Override
public void configure(Context context) {
String pCol = context.getString("payloadColumn", "pCol");
String iCol = context.getString("incrementColumn", "iCol");
rowPrefix = context.getString("rowPrefix", "default");
String suffix = context.getString("suffix", "uuid");
if (pCol != null && !pCol.isEmpty()) {
if (suffix.equals("timestamp")) {
keyType = KeyType.TS;
} else if (suffix.equals("random")) {
keyType = KeyType.RANDOM;
} else if (suffix.equals("nano")) {
keyType = KeyType.TSNANO;
} else {
keyType = KeyType.UUID;
}
payloadColumn = pCol.getBytes(Charsets.UTF_8);
}
if (iCol != null && !iCol.isEmpty()) {
incrementColumn = iCol.getBytes(Charsets.UTF_8);
}
incrementRow = context.getString("incrementRow", "incRow").getBytes(Charsets.UTF_8);
}
@Override
public void setEvent(Event event) {
this.payload = event.getBody();
}
@Override
public void configure(ComponentConfiguration conf) {
// TODO Auto-generated method stub
}
}
將lib目錄的flume-ng-hbase-sink-1.7.0.jar刪除,把flume-ng-hbase-sink重新打包重命名爲flume-ng-hbase-sink-1.7.0.jar,上傳至flume的lib目錄。我打包的jar包:flume-ng-hbase-sink.jar
三、模擬程序開發
通過模擬程序將搜狗實驗室的日誌文件逐行的寫入到webserverlogs文件,供日誌收集的flume收集。
package com.wmh.writeread;
import java.io.*;
public class ReadWrite {
public static void main(String[] args) {
String readFileName = args[0];
String writeFileName = args[1];
try{
System.out.println("執行中...");
process(readFileName,writeFileName);
System.out.println("執行完成!");
}catch (Exception e){
System.out.println("執行失敗!");
}
}
private static void process(String readFileName, String writeFileName) {
File readFile = new File(readFileName);
if (!readFile.exists()){
System.out.println(readFileName+"不存在,請檢查路徑!");
System.exit(1);
}
File writeFile = new File(writeFileName);
BufferedReader br = null;
FileInputStream fis = null;
FileOutputStream fos = null;
long count = 1l;
try{
fis = new FileInputStream(readFile);
fos = new FileOutputStream(writeFile);
br = new BufferedReader(new InputStreamReader(fis));
String line = "";
while ((line = br.readLine()) != null){
fos.write((line + "\n").getBytes("utf-8"));
fos.flush();
Thread.sleep(100);
System.out.println(String.format("row:[%d]>>>>>>>>>> %s",count++,line));
}
}catch (Exception e){
try {
fis.close();
fos.close();
br.close();
} catch (IOException ex) {
System.out.println("關閉資源異常!");
System.exit(1);
}
}
}
}
四、服務啓動測試
1. 服務啓動
1.1 zookeeper
啓動centos01、centos02、centos03的zookeeper
bin/zkServer.sh start
1.2 HDFS
啓動centos01的namenode、datanode,centos02和centos03的datanode
sbin/start-dfs.sh
1.3 Hbase
- 服務啓動
啓動centos01的master、regionserver,centos02和centos03的regionserver
bin/start-hbase.sh
- 創建表
# 進入命令行操作
bin/hbase shell
# 創建表
> create "logs","info"
1.4 kafka
- 服務啓動
啓動centos01、centos02、centos03的kafka
bin/kafka-server-start.sh config/server.properties
- 創建topic
創建flume配置的名爲logs的topic
# --replication-factor 副本數,爲了每個kafka可以單獨對外服務,該項配置值爲集羣機器數
bin/kafka-topics.sh --create --zookeeper bigdata.centos01:2181,bigdata.centos02:2181,bigdata.centos03:2181 --replication-factor 3 --partitions 1 --topic test
- 啓動consumer
啓動任意一臺機器上面的kafka consumer,供測試使用。
bin/kafka-console-consumer.sh --zookeeper bigdata.centos01:2181,bigdata.centos02:2181,bigdata.centos03:2181 --topic logs --from-beginning
1.5 flume
- 啓動centos01日誌聚集flume
bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties -n agent -Dflume.root.logger=INFO,console
- 啓動centos01日誌收集flume
bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties -n a1 -Dflume.root.logger=INFO,console
主要區別在於:agent命名不一樣,centos01命名爲agent,centos02和centos03的命名爲a1
1.6 模擬程序啓動
將上面的模擬程序類打包成jar
#formatedSougouLog是經格式化的數據,將\t和空格替換成逗號
java -jar ReadWrite.jar formatedSougouLog webserverlogs
2. 測試
- hbase插入的數據
- kafka獲取的數據