Flume的學習和使用

本文是基於CentOS 7.3系統環境，進行Flume的學習和使用

CentOS 7.3

一、Flume的簡介

1.1 Flume基本概念

(1) 什麼是Flume

Flume是Cloudera提供的一個高可用的，高可靠的，分佈式的海量日誌採集、聚合和傳輸的系統。

(2) Flume的目的

Flume最主要的作業就是，實時讀取服務器本地磁盤的數據，將數據寫入HDFS

1.2 Flume基本組件

(0) Flume工作流程

Source採集數據幷包裝成Event，並將Event緩存在Channel中，Sink不斷地從Channel獲取Event，並解決成數據，最終將數據寫入存儲或索引系統

(1) Agent

Agent是一個JVM進程，它以事件的形式將數據從源頭送至目的。
Agent主要有3個部分組成，Source、Channel、Sink

(2) Source

Source是負責接收數據到Flume Agent的組件，採集數據幷包裝成Event。Source組件可以處理各種類型、各種格式的日誌數據，包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy

(3) Sink

Sink不斷地輪詢Channel中的事件且批量地移除它們，並將這些事件批量寫入到存儲或索引系統、或者被髮送到另一個Flume Agent。
Sink組件目的地包括hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定義

(4) Channel

Channel是位於Source和Sink之間的緩衝區。因此，Channel允許Source和Sink運作在不同的速率上。Channel是線程安全的，可以同時處理幾個Source的寫入操作和幾個Sink的讀取操作

Flume自帶兩種Channel：Memory Channel和File Channel

Memory Channel是內存中的隊列。Memory Channel在不需要關心數據丟失的情景下適用。如果需要關心數據丟失，那麼Memory Channel就不應該使用，因爲程序死亡、機器宕機或者重啓都會導致數據丟失
File Channel將所有事件寫到磁盤。因此在程序關閉或機器宕機的情況下不會丟失數據

(4) Event

傳輸單元，Flume數據傳輸的基本單元，以Event的形式將數據從源頭送至目的地。Event由Header和Body兩部分組成，Header用來存放該event的一些屬性，爲K-V結構，Body用來存放該條數據，形式爲字節數組

二、Flume的安裝和入門案例

2.1 Flume安裝

(1) Flume壓縮包解壓

tar -xzvf apache-flume-1.7.0-bin.tar.gz -C /opt/module/

(2) 修改Flume名稱

cd /opt/module/
mv apache-flume-1.7.0-bin flume

(3) 修改Flume配置文件

cd /opt/module/flume/conf
mv flume-env.sh.template flume-env.sh
vi flume-env.sh
# 修改內容如下
export JAVA_HOME=/opt/module/jdk1.8.0_201

cd /opt/module/flume/conf
vi log4j.properties
# 修改內容如下
flume.log.dir=/opt/module/flume/logs

2.1 Flume案例-監聽數據端口

(1) 安裝nc

yum install -y nc

(2) 安裝net-tools

yum install -y net-tools

(3) 檢測端口是否被佔用

netstat -nltp | grep 444444

(4) 啓動flume-agent

cd /opt/module/flume
bin/flume-ng agent --name a1 --conf conf/ --conf-file job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

(5) 開啓另一個終端，發送消息

nc localhost 4444
aaa

2.2 Flume案例-實時監控單個追加文件

(1) 拷貝jar包至/opt/module/flume/lib

commons-configuration-1.6.jar
hadoop-auth-2.7.2.jar
hadoop-common-2.7.2.jar
hadoop-hdfs-2.7.2.jar
commons-io-2.4.jar
htrace-core-3.1.0-incubating.jar

(2) 創建flume-file-hdfs.conf文件

vi flume-file-hdfs.conf
# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log
a2.sources.r2.shell = /bin/bash -c

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop102:9000/flume/%Y%m%d/%H
#上傳文件的前綴
a2.sinks.k2.hdfs.filePrefix = logs-
#是否按照時間滾動文件夾
a2.sinks.k2.hdfs.round = true
#多少時間單位創建一個新的文件夾
a2.sinks.k2.hdfs.roundValue = 1
#重新定義時間單位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地時間戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#積攢多少個Event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 1000
#設置文件類型，可支持壓縮
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一個新的文件
a2.sinks.k2.hdfs.rollInterval = 60
#設置每個文件的滾動大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滾動與Event數量無關
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
	
# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

(3) 啓動flume-agent

bin/flume-ng agent -n a2 -c conf/ -f job/flume-file-hdfs.conf

(4) 開啓另一個終端，執行hive命令

hive

2.3 Flume案例-實時監控目錄下多個新文件

(1) 創建flume-dir-hdfs.conf文件

vim flume-dir-hdfs.conf
# 添加如下內容
a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp結尾的文件，不上傳
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop102:9000/flume/upload/%Y%m%d/%H
#上傳文件的前綴
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照時間滾動文件夾
a3.sinks.k3.hdfs.round = true
#多少時間單位創建一個新的文件夾
a3.sinks.k3.hdfs.roundValue = 1
#重新定義時間單位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地時間戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#積攢多少個Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 100
#設置文件類型，可支持壓縮
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一個新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#設置每個文件的滾動大小大概是128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滾動與Event數量無關
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

(2) 啓動flume-agent

bin/flume-ng agent -n a3 -c conf/ -f job/flume-dir-hdfs.conf

(3) 開啓另一個終端

cd /opt/module/flume/
mkdir upload
cp NOTICE upload/

2.4 Flume案例-實時監控目錄下的多個追加文件

Exec source適用於監控一個實時追加的文件，不能實現斷電續傳；Spooldir Source適合用於同步新文件，但不適合對實時追加日誌的文件進行監聽並同步；而Taildir Source適合用於監聽多個實時追加的文件，並且能夠實現斷點續傳。

(1) 創建flume-dir-hdfs.conf文件

vi flume-taildir-hdfs.conf
# 添加內容
a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = TAILDIR
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json
a3.sources.r3.filegroups = f1 f2
a3.sources.r3.filegroups.f1 = /opt/module/flume/files/.*file.*
a3.sources.r3.filegroups.f2 = /opt/module/flume/files/.*log.*

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop102:9000/flume/upload2/%Y%m%d/%H
#上傳文件的前綴
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照時間滾動文件夾
a3.sinks.k3.hdfs.round = true
#多少時間單位創建一個新的文件夾
a3.sinks.k3.hdfs.roundValue = 1
#重新定義時間單位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地時間戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#積攢多少個Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 100
#設置文件類型，可支持壓縮
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一個新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#設置每個文件的滾動大小大概是128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滾動與Event數量無關
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

(2) 創建目錄和文件

cd /opt/module/flume
mkdir files
cp CHANGELOG files/CHANGELOG.log
cp LICENSE files/LICENSE.log

(3) 啓動flume-agent

bin/flume-ng agent -n a3 -c conf/ -f job/flume-taildir-hdfs.conf

(4) 開啓另一個終端

cd /opt/module/flume/files
vi CHANGELOG.log
# 添加如下內容
xxxxx
sssss
wwwww

三、Flume的進階

3.1 Flume事務

(1) Put事務流程

doPut：將批數據先寫入臨時緩存區putList
doCommit：檢查channel內存隊列是否足夠合併
doRollback：channel內存隊列空間不足，回滾數據

(2) Take事務流程

doTake：將數據取到臨時緩存區takeList，並將數據發送到HDFS
doCommit：如果數據全部發送成功，則清除臨時緩衝區takeList
doRollback：數據發送過程中如果出現異常，rollback將臨時緩衝區takeList中的數據歸還給channel內存隊列

3.2 Flume Agent內部原理

(1) ChannelSelector

ChannelSelector的作用就是選出Event將要被髮往哪個Channel，其共有兩種類型

Replicating（複製）
ReplicatingSelector會將同一個Event發往所有的Channel，
和Multiplexing（多路複用）
Multiplexing會根據相應的原則，將不同的Event發往不同的Channel

(2) SinkProcessor

SinkProcessor共有三種類型

DefaultSinkProcessor
對應單個sink，發送至單個sink
LoadBalancingSinkProcessor
LoadBalancingSinkProcessor對應的是Sink Group，LoadBalancingSinkProcessor可以實現負載均衡的功能
FailoverSinkProcessor
FailoverSinkProcessor對應的是Sink Group，
FailoverSinkProcessor可以錯誤恢復的功能

四、Flume的拓撲結構

4.1 簡單串聯

這種模式是將多個flume順序連接起來了，從最初的source開始到最終sink傳送的目的存儲系統。

優點
多個flume並聯，可以增加event緩存數量
缺點
此模式不建議橋接過多的flume數量， flume數量過多不僅會影響傳輸速率，而且一旦傳輸過程中某個節點flume宕機，會影響整個傳輸系統。

4.2 複製和多路複用

Flume支持將事件流向一個或者多個目的地。這種模式可以將相同數據複製到多個channel中，或者將不同數據分發到不同的channel中，sink可以選擇傳送到不同的目的地。

4.3 負載均衡和故障轉移

Flume支持使用將多個sink邏輯上分到一個sink組，sink組配合不同的SinkProcessor可以實現負載均衡和錯誤恢復的功能。

4.4 聚合

這種模式是我們最常見的，也非常實用，日常web應用通常分佈在上百個服務器，大者甚至上千個、上萬個服務器。產生的日誌，處理起來也非常麻煩。用flume的這種組合方式能很好的解決這一問題，每臺服務器部署一個flume採集日誌，傳送到一個集中收集日誌的flume，再由此flume上傳到hdfs、hive、hbase等，進行日誌分析。

五、Flume的企業開發實例

5.1 複製和多路複用

(1) 創建flume-file-avro.conf文件

vi flume-file-avro.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 將數據流複製給所有channel
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
# sink端的avro是一個數據發送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop1021
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop101
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

(2) 創建flume-avro-hdfs.conf文件

vi flume-avro-hdfs.conf
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
# source端的avro是一個數據接收服務
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop101
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:9000/flume2/%Y%m%d/%H
#上傳文件的前綴
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照時間滾動文件夾
a2.sinks.k1.hdfs.round = true
#多少時間單位創建一個新的文件夾
a2.sinks.k1.hdfs.roundValue = 1
#重新定義時間單位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地時間戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#積攢多少個Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 100
#設置文件類型，可支持壓縮
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一個新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#設置每個文件的滾動大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滾動與Event數量無關
a2.sinks.k1.hdfs.rollCount = 0

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(3) 創建flume-avro-dir.conf文件

vi flume-avro-dir.conf
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop101
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/flume/data/flume3

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

(4) 執行配置文件

bin/flume-ng agent -n a3 -c conf/ -f job/group1/flume-avro-dir.conf
bin/flume-ng agent -n a2 -c conf/ -f job/group1/flume-avro-hdfs.conf
bin/flume-ng agent -n a1 -c conf/ -f job/group1/flume-file-avro.conf

(5) 啓動Hadoop和Hive

sbin/start-dfs.sh
sbin/start-yarn.sh
bin/hive

5.2 故障轉移

(1) 創建a1.conf文件

vi a1.conf
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop101
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop101
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

(2) 創建a2.conf文件

vi a2.conf
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop101
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(3) 創建a3.conf文件

vi a3.conf
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop101
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

(4) 執行配置文件

bin/flume-ng agent -n a3 -c conf/ -f job/group2/a3.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -n a2 -c conf/ -f job/group2/a2.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -n a1 -c conf/ -f job/group2/a1.conf

(5) 開啓另一個終端，發送消息

nc localhost 4444
aaa

(6) 殺死a3後，通過故障轉移，a2能正常工作

kill -9 a3-pid

5.3 負載均衡

(1) 創建a1.conf文件

vi a1.conf
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop101
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop101
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

(2) 創建a2.conf文件

vi a2.conf
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop101
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(3) 創建a3.conf文件

vi a3.conf
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop101
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

(4) 執行配置文件

bin/flume-ng agent -n a3 -c conf/ -f job/group2/a3.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -n a2 -c conf/ -f job/group2/a2.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -n a1 -c conf/ -f job/group2/a1.conf

(5) 開啓另一個終端，不斷髮送消息

nc localhost 4444
aaa

5.4 聚合

(1) 創建a1.conf文件

vi a1.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/flume/group.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(2) 創建a2.conf文件

vi a2.conf
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop103
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(3) 創建a3.conf文件

vi a3.conf
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop103
a3.sources.r1.port = 4141

# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

(4) 執行配置文件

hadoop103

bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group4/a3.conf -Dflume.root.logger=INFO,console

hadoop102

bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group4/a2.conf

hadoop101

bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group4/a1.conf

(5) 開啓另一個終端，不斷髮送消息

hadoop101

nc hadoop102 44444
aaa

(6) 向group.log文件中，添加內容

hadoop101

cd /opt/module/flume
echo 222 >> group.log

5.5 自定義Interceptor案例

根據日誌不同的類型（type），將日誌進行分流，分入到不同的sink

(1) 實現一個Interceptor接口

package com.inspur.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.List;
import java.util.Map;

public class MyInterceptor implements Interceptor {
    public void initialize() {

    }

    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        byte[] body = event.getBody();
        if (body[0] <= '9' && body[0] >= '0') {
            headers.put("type", "number");
        } else {
            headers.put("type", "not_number");
        }
        return event;
    }

    public List<Event> intercept(List<Event> events) {
        for (Event event : events) {
            intercept(event);
        }
        return events;
    }

    public void close() {

    }

    public static class MyBuilder implements Interceptor.Builder{
        public Interceptor build() {
            return new MyInterceptor();
        }

        public void configure(Context context) {

        }
    }
}

(2) hadoop101創建配置文件a1.conf

hadoop101

cd /opt/module/flume/job/interceptor
vi a1.conf 
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.inspur.flume.interceptor.MyInterceptor$MyBuilder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.not_number = c1
a1.sources.r1.selector.mapping.number = c2
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141

a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop103
a1.sinks.k2.port = 4242

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100


# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

(3) hadoop102創建配置文件a1.conf

hadoop102

cd /opt/module/flume/job/interceptor
vi a1.conf 
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 4141

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

(4) hadoop103創建配置文件a1.conf

hadoop103

cd /opt/module/flume/job/interceptor
vi a1.conf 
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 4242

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

(5) 分別啓動flume進程

hadoop103

bin/flume-ng agent -n a1 -c conf/ -f job/interceptor/a1.conf -Dflume.root.logger=INFO,console

hadoop102

bin/flume-ng agent -n a1 -c conf/ -f job/interceptor/a1.conf -Dflume.root.logger=INFO,console

hadoop101

bin/flume-ng agent -n a1 -c conf/ -f job/interceptor/a1.conf -Dflume.root.logger=INFO,console

(6) 開啓另一個終端，不斷髮送消息

hadoop101

nc hadoop102 44444
aaa
111
1ss
s11

5.6 自定義Source案例

(1) 實現一個Source類

package com.inspur.flume.source;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;

import java.util.HashMap;

public class MySource extends AbstractSource implements Configurable, PollableSource {

    private String prefix;
    private long interval;

    public Status process() throws EventDeliveryException {
        Status status = null;
        try {
            for (int i = 1; i <= 5; i++) {
                Event e = new SimpleEvent();
                e.setHeaders(new HashMap<String, String>());
                e.setBody((prefix + i).getBytes());
                getChannelProcessor().processEvent(e);
                Thread.sleep(interval);
            }
            status = Status.READY;
        } catch (InterruptedException e) {
            status = Status.BACKOFF;
        }

        return status;
    }

    public long getBackOffSleepIncrement() {
        return 2000;
    }

    public long getMaxBackOffSleepInterval() {
        return 20000;
    }

    public void configure(Context context) {
        prefix = context.getString("source.prefix","Log");
        interval = context.getLong("source.interval",1000L);
    }
}

(2) hadoop101創建配置文件a1.conf

hadoop101

cd /opt/module/flume/job/source
vi a1.conf 
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = com.inspur.flume.source.MySource
a1.sources.r1.source.prefix= Log
a1.sources.r1.source.interval= 1000

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(3) 啓動flume進程

hadoop101

bin/flume-ng agent -n a1 -c conf/ -f job/source/a1.conf -Dflume.root.logger=INFO,console

5.7 自定義文件Source案例

(1) 實現一個Source類

package com.inspur.flume.source;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.channel.ChannelProcessor;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;

import java.io.*;
import java.util.HashMap;

public class MySource extends AbstractSource implements Configurable, PollableSource {
    private long interval;
    private String file;

    public Status process() throws EventDeliveryException {
        Status status = null;
        ChannelProcessor channelProcessor = getChannelProcessor();
        BufferedReader bufferedReader = null;
        try {
            bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
            String line;
            while ((line = bufferedReader.readLine()) != null) {
                Event event = new SimpleEvent();
                event.setHeaders(new HashMap<String, String>());
                event.setBody(line.getBytes());
                channelProcessor.processEvent(event);
                try {
                    Thread.sleep(interval);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
            status = Status.READY;
        } catch (IOException e) {
            status = Status.BACKOFF;
        } finally {
            if (bufferedReader != null) {
                try {
                    bufferedReader.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }

        return status;
    }

    public long getBackOffSleepIncrement() {
        return 2000;
    }

    public long getMaxBackOffSleepInterval() {
        return 20000;
    }

    public void configure(Context context) {
        file = context.getString("source.file", null);
        interval = context.getLong("source.interval",1000L);
    }
}

(2) hadoop101創建配置文件a1.conf

hadoop101

cd /opt/module/flume/job/source
vi a1.conf 
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = com.inspur.flume.source.MySource
a1.sources.r1.source.file= /opt/module/flume/group.log
a1.sources.r1.source.interval= 1000

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(3) 啓動flume進程

hadoop101

bin/flume-ng agent -n a1 -c conf/ -f job/source/a1.conf -Dflume.root.logger=INFO,console

5.8 自定義Sink案例

(1) 實現一個Sink類

package com.inspur.flume.sink;

import org.apache.flume.*;
import org.apache.flume.conf.Configurable;
import org.apache.flume.sink.AbstractSink;

public class MySink extends AbstractSink implements Configurable {
    private long interval;
    private String prefix;
    private String suffix;

    public Status process() throws EventDeliveryException {
        Status status = null;
        Channel channel = this.getChannel();
        Transaction transaction = channel.getTransaction();
        transaction.begin();
        try {
            Event event = null;
            while ((event = channel.take()) == null) {
                Thread.sleep(interval);
            }
            byte[] body = event.getBody();
            String line = new String(body, "UTF-8");
            System.out.println(prefix + line + suffix);
            status = Status.READY;
            transaction.commit();
        } catch (Exception e) {
            transaction.rollback();
            status = Status.BACKOFF;
        } finally {
            transaction.close();
        }

        return status;
    }

    public void configure(Context context) {
        prefix = context.getString("source.prefix", "start:");
        suffix = context.getString("source.suffix", ":end");
        interval = context.getLong("source.interval", 1000L);
    }
}

(2) hadoop101創建配置文件a1.conf

hadoop101

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = com.inspur.flume.sink.MySink
a1.sinks.k1.source.prefix = xuzheng:
a1.sinks.k1.source.suffix = :xuzheng
a1.sinks.k1.source.interval = 1000

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(3) 啓動flume進程

hadoop101

bin/flume-ng agent -n a1 -c conf/ -f job/sink/a1.conf -Dflume.root.logger=INFO,console

六、Flume數據流監控

6.1 Ganglia

Ganglia由gmond、gmetad和gweb三部分組成

gmond（Ganglia Monitoring Daemon）
gmond是一種輕量級服務，安裝在每臺需要收集指標數據的節點主機上。使用gmond，你可以很容易收集很多系統指標數據，如CPU、內存、磁盤、網絡和活躍進程的數據等
gmetad（Ganglia Meta Daemon）
gmetad整合所有信息，並將其以RRD格式存儲至磁盤的服務
gweb（Ganglia Web）
Ganglia可視化工具，gweb是一種利用瀏覽器顯示gmetad所存儲數據的PHP前端。在Web界面中以圖表方式展現集羣的運行狀態下收集的多種不同指標數據