版本統一:
apache-fluem:1.8.0
jkd:1.8
hadoop:2.7.5
zk:3.4.9
flume用戶指導手冊
Flume簡介,架構
(1)概述
flume是一款大數據中海量數據採集傳輸彙總的軟件。特別指的是數據流轉的過程,或者說是數據搬運的過程。把數據從一個存儲介質通過flume傳遞到另一個存儲介質中.
(2)核心組件
source:用於對接各個不同的數據源
sink:用於對接各個不同存儲數據的目的地(數據的下沉地)
channel:用於中間臨時存儲緩存數據
(3)Flume採集系統結構
(4)運行機制
flum本身就是java程序,在需要採集數據的機器上啓動(agent進程)
agent進程裏面包含了:source sink channel
在flum中數據被包裝成event,真是的數據是放置event body中,event是flume中最小的數據單元
(5)運行架構
簡單架構:只需要部署一個agent即可
複雜架構 :多個agent之間的串聯,串聯的架構中沒有主從之分 大家的地位都是一樣的
Flume安裝部署
(1)上傳解壓
上傳安裝包到數據源所在的節點(我的是node02節點)
點擊,安裝包下載地址
tar -zxvf apache-flume-1.8.0-bin.tar.gz -C /export/servers/
(2)修改配置文件
將配置文件的模板複製一份並使用Notepad++連接進行修改
cp flume-env.sh.template flume-env.sh
在conf/flume-env.sh 導入java環境變量
保證flume工作的時候一定可以正確的加載到環境變量
這個配置文件是沒有執行權限的需要我們設置一下
chomd 755 flume-env.sh
(3)使用小demo來測試配置環境是否正常
cd /exprot/servers/apache-flume-1.8.0/conf
touch netcat-logger
我們使用Notepad++連接node02節點進行編輯netcat-logger文件
- 內容(去flume用戶指導手冊中可以有現成的直接粘貼即用)
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 啓動
因爲我們沒有配置環境變量,所以需要到flume的根目錄中輸入以下命令
bin/flume-ng agent --conf conf --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console
flume-ng表明他是一個新的flume,–conf conf 指定配置文件conf ,指定具體採集方案文件配置路徑–conf-file conf/netcat-logger.conf
–name a1,這個agent進程名稱a1, -Dflume.root.logger=INFO,console開啓日誌,在控制檯打印
3. 精簡版命令
bin/flume-ng agent -c ./conf -f ./conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console
-c 後面是配置文件路徑 -f後面是指定採集的配置文件路徑名,最後是開啓日誌,控制檯打印
- 連接一下44444端口傳入數據
首先安裝一下連接工具
yum -y install telnet
telnet localhost 44444
在窗口內輸入hello world
Flume採集配置模板**
分爲
# Name the components on this agent 1.命名組件
# Describe/configure the source 2.配置源
# Describe the sink 3.配置目標
# Use a channel which buffers events in memory 4.使用管道
# # Bind the source and sink to the channel 5.源和目標綁定到管道兩端
Flume採集案例
1.全量採集目錄到HDFS
(1)採集需求:
服務器在某特定目錄下,會不斷產生新的文件,每當有新文件出現,就需要把文件採集到HDFS中去
- 採集源,即source–監控文件目錄:Spooling Directory Source
- 下沉目標,即sink–HDFS文件系統:HDFS Sink
- 源和目標的通道channel使用:File Channel
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
##注意:不能往監控目中重複丟同名文件
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/logs1
a1.sources.r1.fileHeader = true
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件類型,默認是Sequencefile,可用DataStream,則爲普通文本
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(2)參數剖析:
roll控制寫入hdfs文件 以何種方式進行滾動(控制文件滾動)
a1.sinks.k1.hdfs.rollInterval = 3 以時間間隔
a1.sinks.k1.hdfs.rollSize = 20 以文件大小
a1.sinks.k1.hdfs.rollCount = 5 以event個數
如果三個都配置 誰先滿足誰觸發滾動
如果不想以某種屬性滾動 設置爲0即可
是否開啓時間上的捨棄 控制文件夾以多少時間間隔滾動(控制文件夾滾動,一個文件夾裏面可以有多個文件)
以下述爲例:就會每10分鐘生成一個文件夾
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
capacity:默認該通道最大的可以存儲的event數量
trasactionCapacity:每次最大可以從source中拿到或者送到sink中的event數量
(3)啓動監控
bin/flume-ng agent -c ./conf -f ./conf/spooldir-hdfs.conf -n a1 -Dflume.root.logger=INFO,console
2.最最重要問題
- 不能往監控目錄中丟重名文件,如果傳入同名文件flume會報錯且罷工,後續就不再進行數據的監視採集了
- 怎麼保證不會出現重名文件:在企業中通常給文件追加時間戳命名方式,以保證所有文件不重名
3.增量採集文件到HDFS
(1)採集需求:
業務系統中使用log4j生成日誌,日誌文件的內容不斷增加,我們需要實現把實時生成的日誌信息實時採集到hdfs上
- 確定三大要素
- 採集源source–監控文件內容更新:Exec Source
- 下沉目標sink–hdfs系統:HDFS Sink
- 管道channel–源和下沉目標直接傳遞通道:Memory Channel
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/log/test.log
a1.sources.r1.channels = c1
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume1/tailout/%y-%m-%d/%H-%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件類型,默認是Sequencefile,可用DataStream,則爲普通文本
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(2)編寫一個shell腳本模擬生成數據
vim addDate
輸入以下內容
#!/bin/bash
while true
do
date >> /root/log/test.log
done
(3)啓動flume監控
bin/flume-ng agent -c ./conf -f ./conf/tail-hdfs.conf -n a1 -Dflume.root.logger=INFO,console
//啓動生成數據的腳本
addDate.sh
Flume的load-blance負載均衡,failover容錯
1.Flume的負載均衡
(1)使用場景:
當我們多個flume進行串聯,但是當某一節點的計算機處理能力不足時,可能會產生大量數據堆積,我們可以開啓多臺,進行並聯,並聯的多臺集羣,就涉及到了資源分配的負載均衡算法(輪詢, 隨機,權重),和同一個請求只能給一個進程出口,避免數據的一個重複.
(2)分發flume到其他兩個節點
scp -r /export/servers/flume node01:$PWD
scp -r /export/servers/flume node03:$PWD
(2)flume串聯跨網絡傳輸數據
- avro sink
- avro source
使用上述兩個組件指定綁定端口ip就可以滿足數據跨網絡的傳遞,通常用於flume串聯架構中
(3)flume串聯配置
- node01 配置名稱爲:exec-avro.conf
#agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2
#set gruop
agent1.sinkgroups = g1
#set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /root/logs/123.log
# set sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = node02
agent1.sinks.k1.port = 52020
# set sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = node03
agent1.sinks.k2.port = 52020
#set sink group
agent1.sinkgroups.g1.sinks = k1 k2
#set failover
agent1.sinkgroups.g1.processor.type = load_balance
agent1.sinkgroups.g1.processor.backoff = true
agent1.sinkgroups.g1.processor.selector = round_robin
agent1.sinkgroups.g1.processor.selector.maxTimeOut=10000
- node02節點配置名稱爲:avro-logger.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node02
a1.sources.r1.port = 52020
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- node03節點配置名稱爲:avro-logger.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node03
a1.sources.r1.port = 52020
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(4)flume串聯啓動
啓動應該從遠離數據端開始啓動,這樣可以避免數據的丟失
啓動node03
bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console
啓動node02
bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console
啓動node01
bin/flume-ng agent -c conf -f conf/exec-avro.conf -n agent1 -Dflume.root.logger=INFO,console
2.容錯
在上述內容基礎上node01增加failover配置其他不變
#set failover
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 1
agent1.sinkgroups.g1.processor.maxpenalty = 10000
Flume攔截器案例
1.靜態攔截器
(1)使用場景:
通過攔截器可以實現flume的數據傳入source時進行攔截,攔截後往裏面添加k,v對標識,最後上傳hdfs時可以通過標識區分數據後歸總存放
如果沒有使用靜態攔截器
Event: { headers:{} body: 36 Sun Jun 2 18:26 }
使用靜態攔截器之後 自己添加kv標識對
Event: { headers:{type=access} body: 36 Sun Jun 2 18:26 }
Event: { headers:{type=nginx} body: 36 Sun Jun 2 18:26 }
Event: { headers:{type=web} body: 36 Sun Jun 2 18:26 }
後續在存放數據的時候可以使用flume的規則語法獲取到攔截器添加的kv內容
%{type}
(2)實現案例
- 案例場景
A、B兩臺日誌服務機器實時生產日誌主要類型爲access.log、nginx.log、web.log
現在要求:
把A、B 機器中的access.log、nginx.log、web.log 採集彙總到C機器上然後統一收集到hdfs中。
但是在hdfs中要求的目錄爲:
/source/logs/access/20160101/**
/source/logs/nginx/20160101/**
/source/logs/web/20160101/**
- 在node01上配置exec_source_avro_sink.conf文件
# Name the components on this agent
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/access.log
# 添加一個攔截器
a1.sources.r1.interceptors = i1
# 設置爲靜態的
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access
a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /root/logs/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx
a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /root/logs/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node02
a1.sinks.k1.port = 41414
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1
- 在node02上配置avro_source_hdfs_sink.conf文件
#定義agent名, source、channel、sink的名稱
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#定義source
a1.sources.r1.type = avro
a1.sources.r1.bind = node02
a1.sources.r1.port =41414
#添加時間攔截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
#定義channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000
#定義sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=flume2/logs/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix =events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
#時間類型
#a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件不按條數生成
a1.sinks.k1.hdfs.rollCount = 0
#間隔20秒滾動一次
a1.sinks.k1.hdfs.rollInterval = 20
#生成的文件按大小生成
a1.sinks.k1.hdfs.rollSize = 10485760
#批量寫入hdfs的evens個數
a1.sinks.k1.hdfs.batchSize = 20
# flume操作hdfs的線程數(包括新建,寫入等)
a1.sinks.k1.hdfs.threadsPoolSize=10
#操作hdfs超時時間
a1.sinks.k1.hdfs.callTimeout=30000
#組裝source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 啓動node02的flume
bin/flume-ng agent -c ./conf -f ./conf/avro_source_hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console
- 啓動node01的flume
bin/flume-ng agent -c ./conf -f ./conf/exec_source_avro_sink.conf -n a1 -Dflume.root.logger=INFO,console
- 模擬數據實現產生
寫一組shell腳本產生數據
while true; do echo "access access....." >> /root/logs/access.log;sleep 0.5;done
while true; do echo "web web....." >> /root/logs/web.log;sleep 0.5;done
while true; do echo "nginx nginx....." >> /root/logs/nginx.log;sleep 0.5;done