版本統一:
apache-fluem:1.8.0
jkd:1.8
hadoop：2.7.5
zk:3.4.9
flume用戶指導手冊

Flume簡介,架構

(1)概述

flume是一款大數據中海量數據採集傳輸彙總的軟件。特別指的是數據流轉的過程，或者說是數據搬運的過程。把數據從一個存儲介質通過flume傳遞到另一個存儲介質中.

(2)核心組件

source:用於對接各個不同的數據源
sink:用於對接各個不同存儲數據的目的地(數據的下沉地)
channel:用於中間臨時存儲緩存數據

(3)Flume採集系統結構

(4)運行機制

flum本身就是java程序,在需要採集數據的機器上啓動(agent進程)
agent進程裏面包含了:source sink channel
在flum中數據被包裝成event,真是的數據是放置event body中,event是flume中最小的數據單元

(5)運行架構

簡單架構:只需要部署一個agent即可
複雜架構 :多個agent之間的串聯,串聯的架構中沒有主從之分 大家的地位都是一樣的

Flume安裝部署

(1)上傳解壓

上傳安裝包到數據源所在的節點(我的是node02節點)
點擊,安裝包下載地址

tar -zxvf apache-flume-1.8.0-bin.tar.gz -C /export/servers/

(2)修改配置文件

將配置文件的模板複製一份並使用Notepad++連接進行修改

cp flume-env.sh.template flume-env.sh

在conf/flume-env.sh 導入java環境變量
保證flume工作的時候一定可以正確的加載到環境變量
這個配置文件是沒有執行權限的需要我們設置一下

chomd 755 flume-env.sh

(3)使用小demo來測試配置環境是否正常

cd /exprot/servers/apache-flume-1.8.0/conf
touch netcat-logger

我們使用Notepad++連接node02節點進行編輯netcat-logger文件

內容(去flume用戶指導手冊中可以有現成的直接粘貼即用)

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

啓動
因爲我們沒有配置環境變量,所以需要到flume的根目錄中輸入以下命令

bin/flume-ng agent --conf conf --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console

flume-ng表明他是一個新的flume,–conf conf 指定配置文件conf ,指定具體採集方案文件配置路徑–conf-file conf/netcat-logger.conf
–name a1,這個agent進程名稱a1, -Dflume.root.logger=INFO,console開啓日誌,在控制檯打印
3. 精簡版命令

bin/flume-ng agent -c ./conf -f ./conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

-c 後面是配置文件路徑 -f後面是指定採集的配置文件路徑名，最後是開啓日誌，控制檯打印

連接一下44444端口傳入數據
首先安裝一下連接工具

yum -y install telnet

telnet localhost 44444

在窗口內輸入hello world

Flume採集配置模板**

分爲

# Name the components on this agent  1.命名組件
# Describe/configure the source 2.配置源
# Describe the sink 3.配置目標
# Use a channel which buffers events in memory 4.使用管道
# # Bind the source and sink to the channel 5.源和目標綁定到管道兩端

Flume採集案例

1.全量採集目錄到HDFS

(1)採集需求:

服務器在某特定目錄下,會不斷產生新的文件,每當有新文件出現,就需要把文件採集到HDFS中去

採集源,即source–監控文件目錄:Spooling Directory Source
下沉目標,即sink–HDFS文件系統:HDFS Sink
源和目標的通道channel使用:File Channel

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
##注意：不能往監控目中重複丟同名文件
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/logs1
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件類型，默認是Sequencefile，可用DataStream，則爲普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(2)參數剖析:

roll控制寫入hdfs文件 以何種方式進行滾動(控制文件滾動)
a1.sinks.k1.hdfs.rollInterval = 3  以時間間隔
a1.sinks.k1.hdfs.rollSize = 20     以文件大小
a1.sinks.k1.hdfs.rollCount = 5     以event個數
如果三個都配置  誰先滿足誰觸發滾動
如果不想以某種屬性滾動  設置爲0即可

是否開啓時間上的捨棄  控制文件夾以多少時間間隔滾動(控制文件夾滾動,一個文件夾裏面可以有多個文件)
以下述爲例：就會每10分鐘生成一個文件夾
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

capacity:默認該通道最大的可以存儲的event數量
trasactionCapacity:每次最大可以從source中拿到或者送到sink中的event數量

(3)啓動監控

bin/flume-ng agent -c ./conf -f ./conf/spooldir-hdfs.conf -n a1 -Dflume.root.logger=INFO,console

2.最最重要問題

不能往監控目錄中丟重名文件，如果傳入同名文件flume會報錯且罷工，後續就不再進行數據的監視採集了
怎麼保證不會出現重名文件：在企業中通常給文件追加時間戳命名方式，以保證所有文件不重名

3.增量採集文件到HDFS

(1)採集需求:

業務系統中使用log4j生成日誌,日誌文件的內容不斷增加,我們需要實現把實時生成的日誌信息實時採集到hdfs上

確定三大要素
1. 採集源source–監控文件內容更新:Exec Source
2. 下沉目標sink–hdfs系統:HDFS Sink
3. 管道channel–源和下沉目標直接傳遞通道:Memory Channel

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/log/test.log
a1.sources.r1.channels = c1

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume1/tailout/%y-%m-%d/%H-%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件類型，默認是Sequencefile，可用DataStream，則爲普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(2)編寫一個shell腳本模擬生成數據

vim addDate

輸入以下內容

#!/bin/bash
while true 
do 
	date >> /root/log/test.log
done

(3)啓動flume監控

bin/flume-ng agent -c ./conf -f ./conf/tail-hdfs.conf -n a1 -Dflume.root.logger=INFO,console
//啓動生成數據的腳本
addDate.sh

Flume的load-blance負載均衡,failover容錯

1.Flume的負載均衡

(1)使用場景:

當我們多個flume進行串聯,但是當某一節點的計算機處理能力不足時,可能會產生大量數據堆積,我們可以開啓多臺,進行並聯,並聯的多臺集羣,就涉及到了資源分配的負載均衡算法(輪詢, 隨機,權重),和同一個請求只能給一個進程出口,避免數據的一個重複.

(2)分發flume到其他兩個節點

scp -r /export/servers/flume node01:$PWD
scp -r /export/servers/flume node03:$PWD

(2)flume串聯跨網絡傳輸數據

avro sink
avro source
使用上述兩個組件指定綁定端口ip就可以滿足數據跨網絡的傳遞,通常用於flume串聯架構中

(3)flume串聯配置

node01 配置名稱爲:exec-avro.conf

#agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2

#set gruop
agent1.sinkgroups = g1

#set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /root/logs/123.log

# set sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = node02
agent1.sinks.k1.port = 52020

# set sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = node03
agent1.sinks.k2.port = 52020

#set sink group
agent1.sinkgroups.g1.sinks = k1 k2

#set failover
agent1.sinkgroups.g1.processor.type = load_balance
agent1.sinkgroups.g1.processor.backoff = true
agent1.sinkgroups.g1.processor.selector = round_robin
agent1.sinkgroups.g1.processor.selector.maxTimeOut=10000

node02節點配置名稱爲:avro-logger.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node02
a1.sources.r1.port = 52020

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

node03節點配置名稱爲:avro-logger.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node03
a1.sources.r1.port = 52020

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(4)flume串聯啓動

啓動應該從遠離數據端開始啓動,這樣可以避免數據的丟失
啓動node03

bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console

啓動node02

bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console

啓動node01

bin/flume-ng agent -c conf -f conf/exec-avro.conf -n agent1 -Dflume.root.logger=INFO,console

2.容錯

在上述內容基礎上node01增加failover配置其他不變

#set failover
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 1
agent1.sinkgroups.g1.processor.maxpenalty = 10000

Flume攔截器案例

1.靜態攔截器

(1)使用場景:

通過攔截器可以實現flume的數據傳入source時進行攔截,攔截後往裏面添加k,v對標識,最後上傳hdfs時可以通過標識區分數據後歸總存放

如果沒有使用靜態攔截器
Event: { headers:{} body:  36 Sun Jun  2 18:26 }

使用靜態攔截器之後 自己添加kv標識對
Event: { headers:{type=access} body:  36 Sun Jun  2 18:26 }
Event: { headers:{type=nginx} body:  36 Sun Jun  2 18:26 }
Event: { headers:{type=web} body:  36 Sun Jun  2 18:26 }

後續在存放數據的時候可以使用flume的規則語法獲取到攔截器添加的kv內容

%{type}

(2)實現案例

案例場景
A、B兩臺日誌服務機器實時生產日誌主要類型爲access.log、nginx.log、web.log
現在要求：

把A、B 機器中的access.log、nginx.log、web.log 採集彙總到C機器上然後統一收集到hdfs中。
但是在hdfs中要求的目錄爲：

/source/logs/access/20160101/**
/source/logs/nginx/20160101/**
/source/logs/web/20160101/**

在node01上配置exec_source_avro_sink.conf文件

# Name the components on this agent
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/access.log
# 添加一個攔截器
a1.sources.r1.interceptors = i1
# 設置爲靜態的
a1.sources.r1.interceptors.i1.type = static

a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access

a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /root/logs/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx

a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /root/logs/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node02
a1.sinks.k1.port = 41414

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1

在node02上配置avro_source_hdfs_sink.conf文件

#定義agent名， source、channel、sink的名稱
a1.sources = r1
a1.sinks = k1
a1.channels = c1


#定義source
a1.sources.r1.type = avro
a1.sources.r1.bind = node02
a1.sources.r1.port =41414

#添加時間攔截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder


#定義channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000

#定義sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=flume2/logs/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix =events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
#時間類型
#a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件不按條數生成
a1.sinks.k1.hdfs.rollCount = 0
#間隔20秒滾動一次
a1.sinks.k1.hdfs.rollInterval = 20
#生成的文件按大小生成
a1.sinks.k1.hdfs.rollSize  = 10485760
#批量寫入hdfs的evens個數
a1.sinks.k1.hdfs.batchSize = 20
# flume操作hdfs的線程數（包括新建，寫入等）
a1.sinks.k1.hdfs.threadsPoolSize=10
#操作hdfs超時時間
a1.sinks.k1.hdfs.callTimeout=30000

#組裝source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

啓動node02的flume

bin/flume-ng agent -c ./conf -f ./conf/avro_source_hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console

啓動node01的flume

bin/flume-ng agent -c ./conf -f ./conf/exec_source_avro_sink.conf -n a1 -Dflume.root.logger=INFO,console

模擬數據實現產生
寫一組shell腳本產生數據

while true; do echo "access access....." >> /root/logs/access.log;sleep 0.5;done
while true; do echo "web web....." >> /root/logs/web.log;sleep 0.5;done
while true; do echo "nginx nginx....." >> /root/logs/nginx.log;sleep 0.5;done

大數據組件-Apache Flume簡介,架構,安裝部署,Flume全量採集目錄/增量文件到hdfs,負載均衡,容錯,靜態攔截器

目錄標題

Flume簡介,架構

(1)概述

(2)核心組件

(3)Flume採集系統結構

(4)運行機制

(5)運行架構

Flume安裝部署

(1)上傳解壓

(2)修改配置文件

(3)使用小demo來測試配置環境是否正常

Flume採集配置模板**

Flume採集案例

1.全量採集目錄到HDFS

(1)採集需求:

(2)參數剖析:

(3)啓動監控

2.最最重要問題

3.增量採集文件到HDFS

(1)採集需求:

(2)編寫一個shell腳本模擬生成數據

(3)啓動flume監控

Flume的load-blance負載均衡,failover容錯

1.Flume的負載均衡

(1)使用場景:

(2)分發flume到其他兩個節點

(2)flume串聯跨網絡傳輸數據

(3)flume串聯配置

(4)flume串聯啓動

2.容錯

Flume攔截器案例

1.靜態攔截器

(1)使用場景:

(2)實現案例