大數據組件-Apache Flume簡介,架構,安裝部署,Flume全量採集目錄/增量文件到hdfs,負載均衡,容錯,靜態攔截器

版本統一:
apache-fluem:1.8.0
jkd:1.8
hadoop:2.7.5
zk:3.4.9
flume用戶指導手冊

Flume簡介,架構

(1)概述

flume是一款大數據中海量數據採集傳輸彙總的軟件。特別指的是數據流轉的過程,或者說是數據搬運的過程。把數據從一個存儲介質通過flume傳遞到另一個存儲介質中.

(2)核心組件

source:用於對接各個不同的數據源
sink:用於對接各個不同存儲數據的目的地(數據的下沉地)
channel:用於中間臨時存儲緩存數據

(3)Flume採集系統結構

在這裏插入圖片描述

(4)運行機制

flum本身就是java程序,在需要採集數據的機器上啓動(agent進程)
agent進程裏面包含了:source sink channel
在flum中數據被包裝成event,真是的數據是放置event body中,event是flume中最小的數據單元

(5)運行架構

簡單架構:只需要部署一個agent即可
複雜架構 :多個agent之間的串聯,串聯的架構中沒有主從之分 大家的地位都是一樣的

Flume安裝部署

(1)上傳解壓

上傳安裝包到數據源所在的節點(我的是node02節點)
點擊,安裝包下載地址

tar -zxvf apache-flume-1.8.0-bin.tar.gz -C /export/servers/

(2)修改配置文件

將配置文件的模板複製一份並使用Notepad++連接進行修改

cp flume-env.sh.template flume-env.sh

在conf/flume-env.sh 導入java環境變量
保證flume工作的時候一定可以正確的加載到環境變量
在這裏插入圖片描述這個配置文件是沒有執行權限的需要我們設置一下

chomd 755 flume-env.sh 

(3)使用小demo來測試配置環境是否正常

cd /exprot/servers/apache-flume-1.8.0/conf
touch netcat-logger

我們使用Notepad++連接node02節點進行編輯netcat-logger文件

  1. 內容(去flume用戶指導手冊中可以有現成的直接粘貼即用)
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  1. 啓動
    因爲我們沒有配置環境變量,所以需要到flume的根目錄中輸入以下命令
bin/flume-ng agent --conf conf --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console

flume-ng表明他是一個新的flume,–conf conf 指定配置文件conf ,指定具體採集方案文件配置路徑–conf-file conf/netcat-logger.conf
–name a1,這個agent進程名稱a1, -Dflume.root.logger=INFO,console開啓日誌,在控制檯打印
3. 精簡版命令

bin/flume-ng agent -c ./conf -f ./conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

-c 後面是配置文件路徑 -f後面是指定採集的配置文件路徑名,最後是開啓日誌,控制檯打印

在這裏插入圖片描述

  1. 連接一下44444端口傳入數據
    首先安裝一下連接工具
yum -y install telnet
telnet localhost 44444

在這裏插入圖片描述
在窗口內輸入hello world
在這裏插入圖片描述
在這裏插入圖片描述

Flume採集配置模板**

分爲

# Name the components on this agent  1.命名組件
# Describe/configure the source 2.配置源
# Describe the sink 3.配置目標
# Use a channel which buffers events in memory 4.使用管道
# # Bind the source and sink to the channel 5.源和目標綁定到管道兩端

Flume採集案例

1.全量採集目錄到HDFS

(1)採集需求:

服務器在某特定目錄下,會不斷產生新的文件,每當有新文件出現,就需要把文件採集到HDFS中去

  • 採集源,即source–監控文件目錄:Spooling Directory Source
  • 下沉目標,即sink–HDFS文件系統:HDFS Sink
  • 源和目標的通道channel使用:File Channel
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
##注意:不能往監控目中重複丟同名文件
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/logs1
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件類型,默認是Sequencefile,可用DataStream,則爲普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(2)參數剖析:

roll控制寫入hdfs文件 以何種方式進行滾動(控制文件滾動)
a1.sinks.k1.hdfs.rollInterval = 3  以時間間隔
a1.sinks.k1.hdfs.rollSize = 20     以文件大小
a1.sinks.k1.hdfs.rollCount = 5     以event個數
如果三個都配置  誰先滿足誰觸發滾動
如果不想以某種屬性滾動  設置爲0即可

是否開啓時間上的捨棄  控制文件夾以多少時間間隔滾動(控制文件夾滾動,一個文件夾裏面可以有多個文件)
以下述爲例:就會每10分鐘生成一個文件夾
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

capacity:默認該通道最大的可以存儲的event數量
trasactionCapacity:每次最大可以從source中拿到或者送到sink中的event數量

(3)啓動監控

bin/flume-ng agent -c ./conf -f ./conf/spooldir-hdfs.conf -n a1 -Dflume.root.logger=INFO,console

在這裏插入圖片描述

2.最最重要問題

  • 不能往監控目錄中丟重名文件,如果傳入同名文件flume會報錯且罷工,後續就不再進行數據的監視採集了
  • 怎麼保證不會出現重名文件:在企業中通常給文件追加時間戳命名方式,以保證所有文件不重名

3.增量採集文件到HDFS

(1)採集需求:

業務系統中使用log4j生成日誌,日誌文件的內容不斷增加,我們需要實現把實時生成的日誌信息實時採集到hdfs上

  • 確定三大要素
    1. 採集源source–監控文件內容更新:Exec Source
    2. 下沉目標sink–hdfs系統:HDFS Sink
    3. 管道channel–源和下沉目標直接傳遞通道:Memory Channel
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/log/test.log
a1.sources.r1.channels = c1

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume1/tailout/%y-%m-%d/%H-%M/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件類型,默認是Sequencefile,可用DataStream,則爲普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(2)編寫一個shell腳本模擬生成數據

vim addDate

輸入以下內容

#!/bin/bash
while true 
do 
	date >> /root/log/test.log
done

(3)啓動flume監控

bin/flume-ng agent -c ./conf -f ./conf/tail-hdfs.conf -n a1 -Dflume.root.logger=INFO,console
//啓動生成數據的腳本
addDate.sh

Flume的load-blance負載均衡,failover容錯

1.Flume的負載均衡

(1)使用場景:

當我們多個flume進行串聯,但是當某一節點的計算機處理能力不足時,可能會產生大量數據堆積,我們可以開啓多臺,進行並聯,並聯的多臺集羣,就涉及到了資源分配的負載均衡算法(輪詢, 隨機,權重),和同一個請求只能給一個進程出口,避免數據的一個重複.

(2)分發flume到其他兩個節點

scp -r /export/servers/flume node01:$PWD
scp -r /export/servers/flume node03:$PWD

(2)flume串聯跨網絡傳輸數據

  • avro sink
  • avro source
    使用上述兩個組件指定綁定端口ip就可以滿足數據跨網絡的傳遞,通常用於flume串聯架構中

(3)flume串聯配置

  • node01 配置名稱爲:exec-avro.conf
#agent1 name
agent1.channels = c1
agent1.sources = r1
agent1.sinks = k1 k2

#set gruop
agent1.sinkgroups = g1

#set channel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

agent1.sources.r1.channels = c1
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /root/logs/123.log

# set sink1
agent1.sinks.k1.channel = c1
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = node02
agent1.sinks.k1.port = 52020

# set sink2
agent1.sinks.k2.channel = c1
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = node03
agent1.sinks.k2.port = 52020

#set sink group
agent1.sinkgroups.g1.sinks = k1 k2

#set failover
agent1.sinkgroups.g1.processor.type = load_balance
agent1.sinkgroups.g1.processor.backoff = true
agent1.sinkgroups.g1.processor.selector = round_robin
agent1.sinkgroups.g1.processor.selector.maxTimeOut=10000
  • node02節點配置名稱爲:avro-logger.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node02
a1.sources.r1.port = 52020

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  • node03節點配置名稱爲:avro-logger.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = node03
a1.sources.r1.port = 52020

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1



(4)flume串聯啓動

啓動應該從遠離數據端開始啓動,這樣可以避免數據的丟失
啓動node03

bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console

啓動node02

bin/flume-ng agent -c conf -f conf/avro-logger.conf -n a1 -Dflume.root.logger=INFO,console

啓動node01

bin/flume-ng agent -c conf -f conf/exec-avro.conf -n agent1 -Dflume.root.logger=INFO,console

2.容錯

在上述內容基礎上node01增加failover配置其他不變

#set failover
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 1
agent1.sinkgroups.g1.processor.maxpenalty = 10000

Flume攔截器案例

1.靜態攔截器

(1)使用場景:

通過攔截器可以實現flume的數據傳入source時進行攔截,攔截後往裏面添加k,v對標識,最後上傳hdfs時可以通過標識區分數據後歸總存放

如果沒有使用靜態攔截器
Event: { headers:{} body:  36 Sun Jun  2 18:26 }

使用靜態攔截器之後 自己添加kv標識對
Event: { headers:{type=access} body:  36 Sun Jun  2 18:26 }
Event: { headers:{type=nginx} body:  36 Sun Jun  2 18:26 }
Event: { headers:{type=web} body:  36 Sun Jun  2 18:26 }

後續在存放數據的時候可以使用flume的規則語法獲取到攔截器添加的kv內容

%{type}

(2)實現案例

  1. 案例場景
    A、B兩臺日誌服務機器實時生產日誌主要類型爲access.log、nginx.log、web.log
    現在要求:

把A、B 機器中的access.log、nginx.log、web.log 採集彙總到C機器上然後統一收集到hdfs中。
但是在hdfs中要求的目錄爲:

/source/logs/access/20160101/**
/source/logs/nginx/20160101/**
/source/logs/web/20160101/**

  1. 在node01上配置exec_source_avro_sink.conf文件
# Name the components on this agent
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/access.log
# 添加一個攔截器
a1.sources.r1.interceptors = i1
# 設置爲靜態的
a1.sources.r1.interceptors.i1.type = static

a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access

a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /root/logs/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx

a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /root/logs/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node02
a1.sinks.k1.port = 41414

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1
  1. 在node02上配置avro_source_hdfs_sink.conf文件
#定義agent名, source、channel、sink的名稱
a1.sources = r1
a1.sinks = k1
a1.channels = c1


#定義source
a1.sources.r1.type = avro
a1.sources.r1.bind = node02
a1.sources.r1.port =41414

#添加時間攔截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder


#定義channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000

#定義sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=flume2/logs/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix =events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
#時間類型
#a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件不按條數生成
a1.sinks.k1.hdfs.rollCount = 0
#間隔20秒滾動一次
a1.sinks.k1.hdfs.rollInterval = 20
#生成的文件按大小生成
a1.sinks.k1.hdfs.rollSize  = 10485760
#批量寫入hdfs的evens個數
a1.sinks.k1.hdfs.batchSize = 20
# flume操作hdfs的線程數(包括新建,寫入等)
a1.sinks.k1.hdfs.threadsPoolSize=10
#操作hdfs超時時間
a1.sinks.k1.hdfs.callTimeout=30000

#組裝source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

  1. 啓動node02的flume
bin/flume-ng agent -c ./conf -f ./conf/avro_source_hdfs_sink.conf -n a1 -Dflume.root.logger=INFO,console
  1. 啓動node01的flume
bin/flume-ng agent -c ./conf -f ./conf/exec_source_avro_sink.conf -n a1 -Dflume.root.logger=INFO,console
  1. 模擬數據實現產生
    寫一組shell腳本產生數據
while true; do echo "access access....." >> /root/logs/access.log;sleep 0.5;done
while true; do echo "web web....." >> /root/logs/web.log;sleep 0.5;done
while true; do echo "nginx nginx....." >> /root/logs/nginx.log;sleep 0.5;done

在這裏插入圖片描述在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章