- Flume簡介
- Apache Flume是一個分佈式、可信任的彈性系統,用於高效收集、匯聚和移動 大規模日誌信息從多種不同的數據源到一個集中的數據存儲中心(HDFS、 HBase)
- 支持各種接入資源數據的類型以及接出數據類型
- 支持多路徑流量,多管道接入流量,多管道接出流量,上下文路由等
- Flume外部架構
數據發生器(如:facebook,twitter)產生的數據被被單個的運行在數據發生器所在服務器上的agent所收集,之後數據收容器從各個agent上彙集數據並將採集到的數據存入到HDFS或者 HBase中
- Flume數據傳輸基本單位Event
- Flume使用Event對象來作爲傳遞數據的格式,是內部數據傳輸的最基本單元
- 由兩部分組成:轉載數據的字節數組+可選頭部
- Header 是 key/value 形式的,可以用來製造路由決策或攜帶其他結構化信息(如事件的時間戳或事件來源的服務器主機名)。你可以把它想象成和 HTTP 頭一樣提供相同的功能——通過該方法來傳輸正文之外的額外信息。Flume提供的不同source會給其生成的event添加不同的header
- Body是一個字節數組,包含了實際的內容
- Flume的核心Agent
Flume的核心是Agent,在Flume內部有一個或者多個Agent,每一個Agent是一個獨立的守護進程。如上圖所示,Agent由source、channel和sink三個組件組成:
- source
source組件是專門用來收集數據的,可以處理各種類型、各種格式的日誌數據,包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定義 - channel
1)採用被動存儲的形式,即通道會緩存該事件直到該事件被sink組件處理。所以Channel是一種短暫的存儲容器,它將從source處接收到的event格式的數據緩存起來,直到它們被sinks消費掉,它在source和sink間起着一共橋樑的作用,channel是一個 完整的事務,這一點保證了數據在收發的時候的一致性。並且它可以和任意數量的source 和sink鏈接
2)channel的緩存形式有: file、memory、jdbc等
3)Flume通常選擇FileChannel,而不使用Memory Channel
–Memory Channel:內存存儲事務,吞吐率極高,但存在丟數據風險
–File Channel:本地磁盤的事務實現模式,保證數據不會丟失(WAL實現) - sink
1)sink組件是用於把數據發送到目的地的組件,目的地包括hdfs、logger、avro、thrift、ipc、file、null、hbase、solr、自定義
2)Sink成功取出Event後,將Event從Channel中移除
3)Sink必須作用於一個確切的Channel
- Flume運行機制
flume的核心就是agent,agent對外有兩個進行交互的地方,一個是接受數據的輸入—source,一個是數據的輸出sink,sink負責將數據發送到外部指定的目的地。source接收到數據之後,將數據發送給channel,channel作爲一個數據緩衝區會臨時存放這些數據,隨後sink會將channel中的數據發送到指定的地方—例如HDFS、Hbase等。
注意: 只有在sink將channel中的數據成功發送出去之後,channel纔會將臨時數據進行刪除,這種機制保證了數據傳輸的可靠性與安全性。
提到這裏,來說一下Flume的可信任性體現在什麼地方?
- 由於節點出現異常,導致數據傳輸過程中中斷,通過數據回滾,或者數據重發,來彌補
- 對於同一節點,source向channel寫數據,是一個一個批次的寫,如果該批次內數據出現異常,則不會寫入channel,同批次其他正常數據不會寫入channel(但是,對於已經接受到的部分的數據直接拋棄),依靠上一節點重新發送數據。channel向sink寫數據也是一樣的,只有當數據真正被sink消費掉了,纔會去刪除channel中的數據。
- Agent Interceptor
- Interceptor用於Source的一組攔截器,按照預設的順序必要地方對events進行過濾和自定義的 處理邏輯實現
- 在app(應用程序日誌)和 source 之間的,對app日誌進行攔截處理的。也即在日誌進入到 source之前,對日誌進行一些包裝、清新過濾等等動作
- 官方上提供的已有的攔截器有:
– Timestamp Interceptor:在event的header中添加一個key叫:timestamp,value爲當前的時間戳
– Host Interceptor:在event的header中添加一個key叫:host,value爲當前機器的hostname或者ip
– Static Interceptor:可以在event的header中添加自定義的key和value
– Regex Filtering Interceptor:通過正則來清洗或包含匹配的events
– Regex Extractor Interceptor:通過正則表達式來在header中添加指定的key,value則爲正則匹配的部分 - flume的攔截器也是chain形式的,可以對一個source指定多個攔截器,按先後順序依次處理
- Agent Selector
channel selectors 有兩種類型:
- Replicating Channel Selector (default):將source過來的events發往所有channel。類似廣播
- Multiplexing Channel Selector:而Multiplexing 可以選擇該發往哪些channel
- Flume安裝
- 下載安裝包
wget http://www.apache.org/dist/flume/stable/apache-flume-1.9.0-bin.tar.gz
tar -zxvf apache-flume-1.9.0-bin.tar.gz
- 設置環境變量
vim .bash_profile
添加flume環境變量
export FLUME_HOME=/app/apache-flume-1.9.0-bin
export PATH=$PATH:$FLUME_HOME/bin
保存文件後,source一下使配置文件生效
source .bash_profile
- 配置java_home
cp flume-env.sh.template flume-env.sh
vim flume-env.sh
- 將flume安裝包分發到各個從節點,並依次完成如上配置
scp -r apache-flume-1.9.0-bin/ hongqiang@slaver1:/app/
scp -r apache-flume-1.9.0-bin/ hongqiang@slaver2:/app/
- Flume實踐
- netcat
vim flume_netcat.conf
# Name the components on this agent
#首先定義了一個Agent,命名爲a1
a1.sources = r1 #a1裏面的source組件命名爲r1
a1.sinks = k1 #a1裏面的sink組件命名爲k1
a1.channels = c1 #a1裏面的channel命名爲c1
# Describe/configure the source
#source輸入源配置
a1.sources.r1.type = netcat #信息輸入的方式,netcat代表網絡的形式灌入數據
a1.sources.r1.bind = master #從master節點上監聽數據
a1.sources.r1.port = 44444 #設置的端口號
# Describe the sink
#sink輸出方式的設置,這裏是輸出logger的形式
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
#緩存方式設置
a1.channels.c1.type = memory #緩存方式,memory channel
a1.channels.c1.capacity = 1000 #設置channel中最大的消息(Event)容量
a1.channels.c1.transactionCapacity = 100 #一次最多從source獲取的消息容量
# Bind the source and sink to the channel
#連接方式設置
a1.sources.r1.channels = c1 #a1中的source(r1)連接channel (c1)
a1.sinks.k1.channel = c1 #a1中的sink (k1)連接channel (c1)
執行命令
flume-ng agent --conf conf --conf-file ./flume_netcat.conf --name a1 - Dflume.root.logger=INFO,console
效果如下:
- exec
vim flume_exec.conf
Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /app/apache-flume-1.9.0-bin/test_data/1.log
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
執行命令
flume-ng agent --conf conf --conf-file ./flume_exec.conf --name a1 - Dflume.root.logger=INFO,console
效果如下:
- sink輸出數據到hdfs存儲
vim flume_hdfs_webpy.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
## exec表示flume回去調用給的命令,然後從給的命令的結果中去拿數據
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /app/apache-flume-1.9.0-bin/test_data/1.log
a1.sources.r1.channels = c1
# Describe the sink
## 表示下沉到hdfs,類型決定了下面的參數
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
## 下面的配置告訴用hdfs去寫文件的時候寫到什麼位置,下面的表示不是寫死的,而是可以動態的變化的。
表示輸出的目錄名稱是可變的
a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H%M/
## 使用本地時間戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件類型,默認是Sequencefile,可用DataStream:爲普通文本
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
##使用內存的方式
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
執行命令
flume-ng agent --conf conf --conf-file ./flume_hdfs_webpy.conf --name a1 - Dflume.root.logger=INFO,console
效果如下:
當監聽到1.log數據有改變時,會將1.log中最後10條數據輸出存儲到hdfs相應的位置
- 故障轉移failover(實現一種高可用),sink輸出數據存入主從兩個節點,當主節點故障時,將自動切換到從節點,實現高可用
master節點flume配置文件爲:
vim flume-client.properties_failover
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /app/apache-flume-1.9.0-bin/test_data/1.log
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = slaver1
a1.sinks.k1.port = 52020
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = slaver2
a1.sinks.k2.port = 52020
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 10 #誰的值大,誰就是主節點,因此此時slaver1爲主節點
a1.sinkgroups.g1.processor.priority.k2 = 1
a1.sinkgroups.g1.processor.priority.maxpenality = 10000
slaver1節點flume配置文件爲
vim flume-server-failover.conf
# agent1 name
a1.channels = c1
a1.sources = r1
a1.sinks = k1
#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# other node, slave to master
a1.sources.r1.type = avro
a1.sources.r1.bind = slaver1 #此時監聽節點爲slaver1
a1.sources.r1.port = 52020
# set sink to hdfs
a1.sinks.k1.type = logger
# a1.sinks.k1.type = hdfs
# a1.sinks.k1.hdfs.path=/flume_data_pool
# a1.sinks.k1.hdfs.fileType=DataStream
# a1.sinks.k1.hdfs.writeFormat=TEXT
# a1.sinks.k1.hdfs.rollInterval=1
# a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d
a1.sources.r1.channels = c1
a1.sinks.k1.channel=c1
slaver2節點flume配置文件爲
vim flume-server-failover.conf
# agent1 name
a1.channels = c1
a1.sources = r1
a1.sinks = k1
#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# other node, slave to master
a1.sources.r1.type = avro
a1.sources.r1.bind = slaver2 #此時監聽節點爲slaver2
a1.sources.r1.port = 52020
# set sink to hdfs
a1.sinks.k1.type = logger
# a1.sinks.k1.type = hdfs
# a1.sinks.k1.hdfs.path=/flume_data_pool
# a1.sinks.k1.hdfs.fileType=DataStream
# a1.sinks.k1.hdfs.writeFormat=TEXT
# a1.sinks.k1.hdfs.rollInterval=1
# a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d
a1.sources.r1.channels = c1
a1.sinks.k1.channel=c1
配置完成後,首先啓動slaver1和slaver2上的flume,然後啓動master節點上的flume。當我們向1.log文件中寫入數據時,主節點slaver1將接收到數據,當手動關掉slaver1上的flume時,再次發送消息,從節點slaver2將收到數據,當再次重啓slaver1上的flume時,slaver1上將再次接收到數據。
- Agent Selector
– replicating(廣播的形式)
master節點配置
vim flume_client_replicating.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 50000
a1.sources.r1.host = master
a1.sources.r1.selector.type = replicating
a1.sources.r1.channels = c1 c2
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = slaver1
a1.sinks.k1.port = 50000
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = slaver2
a1.sinks.k2.port = 50000
# Use a channel which buffers events inmemory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
slaver1、slaver2節點配置參考實踐4
– multiplexing(根據Event中的header信息進行選擇分發到哪個節點)
vim flume_client_multiplexing.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type= org.apache.flume.source.http.HTTPSource
a1.sources.r1.port= 50000
a1.sources.r1.host= master
a1.sources.r1.selector.type= multiplexing
a1.sources.r1.channels= c1 c2
a1.sources.r1.selector.header= areyouok
a1.sources.r1.selector.mapping.OK = c1
a1.sources.r1.selector.mapping.NO = c2
a1.sources.r1.selector.default= c1
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = slaver1
a1.sinks.k1.port = 50000
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = slaver2
a1.sinks.k2.port = 50000
# Use a channel which buffers events inmemory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
slaver1、slaver2節點配置參考實踐4
更多相關示例可參考flume官網:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
如有問題,歡迎留言指正!