Flume入門：簡介、安裝以及實踐

- Flume簡介

Apache Flume是一個分佈式、可信任的彈性系統，用於高效收集、匯聚和移動大規模日誌信息從多種不同的數據源到一個集中的數據存儲中心(HDFS、 HBase)
支持各種接入資源數據的類型以及接出數據類型
支持多路徑流量，多管道接入流量，多管道接出流量，上下文路由等

- Flume外部架構

數據發生器(如:facebook,twitter)產生的數據被被單個的運行在數據發生器所在服務器上的agent所收集，之後數據收容器從各個agent上彙集數據並將採集到的數據存入到HDFS或者 HBase中

- Flume數據傳輸基本單位Event

Flume使用Event對象來作爲傳遞數據的格式，是內部數據傳輸的最基本單元
由兩部分組成:轉載數據的字節數組+可選頭部
Header 是 key/value 形式的，可以用來製造路由決策或攜帶其他結構化信息(如事件的時間戳或事件來源的服務器主機名)。你可以把它想象成和 HTTP 頭一樣提供相同的功能——通過該方法來傳輸正文之外的額外信息。Flume提供的不同source會給其生成的event添加不同的header
Body是一個字節數組，包含了實際的內容

- Flume的核心Agent

Flume的核心是Agent，在Flume內部有一個或者多個Agent，每一個Agent是一個獨立的守護進程。如上圖所示，Agent由source、channel和sink三個組件組成：

source
source組件是專門用來收集數據的，可以處理各種類型、各種格式的日誌數據,包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定義
channel
1）採用被動存儲的形式，即通道會緩存該事件直到該事件被sink組件處理。所以Channel是一種短暫的存儲容器，它將從source處接收到的event格式的數據緩存起來,直到它們被sinks消費掉,它在source和sink間起着一共橋樑的作用,channel是一個完整的事務,這一點保證了數據在收發的時候的一致性。並且它可以和任意數量的source 和sink鏈接
2）channel的緩存形式有： file、memory、jdbc等
3）Flume通常選擇FileChannel，而不使用Memory Channel
–Memory Channel:內存存儲事務，吞吐率極高，但存在丟數據風險
–File Channel:本地磁盤的事務實現模式，保證數據不會丟失(WAL實現)
sink
1）sink組件是用於把數據發送到目的地的組件，目的地包括hdfs、logger、avro、thrift、ipc、file、null、hbase、solr、自定義
2）Sink成功取出Event後，將Event從Channel中移除
3）Sink必須作用於一個確切的Channel

- Flume運行機制
flume的核心就是agent，agent對外有兩個進行交互的地方，一個是接受數據的輸入—source，一個是數據的輸出sink，sink負責將數據發送到外部指定的目的地。source接收到數據之後，將數據發送給channel，channel作爲一個數據緩衝區會臨時存放這些數據，隨後sink會將channel中的數據發送到指定的地方—例如HDFS、Hbase等。
注意： 只有在sink將channel中的數據成功發送出去之後，channel纔會將臨時數據進行刪除，這種機制保證了數據傳輸的可靠性與安全性。

提到這裏，來說一下Flume的可信任性體現在什麼地方？

由於節點出現異常，導致數據傳輸過程中中斷，通過數據回滾，或者數據重發，來彌補
對於同一節點，source向channel寫數據，是一個一個批次的寫，如果該批次內數據出現異常，則不會寫入channel，同批次其他正常數據不會寫入channel（但是，對於已經接受到的部分的數據直接拋棄），依靠上一節點重新發送數據。channel向sink寫數據也是一樣的，只有當數據真正被sink消費掉了，纔會去刪除channel中的數據。

- Agent Interceptor

Interceptor用於Source的一組攔截器，按照預設的順序必要地方對events進行過濾和自定義的處理邏輯實現
在app(應用程序日誌)和 source 之間的，對app日誌進行攔截處理的。也即在日誌進入到 source之前，對日誌進行一些包裝、清新過濾等等動作
官方上提供的已有的攔截器有:
– Timestamp Interceptor:在event的header中添加一個key叫:timestamp,value爲當前的時間戳
– Host Interceptor:在event的header中添加一個key叫:host,value爲當前機器的hostname或者ip
– Static Interceptor:可以在event的header中添加自定義的key和value
– Regex Filtering Interceptor:通過正則來清洗或包含匹配的events
– Regex Extractor Interceptor:通過正則表達式來在header中添加指定的key,value則爲正則匹配的部分
flume的攔截器也是chain形式的，可以對一個source指定多個攔截器，按先後順序依次處理

- Agent Selector
channel selectors 有兩種類型:

Replicating Channel Selector (default):將source過來的events發往所有channel。類似廣播
Multiplexing Channel Selector:而Multiplexing 可以選擇該發往哪些channel

- Flume安裝

下載安裝包

wget http://www.apache.org/dist/flume/stable/apache-flume-1.9.0-bin.tar.gz
tar -zxvf apache-flume-1.9.0-bin.tar.gz

設置環境變量

vim .bash_profile

添加flume環境變量

export FLUME_HOME=/app/apache-flume-1.9.0-bin
export PATH=$PATH:$FLUME_HOME/bin

保存文件後，source一下使配置文件生效

source .bash_profile

配置java_home

cp flume-env.sh.template flume-env.sh
vim flume-env.sh

將flume安裝包分發到各個從節點，並依次完成如上配置

scp -r apache-flume-1.9.0-bin/ hongqiang@slaver1:/app/
scp -r apache-flume-1.9.0-bin/ hongqiang@slaver2:/app/

- Flume實踐

netcat

vim flume_netcat.conf

# Name the components on this agent
#首先定義了一個Agent，命名爲a1
a1.sources = r1          #a1裏面的source組件命名爲r1
a1.sinks = k1            #a1裏面的sink組件命名爲k1
a1.channels = c1         #a1裏面的channel命名爲c1

# Describe/configure the source
#source輸入源配置
a1.sources.r1.type = netcat           #信息輸入的方式，netcat代表網絡的形式灌入數據
a1.sources.r1.bind = master           #從master節點上監聽數據
a1.sources.r1.port = 44444            #設置的端口號

# Describe the sink
#sink輸出方式的設置，這裏是輸出logger的形式
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
#緩存方式設置
a1.channels.c1.type = memory                      #緩存方式，memory channel
a1.channels.c1.capacity = 1000                    #設置channel中最大的消息（Event）容量
a1.channels.c1.transactionCapacity = 100          #一次最多從source獲取的消息容量

# Bind the source and sink to the channel
#連接方式設置
a1.sources.r1.channels = c1                #a1中的source（r1）連接channel （c1）
a1.sinks.k1.channel = c1                   #a1中的sink （k1）連接channel （c1）

執行命令

flume-ng agent --conf conf --conf-file ./flume_netcat.conf  --name a1 - Dflume.root.logger=INFO,console

效果如下：

exec

vim flume_exec.conf

 Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /app/apache-flume-1.9.0-bin/test_data/1.log

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

執行命令

flume-ng agent --conf conf --conf-file ./flume_exec.conf  --name a1 - Dflume.root.logger=INFO,console

效果如下：

sink輸出數據到hdfs存儲

vim flume_hdfs_webpy.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
## exec表示flume回去調用給的命令，然後從給的命令的結果中去拿數據
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /app/apache-flume-1.9.0-bin/test_data/1.log
a1.sources.r1.channels = c1

# Describe the sink
## 表示下沉到hdfs，類型決定了下面的參數
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
## 下面的配置告訴用hdfs去寫文件的時候寫到什麼位置，下面的表示不是寫死的，而是可以動態的變化的。
表示輸出的目錄名稱是可變的
a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H%M/
## 使用本地時間戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件類型，默認是Sequencefile，可用DataStream：爲普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
##使用內存的方式
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

執行命令

flume-ng agent --conf conf --conf-file ./flume_hdfs_webpy.conf  --name a1 - Dflume.root.logger=INFO,console

效果如下：

當監聽到1.log數據有改變時，會將1.log中最後10條數據輸出存儲到hdfs相應的位置

故障轉移failover（實現一種高可用），sink輸出數據存入主從兩個節點，當主節點故障時，將自動切換到從節點，實現高可用

master節點flume配置文件爲：

vim flume-client.properties_failover

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /app/apache-flume-1.9.0-bin/test_data/1.log

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = slaver1
a1.sinks.k1.port = 52020

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = slaver2
a1.sinks.k2.port = 52020

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 10        #誰的值大，誰就是主節點，因此此時slaver1爲主節點
a1.sinkgroups.g1.processor.priority.k2 = 1
a1.sinkgroups.g1.processor.priority.maxpenality = 10000

slaver1節點flume配置文件爲

vim flume-server-failover.conf

# agent1 name
a1.channels = c1
a1.sources = r1
a1.sinks = k1

#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# other node, slave to master
a1.sources.r1.type = avro
a1.sources.r1.bind = slaver1      #此時監聽節點爲slaver1
a1.sources.r1.port = 52020

# set sink to hdfs
a1.sinks.k1.type = logger
# a1.sinks.k1.type = hdfs
# a1.sinks.k1.hdfs.path=/flume_data_pool
# a1.sinks.k1.hdfs.fileType=DataStream
# a1.sinks.k1.hdfs.writeFormat=TEXT
# a1.sinks.k1.hdfs.rollInterval=1
# a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d

a1.sources.r1.channels = c1
a1.sinks.k1.channel=c1

slaver2節點flume配置文件爲

vim flume-server-failover.conf

# agent1 name
a1.channels = c1
a1.sources = r1
a1.sinks = k1

#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# other node, slave to master
a1.sources.r1.type = avro
a1.sources.r1.bind = slaver2      #此時監聽節點爲slaver2
a1.sources.r1.port = 52020

# set sink to hdfs
a1.sinks.k1.type = logger
# a1.sinks.k1.type = hdfs
# a1.sinks.k1.hdfs.path=/flume_data_pool
# a1.sinks.k1.hdfs.fileType=DataStream
# a1.sinks.k1.hdfs.writeFormat=TEXT
# a1.sinks.k1.hdfs.rollInterval=1
# a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d

a1.sources.r1.channels = c1
a1.sinks.k1.channel=c1

配置完成後，首先啓動slaver1和slaver2上的flume，然後啓動master節點上的flume。當我們向1.log文件中寫入數據時，主節點slaver1將接收到數據，當手動關掉slaver1上的flume時，再次發送消息，從節點slaver2將收到數據，當再次重啓slaver1上的flume時，slaver1上將再次接收到數據。

Agent Selector
– replicating（廣播的形式）
master節點配置

vim flume_client_replicating.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 50000
a1.sources.r1.host = master
a1.sources.r1.selector.type = replicating
a1.sources.r1.channels = c1 c2

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = slaver1
a1.sinks.k1.port = 50000

a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = slaver2
a1.sinks.k2.port = 50000

# Use a channel which buffers events inmemory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

slaver1、slaver2節點配置參考實踐4

– multiplexing（根據Event中的header信息進行選擇分發到哪個節點）

vim flume_client_multiplexing.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type= org.apache.flume.source.http.HTTPSource
a1.sources.r1.port= 50000
a1.sources.r1.host= master
a1.sources.r1.selector.type= multiplexing
a1.sources.r1.channels= c1 c2

a1.sources.r1.selector.header= areyouok
a1.sources.r1.selector.mapping.OK = c1
a1.sources.r1.selector.mapping.NO = c2
a1.sources.r1.selector.default= c1

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = slaver1
a1.sinks.k1.port = 50000

a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = slaver2
a1.sinks.k2.port = 50000

# Use a channel which buffers events inmemory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

slaver1、slaver2節點配置參考實踐4

更多相關示例可參考flume官網：http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
如有問題，歡迎留言指正！

Flume入門：簡介、安裝以及實踐

微服務實踐k8s&dapr開發部署實驗（2）狀態管理

Win10 LTSC 2019 安裝後的一些步驟

Python 潮流週刊#52：Python 處理 Excel 的資源

重新封裝優化React組件並打包發佈到npm私服

Flume報錯：java.lang.NumberFormatException: For input string: "0　"

Flume入門：簡介、安裝以及實踐

HIVE建表、HBASE建表、HIVE表與HBASE表關聯以及數據導入

JAVA動態代理實現簡單的AOP框架

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結