Flume 使用exec及avro方式實現數據收集

導讀：
本篇博客筆者主要介紹如何使用exec實現數據收集到HDFS、使用avro方式實現數據收集及整合exec和avro實現數據收集。

Flume 官方文檔：http://flume.apache.org/FlumeUserGuide.html

1.使用exec實現數據收集到HDFS

需求：監控一個文件，將文件中新增的內容收集到HDFS
Agent選型：exec source + memory channel + hdfs sink

編寫flume-exec-hdfs.conf文件，內容如下

# Name the components on this agent
exec-hdfs-agent.sources = exec-source
exec-hdfs-agent.sinks = hdfs-sink
exec-hdfs-agent.channels = memory-channel

# Describe/configure the source
exec-hdfs-agent.sources.exec-source.type = exec
exec-hdfs-agent.sources.exec-source.command = tail -F ~/data/data.log
exec-hdfs-agent.sources.exec-source.shell = /bin/bash -c

# Describe the sink
exec-hdfs-agent.sinks.hdfs-sink.type = hdfs
exec-hdfs-agent.sinks.hdfs-sink.path = hdfs://Master:9000/data/flume/tail
exec-hdfs-agent.sinks.hdfs-sink.hdfs.fileType=DataStream
exec-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat=Text
exec-hdfs-agent.sinks.hdfs-sink.hdfs.batchSize=10

# Use a channel which buffers events in memory
exec-hdfs-agent.channels.memory-channel.type = memory

# Bind the source and sink to the channel
exec-hdfs-agent.sources.exec-source.channels = memory-channel
exec-hdfs-agent.sinks.hdfs-sink.channel = memory-channel

創建文件及目錄

$ mkdir -p ~/data/data.log

啓動hdfs

[hadoop@Master ~]$ start-dfs.sh 
[hadoop@Master ~]$ jps
3728 NameNode
3920 SecondaryNameNode
4035 Jps

在hdfs中創建目錄，用於存儲flume日誌數據

$ hadoop fs -mkdir -p /data/flume/tail

目前data.log文件及hdfs中的/data/flume/tail目錄下是沒有任何數據和文件的

啓動Agent

flume-ng agent \
--name exec-hdfs-agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/config/flume-exec-hdfs.conf \
-Dflume.root.logger=INFO,console

向data.log中寫入數據

[hadoop@Master data]$ echo aaa >> data.log 
[hadoop@Master data]$ echo bbb >> data.log 
[hadoop@Master data]$ echo ccc >> data.log 
[hadoop@Master data]$ echo eee >> data.log 
[hadoop@Master data]$ echo fff >> data.log 
[hadoop@Master data]$ cat data.log 
aaa
bbb
ccc
eee
fff

查看hdfs中/data/flume/tail中是否有日誌文件產生（會生成一個名爲FlumeData.1508875804298的文件，該文件在正在使用的時候是一個以tmp結尾的文件，默認文件名的前綴爲FlumeData）

[hadoop@Master ~]$ hadoop fs -ls /data/flume/tail
Found 1 items
-rw-r--r--   1 hadoop supergroup         20 2017-9-25 04:10 /data/flume/tail/FlumeData.1508875804298

FlumeData.1508875804298文件內容

[hadoop@Master ~]$ hadoop fs -text /data/flume/tail/FlumeData.1508875804298
aaa
bbb
ccc
eee
fff

現在我們已經完成了監控一個文件，將文件中新增的內容收集到HDFS的需求，但是存在一個問題，就是小文件過多，namenode是按照塊來存儲，每一個文件就是一個block，每個block在namenode中都會存儲它的元數據信息，導致namenode的壓力較大。那麼如果解決吶？

官網中hdfs-sink提供了三個參數，如下

hdfs.rollInterval：以指定的時間作爲提交的標準，0表示不以時間作爲提交的標準
hdfs.rollSize：block數量，筆者使用的hadoop版本是2.x，block的大小爲128M，在這寫的是134217728
hdfs.rollCount：文件內容的行數，以行數爲基準作爲提交，在這寫的是1000000
注意：如果這三個參數都配置了，那麼只要有一個達到就會提交

在flume-exec-hdfs.conf文件中添加如下三項配置，重新啓動Agent

exec-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 0
exec-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 134217728
exec-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 1000000

測試：將test.log文件中的內容寫入到data.log中3次（test.log文件內容的行數爲30多萬）

[hadoop@Master data]$ wc -l test.log 
338030 test.log
[hadoop@Master data]$ cat test.log >> data.log 
[hadoop@Master data]$ cat test.log >> data.log 
[hadoop@Master data]$ cat test.log >> data.log 
[hadoop@Master data]$ wc -l data.log 
1014095 data.log

查看日誌是否收集到了hdfs（滿足100萬行且文件大小超過128M時寫入到一個block文件）

[hadoop@Master config]$ hadoop fs -ls /data/flume/tail
Found 3 items
-rw-r--r--   1 hadoop supergroup         20 2017-9-25 07:10 /data/flume/tail/FlumeData.1508875804298
-rw-r--r--   1 hadoop supergroup  134246595 2017-9-25 07:49 /data/flume/tail/FlumeData.1508877550270
-rw-r--r--   1 hadoop supergroup       3808 2017-9-25 07:49 /data/flume/tail/FlumeData.1508877550271.tmp

2.使用avro方式實現數據收集

需求：使用avro-client方式實現一臺機器到另一臺的avro文件傳輸
Agent選型：avro client => avro-source
問題：avro-client僅限於一次將文件發送，而不能實時進行傳遞新增的內容

avro 介紹

avro 是序列化的一種，實現了RPC（Remote Procedure Call），RPC是一種遠程調用協議，我們先通過一張圖來看下RPC的調用過程

說明：
根據用戶id來獲取用戶，首先在客戶端發送請求，然後將參數序列化，通過網絡傳輸到服務端，服務端進行反序列化，調用服務返回結果並將結果序列化後傳輸到客戶端，客戶端再反序列化後獲得結果。

實現一臺機器到另一臺的avro文件傳輸，需要使用avro-client命令，flume-ng help 查看該命令

我們還是通過一張圖來看下avro數據傳輸的流程圖

說明：機器A通過avro-client將文件傳輸到機器B，機器B中的Agent的組件Source收集數據，Channel緩衝數據，Sink最終將數據寫入到機器B

需求實現

編寫flume-avroclient.conf（機器B）

# Name the components on this agent
avroclient-agent.sources = avro-source
avroclient-agent.sinks = logger-sink
avroclient-agent.channels = memory-channel

# Describe/configure the source
avroclient-agent.sources.avro-source.type = avro
avroclient-agent.sources.avro-source.bind = Master
avroclient-agent.sources.avro-source.port = 44444

# Describe the sink
avroclient-agent.sinks.logger-sink.type = logger

# Use a channel which buffers events in memory
avroclient-agent.channels.memory-channel.type = memory

# Bind the source and sink to the channel
avroclient-agent.sources.avro-source.channels = memory-channel
avroclient-agent.sinks.logger-sink.channel = memory-channel

啓動Agent（機器B）

flume-ng agent \
--name avroclient-agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/config/flume-avroclient.conf \
-Dflume.root.logger=INFO,console

在機器A中創建一個文件avro-access.log，添加幾行內容

[hadoop@Master data]$ touch avro-access.log
[hadoop@Master data]$ echo hello >> avro-access.log 
[hadoop@Master data]$ echo world >> avro-access.log 
[hadoop@Master data]$ echo hello flume >> avro-access.log 
[hadoop@Master data]$ cat avro-access.log 
hello
world
hello flume

使用avro-client命令發送文件到機器B，如果不知道avro-client如何使用，使用如下命令查看命令幫助

[hadoop@Master ~]$ flume-ng avro-client --help
usage: flume-ng avro-client [--dirname <arg>] [-F <arg>] [-H <arg>] [-h]
       [-P <arg>] [-p <arg>] [-R <arg>]
    --dirname <arg>      directory to stream to avro source
 -F,--filename <arg>     file to stream to avro source
 -H,--host <arg>         hostname of the avro source
 -h,--help               display help text
 -P,--rpcProps <arg>     RPC client properties file with server connection
                         params
 -p,--port <arg>         port of the avro source
 -R,--headerFile <arg>   file containing headers as key/value pairs on
                         each new line
The --dirname option assumes that a spooling directory exists where
immutable log files are dropped.

機器A中輸入如下命令，發送文件avro-access.log到機器B

flume-ng avro-client \
-H Master -p 44444 \
--conf $FLUME_HOME/conf \
-F ~/data/avro-access.log

機器B的控制檯打印的信息中可以看到我們在機器A文件avro-access.log中的內容

Event: { headers:{} body: 68 65 6C 6C 6F                                  hello }
Event: { headers:{} body: 77 6F 72 6C 64                                  world }
Event: { headers:{} body: 68 65 6C 6C 6F 20 66 6C 75 6D 65                hello flume }

3.整合exec和avro實現數據收集

需求說明

需求：將A服務器上的日誌實時傳輸到B服務器
Agent 選型：
A 機器：exec source => mc channel => avro sink
B 機器：avro source => mc channel => logger sink

針對這個需求，我們需要使用兩個Apent，也就是Agent的串聯使用，下面是官網中給出的一個Agent串聯使用的一個圖

配置機器A和機器B的Agent（參照上圖）

注意：機器A和機器B之間如果要進行數據交互，那麼必須滿足監聽者和發送者的hostname+port對應上

機器A（Master） Agent編寫

# Name the components on this agent
flume-avro-sink-agent.sources = exec-source
flume-avro-sink-agent.sinks = avro-sink
flume-avro-sink-agent.channels = memory-channel

# Describe/configure the source
flume-avro-sink-agent.sources.exec-source.type = exec
flume-avro-sink-agent.sources.exec-source.command = tail -F ~/data/data.log 
flume-avro-sink-agent.sources.exec-source.shell = /bin/bash -c

# Describe the sink
flume-avro-sink-agent.sinks.avro-sink.type = avro
flume-avro-sink-agent.sinks.avro-sink.hostname = dn1
flume-avro-sink-agent.sinks.avro-sink.port = 44444

# Use a channel which buffers events in memory
flume-avro-sink-agent.channels.memory-channel.type = memory

# Bind the source and sink to the channel
flume-avro-sink-agent.sources.exec-source.channels = memory-channel
flume-avro-sink-agent.sinks.avro-sink.channel = memory-channel

機器B（dn1） Agent編寫

# Name the components on this agent
flume-avro-source-agent.sources = avro-source
flume-avro-source-agent.sinks = logger-sink
flume-avro-source-agent.channels = memory-channel

# Describe/configure the source
flume-avro-source-agent.sources.avro-source.type = avro
flume-avro-source-agent.sources.avro-source.bind = dn1
flume-avro-source-agent.sources.avro-source.port = 44444

# Describe the sink
flume-avro-source-agent.sinks.logger-sink.type = logger

# Use a channel which buffers events in memory
flume-avro-source-agent.channels.memory-channel.type = memory

# Bind the source and sink to the channel
flume-avro-source-agent.sources.avro-source.channels = memory-channel
flume-avro-source-agent.sinks.logger-sink.channel = memory-channel

啓動Agent

先啓動B機器的Agent

flume-ng agent \
--name flume-avro-source-agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/config/flume-avro-source-agent.conf \
-Dflume.root.logger=INFO,console

再啓動A機器的Agent

flume-ng agent \
--name flume-avro-sink-agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/config/flume-avro-sink-agent.conf \
-Dflume.root.logger=INFO,console

筆者在啓動機器A上的Agent時遇到一個問題，如下（卡在這個地方不動了）

異常信息：
Post-validation flume configuration contains configuration for agents: [flume-avro-agent]
No configuration found for this host:flume-avro-sink-agent

原因：Agent的名字寫錯誤
解決：flume-ng命令中的參數–name所指定的Agent名稱要與配置文件中的Agent名稱一致

測試

清空文件內容，向data.log文件中寫入數據，觀察B機器控制檯打印的日誌信息（也可以將這些信息收集到HDFS）

[hadoop@Master data]$ echo "" > data.log 
[hadoop@Master data]$ echo hello flume >> data.log
[hadoop@Master data]$ echo hello spark >> data.log

在機器B的Agent控制檯中可以看到如下信息

Event: { headers:{} body: 68 65 6C 6C 6F 20 66 6C 75 6D 65                hello flume }
Event: { headers:{} body: 68 65 6C 6C 6F 20 73 70 61 72 6B                hello spark }

Flume 使用exec及avro方式實現數據收集

1.使用exec實現數據收集到HDFS

2.使用avro方式實現數據收集

3.整合exec和avro實現數據收集

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

Scala 編程—第五節：函數與閉包

Scala 編程—第四節：集合操作(List、Set、Map、Tuple、Option)

Scala 編程—第六節：類和對象(一)

Scala 編程—第二節：數據類型及操作、流程控制

Scala 編程—第三節：數組 Array

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結