1 Flume概述

1.1 Flume是什麼

Flume是Cloudera提供的一個高可用的，高可靠的，分佈式的海量日誌採集、聚合和傳輸的系統。支持在系統中定製各類數據發送方，用於收集數據；同時，Flume提供對數據的簡單處理，並寫到各種數據接收方的能力。Flume基於流式架構，靈活簡單。

Flume 在0.9.x and 1.x之間有較大的架構調整，0.9.x的稱爲Flume OG，1.x版本經過重構核心組件、核心配置及代碼架構之後改稱Flume NG，歸爲Apache旗下，目前是Apache的頂級項目。

1.2 Flume組成架構

Flume組成架構：

Flume中的組件：

1.Agent

Agent是一個JVM進程，它以時間的形式將數據源從源頭送至目的地，是Flume數據傳輸的基本單元。Agent啓動後，進程名稱爲Application

Agent主要有3個部分組成，Source、Channel、Sink。

2.Source

Source是負責接收數據到Flume Agent的組件。Source組件可以處理各種類型、各種格式的日誌數據，包括avro、thrift、exec（execute）、jms、spooling directory、netcat(讀取端口數據)、sequence generator、syslog、http、legacy。

3.Channel

Channel是位於Source和Sink之間的緩衝區。因此，Channel允許Source和Sink運作在不同的速率上。Channel是線程安全的，可以同時處理幾個Source的寫入操作和幾個Sink的讀取操作。

Flume自帶兩種Channel：Memory Channel和File Channel。

Memory Channel是內存中的隊列。Memory Channel在不需要關心數據丟失的情景下適用。如果需要關心數據丟失，那麼Memory Channel就不應該使用，因爲程序死亡、機器宕機或者重啓都會導致數據丟失。

File Channel將所有事件寫到磁盤。因此在程序關閉或機器宕機的情況下不會丟失數據。

4.Sink

Sink不斷地輪詢Channel中的事件且批量地移除它們，並將這些事件批量寫入到存儲、或者被髮送到另一個Flume Agent。

Sink是完全事務性的。在從Channel批量刪除數據之前，每個Sink用Channel啓動一個事務。批量事件一旦成功寫出到存儲系統或下一個Flume Agent，Sink就利用Channel提交事務。事務一旦被提交，該Channel從自己的內部緩衝區刪除事件。

Sink組件目的地包括hdfs、logger、avro(傳遞給下一個Flume)、thrift、ipc、file、null、HBase、solr、自定義。

5.Event

傳輸單元，Flume數據傳輸的基本單元，以事件的形式將數據從源頭送至目的地。

2 Flume安裝

2.1 下載

1） Flume官網地址：http://flume.apache.org/

2）文檔查看地址：http://flume.apache.org/FlumeUserGuide.html

3）下載地址：http://archive.apache.org/dist/flume/

2.2 安裝部署

1）將apache-flume-1.7.0-bin.tar.gz上傳到linux的/opt/software目錄下

2）解壓apache-flume-1.7.0-bin.tar.gz到/opt/module/目錄下

[root@hadoop102 software]$ tar -zxvf apache-flume-1.7.0-bin.tar.gz -C /opt/module/

3）修改apache-flume-1.7.0-bin的名稱爲flume

[root@hadoop102 module]$ mv apache-flume-1.7.0-bin flume

4）將flume/conf下的flume-env.sh.template文件修改爲flume-env.sh，並配置flume-env.sh文件

[root@hadoop102 conf]$ mv flume-env.sh.template flume-env.sh

[root@hadoop102 conf]$ vim flume-env.sh

export JAVA_HOME=/opt/module/jdk1.8.0_144

3 案例

3.1 Flume實時讀取目錄文件到HDFS

需求分析：使用flume監聽整個目錄的文件

1. 創建配置文件flume-dir-hdfs.conf

[root@hadoop101 job]$ pwd

/opt/module/apache-flume-1.7.0-bin/conf/job/

[root@hadoop101 job]$ vim flume-dir-hdfs.conf

a3.sources = r3

a3.sinks = k3

a3.channels = c3

# Describe/configure the source

a3.sources.r3.type = spooldir

a3.sources.r3.spoolDir = /opt/module/flume/upload

a3.sources.r3.fileSuffix = .COMPLETED

a3.sources.r3.fileHeader = true

#忽略所有以.tmp結尾的文件，不上傳

a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# Describe the sink

a3.sinks.k3.type = hdfs

a3.sinks.k3.hdfs.path = hdfs://hadoop101:9000/flume/upload/%Y%m%d/%H

#上傳文件的前綴

a3.sinks.k3.hdfs.filePrefix = upload-

#是否按照時間滾動文件夾

a3.sinks.k3.hdfs.round = true

#多少時間單位創建一個新的文件夾

a3.sinks.k3.hdfs.roundValue = 1

#重新定義時間單位

a3.sinks.k3.hdfs.roundUnit = hour

#是否使用本地時間戳

a3.sinks.k3.hdfs.useLocalTimeStamp = true

#積攢多少個Event才flush到HDFS一次

a3.sinks.k3.hdfs.batchSize = 100

#設置文件類型，可支持壓縮

a3.sinks.k3.hdfs.fileType = DataStream

#多久生成一個新的文件

a3.sinks.k3.hdfs.rollInterval = 30

#設置每個文件的滾動大小大概是128M

a3.sinks.k3.hdfs.rollSize = 134217700

#文件的滾動與Event數量無關

a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory

a3.channels.c3.type = memory

a3.channels.c3.capacity = 1000

a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel

a3.sources.r3.channels = c3

a3.sinks.k3.channel = c3

2. 啓動監控文件夾命令

[root@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file conf/job/flume-dir-hdfs.conf

說明：在使用Spooling Directory Source時

不要在監控目錄中創建並持續修改文件
上傳完成的文件會以.COMPLETED結尾
被監控文件夾每500毫秒掃描一次文件變動

3. 向upload文件夾中添加文件

在/opt/module/flume目錄下創建upload文件夾

[root@hadoop102 flume]$ mkdir upload

向upload文件夾中添加文件

[root@hadoop102 upload]$ touch hadoop.txt

[root@hadoop102 upload]$ touch hadoop.tmp

[root@hadoop102 upload]$ touch hadoop.log

4. 查看HDFS上的數據

5. 等待1s，再次查詢upload文件夾

[root@hadoop102 upload]$ ll

總用量 0

-rw-rw-r--. 1 hadoop hadoop 0 5月 20 22:31 bigdata.log.COMPLETED

-rw-rw-r--. 1 hadoop hadoop 0 5月 20 22:31 bigdata.tmp

-rw-rw-r--. 1 hadoop hadoop 0 5月 20 22:31 bigdata.txt.COMPLETED

3.2 Flume實時讀取本地文件新增內容到HDFS(常用)

需求分析：實時監控Hive日誌，並上傳到HDFS中

1. 創建flume-file-hdfs.conf文件

[root@hadoop101 job]$ vim flume-file-hdfs.conf

注：要想讀取Linux系統中的文件，就得按照Linux命令的規則執行命令。由於hive日誌在Linux系統中所以讀取文件的類型選擇：exec即execute執行的意思。表示執行Linux命令來讀取文件。

# Name the components on this agent

a2.sources = r2

a2.sinks = k2

a2.channels = c2

# Describe/configure the source

a2.sources.r2.type = exec

a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log

a2.sources.r2.shell = /bin/bash -c

# Describe the sink

a2.sinks.k2.type = hdfs

a2.sinks.k2.hdfs.path = hdfs://hadoop101:9000/flume/%Y%m%d/%H

#上傳文件的前綴

a2.sinks.k2.hdfs.filePrefix = logs-

#是否按照時間滾動文件夾

a2.sinks.k2.hdfs.round = true

#多少時間單位創建一個新的文件夾

a2.sinks.k2.hdfs.roundValue = 1

#重新定義時間單位

a2.sinks.k2.hdfs.roundUnit = hour

#是否使用本地時間戳

a2.sinks.k2.hdfs.useLocalTimeStamp = true

#積攢多少個Event才flush到HDFS一次

a2.sinks.k2.hdfs.batchSize = 1000

#設置文件類型，可支持壓縮

a2.sinks.k2.hdfs.fileType = DataStream

#多久生成一個新的文件

a2.sinks.k2.hdfs.rollInterval = 600

#設置每個文件的滾動大小

a2.sinks.k2.hdfs.rollSize = 134217700

#文件的滾動與Event數量無關

a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory

a2.channels.c2.type = memory

a2.channels.c2.capacity = 1000

a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel

a2.sources.r2.channels = c2

a2.sinks.k2.channel = c2

2. 執行監控配置

[root@hadoop101 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file conf/job/flume-file-hdfs.conf

3. 開啓hadoop和hive並操作hive產生日誌

[root@hadoop101 hadoop-2.7.2]$ sbin/start-dfs.sh

[root@hadoop101 hadoop-2.7.2]$ sbin/start-yarn.sh

[root@hadoop101 hive]$ bin/hive

hive (default)>

4. 在HDFS上查看文件

3.3 單數據源多出口案例

案例需求：使用flume-1監控文件變動，flume-1將變動內容傳遞給flume-2，flume-2負責存儲到HDFS。同時flume-1將變動內容傳遞給flume-3，flume-3負責輸出到local filesystem。

1. 準備工作

在job目錄下創建group1文件夾

[root@hadoop101 job]$ cd group1/

在/opt/module/datas/目錄下創建flume3文件夾

[root@hadoop101 datas]$ mkdir flume3

2. 創建flume-file-flume.conf

[root@hadoop101 group1]$ vim flume-file-flume.conf

配置1個接收日誌文件的source和兩個channel、兩個sink，分別輸送給flume-flume-hdfs和flume-flume-dir。

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 將數據流複製給多個channel
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/apache-hive-1.2.1-bin/logs/hive.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop101
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop101
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

注：Avro是由Hadoop創始人Doug Cutting創建的一種語言無關的數據序列化和RPC框架。

注：RPC（Remote Procedure Call）遠程過程調用，是一種通過網絡從遠程計算機程序請求服務，而不需要了解底層網絡技術的協議。

3. 創建flume-flume-hdfs.conf

[root@hadoop101 group1]$ vim flume-flume-hdfs.conf

配置上級flume輸出的source，輸出是到hdfs的sink

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop101
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop101:9000/flume2/%Y%m%d/%H
#上傳文件的前綴
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照時間滾動文件夾
a2.sinks.k1.hdfs.round = true
#多少時間單位創建一個新的文件夾
a2.sinks.k1.hdfs.roundValue = 1
#重新定義時間單位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地時間戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#積攢多少個Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 100
#設置文件類型，可支持壓縮
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一個新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#設置每個文件的滾動大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滾動與Event數量無關
a2.sinks.k1.hdfs.rollCount = 0
#最小冗餘數
a2.sinks.k1.hdfs.minBlockReplicas = 1

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

4. 創建flume-flume-dir.conf

[root@hadoop102 group1]$ vim flume-flume-dir.conf

配置上級flume輸出的source，輸出是到本地目錄的sink

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop101
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/datas/flume3

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

提示：輸出的本地目錄必須是已經存在的目錄，如果該目錄不存在，並不會創建新的目錄。

5．執行配置文件

分別開啓對應配置文件：flume-flume-dir，flume-flume-hdfs，flume-file-flume。

[root@hadoop101 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file conf/job/group1/flume-flume-dir.conf

[root@hadoop101 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file conf/job/group1/flume-flume-hdfs.conf

[root@hadoop101 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group1/flume-file-flume.conf

6．啓動hadoop和hive

[root@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh

[root@hadoop102 hadoop-2.7.2]$ sbin/start-yarn.sh

[root@hadoop102 hive]$ bin/hive

hive (default)>

7. 檢查HDFS上數據

8. 檢查/opt/module/datas/flume3目錄中數據

3.4 多數據源彙總

案例需求：

hadoop101上的flume-1監控文件hive.log，

hadoop101上的flume-2監控某一個端口的數據流，

flume-1與flume-2將數據發送給hadoop101上的flume-3，flume-3將最終數據打印到控制檯

1．準備工作

在/opt/module/flume/job目錄下創建一個group2文件夾

2．創建flume1.conf

在hadoop101上創建配置文件並打開

[root@hadoop101 group2]$ vim flume1-file.conf

配置source用於監控hive.log文件，配置sink輸出數據到下一級flume。

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/apache-hive-1.2.1-bin/logs/hive.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop101
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3. 創建flume2.conf

[root@hadoop101 group2]$ vim flume2.conf

配置source監控端口44444數據流，配置sink數據到下一級flume：

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop101
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop101
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

4．創建flume3.conf

[root@hadoop103 group2]$ vim flume3.conf

配置source用於接收flume1與flume2發送過來的數據流，最終合併後sink到控制檯。

Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop101
a3.sources.r1.port = 4141

# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

5．執行配置文件

分別開啓對應配置文件：flume3.conf，flume2.conf，flume1.conf。

[root@hadoop101 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file conf/job/group2/flume3.conf - Dflume.root.logger=INFO,console

[root@hadoop101 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file conf/job/group2/flume2.conf

[root@hadoop101 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/job/group3/flume-file.conf

6．在hadoop102上向/opt/module目錄下的group.log追加內容

[root@hadoop102 module]$ echo 'hello' > group.log

7．在hadoop103上向44444端口發送數據

[root@hadoop103 flume]$ telnet hadoop104 44444

8. 檢查數據

Hadoop生態圈（八）：Flume

1 Flume概述

1.1 Flume是什麼

1.2 Flume組成架構

2 Flume安裝

2.1 下載

2.2 安裝部署

3 案例

3.1 Flume實時讀取目錄文件到HDFS

3.2 Flume實時讀取本地文件新增內容到HDFS(常用)

3.3 單數據源多出口案例

3.4 多數據源彙總

DAPPER 事務 TRANSACTION

Java中線程的創建方式

Hadoop生態圈（七）：Sqoop

Hadoop生態圈（三）：MapReduce

spark學習（四）：共享變量及一些優化

Hadoop生態圈（八）：Flume

Hadoop生態圈（五）：Zookeeper

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結