目錄
(1)下載 flume-ng-1.6.0-cdh5.7.0.tar.gz
(4)在 flume-env.sh 中配置 Java JDK 的路徑
(5)編寫 flume 的啓動腳本(生產環境推薦使用這種方式)
(7)查看 jq 安裝包是否存在,安裝 jq 工具,便於查看 json 格式的內容
(3)編寫 agent 的啓動腳本(生產環境推薦使用這種方式)
實戰 03:實時讀取本地文件到 HDFS 中(需要 flume 節點配置 Hadoop 集羣環境)
實戰 04:從服務器 A 收集數據到服務器 B 並上傳到 HDFS(需要服務器 B 節點配置 Hadoop 集羣環境)
實戰 05:多 flume 彙總數據到單 flume(需要單 flume 匯聚節點配置 hadoop 集羣環境)
寫在最前之應用場景:
flume 在大數據中扮演着數據收集的角色,收集到數據以後再通過計算框架進行處理。flume 是 Cloudera 提供的一個高可用的、高可靠的、分佈式的海量日誌採集、聚合和傳輸的系統,flume 支持在日誌系統中定製各類數據發送方,用於收集數據;同時,flume 提供對數據進行簡單處理,並寫到各種數據接收方(可定製)的能力。
Flume 作爲 Hadoop 中的日誌採集工具,非常的好用,但是在安裝 Flume 的時候,查閱很多資料時發現形形色色,有的說安裝 Flume 很簡單,有的說安裝 Flume 很複雜,需要依賴 zookeeper,所以一方面說直接安裝 Flume,解壓即可用,還有一方面說需要先裝了 Zookeeper 纔可以安裝 Flume。那麼爲何會才生這種情況呢?其實兩者說的都對,只是 Flume 的不同版本問題。
背景介紹:Cloudera 開發的分佈式日誌收集系統 Flume,是 Hadoop 周邊組件之一。其可以實時的將分佈在不同節點、機器上的日誌收集到 hdfs 中。Flume 初始的發行版本目前被統稱爲 Flume OG(original generation),屬於 cloudera。但隨着 FLume 功能的擴展,Flume OG 代碼工程臃腫、核心組件設計不合理、核心配置不標準等缺點暴露出來,尤其是在 Flume OG 的最後一個發行版本 0.94.0 中,日誌傳輸不穩定的現象尤爲嚴重,這點可以在 BigInsights 產品文檔的 troubleshooting 板塊發現。爲了解決這些問題,2011 年 10 月 22 日,cloudera 完成了 Flume-728,對 Flume 進行了里程碑式的改動:重構核心組件、核心配置以及代碼架構,重構後的版本統稱爲 Flume NG(next generation);改動的另一原因是將 Flume 納入 apache 旗下,cloudera Flume 改名爲 Apache Flume。
FLUME OG 有三種角色的節點:代理節點(agent)、收集節點(collector)、主節點(master)。
FLUME NG 只有一種角色的節點:代理節點(agent)。
Flume OG vs Flume NG:
- 在 OG 版本中,Flume 的使用穩定性依賴 zookeeper。它需要 zookeeper 對其多類節點(agent、collector、master)的工作進行管理,尤其是在集羣中配置多個 master 的情況下。當然,OG也可以用內存的方式管理各類節點的配置信息,但是需要用戶能夠忍受在機器出現故障時配置信息出現丟失。所以說 OG 的穩定行使用是依賴 zookeeper 的。
- 而在 NG 版本中,節點角色的數量由 3 縮減到 1,不存在多類角色的問題,所以就不再需要 zookeeper 對各類節點協調的作用了,由此脫離了對 zookeeper 的依賴。由於 OG 的穩定使用對 zookeeper 的依賴表現在整個配置和使用過程中,這就要求用戶掌握對 zookeeper 集羣的搭建及其使用。
- OG 在安裝時:在 flume-env.sh 中設置 $JAVA_HOME。 需要配置文件 flume-conf.xml。其中最主要的、必須的配置與 master 有關。集羣中的每個 Flume 都需要配置 master 相關屬性(如 flume.master.servers、flume.master.store、flume.master.serverid)。 如果想穩定使用 Flume 集羣,還需要安裝 zookeeper 集羣,這需要用戶對 zookeeper 有較深入的瞭解。 安裝 zookeeper 之後,需要配置 flume-conf.xml 中的相關屬性,如 flume.master.zk.use.external、flume.master.zk.servers。 在使用 OG 版本傳輸數據之前,需要啓動 master、agent。
- NG 在安裝時,只需要在 flume-env.sh 中設置$JAVA_HOME。
所以,當我們使用 Flume 的時候,一般都採用 Flume NG。
一、Flume 架構和核心組件
1、Event 的概念
flume 的核心是把數據從數據源(source)收集過來,再將收集到的數據送到指定的目的地(sink)。爲了保證輸送的過程一定成功,在送到目的地(sink)之前,會先緩存數據(channel),待數據真正到達目的地(sink)後,flume 再刪除自己緩存的數據。在整個數據傳輸過程中,流動的事 event,即事務保證是在 event 級別進行的。
那麼什麼是 event 呢?
event 將傳輸的數據進行封裝,是 flume 傳輸數據的基本單位,如果是文本文件,通常是一行記錄,event 也是事務的基本單位。event 從 source,流向 channel,再到 sink,本身爲一個字節數組,並可攜帶 headers(頭信息)信息,event 代表着一個數據的最小完整單元,從外部數據源來,向外部的目的去。
2、Flume 架構
flume 之所以如此神奇,就是源於它自身的一個設計,這個設計就是 agent,agent 本身是一個 Java 進程,運行在日誌收集節點(所謂日誌收集節點就是服務器節點)。
agent 裏面包含 3 個核心的組件:source➡️channel➡️sink,類似生產者、倉庫、消費者的架構。
- source:source 組件是專門用來收集數據的,可以處理各種類型、各種格式的日誌數據,包括 avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定義。
- channel:source 組件把數據收集來以後,臨時存放在 channel 中,即 channel 組件在 agent 中是專門用來存放臨時數據的(對採集到的數據進行簡單的緩存,可以放在 memory、jdbc、file 等等)。
- sink:sink 組件是用於把數據發送到目的地的組件,目的地包括 hdfs、logger、avro、thrift、ipc、file、null、hbase、solr、自定義。
3、Flume 的運行機制
flume 的核心就是一個 agent,這個 agent 對外有兩個進行交互的地方,一個是接收數據的輸入(source),一個是數據的輸出(sink),sink 負責將數據發送到外部指定的目的地。source 接收到數據之後,將數據發送給 channel,channel 作爲一個數據緩衝區會臨時存放這些數據,隨後 sink 會將 channel 中的數據發送到指定的地方(如 HDFS 等)。
注意:只有在 sink 將 channel 中的數據成功發送出去之後,channel 纔會將臨時數據進行刪除,這種機制保證了數據傳輸的可靠性與安全性。
4、Flume 的廣義用法
flume 可以支持多級 flume 的 agent,即 flume 可以前後相繼,例如 sink 可以將數據寫到下一個 agent 的 source 中,這樣的話就可以練成串了,可以整體處理了。flume 還支持扇入(fan-in)、扇出(fan-out)。所謂扇入就是 source 可以接收多個輸入,所謂扇出就是 sink 可以將數據輸出多個目的地 destination 中。
值得注意的是,flume 提供了大量內置的 source、channel 和 sink 類型。不同類型的 source、channel 和 sink 可以自由組合。組合方式基於用戶設置的配置文件,非常靈活。
舉個例子:channel 可以把事件暫存在內存裏,也可以持久化到本地硬盤上。sink 可以把日誌寫入 HDFS、HBase,甚至是另一個 source 等等。
flume 支持用戶建立多級流,也就是說,多個 agent 可以協同工作,並且支持 fan-in、fan-out、contextual routing、backup routes。如下圖👇所示:
二、Flume 環境搭建
1、前置條件
- flume 需要 Java 1.7 及以上(推薦 1.8)
- 足夠的內存和磁盤空間
- 對 agent 監控目錄的讀寫權限
2、搭建
(1)下載 flume-ng-1.6.0-cdh5.7.0.tar.gz
下載地址01:https://download.csdn.net/download/weixin_42018518/12314171,Flume-ng-1.6.0-cdh.zip 內壓縮了 3 個項目,分別爲:flume-ng-1.6.0-cdh5.5.0.tar.gz、flume-ng-1.6.0-cdh5.7.0.tar.gz 和 flume-ng-1.6.0-cdh5.10.1.tar.gz,選擇你需要的版本,我們暫時選擇 cdh5.7.0 這個版本。
下載地址 02:wget http://mirrors.tuna.tsinghua.edu.cn/apache/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz
(2)上傳到服務器,並解壓
[root@yz-sre-backup019 ~]# cd
[root@yz-sre-backup019 ~]# mkdir apps
[root@yz-sre-backup019 ~]# cd /data/soft
[root@yz-sre-backup019 soft]# tar -zxvf flume-ng-1.6.0-cdh5.7.0.tar.gz -C /root/apps/
(3)配置環境變量
[root@yz-sre-backup019 apps]# vim ~/.bash_profile
# 配置 Flume 的路徑,根據自己安裝的路徑進行修改
export FLUME_HOME=/root/apps/apache-flume-1.6.0-cdh5.7.0-bin
export PATH=$FLUME_HOME/bin:$PATH
// 使配置文件生效
[root@yz-sre-backup019 apps]# source ~/.bash_profile
(4)在 flume-env.sh 中配置 Java JDK 的路徑
a. 首先下載最新穩定 JDK:
注意:JDK 安裝在哪個用戶下,就給哪個用戶使用
當前最新版下載地址:https://www.oracle.com/java/technologies/javase-downloads.html
b. 將下載的 JDK 上傳到 /data/soft 下,並修改文件權限:
[root@yz-sre-backup019 soft]# rz
[root@yz-sre-backup019 soft]# chmod 755 jdk-8u241-linux-x64.tar.gz
c. 解壓 JDK 到 /usr/:
[root@yz-sre-backup019 soft]# tar -zxvf jdk-8u241-linux-x64.tar.gz -C /usr/
d. 配置 JDK 環境變量:
[root@yz-sre-backup019 soft]# vim /etc/profile
# Java environment
export JAVA_HOME=/usr/jdk1.8.0_241
export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
export PATH=$PATH:${JAVA_HOME}/bin
// 使 JDK 配置文件生效
[root@yz-sre-backup019 soft]# source /etc/profile
e. 檢查 JDK 是否安裝成功:
[root@yz-sre-backup019 soft]# java -version
java version "1.8.0_241"
Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)
f. 在 flume-env.sh 中配置 Java JDK 的路徑:
[root@yz-sre-backup019 apps]# cd $FLUME_HOME/conf
// 複製模板
[root@yz-sre-backup019 conf]# cp flume-env.sh.template flume-env.sh
[root@yz-sre-backup019 conf]# vim flume-env.sh
// 配置 Java 目錄,在末尾新增一行
export JAVA_HOME=/usr/jdk1.8.0_241
g. 檢測
// 在 flume 的 bin 目錄下執行 flume-ng version 可查看版本
[root@yz-sre-backup019 bin]# cd $FLUME_HOME/bin
[root@yz-sre-backup019 bin]# flume-ng version
// 出現以下內容,說明安裝成功
Flume 1.6.0-cdh5.7.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 8f5f5143ae30802fe79f9ab96f893e6c54a105d1
Compiled by jenkins on Wed Mar 23 11:38:48 PDT 2016
From source with checksum 50b533f0ffc32db9246405ac4431872e
三、Flume 實戰
使用 flume 的關鍵就是寫配置文件,主要分爲以下四步:
- 配置 source
- 配置 channel
- 配置 sink
- 把以上三個組件串起來
實戰 01:從指定網絡端口採集數據輸出到控制檯
(1)自定義 flume 的配置文件存放目錄
[root@yz-sre-backup019 data]# mkdir -pv /data/flume/{log,job,bin}
mkdir: created directory `/data/flume'
mkdir: created directory `/data/flume/log'
mkdir: created directory `/data/flume/job'
mkdir: created directory `/data/flume/bin'
[root@yz-sre-backup019 data]# cd flume/
[root@yz-sre-backup019 flume]# ll
total 12
drwxr-xr-x 2 root root 4096 Apr 10 10:34 bin # 用戶存放啓動腳本
drwxr-xr-x 2 root root 4096 Apr 10 10:34 job # 用於存放 flume 啓動 agent 的配置文件
drwxr-xr-x 2 root root 4096 Apr 10 10:34 log # 用於存放啓動腳本
(2)配置 agent
agent 選型:netcat source + memory channel + logger sink
在 /data/flume 的 job 目錄下新建 flume-netcat.conf 文件(目錄和文件名可以自定義,只需要在後續啓動 agent 時需要用到)
# flume-netcat.conf: A single_node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名稱; r1: source 的名稱; k1: sink 的名稱; c1: channel 的名稱
# Describe/configure the source
# 配置 source 的類型
a1.sources.r1.type = netcat
# 配置 source 綁定的主機
a1.sources.r1.bind = localhost
# 配置 source 綁定的主機端口
a1.sources.r1.port = 8888
# 指定 sink 的類型,我們這裏指定的爲 logger,即控制檯輸出
# 配置 sink 的類型
a1.sinks.k1.type = logger
# 指定 channel 的類型爲 memory,指定 channel 的容量是 1000,每次傳輸的容量是 100
# 配置 channel 的類型
a1.channels.c1.type = memory
# 配置通道中存儲的最大 event 數
a1.channels.c1.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
a1.channels.c1.transactionCapacity = 100
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
a1.sources.r1.channels = c1
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
a1.sinks.k1.channel = c1
(3)啓動 agent
[root@yz-sre-backup019 ~]# flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-netcat.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console
// 啓動後另開窗口就行下面測試
(4)連接測試
// 如果不能 telnet,記得先安裝
[root@yz-sre-backup019 ~]# yum -y install telnet net-tools
[root@yz-sre-backup019 ~]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 50 127.0.0.1:8888 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
LISTEN 0 50 *:10501 *:*
[root@yz-sre-backup019 ~]# telnet 127.0.0.1 8888
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
飛花點點輕!
OK
Welcome to Beijing...。。
OK
Python
OK
SHOWufei
OK
(5)編寫 flume 的啓動腳本(生產環境推薦使用這種方式)
- [root@yz-sre-backup019 flume]# cd /data/flume/bin
- [root@yz-sre-backup019 bin]# vim start-netcat.sh
- [root@yz-sre-backup019 bin]# chmod +x start-netcat.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Fri Apr 10 11:13:11 CST 2020
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-netcat.conf --name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-netcat.log 2>&1 &
(6)啓動 flume 並查看啓動日誌信息
[root@yz-sre-backup019 bin]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
[root@yz-sre-backup019 bin]# bash start-netcat.sh
[root@yz-sre-backup019 bin]#
[root@yz-sre-backup019 bin]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 50 127.0.0.1:8888 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
LISTEN 0 50 *:10501 *:*
[root@yz-sre-backup019 log]# telnet 127.0.0.1 8888
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
飛花點點輕!
OK
Welcome to Beijing...。。
OK
Python
OK
SHOWufei
OK
[root@yz-sre-backup019 log]# tail -f flume-netcat.log
2020-04-10 15:31:36,708 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:145)] Starting Channel c1
2020-04-10 15:31:36,832 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
2020-04-10 15:31:36,833 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: CHANNEL, name: c1 started
2020-04-10 15:31:36,837 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink k1
2020-04-10 15:31:36,838 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source r1
2020-04-10 15:31:36,839 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:155)] Source starting
2020-04-10 15:31:36,865 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:169)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:8888]
2020-04-10 15:31:36,883 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2020-04-10 15:31:36,944 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26.cloudera.4
2020-04-10 15:31:36,979 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started [email protected]:10501
2020-04-10 15:32:30,851 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: E9 A3 9E E8 8A B1 E7 82 B9 E7 82 B9 E8 BD BB 21 ...............! }
2020-04-10 15:32:39,853 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 65 6C 63 6F 6D 65 20 74 6F 20 42 65 69 6A 69 Welcome to Beiji }
2020-04-10 15:32:55,060 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 50 79 74 68 6F 6E 0D Python. }
2020-04-10 15:33:00,781 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 53 48 4F 57 75 66 65 69 0D SHOWufei. }
(7)查看 jq 安裝包是否存在,安裝 jq 工具,便於查看 json 格式的內容
[root@yz-sre-backup019 soft]# wget -O jq https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64
[root@yz-sre-backup019 soft]# chmod +x ./jq
[root@yz-sre-backup019 soft]# cp jq /usr/bin
(8)查看 flume 度量值
[root@yz-sre-backup019 ~]# curl http://127.0.0.1:10501/metrics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
129 259 0 259 0 0 25951 0 --:--:-- --:--:-- --:--:-- 37000
{
"CHANNEL.c1": { # 這是 c1 的 CHANEL 監控數據,c1 該名稱在 flume-netcat.conf 中配置文件中定義的
"ChannelCapacity": "1000", # channel 的容量,目前僅支持 File Channel、Memory channel 的統計數據
"ChannelFillPercentage": "0.4", # channel 已填入的百分比
"Type": "CHANNEL", # 很顯然,這裏是CHANNEL監控項,類型爲 CHANNEL
"EventTakeSuccessCount": "0", # sink 成功從 channel 讀取事件的總數量
"ChannelSize": "4", # 目前channel 中事件的總數量,目前僅支持 File Channel、Memory channel 的統計數據
"EventTakeAttemptCount": "0", # sink 嘗試從 channel 拉取事件的總次數。這不意味着每次時間都被返回,因爲 sink 拉取的時候 channel 可能沒有任何數據
"StartTime": "1586489375175", # channel 啓動時的毫秒值時間
"EventPutAttemptCount": "4", # Source 嘗試寫入 Channe 的事件總次數
"EventPutSuccessCount": "4", # 成功寫入 channel 且提交的事件總次數
"StopTime": "0" # channel 停止時的毫秒值時間,爲 0 表示一直在運行
}
}
小提示:如果還想了解更多度量值,可參考官方文檔:http://flume.apache.org/FlumeUserGuide.html#monitoring
(9)刪掉對應的 flume 進程
[root@yz-sre-backup019 ~]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 50 127.0.0.1:8888 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
LISTEN 0 50 *:10501 *:*
[root@yz-sre-backup019 ~]# netstat -untalp | grep 8888
tcp 0 0 127.0.0.1:8888 0.0.0.0:* LISTEN 4565/java
[root@yz-sre-backup019 ~]# kill 4565
[root@yz-sre-backup019 ~]# netstat -untalp | grep 8888
[root@yz-sre-backup019 ~]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
實戰 02:監控一個文件實時採集新增的數據輸出到本地文件中
(1)agent 的選型
exec source + memory channel + file_roll sink
(2)配置 agent
在 /data/flume 的 job 目錄下新建 flume-file.conf 文件(目錄和文件名可以自定義,只需要在後續啓動 agent 時需要用到)
# flume-file.conf: A single_node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名稱; r1: source 的名稱; k1: sink 的名稱; c1: channel 的名稱
# Describe/configure the source
# 配置 source 的類型
a1.sources.r1.type = exec
# 配置 source 執行的命令
a1.sources.r1.command = tail -F /data/data.log
# 配置 source 讓 bash 將一個字符串作爲完整的命令來執行
a1.sources.r1.shell = /bin/bash -c
# 指定 sink 的類型,我們這裏指定的爲 file_roll,即本地文件輸出
# 配置 sink 的類型,將數據傳輸到本地文件,需要設置文件路徑
a1.sinks.k1.type = file_roll
# 配置 sink 輸出到本地的路徑
a1.sinks.k1.sink.directory = /data/flume/data
# 配置 channel 的類型
a1.channels.c1.type = memory
# 配置通道中存儲的最大 event 數
a1.channels.c1.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
a1.channels.c1.transactionCapacity = 100
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
a1.sources.r1.channels = c1
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
a1.sinks.k1.channel = c1
(3)編寫 agent 的啓動腳本(生產環境推薦使用這種方式)
- [root@yz-sre-backup019 flume]# cd /data/flume/bin
- [root@yz-sre-backup019 bin]# vim start-file.sh
- [root@yz-sre-backup019 bin]# chmod +x start-file.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Fri Apr 10 15:05:56 CST 2020
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-file.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console >> /data/flume/log/flume-file.log 2>&1 &
(4)啓動 agent 並測試往監聽文件中輸入數據,
在 /data/flume/data 中生成的文件中查看 event 數據,測試完後刪掉 flume 進程。
[root@yz-sre-backup019 bin]# bash start-file.sh
[root@yz-sre-backup019 bin]# cd /data
[root@yz-sre-backup019 data]# echo "帥飛飛!!!" >> data.log
[root@yz-sre-backup019 bin]# cd /data/flume/data
[root@yz-sre-backup019 data]# tail 1586513526197-4
帥飛飛!!!
[root@yz-sre-backup019 data]# ps -ef | grep flume
root 17682 1 1 18:12 pts/1 00:00:03 /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/* -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
root 17970 956 0 18:16 pts/0 00:00:00 grep flume
[root@yz-sre-backup019 data]# kill 17682
[root@yz-sre-backup019 data]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
實戰 03:實時讀取本地文件到 HDFS 中(需要 flume 節點配置 Hadoop 集羣環境)
上面兩個需求中把數據輸出到控制檯沒有任何意義,實際需求可能需要輸出到 hdfs 中,只需要改動 agent 的配置,把 sink 的類型改爲 hdfs,然後指定 hdfs 的 url 和寫入路徑。
(1)agent 選型
exec source - memory channel - hdfs sink
(2)配置 agent
# flume-hdfs.conf: A single_node Flume configuration
# Name the components on this agent
wufei03.sources = file_source
wufei03.sinks = hdfs_sink
wufei03.channels = memory_channel
# wufei03: agent 的名稱; file_source: source 的名稱; hdfs_sink: sink 的名稱; memory_channel: channel 的名稱
# Describe/configure the source
# 配置 source 的類型
wufei03.sources.file_source.type = exec
# 配置 source 執行的命令
# wufei03.sources.file_source.command = tail -F /data/messages
wufei03.sources.file_source.command = tail -F /data/data.log
# 配置 source 讓 bash 將一個字符串作爲完整的命令來執行
wufei03.sources.file_source.shell = /bin/bash -c
# 指定 sink 的類型,我們這裏指定的爲 hdfs
# 配置 sink 的類型,將數據傳輸到 HDFS 集羣
wufei03.sinks.hdfs_sink.type = hdfs
# 配置 sink 輸出到本 hdfs 的 url 和寫入路徑
wufei03.sinks.hdfs_sink.hdfs.path = hdfs://yz-higo-nn1:9000/flume/dt=%Y-%m-%d/%H
# 上傳文件的前綴
wufei03.sinks.hdfs_sink.hdfs.filePrefix = gz_10.20.2.24-
# 是否按照時間滾動文件夾
wufei03.sinks.hdfs_sink.hdfs.round = true
# 多少時間單位創建一個文件夾
wufei03.sinks.hdfs_sink.hdfs.roundValue = 1
# 重新定義時間單位
wufei03.sinks.hdfs_sink.hdfs.roundUnit = hour
# 是否使用本地時間戳
wufei03.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
# 積攢多少個 event 才 flush 到 hdfs 一次
wufei03.sinks.hdfs_sink.hdfs.batchSize = 1000
# 設置文件類型,可支持壓縮
wufei03.sinks.hdfs_sink.hdfs.fileType = DataStream
# 多久生成一個新文件
wufei03.sinks.hdfs_sink.hdfs.rollInterval = 600
# 設置每個文件的滾動大小
wufei03.sinks.hdfs_sink.hdfs.rollSize = 134217700
# 文件的滾動與 event 數量無關
wufei03.sinks.hdfs_sink.hdfs.rollCount = 0
# 最小副本數
wufei03.sinks.hdfs_sink.hdfs.minBlockReplicas = 1
# 配置 channel 的類型
wufei03.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
wufei03.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
wufei03.channels.memory_channel.transactionCapacity = 1000
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
wufei03.sources.file_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
wufei03.sinks.hdfs_sink.channel = memory_channel
(3)編寫啓動腳本並啓動 flume
- [root@yz-sre-backup019 flume]# cd /data/flume/bin
- [root@yz-sre-backup019 bin]# vim start-hdfs.sh
- [root@yz-sre-backup019 bin]# chmod +x start-hdfs.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Tue Apr 14 11:56:51 CST 2020
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 wufei03
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name wufei03 --conf-file=/data/flume/job/flume-hdfs.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-hdfs.log 2>&1 &
- [root@yz-bi-web01 bin]# bash start-hdfs.sh
- [root@yz-bi-web01 bin]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
.....
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 50 *:10502 *:*
LISTEN 0 128 *:80 *:*
[root@yz-bi-web01 bin]#
[root@yz-bi-web01 data]# cd /data/
[root@yz-bi-web01 data]# echo "SHOWufei" >> data.log
[root@yz-bi-web01 data]# echo "帥飛飛!!!" >> data.log
(4)查看 flume 日誌收集信息
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/hadoop/hadoop/bin/hadoop) for HDFS access
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including HBASE libraries found via (/hadoop/hbase/bin/hbase) for HBASE access
Info: Excluding /hadoop/hbase/lib/slf4j-api-1.7.7.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including Hive libraries found via (/hadoop/hive) for Hive access
...。。
2020-04-14 12:16:15,648 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.ExecSource.start(ExecSource.java:169)] Exec source starting with command:tail -F /data/data.log
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SINK, name: hdfs_sink: Successfully registered new MBean.
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SOURCE, name: file_source: Successfully registered new MBean.
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SINK, name: hdfs_sink started
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SOURCE, name: file_source started
2020-04-14 12:16:15,683 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2020-04-14 12:16:15,725 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26.cloudera.4
2020-04-14 12:16:15,761 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started [email protected]:10502
2020-04-14 12:16:53,678 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-14 12:16:53,977 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
2020-04-14 12:26:55,533 (hdfs-hdfs_sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.BucketWriter.close(BucketWriter.java:363)] Closing hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
2020-04-14 12:26:55,571 (hdfs-hdfs_sink-call-runner-6) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:629)] Renaming hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp to hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679
2020-04-14 12:26:55,580 (hdfs-hdfs_sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.HDFSEventSink$1.run(HDFSEventSink.java:394)] Writer callback called.
2020-04-14 17:47:30,793 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-14 17:47:30,839 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/17/gz_10.20.2.24-.1586857650794.tmp
(5)查看 hdfs 對應目錄是否生成相應的日誌信息
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-14/12
Found 1 items
-rw-r--r-- 3 root hadoop 9 2020-04-14 12:16 /flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-14
Found 2 items
drwxrwxrwx - root hadoop 0 2020-04-14 12:26 /flume/dt=2020-04-14/12
drwxrwxrwx - root hadoop 0 2020-04-14 17:47 /flume/dt=2020-04-14/17
(6)Browsing HDFS
(7)查看 flume 度量值
[root@yz-bi-web01 ~]# curl http://127.0.0.1:10502/metrics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
105 841 0 841 0 0 163k 0 --:--:-- --:--:-- --:--:-- 273k
{
"SOURCE.file_source": { # source 的名稱
"OpenConnectionCount": "0", # 目前與客戶端或 sink 保持連接的總數量,目前僅支持 avro source 展現該度量
"Type": "SOURCE", # 當前類型爲 SOURRCE
"AppendBatchAcceptedCount": "0", # 成功提交到 channel 的批次的總數量
"AppendBatchReceivedCount": "0", # 接收到事件批次的總數量
"EventAcceptedCount": "3", ## 成功寫出到channel的事件總數量
"AppendReceivedCount": "0", # 每批只有一個事件的事件總數量(與 RPC 調用的一個 append 調用相等)
"StopTime": "0", # SOURCE 停止時的毫秒值時間,0 代表一直運行着
"StartTime": "1586837775650", # SOURCE 啓動時的毫秒值時間
"EventReceivedCount": "3", ## 目前爲止 source 已經接收到的事件總數量
"AppendAcceptedCount": "0" # 逐條錄入的次數,單獨傳入的事件到 Channel 且成功返回的事件總數量
},
"SINK.hdfs_sink": { # sink 的名稱
"BatchCompleteCount": "0", # 批量處理event的個數等於批處理大小的數量
"ConnectionFailedCount": "0", # 連接失敗的次數
"EventDrainAttemptCount": "3", ## sink 嘗試寫出到存儲的事件總數量
"ConnectionCreatedCount": "2", # 下一個階段(或存儲系統)創建鏈接的數量(如HDFS創建一個文件)
"Type": "SINK", # 當前類型爲 SINK
"BatchEmptyCount": "2551", # 批量處理 event 的個數爲 0 的數量(空的批量的數量),如果數量很大表示 source 寫入數據的速度比 sink 處理數據的速度慢很多
"ConnectionClosedCount": "1", # 連接關閉的次數
"EventDrainSuccessCount": "3", ## sink成功寫出到存儲的事件總數量
"StopTime": "0", # SINK 停止時的毫秒值時間
"StartTime": "1586837775650", # SINK 啓動時的毫秒值時間
"BatchUnderflowCount": "3" # 批量處理 event 的個數小於批處理大小的數量(比 sink 配置使用的最大批量尺寸更小的批量的數量),如果該值很高也表示 sink 比 source 更快
},
"CHANNEL.memory_channel": { # channel 的名稱
"EventPutSuccessCount": "3", ## 成功寫入channel且提交的事件總次數
"ChannelFillPercentage": "0.0", # channel已填入的百分比
"Type": "CHANNEL", # 當前類型爲 CHANNEL
"StopTime": "0", # CHANNEL 停止時的毫秒值時間
"EventPutAttemptCount": "3", ## Source 嘗試寫入 Channe 的事件總次數
"ChannelSize": "0", # 目前 channel 中事件的總數量,目前僅支持 File Channel,Memory channel 的統計數據
"StartTime": "1586837775646", # CHANNEL 啓動時的毫秒值時間
"EventTakeSuccessCount": "3", ## sink 成功從 channel 讀取事件的總數量
"ChannelCapacity": "1000", # channel 的容量,目前僅支持 File Channel,Memory channel 的統計數據
"EventTakeAttemptCount": "2558" # sink 嘗試從 channel 拉取事件的總次數。這不意味着每次時間都被返回,因爲 sink 拉取的時候 channel 可能沒有任何數據
}
}
(8)測試完後刪掉 flume 進程
[root@yz-bi-web01 ~]# ps -ef | grep flume
root 19768 14759 0 17:57 pts/11 00:00:00 grep flume
root 26653 1 0 12:16 pts/0 00:00:26 /usr/local/jdk1.7.0_76/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/hadoop/hadoop-2.7.1/etc/hadoop:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/activation-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/asm-3.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/avro-1.7.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-cli-1.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-codec-1.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-collections-3.2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-compress-1.4.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-configuration-1.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-digester-1.8.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-httpclient-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-io-2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-lang-2.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-logging-1.1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-math3-3.1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-net-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-client-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-framework-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/gson-2.2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/guava-11.0.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-annotations-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-auth-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hamcrest-core-1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpclient-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpcore-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-core-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-json-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-server-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jets3t-0.9.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jettison-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-util-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsch-0.1.42.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsp-api-2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsr305-3.0.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/junit-4.11.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/log4j-1.2.17.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/mockito-all-1.8.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/netty-3.6.2.Final.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/paranamer-2.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/com
[root@yz-bi-web01 ~]# kill 26653
[root@yz-bi-web01 ~]# ps -ef | grep flume
root 19777 14759 0 17:58 pts/11 00:00:00 grep flume
(9)清除 hdfs 上數據
[root@yz-bi-web01 ~]# su - hadoop
[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume
[hadoop@yz-bi-web01 ~]$
實戰 04:從服務器 A 收集數據到服務器 B 並上傳到 HDFS(需要服務器 B 節點配置 Hadoop 集羣環境)
重點:服務器 A 的 sink 類型是 avro,而服務器 B 的 source 類型是 avro。
流程:
- 機器 A 監控一個文件,把日誌記錄到 data.log 中
- avro sink 把新產生的日誌輸出到指定的 hostname 和 port 上
- 通過 avro source 對應的 agent 將日誌輸出到控制檯、kafka、hdfs 等
(1)機器 A 配置
agent 選型:exec source + memory channel + avro sink
# flume-file.conf: A single_node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名稱; r1: source 的名稱; k1: sink 的名稱; c1: channel 的名稱
# Describe/configure the source
# 配置 source 的類型
a1.sources.r1.type = exec
# 配置 source 執行的命令
a1.sources.r1.command = tail -F /data/data.log
# 配置 source 讓 bash 將一個字符串作爲完整的命令來執行
a1.sources.r1.shell = /bin/bash -c
# 指定 sink 的類型,我們這裏指定的爲 avro,即將數據發送到端口,需要設置端口名稱、端口號
a1.sinks.k1.type = avro
# 配置 sink 主機名稱
a1.sinks.k1.hostname = 10.20.2.24
# 配置 sink 主機端口
a1.sinks.k1.port = 8888
# 配置 channel 的類型
a1.channels.c1.type = memory
# 配置通道中存儲的最大 event 數
a1.channels.c1.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
a1.channels.c1.transactionCapacity = 100
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
a1.sources.r1.channels = c1
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
a1.sinks.k1.channel = c1
(2)機器 B 配置
agent 選型:avro source + memory channel + hdfs sink
# flume-hdfs.conf: A single_node Flume configuration
# Name the components on this agent
wufei03.sources = file_source
wufei03.sinks = hdfs_sink
wufei03.channels = memory_channel
# wufei03: agent 的名稱; file_source: source 的名稱; hdfs_sink: sink 的名稱; memory_channel: channel 的名稱
# Describe/configure the source
# 配置 source 的類型
wufei03.sources.file_source.type = avro
# 配置 source 綁定主機
wufei03.sources.file_source.bind = 10.20.2.24
# 配置 source 綁定主機端口
wufei03.sources.file_source.port = 8888
# 指定 sink 的類型,我們這裏指定的爲 hdfs
# 配置 sink 的類型,將數據傳輸到 HDFS 集羣
wufei03.sinks.hdfs_sink.type = hdfs
# 配置 sink 輸出到本 hdfs 的 url 和寫入路徑
wufei03.sinks.hdfs_sink.hdfs.path = hdfs://yz-higo-nn1:9000/flume/dt=%Y-%m-%d/%H
# 上傳文件的前綴
wufei03.sinks.hdfs_sink.hdfs.filePrefix = gz_10.20.3.36-
# 是否按照時間滾動文件夾
wufei03.sinks.hdfs_sink.hdfs.round = true
# 多少時間單位創建一個文件夾
wufei03.sinks.hdfs_sink.hdfs.roundValue = 1
# 重新定義時間單位
wufei03.sinks.hdfs_sink.hdfs.roundUnit = hour
# 是否使用本地時間戳
wufei03.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
# 積攢多少個 event 才 flush 到 hdfs 一次
wufei03.sinks.hdfs_sink.hdfs.batchSize = 1000
# 設置文件類型,可支持壓縮
wufei03.sinks.hdfs_sink.hdfs.fileType = DataStream
# 多久生成一個新文件
wufei03.sinks.hdfs_sink.hdfs.rollInterval = 600
# 設置每個文件的滾動大小
wufei03.sinks.hdfs_sink.hdfs.rollSize = 134217700
# 文件的滾動與 event 數量無關
wufei03.sinks.hdfs_sink.hdfs.rollCount = 0
# 最小副本數
wufei03.sinks.hdfs_sink.hdfs.minBlockReplicas = 1
# 配置 channel 的類型
wufei03.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
wufei03.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
wufei03.channels.memory_channel.transactionCapacity = 100
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
wufei03.sources.file_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
wufei03.sinks.hdfs_sink.channel = memory_channel
(3)編寫啓動腳本
// 機器 A 啓動腳本
[root@yz-sre-backup019 bin]# vim start-file.sh
[root@yz-sre-backup019 bin]# cat start-file.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Wed Apr 15 11:22:24 CST 2020
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-file.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console >> /data/flume/log/flume-file.log 2>&1 &
// 機器 B 啓動腳本
[root@yz-bi-web01 bin]# vim start-hdfs.sh
[root@yz-bi-web01 bin]# cat start-hdfs.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Wed Apr 15 11:22:24 CST 2020
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 wufei03
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name wufei03 --conf-file=/data/flume/job/flume-hdfs.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-hdfs.log 2>&1 &
(4)啓動腳本並查看對應的日誌信息
// 機器 A
[root@yz-sre-backup019 bin]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
// 機器 B
[root@yz-bi-web01 bin]# ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*
// 先啓動機器 B 的 agent
[root@yz-bi-web01 bin]# bash start-hdfs.sh
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/hadoop/hadoop/bin/hadoop) for HDFS access
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including HBASE libraries found via (/hadoop/hbase/bin/hbase) for HBASE access
Info: Excluding /hadoop/hbase/lib/slf4j-api-1.7.7.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including Hive libraries found via (/hadoop/hive) for Hive access
// 後啓動機器 A 的 agent
[root@yz-sre-backup019 bin]# bash start-file.sh
[root@yz-sre-backup019 log]# tail -f flume-file.log
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp '/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
2020-04-15 11:55:35,409 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2020-04-15 11:55:35,416 (lifecycleSupervisor-1-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:78)] Configuration provider started
// 插入測試數據
[root@yz-sre-backup019 ~]# cd /data/
[root@yz-sre-backup019 data]# echo "帥飛飛!!!" >> data.log
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
2020-04-15 11:55:51,147 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859.tmp
[root@yz-sre-backup019 data]# echo "SHOWufei!!!" >> data.log
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
2020-04-15 12:03:14,139 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969.tmp
(5)查看 hdfs 對應目錄是否生成相應的日誌信息
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15
Found 2 items
drwxrwxrwx - root hadoop 0 2020-04-15 12:05 /flume/dt=2020-04-15/11
drwxrwxrwx - root hadoop 0 2020-04-15 12:03 /flume/dt=2020-04-15/12
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15/11
Found 1 items
-rw-r--r-- 3 root hadoop 19 2020-04-15 12:05 /flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859
[root@yz-bi-web01 ~]# hadoop fs -cat /flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859
帥飛飛!!!
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15/12
Found 1 items
-rw-r--r-- 3 root hadoop 18 2020-04-15 12:03 /flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969.tmp
[root@yz-bi-web01 ~]# hadoop fs -cat /flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969
SHOWufei!!!
(6)查看 flume 度量值
// 機器 A
[root@yz-sre-backup019 ~]# curl http://127.0.0.1:10501/metrics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
101 811 0 811 0 0 4924 0 --:--:-- --:--:-- --:--:-- 4975
{
"SINK.k1": {
"ConnectionCreatedCount": "1",
"ConnectionClosedCount": "0",
"Type": "SINK",
"BatchCompleteCount": "0",
"BatchEmptyCount": "109",
"EventDrainAttemptCount": "2",
"StartTime": "1586922935645",
"EventDrainSuccessCount": "2",
"BatchUnderflowCount": "2",
"StopTime": "0",
"ConnectionFailedCount": "0"
},
"CHANNEL.c1": {
"ChannelCapacity": "1000",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"ChannelSize": "0",
"EventTakeSuccessCount": "2",
"EventTakeAttemptCount": "114",
"StartTime": "1586922935643",
"EventPutAttemptCount": "2",
"EventPutSuccessCount": "2",
"StopTime": "0"
},
"SOURCE.r1": {
"EventReceivedCount": "2",
"AppendBatchAcceptedCount": "0",
"Type": "SOURCE",
"EventAcceptedCount": "2",
"AppendReceivedCount": "0",
"StartTime": "1586922935652",
"AppendAcceptedCount": "0",
"OpenConnectionCount": "0",
"AppendBatchReceivedCount": "0",
"StopTime": "0"
}
}
// 機器 B
[root@yz-bi-web01 ~]# curl http://127.0.0.1:10502/metrics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
104 839 0 839 0 0 7163 0 --:--:-- --:--:-- --:--:-- 7295
{
"SOURCE.file_source": {
"OpenConnectionCount": "1",
"Type": "SOURCE",
"AppendBatchReceivedCount": "2",
"AppendBatchAcceptedCount": "2",
"EventAcceptedCount": "2",
"AppendReceivedCount": "0",
"StopTime": "0",
"StartTime": "1586922913313",
"EventReceivedCount": "2",
"AppendAcceptedCount": "0"
},
"SINK.hdfs_sink": {
"BatchCompleteCount": "0",
"ConnectionFailedCount": "0",
"EventDrainAttemptCount": "2",
"ConnectionCreatedCount": "2",
"Type": "SINK",
"BatchEmptyCount": "117",
"ConnectionClosedCount": "1",
"EventDrainSuccessCount": "2",
"StopTime": "0",
"StartTime": "1586922912838",
"BatchUnderflowCount": "2"
},
"CHANNEL.memory_channel": {
"EventPutSuccessCount": "2",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"EventPutAttemptCount": "2",
"ChannelSize": "0",
"StopTime": "0",
"StartTime": "1586922912835",
"EventTakeSuccessCount": "2",
"ChannelCapacity": "1000",
"EventTakeAttemptCount": "121"
}
}
(7)測試完刪掉 flume 進程並清除 hdfs 上數據
// 先刪掉機器 A 的 flume 進程
[root@yz-sre-backup019 bin]# ps -ef | grep flume
root 10492 6728 0 11:54 pts/2 00:00:00 tail -f flume-file.log
root 10500 1 0 11:55 pts/0 00:00:09 /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/* -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
root 11084 5377 0 12:12 pts/0 00:00:00 grep flume
[root@yz-sre-backup019 bin]# kill 10500
[root@yz-sre-backup019 bin]# ps -ef | grep flume
root 10492 6728 0 11:54 pts/2 00:00:00 tail -f flume-file.log
root 11092 5377 0 12:12 pts/0 00:00:00 grep flume
// 後刪掉機器 B 的 flume 進程
[root@yz-bi-web01 ~]# ps -ef | grep flume
root 5725 16077 0 11:54 pts/11 00:00:00 tail -f flume-hdfs.log
root 5735 1 1 11:55 pts/0 00:00:20 /usr/local/jdk1.7.0_76/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/hadoop/hadoop-2.7.1/etc/hadoop:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/activation-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/asm-3.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/avro-1.7.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-cli-1.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-codec-1.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-collections-3.2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-compress-1.4.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-configuration-1.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-digester-1.8.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-httpclient-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-io-2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-lang-2.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-logging-1.1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-math3-3.1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-net-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-client-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-framework-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/gson-2.2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/guava-11.0.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-annotations-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-auth-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hamcrest-core-1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpclient-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpcore-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-core-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-json-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-server-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jets3t-0.9.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jettison-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-util-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsch-0.1.42.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsp-api-2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsr305-3.0.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/junit-4.11.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/log4j-1.2.17.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/mockito-all-1.8.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/netty-3.6.2.Final.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/paranamer-2.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/com
root 18949 9025 0 12:12 pts/12 00:00:00 grep flume
[root@yz-bi-web01 ~]# kill 5735
[root@yz-bi-web01 ~]# ps -ef | grep flume
root 5725 16077 0 11:54 pts/11 00:00:00 tail -f flume-hdfs.log
root 18963 9025 0 12:12 pts/12 00:00:00 grep flume
// 清除 hdfs 上的數據
[root@yz-bi-web01 ~]# su - hadoop
[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume
[hadoop@yz-bi-web01 ~]$ exit
logout
[root@yz-bi-web01 ~]#
實戰 05:多 flume 彙總數據到單 flume(需要單 flume 匯聚節點配置 hadoop 集羣環境)
(1)流程
- Agent1 監控文件 /data/data.log(exec source - memory channel - avro sink)
- Agent2 監控某一端口數據流 (netcat source - memory channel - avro sink)
- Agent3 實時指定目錄文件內容(spooldir source - memory channel - avro sink)
- Agent1、Agent2、Agent3 將數據發送給 Agent4
- Agent4 將最終數據寫入到 hdfs(avro source - memory channel - hdfs sink)
(2)編寫相應的 agent 配置文件
[root@yz-sre-backup019 job]# vim agent1-exec.conf # Name the components on this agent # Describe/configure the source # 指定 sink 的類型,我們這裏指定的爲 avro,即將數據發送到端口,需要設置端口名稱、端口號 # 配置 channel 的類型 # 綁定 source 和 sink |
[root@yz-bi-web01 job]# vim agent4-hdfs.conf # Name the components on this agent # Describe/configure the source # 指定 sink 的類型,我們這裏指定的爲 hdfs # 配置 channel 的類型 # 綁定 source 和 sink |
[root@yz-sre-backup019 job]# vim agent2-netcat.conf # Name the components on this agent # Describe/configure the source # 指定 sink 的類型,我們這裏指定的爲 avro,即將數據發送到端口,需要設置端口名稱、端口號 # 指定 channel 的類型爲 memory,指定 channel 的容量是 1000,每次傳輸的容量是 100 # 綁定 source 和 sink |
|
[root@yz-sre-backup019 job]# vim agent3-dir.conf # Name the components on this agent # Describe/configure the source # 指定 sink 的類型,我們這裏指定的爲 avro,即將數據發送到端口,需要設置端口名稱、端口號 # 配置 channel 的類型 # 綁定 source 和 sink [root@yz-sre-backup019 job]# mkdir -pv /data/flume/upload // 創建測試監視文件夾 |
(3)編寫相應的 agent 啓動腳本
#!/bin/bash # 啓動 flume 自身的監控參數,默認執行以下腳本 |
#!/bin/bash # 啓動 flume 自身的監控參數,默認執行以下腳本 |
#!/bin/bash # 啓動 flume 自身的監控參數,默認執行以下腳本 |
#!/bin/bash # 啓動 flume 自身的監控參數,默認執行以下腳本 |
(4)分別啓動各 agent 並查看對應的日誌信息
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 |
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 |
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 |
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 |
(5)分別進行測試並查看對應的日誌信息
// agent1 測試
|
2020-04-15 17:16:04,150 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false |
// agent2 測試
Trying 127.0.0.1... |
2020-04-15 17:17:07,096 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false |
// agent3 測試
|
2020-04-15 17:26:37,864 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false 2020-04-15 17:27:18,969 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false |
// 查看 hdfs 對應目錄是否生成相應的日誌信息
Found 4 items
帥飛飛!!!
飛花點點輕!
https://showufei.blog.csdn.net
|
|
(6)查看 flume 度量值
// agent1
% Total % Received % Xferd Average Speed Time Time Time Current
|
// agent3
% Total % Received % Xferd Average Speed Time Time Time Current
|
// agent2
% Total % Received % Xferd Average Speed Time Time Time Current
|
// agent4
% Total % Received % Xferd Average Speed Time Time Time Current
|
(7)測試完刪掉 flume 進程並清除 hdfs 上數據
[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume
實戰 06:挑選器案例
channel selector:通道挑選器,選擇指定的 event 發送到指定的 channel
- Replicating Channel Selector:默認副本挑選器,事件均以副本方式輸出,換句話說就是有幾個 channel 就發送幾個副本
- multiplexing selector:多路複用挑選器,作用就是可以將不同的內容發送到指定的 channel
- 詳情參考官方文檔:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-channel-selectors
流程圖:
實戰 07:主機攔截器案例
攔截器(interceptor):是 source 端的在處理過程中能夠對數據(event)進行修改或丟棄的組件。
常見攔截器有:
- host interceptor:將發送的 event 添加主機名 header
- timestamp interceptor:將發送的 event 添加時間戳的 header
- 更多攔截器可參考官方文檔:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-interceptors
(1)編輯主機攔截器配置文件(案例一)
agent 選型:netcat source + memory channel + logger sink
[root@yz-sre-backup019 job]# vim flume-host_interceptor.conf
[root@yz-sre-backup019 job]# cat flume-host_interceptor.conf
# flume-host_interceptor.conf: A single_node Flume configuration
# Name the components on this agent
wf_host_interceptor.sources = netcat_source
wf_host_interceptor.sinks = logger_sink
wf_host_interceptor.channels = memory_channel
# wf_host_interceptor: agent 的名稱; netcat_source: source 的名稱; logger_sink: sink 的名稱; memory_channel: channel 的名稱
# Describe/configure the source
# 配置 source 的類型
wf_host_interceptor.sources.netcat_source.type = netcat
# 配置 source 綁定的主機
wf_host_interceptor.sources.netcat_source.bind = 127.0.0.1
# 配置 source 綁定的主機端口
wf_host_interceptor.sources.netcat_source.port = 8888
# 指定添加攔截器
wf_host_interceptor.sources.netcat_source.interceptors = host_interceptor
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.type = org.apache.flume.interceptor.HostInterceptor$Builder
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.preserveExisting = false
# 指定 header 的 key
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.hostHeader = hostname
# 指定 header 的 value 爲主機 IP
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.useIP = true
# 指定 sink 的類型,我們這裏指定的爲 logger,即控制檯輸出
# 配置 sink 的類型,
wf_host_interceptor.sinks.logger_sink.type = logger
# 指定 channel 的類型爲 memory,指定 channel 的容量是 1000,每次傳輸的容量是 100
# 配置 channel 的類型
wf_host_interceptor.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
wf_host_interceptor.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
wf_host_interceptor.channels.memory_channel.transactionCapacity = 100
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
wf_host_interceptor.sources.netcat_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
wf_host_interceptor.sinks.logger_sink.channel = memory_channel
(2)編寫啓動腳本
[root@yz-sre-backup019 bin]# vim start-host_interceptor.sh
[root@yz-sre-backup019 bin]# chmod +x start-host_interceptor.sh
[root@yz-sre-backup019 bin]# cat start-host_interceptor.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Thu Apr 16 11:33:39 CST 2020
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-host_interceptor.conf --name wf_host_interceptor -Dflume.monitoring.type=http -Dflume.monitoring.port=10520 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-host_interceptor.log 2>&1 &
(3)啓動並連接到指定端口發送測試數據
|
Info: Sourcing environment configuration script //root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 2020-04-16 12:08:55,492 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started [email protected]:10520 |
Trying 127.0.0.1... |
(4)編輯時間戳攔截器配置文件(案例二)
agent 選型:netcat source + memory channel + logger sink
[root@yz-sre-backup019 job]# vim flume-timestamp_interceptor.conf
[root@yz-sre-backup019 job]# cat flume-timestamp_interceptor.conf
# flume-timestamp_interceptor.conf: A single_node Flume configuration
# Name the components on this agent
wf_timestamp_interceptor.sources = netcat_source
wf_timestamp_interceptor.sinks = logger_sink
wf_timestamp_interceptor.channels = memory_channel
# wf_timestamp_interceptor: agent 的名稱; netcat_source: source 的名稱; logger_sink: sink 的名稱; memory_channel: channel 的名稱
# Describe/configure the source
# 配置 source 的類型
wf_timestamp_interceptor.sources.netcat_source.type = netcat
# 配置 source 綁定的主機
wf_timestamp_interceptor.sources.netcat_source.bind = 127.0.0.1
# 配置 source 綁定的主機端口
wf_timestamp_interceptor.sources.netcat_source.port = 8888
# 指定添加攔截器
wf_timestamp_interceptor.sources.netcat_source.interceptors = timestamp_interceptor
wf_timestamp_interceptor.sources.netcat_source.interceptors.timestamp_interceptor.type = timestamp
# 指定 sink 的類型,我們這裏指定的爲 logger,即控制檯輸出
# 配置 sink 的類型,
wf_timestamp_interceptor.sinks.logger_sink.type = logger
# 指定 channel 的類型爲 memory,指定 channel 的容量是 1000,每次傳輸的容量是 100
# 配置 channel 的類型
wf_timestamp_interceptor.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
wf_timestamp_interceptor.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
wf_timestamp_interceptor.channels.memory_channel.transactionCapacity = 100
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
wf_timestamp_interceptor.sources.netcat_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
wf_timestamp_interceptor.sinks.logger_sink.channel = memory_channel
(5)編寫啓動腳本
[root@yz-sre-backup019 bin]# vim start-timestamp_interceptor.sh
[root@yz-sre-backup019 bin]# chmod +x start-timestamp_interceptor.sh
[root@yz-sre-backup019 bin]# cat start-timestamp_interceptor.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Thu Apr 16 12:26:26 CST 2020
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-timestamp_interceptor.conf --name wf_timestamp_interceptor -Dflume.monitoring.type=http -Dflume.monitoring.port=10521 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-timestamp_interceptor.log 2>&1 &
(6)啓動並連接到指定端口發送測試數據
State Recv-Q Send-Q Local Address:Port Peer Address:Port |
Info: Sourcing environment configuration script //root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh ...。。 2020-04-16 12:28:55,062 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started [email protected]:10521 2020-04-16 12:30:15,386 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{timestamp=1587011415381} body: 53 48 4F 57 75 66 65 69 E3 80 82 2E 2E 2E E3 80 SHOWufei........ } |
Trying 127.0.0.1... |
四、在生成環境的實際應用
待實施...。。