分佈式日誌收集框架 Flume NG 實戰案例

目錄

寫在最前之應用場景:

一、Flume 架構和核心組件

1、Event 的概念

2、Flume 架構

3、Flume 的運行機制

4、Flume 的廣義用法

二、Flume 環境搭建

1、前置條件

2、搭建

(1)下載 flume-ng-1.6.0-cdh5.7.0.tar.gz

(2)上傳到服務器,並解壓

(3)配置環境變量

(4)在 flume-env.sh 中配置 Java JDK 的路徑

三、Flume 實戰

實戰 01:從指定網絡端口採集數據輸出到控制檯

(1)自定義 flume 的配置文件存放目錄

(2)配置 agent

(3)啓動 agent

(4)連接測試

(5)編寫 flume 的啓動腳本(生產環境推薦使用這種方式)

(6)啓動 flume 並查看啓動日誌信息

(7)查看 jq 安裝包是否存在,安裝 jq 工具,便於查看 json 格式的內容

(8)查看 flume 度量值

(9)刪掉對應的 flume 進程

實戰 02:監控一個文件實時採集新增的數據輸出到本地文件中

(1)agent 的選型

(2)配置 agent

(3)編寫 agent 的啓動腳本(生產環境推薦使用這種方式)

(4)啓動 agent 並測試往監聽文件中輸入數據,

實戰 03:實時讀取本地文件到 HDFS 中(需要 flume 節點配置 Hadoop 集羣環境)

(1)agent 選型

(2)配置 agent

(3)編寫啓動腳本並啓動 flume

(4)查看 flume 日誌收集信息

(5)查看 hdfs 對應目錄是否生成相應的日誌信息

(6)Browsing HDFS

(7)查看 flume 度量值

(8)測試完後刪掉 flume 進程

(9)清除 hdfs 上數據

實戰 04:從服務器 A 收集數據到服務器 B 並上傳到 HDFS(需要服務器 B 節點配置 Hadoop 集羣環境)

(1)機器 A 配置

(2)機器 B 配置

(3)編寫啓動腳本

(4)啓動腳本並查看對應的日誌信息

(5)查看 hdfs 對應目錄是否生成相應的日誌信息

(6)查看 flume 度量值

(7)測試完刪掉 flume 進程並清除 hdfs 上數據

實戰 05:多 flume 彙總數據到單 flume(需要單 flume 匯聚節點配置 hadoop 集羣環境)

(1)流程

(2)編寫相應的 agent 配置文件

(3)編寫相應的 agent 啓動腳本

(4)分別啓動各 agent 並查看對應的日誌信息

(5)分別進行測試並查看對應的日誌信息

(6)查看 flume 度量值

(7)測試完刪掉 flume 進程並清除 hdfs 上數據

實戰 06:挑選器案例

實戰 07:主機攔截器案例

(1)編輯主機攔截器配置文件(案例一)

(2)編寫啓動腳本

(3)啓動並連接到指定端口發送測試數據

(4)編輯時間戳攔截器配置文件(案例二)

(5)編寫啓動腳本

(6)啓動並連接到指定端口發送測試數據

四、在生成環境的實際應用


寫在最前之應用場景:

flume 在大數據中扮演着數據收集的角色,收集到數據以後再通過計算框架進行處理。flume 是 Cloudera 提供的一個高可用的、高可靠的、分佈式的海量日誌採集、聚合和傳輸的系統,flume 支持在日誌系統中定製各類數據發送方,用於收集數據;同時,flume 提供對數據進行簡單處理,並寫到各種數據接收方(可定製)的能力。

Flume 作爲 Hadoop 中的日誌採集工具,非常的好用,但是在安裝 Flume 的時候,查閱很多資料時發現形形色色,有的說安裝 Flume 很簡單,有的說安裝 Flume 很複雜,需要依賴 zookeeper,所以一方面說直接安裝 Flume,解壓即可用,還有一方面說需要先裝了 Zookeeper 纔可以安裝 Flume。那麼爲何會才生這種情況呢?其實兩者說的都對,只是 Flume 的不同版本問題。

背景介紹:Cloudera 開發的分佈式日誌收集系統 Flume,是 Hadoop 周邊組件之一。其可以實時的將分佈在不同節點、機器上的日誌收集到 hdfs 中。Flume 初始的發行版本目前被統稱爲 Flume OG(original generation),屬於 cloudera。但隨着 FLume 功能的擴展,Flume OG 代碼工程臃腫、核心組件設計不合理、核心配置不標準等缺點暴露出來,尤其是在 Flume OG 的最後一個發行版本 0.94.0 中,日誌傳輸不穩定的現象尤爲嚴重,這點可以在 BigInsights 產品文檔的 troubleshooting 板塊發現。爲了解決這些問題,2011 年 10 月 22 日,cloudera 完成了 Flume-728,對 Flume 進行了里程碑式的改動:重構核心組件、核心配置以及代碼架構,重構後的版本統稱爲 Flume NG(next generation);改動的另一原因是將 Flume 納入 apache 旗下,cloudera Flume 改名爲 Apache Flume。

FLUME OG 有三種角色的節點:代理節點(agent)、收集節點(collector)、主節點(master)。

FLUME NG 只有一種角色的節點:代理節點(agent)。

Flume OG vs Flume NG:

  • 在 OG 版本中,Flume 的使用穩定性依賴 zookeeper。它需要 zookeeper 對其多類節點(agent、collector、master)的工作進行管理,尤其是在集羣中配置多個 master 的情況下。當然,OG也可以用內存的方式管理各類節點的配置信息,但是需要用戶能夠忍受在機器出現故障時配置信息出現丟失。所以說 OG 的穩定行使用是依賴 zookeeper 的。
  • 而在 NG 版本中,節點角色的數量由 3 縮減到 1,不存在多類角色的問題,所以就不再需要 zookeeper 對各類節點協調的作用了,由此脫離了對 zookeeper 的依賴。由於 OG 的穩定使用對 zookeeper 的依賴表現在整個配置和使用過程中,這就要求用戶掌握對 zookeeper 集羣的搭建及其使用。
  • OG 在安裝時:在 flume-env.sh 中設置 $JAVA_HOME。 需要配置文件 flume-conf.xml。其中最主要的、必須的配置與 master 有關。集羣中的每個 Flume 都需要配置 master 相關屬性(如 flume.master.servers、flume.master.store、flume.master.serverid)。 如果想穩定使用 Flume 集羣,還需要安裝 zookeeper 集羣,這需要用戶對 zookeeper 有較深入的瞭解。 安裝 zookeeper 之後,需要配置 flume-conf.xml 中的相關屬性,如 flume.master.zk.use.external、flume.master.zk.servers。 在使用 OG 版本傳輸數據之前,需要啓動 master、agent。
  • NG 在安裝時,只需要在 flume-env.sh 中設置$JAVA_HOME。

所以,當我們使用 Flume 的時候,一般都採用 Flume NG。

一、Flume 架構和核心組件

1、Event 的概念

flume 的核心是把數據從數據源(source)收集過來,再將收集到的數據送到指定的目的地(sink)。爲了保證輸送的過程一定成功,在送到目的地(sink)之前,會先緩存數據(channel),待數據真正到達目的地(sink)後,flume 再刪除自己緩存的數據。在整個數據傳輸過程中,流動的事 event,即事務保證是在 event 級別進行的。

那麼什麼是 event 呢?

event 將傳輸的數據進行封裝,是 flume 傳輸數據的基本單位,如果是文本文件,通常是一行記錄,event 也是事務的基本單位。event 從 source,流向 channel,再到 sink,本身爲一個字節數組,並可攜帶 headers(頭信息)信息,event 代表着一個數據的最小完整單元,從外部數據源來,向外部的目的去。

 

2、Flume 架構

 

flume 之所以如此神奇,就是源於它自身的一個設計,這個設計就是 agent,agent 本身是一個 Java 進程,運行在日誌收集節點(所謂日誌收集節點就是服務器節點)。

agent 裏面包含 3 個核心的組件:source➡️channel➡️sink,類似生產者、倉庫、消費者的架構。

  • source:source 組件是專門用來收集數據的,可以處理各種類型、各種格式的日誌數據,包括 avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定義。
  • channel:source 組件把數據收集來以後,臨時存放在 channel 中,即 channel 組件在 agent 中是專門用來存放臨時數據的(對採集到的數據進行簡單的緩存,可以放在 memory、jdbc、file 等等)。
  • sink:sink 組件是用於把數據發送到目的地的組件,目的地包括 hdfs、logger、avro、thrift、ipc、file、null、hbase、solr、自定義。

3、Flume 的運行機制

flume 的核心就是一個 agent,這個 agent 對外有兩個進行交互的地方,一個是接收數據的輸入(source),一個是數據的輸出(sink),sink 負責將數據發送到外部指定的目的地。source 接收到數據之後,將數據發送給 channel,channel 作爲一個數據緩衝區會臨時存放這些數據,隨後 sink 會將 channel 中的數據發送到指定的地方(如 HDFS 等)。

注意:只有在 sink 將 channel 中的數據成功發送出去之後,channel 纔會將臨時數據進行刪除,這種機制保證了數據傳輸的可靠性與安全性。

4、Flume 的廣義用法

flume 可以支持多級 flume 的 agent,即 flume 可以前後相繼,例如 sink 可以將數據寫到下一個 agent 的 source 中,這樣的話就可以練成串了,可以整體處理了。flume 還支持扇入(fan-in)、扇出(fan-out)。所謂扇入就是 source 可以接收多個輸入,所謂扇出就是 sink 可以將數據輸出多個目的地 destination 中。

值得注意的是,flume 提供了大量內置的 source、channel 和 sink 類型。不同類型的 source、channel 和 sink 可以自由組合。組合方式基於用戶設置的配置文件,非常靈活。

舉個例子:channel 可以把事件暫存在內存裏,也可以持久化到本地硬盤上。sink 可以把日誌寫入 HDFS、HBase,甚至是另一個 source 等等。

flume 支持用戶建立多級流,也就是說,多個 agent 可以協同工作,並且支持 fan-in、fan-out、contextual routing、backup routes。如下圖👇所示:

 

二、Flume 環境搭建

1、前置條件

  • flume 需要 Java 1.7 及以上(推薦 1.8)
  • 足夠的內存和磁盤空間
  • 對 agent 監控目錄的讀寫權限

2、搭建

(1)下載 flume-ng-1.6.0-cdh5.7.0.tar.gz

下載地址01:https://download.csdn.net/download/weixin_42018518/12314171,Flume-ng-1.6.0-cdh.zip 內壓縮了 3 個項目,分別爲:flume-ng-1.6.0-cdh5.5.0.tar.gz、flume-ng-1.6.0-cdh5.7.0.tar.gz 和 flume-ng-1.6.0-cdh5.10.1.tar.gz,選擇你需要的版本,我們暫時選擇 cdh5.7.0 這個版本。

下載地址 02:wget http://mirrors.tuna.tsinghua.edu.cn/apache/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz

(2)上傳到服務器,並解壓

知識點:MaciTerm2安裝lszrz rz sz命令

[root@yz-sre-backup019 ~]# cd
[root@yz-sre-backup019 ~]# mkdir apps
[root@yz-sre-backup019 ~]# cd /data/soft
[root@yz-sre-backup019 soft]# tar -zxvf flume-ng-1.6.0-cdh5.7.0.tar.gz -C /root/apps/

(3)配置環境變量

[root@yz-sre-backup019 apps]# vim ~/.bash_profile
    # 配置 Flume 的路徑,根據自己安裝的路徑進行修改
    export FLUME_HOME=/root/apps/apache-flume-1.6.0-cdh5.7.0-bin
    export PATH=$FLUME_HOME/bin:$PATH
  
// 使配置文件生效
[root@yz-sre-backup019 apps]# source ~/.bash_profile

(4)在 flume-env.sh 中配置 Java JDK 的路徑

a. 首先下載最新穩定 JDK:

注意:JDK 安裝在哪個用戶下,就給哪個用戶使用

當前最新版下載地址:https://www.oracle.com/java/technologies/javase-downloads.html

 

b. 將下載的 JDK 上傳到 /data/soft 下,並修改文件權限:

[root@yz-sre-backup019 soft]# rz
[root@yz-sre-backup019 soft]# chmod 755 jdk-8u241-linux-x64.tar.gz

c. 解壓 JDK 到 /usr/:

[root@yz-sre-backup019 soft]# tar -zxvf jdk-8u241-linux-x64.tar.gz -C /usr/

d. 配置 JDK 環境變量:

[root@yz-sre-backup019 soft]# vim /etc/profile
# Java environment
    export JAVA_HOME=/usr/jdk1.8.0_241
    export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
    export PATH=$PATH:${JAVA_HOME}/bin
  
// 使 JDK 配置文件生效
[root@yz-sre-backup019 soft]# source /etc/profile

e. 檢查 JDK 是否安裝成功:

[root@yz-sre-backup019 soft]# java -version
java version "1.8.0_241"
Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)

f. 在 flume-env.sh 中配置 Java JDK 的路徑:

[root@yz-sre-backup019 apps]# cd $FLUME_HOME/conf
// 複製模板
[root@yz-sre-backup019 conf]# cp flume-env.sh.template flume-env.sh
[root@yz-sre-backup019 conf]# vim flume-env.sh
// 配置 Java 目錄,在末尾新增一行
    export JAVA_HOME=/usr/jdk1.8.0_241

g. 檢測

// 在 flume 的 bin 目錄下執行 flume-ng version 可查看版本
[root@yz-sre-backup019 bin]# cd $FLUME_HOME/bin
[root@yz-sre-backup019 bin]# flume-ng version
// 出現以下內容,說明安裝成功
Flume 1.6.0-cdh5.7.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 8f5f5143ae30802fe79f9ab96f893e6c54a105d1
Compiled by jenkins on Wed Mar 23 11:38:48 PDT 2016
From source with checksum 50b533f0ffc32db9246405ac4431872e

三、Flume 實戰

使用 flume 的關鍵就是寫配置文件,主要分爲以下四步:

  1. 配置 source
  2. 配置 channel
  3. 配置 sink
  4. 把以上三個組件串起來

實戰 01:從指定網絡端口採集數據輸出到控制檯

(1)自定義 flume 的配置文件存放目錄

[root@yz-sre-backup019 data]# mkdir -pv /data/flume/{log,job,bin}
mkdir: created directory `/data/flume'
mkdir: created directory `/data/flume/log'
mkdir: created directory `/data/flume/job'
mkdir: created directory `/data/flume/bin'
[root@yz-sre-backup019 data]# cd flume/
[root@yz-sre-backup019 flume]# ll
total 12
drwxr-xr-x 2 root root 4096 Apr 10 10:34 bin    # 用戶存放啓動腳本
drwxr-xr-x 2 root root 4096 Apr 10 10:34 job    # 用於存放 flume 啓動 agent 的配置文件
drwxr-xr-x 2 root root 4096 Apr 10 10:34 log    # 用於存放啓動腳本

(2)配置 agent

agent 選型:netcat source + memory channel + logger sink

在 /data/flume 的 job 目錄下新建 flume-netcat.conf 文件(目錄和文件名可以自定義,只需要在後續啓動 agent 時需要用到

# flume-netcat.conf: A single_node Flume configuration
  
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名稱; r1: source 的名稱; k1: sink 的名稱; c1: channel 的名稱
  
# Describe/configure the source
# 配置 source 的類型
a1.sources.r1.type = netcat
# 配置 source 綁定的主機
a1.sources.r1.bind = localhost
# 配置 source 綁定的主機端口
a1.sources.r1.port = 8888
  
# 指定 sink 的類型,我們這裏指定的爲 logger,即控制檯輸出
# 配置 sink 的類型
a1.sinks.k1.type = logger
  
# 指定 channel 的類型爲 memory,指定 channel 的容量是 1000,每次傳輸的容量是 100
# 配置 channel 的類型
a1.channels.c1.type = memory
# 配置通道中存儲的最大 event 數
a1.channels.c1.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
a1.channels.c1.transactionCapacity = 100
  
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
a1.sources.r1.channels = c1
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
a1.sinks.k1.channel = c1

(3)啓動 agent

[root@yz-sre-backup019 ~]# flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-netcat.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console
// 啓動後另開窗口就行下面測試

(4)連接測試

// 如果不能 telnet,記得先安裝
[root@yz-sre-backup019 ~]# yum -y install telnet net-tools
[root@yz-sre-backup019 ~]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      50                                                    127.0.0.1:8888                                                         *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*
LISTEN     0      50                                                            *:10501                                                        *:*
[root@yz-sre-backup019 ~]# telnet 127.0.0.1 8888
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
飛花點點輕!
OK
Welcome to Beijing...。。
OK
Python
OK
SHOWufei
OK

(5)編寫 flume 的啓動腳本(生產環境推薦使用這種方式)

  1. [root@yz-sre-backup019 flume]# cd /data/flume/bin
  2. [root@yz-sre-backup019 bin]# vim start-netcat.sh
  3. [root@yz-sre-backup019 bin]# chmod +x start-netcat.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Fri Apr 10 11:13:11 CST 2020
 
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-netcat.conf --name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-netcat.log 2>&1 &

(6)啓動 flume 並查看啓動日誌信息

[root@yz-sre-backup019 bin]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*
[root@yz-sre-backup019 bin]# bash start-netcat.sh
[root@yz-sre-backup019 bin]#
[root@yz-sre-backup019 bin]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      50                                                    127.0.0.1:8888                                                         *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*
LISTEN     0      50                                                            *:10501                                                        *:*
[root@yz-sre-backup019 log]# telnet 127.0.0.1 8888
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
飛花點點輕!
OK
Welcome to Beijing...。。
OK
Python
OK
SHOWufei
OK
[root@yz-sre-backup019 log]# tail -f flume-netcat.log
2020-04-10 15:31:36,708 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:145)] Starting Channel c1
2020-04-10 15:31:36,832 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
2020-04-10 15:31:36,833 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: CHANNEL, name: c1 started
2020-04-10 15:31:36,837 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink k1
2020-04-10 15:31:36,838 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source r1
2020-04-10 15:31:36,839 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:155)] Source starting
2020-04-10 15:31:36,865 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:169)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:8888]
2020-04-10 15:31:36,883 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2020-04-10 15:31:36,944 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26.cloudera.4
2020-04-10 15:31:36,979 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started [email protected]:10501
2020-04-10 15:32:30,851 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: E9 A3 9E E8 8A B1 E7 82 B9 E7 82 B9 E8 BD BB 21 ...............! }
2020-04-10 15:32:39,853 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 57 65 6C 63 6F 6D 65 20 74 6F 20 42 65 69 6A 69 Welcome to Beiji }
2020-04-10 15:32:55,060 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 50 79 74 68 6F 6E 0D                            Python. }
2020-04-10 15:33:00,781 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 53 48 4F 57 75 66 65 69 0D                      SHOWufei. }

(7)查看 jq 安裝包是否存在,安裝 jq 工具,便於查看 json 格式的內容

[root@yz-sre-backup019 soft]# wget -O jq https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64
[root@yz-sre-backup019 soft]# chmod +x ./jq
[root@yz-sre-backup019 soft]# cp jq /usr/bin

(8)查看 flume 度量值

 [root@yz-sre-backup019 ~]# curl http://127.0.0.1:10501/metrics | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
129   259    0   259    0     0  25951      0 --:--:-- --:--:-- --:--:-- 37000
{
  "CHANNEL.c1": {                           # 這是 c1 的 CHANEL 監控數據,c1 該名稱在 flume-netcat.conf 中配置文件中定義的
    "ChannelCapacity": "1000",              # channel 的容量,目前僅支持 File Channel、Memory channel 的統計數據
    "ChannelFillPercentage": "0.4",         # channel 已填入的百分比
    "Type": "CHANNEL",                      # 很顯然,這裏是CHANNEL監控項,類型爲 CHANNEL
    "EventTakeSuccessCount": "0",           # sink 成功從 channel 讀取事件的總數量
    "ChannelSize": "4",                     # 目前channel 中事件的總數量,目前僅支持 File Channel、Memory channel 的統計數據
    "EventTakeAttemptCount": "0",           # sink 嘗試從 channel 拉取事件的總次數。這不意味着每次時間都被返回,因爲 sink 拉取的時候 channel 可能沒有任何數據
    "StartTime": "1586489375175",           # channel 啓動時的毫秒值時間
    "EventPutAttemptCount": "4",            # Source 嘗試寫入 Channe 的事件總次數
    "EventPutSuccessCount": "4",            # 成功寫入 channel 且提交的事件總次數
    "StopTime": "0"                         # channel 停止時的毫秒值時間,爲 0 表示一直在運行
  }
}

小提示:如果還想了解更多度量值,可參考官方文檔:http://flume.apache.org/FlumeUserGuide.html#monitoring

(9)刪掉對應的 flume 進程

[root@yz-sre-backup019 ~]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      50                                                    127.0.0.1:8888                                                         *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*
LISTEN     0      50                                                            *:10501                                                        *:*
[root@yz-sre-backup019 ~]# netstat -untalp  | grep 8888
tcp        0      0 127.0.0.1:8888              0.0.0.0:*                   LISTEN      4565/java
[root@yz-sre-backup019 ~]# kill 4565
[root@yz-sre-backup019 ~]# netstat -untalp  | grep 8888
[root@yz-sre-backup019 ~]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*

實戰 02:監控一個文件實時採集新增的數據輸出到本地文件中

(1)agent 的選型

exec source + memory channel + file_roll sink

(2)配置 agent

在 /data/flume 的 job 目錄下新建 flume-file.conf 文件(目錄和文件名可以自定義,只需要在後續啓動 agent 時需要用到

# flume-file.conf: A single_node Flume configuration
  
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名稱; r1: source 的名稱; k1: sink 的名稱; c1: channel 的名稱
  
# Describe/configure the source
# 配置 source 的類型
a1.sources.r1.type = exec
# 配置 source 執行的命令
a1.sources.r1.command = tail -F /data/data.log
# 配置 source 讓 bash 將一個字符串作爲完整的命令來執行
a1.sources.r1.shell = /bin/bash -c
  
# 指定 sink 的類型,我們這裏指定的爲 file_roll,即本地文件輸出
# 配置 sink 的類型,將數據傳輸到本地文件,需要設置文件路徑
a1.sinks.k1.type = file_roll
# 配置 sink 輸出到本地的路徑
a1.sinks.k1.sink.directory = /data/flume/data
  
# 配置 channel 的類型
a1.channels.c1.type = memory
# 配置通道中存儲的最大 event 數
a1.channels.c1.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
a1.channels.c1.transactionCapacity = 100
  
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
a1.sources.r1.channels = c1
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
a1.sinks.k1.channel = c1

(3)編寫 agent 的啓動腳本(生產環境推薦使用這種方式)

  1. [root@yz-sre-backup019 flume]# cd /data/flume/bin
  2. [root@yz-sre-backup019 bin]# vim start-file.sh
  3. [root@yz-sre-backup019 bin]# chmod +x start-file.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Fri Apr 10 15:05:56 CST 2020
 
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-file.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console >> /data/flume/log/flume-file.log 2>&1 &

(4)啓動 agent 並測試往監聽文件中輸入數據,

在 /data/flume/data 中生成的文件中查看 event 數據,測試完後刪掉 flume 進程。

[root@yz-sre-backup019 bin]# bash start-file.sh
[root@yz-sre-backup019 bin]# cd /data
[root@yz-sre-backup019 data]# echo "帥飛飛!!!" >> data.log
[root@yz-sre-backup019 bin]# cd /data/flume/data
[root@yz-sre-backup019 data]# tail 1586513526197-4
帥飛飛!!!
[root@yz-sre-backup019 data]# ps -ef | grep flume
root     17682     1  1 18:12 pts/1    00:00:03 /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/* -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
root     17970   956  0 18:16 pts/0    00:00:00 grep flume
[root@yz-sre-backup019 data]# kill 17682
[root@yz-sre-backup019 data]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      128                                                           *:1988                                                         *:*

實戰 03:實時讀取本地文件到 HDFS 中(需要 flume 節點配置 Hadoop 集羣環境)

上面兩個需求中把數據輸出到控制檯沒有任何意義,實際需求可能需要輸出到 hdfs 中,只需要改動 agent 的配置,把 sink 的類型改爲 hdfs,然後指定 hdfs 的 url 和寫入路徑。

 

(1)agent 選型

exec source - memory channel - hdfs sink

(2)配置 agent

# flume-hdfs.conf: A single_node Flume configuration
  
# Name the components on this agent
wufei03.sources = file_source
wufei03.sinks = hdfs_sink
wufei03.channels = memory_channel
# wufei03: agent 的名稱; file_source: source 的名稱; hdfs_sink: sink 的名稱; memory_channel: channel 的名稱
  
# Describe/configure the source
# 配置 source 的類型
wufei03.sources.file_source.type = exec
# 配置 source 執行的命令
# wufei03.sources.file_source.command = tail -F /data/messages
wufei03.sources.file_source.command = tail -F /data/data.log
# 配置 source 讓 bash 將一個字符串作爲完整的命令來執行
wufei03.sources.file_source.shell = /bin/bash -c
  
# 指定 sink 的類型,我們這裏指定的爲 hdfs
# 配置 sink 的類型,將數據傳輸到 HDFS 集羣
wufei03.sinks.hdfs_sink.type = hdfs
# 配置 sink 輸出到本 hdfs 的 url 和寫入路徑
wufei03.sinks.hdfs_sink.hdfs.path = hdfs://yz-higo-nn1:9000/flume/dt=%Y-%m-%d/%H
# 上傳文件的前綴
wufei03.sinks.hdfs_sink.hdfs.filePrefix = gz_10.20.2.24-
# 是否按照時間滾動文件夾
wufei03.sinks.hdfs_sink.hdfs.round = true
# 多少時間單位創建一個文件夾
wufei03.sinks.hdfs_sink.hdfs.roundValue = 1
# 重新定義時間單位
wufei03.sinks.hdfs_sink.hdfs.roundUnit = hour
# 是否使用本地時間戳
wufei03.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
# 積攢多少個 event 才 flush 到 hdfs 一次
wufei03.sinks.hdfs_sink.hdfs.batchSize = 1000
# 設置文件類型,可支持壓縮
wufei03.sinks.hdfs_sink.hdfs.fileType = DataStream
# 多久生成一個新文件
wufei03.sinks.hdfs_sink.hdfs.rollInterval = 600
# 設置每個文件的滾動大小
wufei03.sinks.hdfs_sink.hdfs.rollSize = 134217700
# 文件的滾動與 event 數量無關
wufei03.sinks.hdfs_sink.hdfs.rollCount = 0
# 最小副本數
wufei03.sinks.hdfs_sink.hdfs.minBlockReplicas = 1
  
# 配置 channel 的類型
wufei03.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
wufei03.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
wufei03.channels.memory_channel.transactionCapacity = 1000
  
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
wufei03.sources.file_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
wufei03.sinks.hdfs_sink.channel = memory_channel

(3)編寫啓動腳本並啓動 flume

  1. [root@yz-sre-backup019 flume]# cd /data/flume/bin
  2. [root@yz-sre-backup019 bin]# vim start-hdfs.sh
  3. [root@yz-sre-backup019 bin]# chmod +x start-hdfs.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Tue Apr 14 11:56:51 CST 2020
 
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 wufei03
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name wufei03 --conf-file=/data/flume/job/flume-hdfs.conf -Dflume.monitoring.type=http  -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console  >> /data/flume/log/flume-hdfs.log  2>&1 &
  1. [root@yz-bi-web01 bin]# bash start-hdfs.sh
  2. [root@yz-bi-web01 bin]# ss -ntl
State      Recv-Q Send-Q                                            Local Address:Port                                              Peer Address:Port
.....
LISTEN     0      128                                                           *:22                                                           *:*
LISTEN     0      100                                                   127.0.0.1:25                                                           *:*
LISTEN     0      50                                                            *:10502                                                        *:*
LISTEN     0      128                                                           *:80                                                           *:*
[root@yz-bi-web01 bin]#
[root@yz-bi-web01 data]# cd /data/
[root@yz-bi-web01 data]# echo "SHOWufei" >> data.log
[root@yz-bi-web01 data]# echo "帥飛飛!!!" >> data.log

(4)查看 flume 日誌收集信息

Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/hadoop/hadoop/bin/hadoop) for HDFS access
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including HBASE libraries found via (/hadoop/hbase/bin/hbase) for HBASE access
Info: Excluding /hadoop/hbase/lib/slf4j-api-1.7.7.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including Hive libraries found via (/hadoop/hive) for Hive access
  
...。。
 
2020-04-14 12:16:15,648 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.ExecSource.start(ExecSource.java:169)] Exec source starting with command:tail -F /data/data.log
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SINK, name: hdfs_sink: Successfully registered new MBean.
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SOURCE, name: file_source: Successfully registered new MBean.
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SINK, name: hdfs_sink started
2020-04-14 12:16:15,650 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SOURCE, name: file_source started
2020-04-14 12:16:15,683 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2020-04-14 12:16:15,725 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] jetty-6.1.26.cloudera.4
2020-04-14 12:16:15,761 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started [email protected]:10502
2020-04-14 12:16:53,678 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-14 12:16:53,977 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
2020-04-14 12:26:55,533 (hdfs-hdfs_sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.BucketWriter.close(BucketWriter.java:363)] Closing hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
2020-04-14 12:26:55,571 (hdfs-hdfs_sink-call-runner-6) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:629)] Renaming hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp to hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679
2020-04-14 12:26:55,580 (hdfs-hdfs_sink-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.HDFSEventSink$1.run(HDFSEventSink.java:394)] Writer callback called.
2020-04-14 17:47:30,793 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-14 17:47:30,839 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-14/17/gz_10.20.2.24-.1586857650794.tmp

(5)查看 hdfs 對應目錄是否生成相應的日誌信息

[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-14/12
Found 1 items
-rw-r--r--   3 root hadoop          9 2020-04-14 12:16 /flume/dt=2020-04-14/12/gz_10.20.2.24-.1586837813679.tmp
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-14
Found 2 items
drwxrwxrwx   - root hadoop          0 2020-04-14 12:26 /flume/dt=2020-04-14/12
drwxrwxrwx   - root hadoop          0 2020-04-14 17:47 /flume/dt=2020-04-14/17

(6)Browsing HDFS

 

(7)查看 flume 度量值

[root@yz-bi-web01 ~]# curl http://127.0.0.1:10502/metrics | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
105   841    0   841    0     0   163k      0 --:--:-- --:--:-- --:--:--  273k
{
  "SOURCE.file_source": {               # source 的名稱
    "OpenConnectionCount": "0",         # 目前與客戶端或 sink 保持連接的總數量,目前僅支持 avro source 展現該度量
    "Type": "SOURCE",                   # 當前類型爲 SOURRCE
    "AppendBatchAcceptedCount": "0",    # 成功提交到 channel 的批次的總數量
    "AppendBatchReceivedCount": "0",    # 接收到事件批次的總數量
    "EventAcceptedCount": "3",          ## 成功寫出到channel的事件總數量
    "AppendReceivedCount": "0",         # 每批只有一個事件的事件總數量(與 RPC 調用的一個 append 調用相等)
    "StopTime": "0",                    # SOURCE 停止時的毫秒值時間,0 代表一直運行着
    "StartTime": "1586837775650",       # SOURCE 啓動時的毫秒值時間
    "EventReceivedCount": "3",          ## 目前爲止 source 已經接收到的事件總數量
    "AppendAcceptedCount": "0"          # 逐條錄入的次數,單獨傳入的事件到 Channel 且成功返回的事件總數量
  },
  "SINK.hdfs_sink": {                   # sink 的名稱
    "BatchCompleteCount": "0",          # 批量處理event的個數等於批處理大小的數量
    "ConnectionFailedCount": "0",       # 連接失敗的次數
    "EventDrainAttemptCount": "3",      ## sink 嘗試寫出到存儲的事件總數量
    "ConnectionCreatedCount": "2",      # 下一個階段(或存儲系統)創建鏈接的數量(如HDFS創建一個文件)
    "Type": "SINK",                     # 當前類型爲 SINK
    "BatchEmptyCount": "2551",          # 批量處理 event 的個數爲 0 的數量(空的批量的數量),如果數量很大表示 source 寫入數據的速度比 sink 處理數據的速度慢很多
    "ConnectionClosedCount": "1",       # 連接關閉的次數
    "EventDrainSuccessCount": "3",      ## sink成功寫出到存儲的事件總數量
    "StopTime": "0",                    # SINK 停止時的毫秒值時間
    "StartTime": "1586837775650",       # SINK 啓動時的毫秒值時間
    "BatchUnderflowCount": "3"          # 批量處理 event 的個數小於批處理大小的數量(比 sink 配置使用的最大批量尺寸更小的批量的數量),如果該值很高也表示 sink 比 source 更快
  },
  "CHANNEL.memory_channel": {           # channel 的名稱
    "EventPutSuccessCount": "3",        ## 成功寫入channel且提交的事件總次數
    "ChannelFillPercentage": "0.0",     # channel已填入的百分比
    "Type": "CHANNEL",                  # 當前類型爲 CHANNEL
    "StopTime": "0",                    # CHANNEL 停止時的毫秒值時間
    "EventPutAttemptCount": "3",        ## Source 嘗試寫入 Channe 的事件總次數
    "ChannelSize": "0",                 # 目前 channel 中事件的總數量,目前僅支持 File Channel,Memory channel 的統計數據
    "StartTime": "1586837775646",       # CHANNEL 啓動時的毫秒值時間
    "EventTakeSuccessCount": "3",       ## sink 成功從 channel 讀取事件的總數量
    "ChannelCapacity": "1000",          # channel 的容量,目前僅支持 File Channel,Memory channel 的統計數據
    "EventTakeAttemptCount": "2558"     # sink 嘗試從 channel 拉取事件的總次數。這不意味着每次時間都被返回,因爲 sink 拉取的時候 channel 可能沒有任何數據
  }
}

(8)測試完後刪掉 flume 進程

[root@yz-bi-web01 ~]# ps -ef | grep flume
root     19768 14759  0 17:57 pts/11   00:00:00 grep flume
root     26653     1  0 12:16 pts/0    00:00:26 /usr/local/jdk1.7.0_76/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/hadoop/hadoop-2.7.1/etc/hadoop:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/activation-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/asm-3.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/avro-1.7.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-cli-1.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-codec-1.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-collections-3.2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-compress-1.4.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-configuration-1.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-digester-1.8.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-httpclient-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-io-2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-lang-2.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-logging-1.1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-math3-3.1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-net-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-client-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-framework-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/gson-2.2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/guava-11.0.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-annotations-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-auth-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hamcrest-core-1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpclient-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpcore-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-core-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-json-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-server-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jets3t-0.9.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jettison-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-util-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsch-0.1.42.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsp-api-2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsr305-3.0.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/junit-4.11.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/log4j-1.2.17.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/mockito-all-1.8.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/netty-3.6.2.Final.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/paranamer-2.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/com
[root@yz-bi-web01 ~]# kill 26653
[root@yz-bi-web01 ~]# ps -ef | grep flume
root     19777 14759  0 17:58 pts/11   00:00:00 grep flume

(9)清除 hdfs 上數據

[root@yz-bi-web01 ~]# su - hadoop
[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume
[hadoop@yz-bi-web01 ~]$

實戰 04:從服務器 A 收集數據到服務器 B 並上傳到 HDFS(需要服務器 B 節點配置 Hadoop 集羣環境)

重點:服務器 A 的 sink 類型是 avro,而服務器 B 的 source 類型是 avro。

 

流程:

  • 機器 A 監控一個文件,把日誌記錄到 data.log  中
  • avro sink 把新產生的日誌輸出到指定的 hostname 和 port 上
  • 通過 avro source 對應的 agent 將日誌輸出到控制檯、kafka、hdfs 等

(1)機器 A 配置

agent 選型:exec source + memory channel + avro sink

# flume-file.conf: A single_node Flume configuration
 
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# a1: agent 的名稱; r1: source 的名稱; k1: sink 的名稱; c1: channel 的名稱
 
# Describe/configure the source
# 配置 source 的類型
a1.sources.r1.type = exec
# 配置 source 執行的命令
a1.sources.r1.command = tail -F /data/data.log
# 配置 source 讓 bash 將一個字符串作爲完整的命令來執行
a1.sources.r1.shell = /bin/bash -c
 
# 指定 sink 的類型,我們這裏指定的爲 avro,即將數據發送到端口,需要設置端口名稱、端口號
a1.sinks.k1.type = avro
# 配置 sink 主機名稱
a1.sinks.k1.hostname = 10.20.2.24
# 配置 sink 主機端口
a1.sinks.k1.port = 8888
 
# 配置 channel 的類型
a1.channels.c1.type = memory
# 配置通道中存儲的最大 event 數
a1.channels.c1.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
a1.channels.c1.transactionCapacity = 100
 
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
a1.sources.r1.channels = c1
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
a1.sinks.k1.channel = c1

(2)機器 B 配置

agent 選型:avro source + memory channel + hdfs sink

# flume-hdfs.conf: A single_node Flume configuration
 
# Name the components on this agent
wufei03.sources = file_source
wufei03.sinks = hdfs_sink
wufei03.channels = memory_channel
# wufei03: agent 的名稱; file_source: source 的名稱; hdfs_sink: sink 的名稱; memory_channel: channel 的名稱
 
# Describe/configure the source
# 配置 source 的類型
wufei03.sources.file_source.type = avro
# 配置 source 綁定主機
wufei03.sources.file_source.bind = 10.20.2.24
# 配置 source 綁定主機端口
wufei03.sources.file_source.port = 8888
 
# 指定 sink 的類型,我們這裏指定的爲 hdfs
# 配置 sink 的類型,將數據傳輸到 HDFS 集羣
wufei03.sinks.hdfs_sink.type = hdfs
# 配置 sink 輸出到本 hdfs 的 url 和寫入路徑
wufei03.sinks.hdfs_sink.hdfs.path = hdfs://yz-higo-nn1:9000/flume/dt=%Y-%m-%d/%H
# 上傳文件的前綴
wufei03.sinks.hdfs_sink.hdfs.filePrefix = gz_10.20.3.36-
# 是否按照時間滾動文件夾
wufei03.sinks.hdfs_sink.hdfs.round = true
# 多少時間單位創建一個文件夾
wufei03.sinks.hdfs_sink.hdfs.roundValue = 1
# 重新定義時間單位
wufei03.sinks.hdfs_sink.hdfs.roundUnit = hour
# 是否使用本地時間戳
wufei03.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
# 積攢多少個 event 才 flush 到 hdfs 一次
wufei03.sinks.hdfs_sink.hdfs.batchSize = 1000
# 設置文件類型,可支持壓縮
wufei03.sinks.hdfs_sink.hdfs.fileType = DataStream
# 多久生成一個新文件
wufei03.sinks.hdfs_sink.hdfs.rollInterval = 600
# 設置每個文件的滾動大小
wufei03.sinks.hdfs_sink.hdfs.rollSize = 134217700
# 文件的滾動與 event 數量無關
wufei03.sinks.hdfs_sink.hdfs.rollCount = 0
# 最小副本數
wufei03.sinks.hdfs_sink.hdfs.minBlockReplicas = 1
 
# 配置 channel 的類型
wufei03.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
wufei03.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
wufei03.channels.memory_channel.transactionCapacity = 100
 
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
wufei03.sources.file_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
wufei03.sinks.hdfs_sink.channel = memory_channel

(3)編寫啓動腳本

// 機器 A 啓動腳本
[root@yz-sre-backup019 bin]# vim start-file.sh
[root@yz-sre-backup019 bin]# cat start-file.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Wed Apr 15 11:22:24 CST 2020
 
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name a1 --conf-file /data/flume/job/flume-file.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console >> /data/flume/log/flume-file.log 2>&1 &
  
// 機器 B 啓動腳本
[root@yz-bi-web01 bin]# vim start-hdfs.sh
[root@yz-bi-web01 bin]# cat start-hdfs.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Wed Apr 15 11:22:24 CST 2020
 
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 wufei03
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name wufei03 --conf-file=/data/flume/job/flume-hdfs.conf -Dflume.monitoring.type=http  -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console  >> /data/flume/log/flume-hdfs.log  2>&1 &

(4)啓動腳本並查看對應的日誌信息

// 機器 A
[root@yz-sre-backup019 bin]# ss -ntl
State       Recv-Q Send-Q                                   Local Address:Port                                     Peer Address:Port
LISTEN      0      128                                                  *:22                                                  *:*
LISTEN      0      100                                          127.0.0.1:25                                                  *:*
LISTEN      0      128                                                  *:1988                                                *:*
// 機器 B
[root@yz-bi-web01 bin]# ss -ntl
State       Recv-Q Send-Q                                   Local Address:Port                                     Peer Address:Port
LISTEN      0      128                                                  *:22                                                  *:*
LISTEN      0      100                                          127.0.0.1:25                                                  *:*
LISTEN      0      128                                                  *:1988                                                *:*
  
// 先啓動機器 B 的 agent
[root@yz-bi-web01 bin]# bash start-hdfs.sh
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/hadoop/hadoop/bin/hadoop) for HDFS access
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including HBASE libraries found via (/hadoop/hbase/bin/hbase) for HBASE access
Info: Excluding /hadoop/hbase/lib/slf4j-api-1.7.7.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including Hive libraries found via (/hadoop/hive) for Hive access
// 後啓動機器 A 的 agent
[root@yz-sre-backup019 bin]# bash start-file.sh
[root@yz-sre-backup019 log]# tail -f flume-file.log
Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp '/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
2020-04-15 11:55:35,409 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2020-04-15 11:55:35,416 (lifecycleSupervisor-1-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:78)] Configuration provider started
  
// 插入測試數據
[root@yz-sre-backup019 ~]# cd /data/
[root@yz-sre-backup019 data]# echo "帥飛飛!!!" >> data.log
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
2020-04-15 11:55:51,147 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859.tmp
[root@yz-sre-backup019 data]# echo "SHOWufei!!!" >> data.log
[root@yz-bi-web01 log]# tail -f flume-hdfs.log
2020-04-15 12:03:14,139 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969.tmp

(5)查看 hdfs 對應目錄是否生成相應的日誌信息

[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15
Found 2 items
drwxrwxrwx   - root hadoop          0 2020-04-15 12:05 /flume/dt=2020-04-15/11
drwxrwxrwx   - root hadoop          0 2020-04-15 12:03 /flume/dt=2020-04-15/12
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15/11
Found 1 items
-rw-r--r--   3 root hadoop         19 2020-04-15 12:05 /flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859
[root@yz-bi-web01 ~]# hadoop fs -cat /flume/dt=2020-04-15/11/gz_10.20.3.36-.1586922950859
帥飛飛!!!
[root@yz-bi-web01 ~]# hdfs dfs -ls /flume/dt=2020-04-15/12
Found 1 items
-rw-r--r--   3 root hadoop         18 2020-04-15 12:03 /flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969.tmp
[root@yz-bi-web01 ~]# hadoop fs -cat /flume/dt=2020-04-15/12/gz_10.20.3.36-.1586923393969
SHOWufei!!!

(6)查看 flume 度量值

// 機器 A
[root@yz-sre-backup019 ~]# curl http://127.0.0.1:10501/metrics | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
101   811    0   811    0     0   4924      0 --:--:-- --:--:-- --:--:--  4975
{
  "SINK.k1": {
    "ConnectionCreatedCount": "1",
    "ConnectionClosedCount": "0",
    "Type": "SINK",
    "BatchCompleteCount": "0",
    "BatchEmptyCount": "109",
    "EventDrainAttemptCount": "2",
    "StartTime": "1586922935645",
    "EventDrainSuccessCount": "2",
    "BatchUnderflowCount": "2",
    "StopTime": "0",
    "ConnectionFailedCount": "0"
  },
  "CHANNEL.c1": {
    "ChannelCapacity": "1000",
    "ChannelFillPercentage": "0.0",
    "Type": "CHANNEL",
    "ChannelSize": "0",
    "EventTakeSuccessCount": "2",
    "EventTakeAttemptCount": "114",
    "StartTime": "1586922935643",
    "EventPutAttemptCount": "2",
    "EventPutSuccessCount": "2",
    "StopTime": "0"
  },
  "SOURCE.r1": {
    "EventReceivedCount": "2",
    "AppendBatchAcceptedCount": "0",
    "Type": "SOURCE",
    "EventAcceptedCount": "2",
    "AppendReceivedCount": "0",
    "StartTime": "1586922935652",
    "AppendAcceptedCount": "0",
    "OpenConnectionCount": "0",
    "AppendBatchReceivedCount": "0",
    "StopTime": "0"
  }
}
  
// 機器 B
[root@yz-bi-web01 ~]# curl http://127.0.0.1:10502/metrics | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
104   839    0   839    0     0   7163      0 --:--:-- --:--:-- --:--:--  7295
{
  "SOURCE.file_source": {
    "OpenConnectionCount": "1",
    "Type": "SOURCE",
    "AppendBatchReceivedCount": "2",
    "AppendBatchAcceptedCount": "2",
    "EventAcceptedCount": "2",
    "AppendReceivedCount": "0",
    "StopTime": "0",
    "StartTime": "1586922913313",
    "EventReceivedCount": "2",
    "AppendAcceptedCount": "0"
  },
  "SINK.hdfs_sink": {
    "BatchCompleteCount": "0",
    "ConnectionFailedCount": "0",
    "EventDrainAttemptCount": "2",
    "ConnectionCreatedCount": "2",
    "Type": "SINK",
    "BatchEmptyCount": "117",
    "ConnectionClosedCount": "1",
    "EventDrainSuccessCount": "2",
    "StopTime": "0",
    "StartTime": "1586922912838",
    "BatchUnderflowCount": "2"
  },
  "CHANNEL.memory_channel": {
    "EventPutSuccessCount": "2",
    "ChannelFillPercentage": "0.0",
    "Type": "CHANNEL",
    "EventPutAttemptCount": "2",
    "ChannelSize": "0",
    "StopTime": "0",
    "StartTime": "1586922912835",
    "EventTakeSuccessCount": "2",
    "ChannelCapacity": "1000",
    "EventTakeAttemptCount": "121"
  }
}

(7)測試完刪掉 flume 進程並清除 hdfs 上數據

// 先刪掉機器 A 的 flume 進程
[root@yz-sre-backup019 bin]# ps -ef | grep flume
root     10492  6728  0 11:54 pts/2    00:00:00 tail -f flume-file.log
root     10500     1  0 11:55 pts/0    00:00:09 /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/* -Djava.library.path= org.apache.flume.node.Application --name a1 --conf-file /data/flume/job/flume-file.conf
root     11084  5377  0 12:12 pts/0    00:00:00 grep flume
[root@yz-sre-backup019 bin]# kill 10500
[root@yz-sre-backup019 bin]# ps -ef | grep flume
root     10492  6728  0 11:54 pts/2    00:00:00 tail -f flume-file.log
root     11092  5377  0 12:12 pts/0    00:00:00 grep flume
  
// 後刪掉機器 B 的 flume 進程
[root@yz-bi-web01 ~]# ps -ef | grep flume
root      5725 16077  0 11:54 pts/11   00:00:00 tail -f flume-hdfs.log
root      5735     1  1 11:55 pts/0    00:00:20 /usr/local/jdk1.7.0_76/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger=INFO,console -cp /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/hadoop/hadoop-2.7.1/etc/hadoop:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/activation-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/asm-3.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/avro-1.7.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-cli-1.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-codec-1.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-collections-3.2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-compress-1.4.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-configuration-1.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-digester-1.8.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-httpclient-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-io-2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-lang-2.6.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-logging-1.1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-math3-3.1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/commons-net-3.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-client-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-framework-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/gson-2.2.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/guava-11.0.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-annotations-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-auth-2.7.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/hamcrest-core-1.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpclient-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/httpcore-4.2.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-core-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-json-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jersey-server-1.9.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jets3t-0.9.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jettison-1.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jetty-util-6.1.26.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsch-0.1.42.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsp-api-2.1.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/jsr305-3.0.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/junit-4.11.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/log4j-1.2.17.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/mockito-all-1.8.5.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/netty-3.6.2.Final.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/paranamer-2.3.jar:/hadoop/hadoop-2.7.1/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/hadoop/hadoop-2.7.1/share/hadoop/com
root     18949  9025  0 12:12 pts/12   00:00:00 grep flume
[root@yz-bi-web01 ~]# kill 5735
[root@yz-bi-web01 ~]# ps -ef | grep flume
root      5725 16077  0 11:54 pts/11   00:00:00 tail -f flume-hdfs.log
root     18963  9025  0 12:12 pts/12   00:00:00 grep flume
  
// 清除 hdfs 上的數據
[root@yz-bi-web01 ~]# su - hadoop
[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume
[hadoop@yz-bi-web01 ~]$ exit
logout
[root@yz-bi-web01 ~]#

實戰 05:多 flume 彙總數據到單 flume(需要單 flume 匯聚節點配置 hadoop 集羣環境)

(1)流程

  • Agent1 監控文件 /data/data.log(exec source - memory channel - avro sink
  • Agent2 監控某一端口數據流 (netcat source - memory channel - avro sink
  • Agent3 實時指定目錄文件內容(spooldir source - memory channel - avro sink
  • Agent1、Agent2、Agent3 將數據發送給 Agent4
  • Agent4 將最終數據寫入到 hdfs(avro source - memory channel - hdfs sink

(2)編寫相應的 agent 配置文件

[root@yz-sre-backup019 job]# vim agent1-exec.conf
[root@yz-sre-backup019 job]# cat agent1-exec.conf
# agent1-exec.conf: Clusterde_node Flume configuration

# Name the components on this agent
agent1.sources = exec_source
agent1.sinks = avro_sink
agent1.channels = memory_channel
# agent1: agent 的名稱; exec_source: source 的名稱; avro_sink: sink 的名稱; memory_channel: channel 的名稱

# Describe/configure the source
# 配置 source 的類型
agent1.sources.exec_source.type = exec
# 配置 source 執行的命令
agent1.sources.exec_source.command = tail -F /data/data.log
# 配置 source 讓 bash 將一個字符串作爲完整的命令來執行
agent1.sources.exec_source.shell = /bin/bash -c

# 指定 sink 的類型,我們這裏指定的爲 avro,即將數據發送到端口,需要設置端口名稱、端口號
agent1.sinks.avro_sink.type = avro
# 配置 sink 主機名稱
agent1.sinks.avro_sink.hostname = 10.20.2.24
# 配置 sink 主機端口
agent1.sinks.avro_sink.port = 6666

# 配置 channel 的類型
agent1.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
agent1.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
agent1.channels.memory_channel.transactionCapacity = 100

# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
agent1.sources.exec_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
agent1.sinks.avro_sink.channel = memory_channel

[root@yz-bi-web01 job]# vim agent4-hdfs.conf
[root@yz-bi-web01 job]# cat agent4-hdfs.conf
# agent4-hdfs.conf: Clusterde_node Flume configuration

# Name the components on this agent
agent4.sources = avro_source
agent4.sinks = hdfs_sink
agent4.channels = memory_channel
# agent4: agent 的名稱; avro_source: source 的名稱; hdfs_sink: sink 的名稱; memory_channel: channel 的名稱

# Describe/configure the source
# 配置 source 的類型
agent4.sources.avro_source.type = avro
# 配置 source 綁定主機
agent4.sources.avro_source.bind = 10.20.2.24
# 配置 source 綁定主機端口
agent4.sources.avro_source.port = 6666

# 指定 sink 的類型,我們這裏指定的爲 hdfs
# 配置 sink 的類型,將數據傳輸到 HDFS 集羣
agent4.sinks.hdfs_sink.type = hdfs
# 配置 sink 輸出到本 hdfs 的 url 和寫入路徑
agent4.sinks.hdfs_sink.hdfs.path = hdfs://yz-higo-nn1:9000/flume/%Y-%m-%d/%H
# 上傳文件的前綴
agent4.sinks.hdfs_sink.hdfs.filePrefix = gz_10.20.2.24
# 是否按照時間滾動文件夾
agent4.sinks.hdfs_sink.hdfs.round = true
# 多少時間單位創建一個文件夾
agent4.sinks.hdfs_sink.hdfs.roundValue = 1
# 重新定義時間單位
agent4.sinks.hdfs_sink.hdfs.roundUnit = hour
# 是否使用本地時間戳
agent4.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
# 積攢多少個 event 才 flush 到 hdfs 一次
agent4.sinks.hdfs_sink.hdfs.batchSize = 100
# 設置文件類型,可支持壓縮
agent4.sinks.hdfs_sink.hdfs.fileType = DataStream
# 多久生成一個新文件
agent4.sinks.hdfs_sink.hdfs.rollInterval = 60
# 設置每個文件的滾動大小
agent4.sinks.hdfs_sink.hdfs.rollSize = 134217700
# 文件的滾動與 event 數量無關
agent4.sinks.hdfs_sink.hdfs.rollCount = 0
# 最小副本數
agent4.sinks.hdfs_sink.hdfs.minBlockReplicas = 1
# 和 agent3 的 basenameHeader,basenameHeaderKey 兩個屬性一起用可以保持原文件名稱上傳
agent4.sinks.hdfs_sink.hdfs.filePrefix = %{fileName}

# 配置 channel 的類型
agent4.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
agent4.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
agent4.channels.memory_channel.transactionCapacity = 100

# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
agent4.sources.avro_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
agent4.sinks.hdfs_sink.channel = memory_channel

[root@yz-sre-backup019 job]# vim agent2-netcat.conf
[root@yz-sre-backup019 job]# cat agent2-netcat.conf
# agent2-netcat.conf: Clusterde_node Flume configuration

# Name the components on this agent
agent2.sources = netcat_source
agent2.sinks = avro_sink
agent2.channels = memory_channel
# agent2: agent 的名稱; netcat_source: source 的名稱; avro_sink: sink 的名稱; memory_channel: channel 的名稱

# Describe/configure the source
# 配置 source 的類型
agent2.sources.netcat_source.type = netcat
# 配置 source 綁定的主機
agent2.sources.netcat_source.bind = 127.0.0.1
# 配置 source 綁定的主機端口
agent2.sources.netcat_source.port = 8888

# 指定 sink 的類型,我們這裏指定的爲 avro,即將數據發送到端口,需要設置端口名稱、端口號
agent2.sinks.avro_sink.type = avro
# 配置 sink 主機名稱
agent2.sinks.avro_sink.hostname = 10.20.2.24
# 配置 sink 主機端口
agent2.sinks.avro_sink.port = 6666

# 指定 channel 的類型爲 memory,指定 channel 的容量是 1000,每次傳輸的容量是 100
# 配置 channel 的類型
agent2.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
agent2.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
agent2.channels.memory_channel.transactionCapacity = 100

# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
agent2.sources.netcat_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
agent2.sinks.avro_sink.channel = memory_channel

[root@yz-sre-backup019 job]# vim agent3-dir.conf
[root@yz-sre-backup019 job]# cat agent3-dir.conf
# agent3-dir.conf: Clusterde_node Flume configuration

# Name the components on this agent
agent3.sources = spooldir_source
agent3.sinks = avro_sink
agent3.channels = memory_channel
# agent3: agent 的名稱; spooldir_source: source 的名稱; avro_sink: sink 的名稱; memory_channel: channel 的名稱

# Describe/configure the source
# 配置 source 的類型,監視一個文件夾,需要文件夾路徑
agent3.sources.spooldir_source.type = spooldir
# 配置 source 監視文件夾路徑
agent3.sources.spooldir_source.spoolDir = /data/flume/upload
# 配置 source 文件緩衝
agent3.sources.spooldir_source.fileSuffix = .COMPLETED
#
agent3.sources.spooldir_source.fileHeader = true
# 忽略所有以.tmp 結尾的文件,不上傳
agent3.sources.spooldir_source.ignorePattern = ([^ ]*\.tmp)
# 獲取源文件名稱,方便下面的 sink 調用變量 fileName
agent3.sources.spooldir_source.basenameHeader = true
agent3.sources.spooldir_source.basenameHeaderKey = fileName

# 指定 sink 的類型,我們這裏指定的爲 avro,即將數據發送到端口,需要設置端口名稱、端口號
agent3.sinks.avro_sink.type = avro
# 配置 sink 主機名稱
agent3.sinks.avro_sink.hostname = 10.20.2.24
# 配置 sink 主機端口
agent3.sinks.avro_sink.port = 6666

# 配置 channel 的類型
agent3.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
agent3.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
agent3.channels.memory_channel.transactionCapacity = 100

# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
agent3.sources.spooldir_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
agent3.sinks.avro_sink.channel = memory_channel

[root@yz-sre-backup019 job]# mkdir -pv  /data/flume/upload // 創建測試監視文件夾

(3)編寫相應的 agent 啓動腳本

  • [root@yz-sre-backup019 bin]# vim start-agent1.sh
  • [root@yz-sre-backup019 bin]# cat start-agent1.sh

#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Wed Apr 15 16:29:37 CST 2020

# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name agent1 --conf-file /data/flume/job/agent1-exec.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console >> /data/flume/log/agent1-exec.log 2>&1 &

  • [root@yz-sre-backup019 bin]# vim start-agent3.sh
  • [root@yz-sre-backup019 bin]# cat start-agent3.sh

#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Wed Apr 15 16:29:37 CST 2020

# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name agent3 --conf-file /data/flume/job/agent3-dir.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10503 -Dflume.root.logger==INFO,console >> /data/flume/log/agent3-dir.log 2>&1 &

  • [root@yz-sre-backup019 bin]# vim start-agent2.sh
  • [root@yz-sre-backup019 bin]# cat start-agent2.sh

#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Wed Apr 15 16:29:37 CST 2020

# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name agent2 --conf-file /data/flume/job/agent2-netcat.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger==INFO,console >> /data/flume/log/agent2-netcat.log 2>&1 &

  • [root@yz-bi-web01 bin]# vim start-agent4.sh
  • [root@yz-bi-web01 bin]# cat start-agent4.sh

#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Wed Apr 15 16:36:36 CST 2020

# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 wufei03
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf ${FLUME_HOME}/conf --name agent4 --conf-file=/data/flume/job/agent4-hdfs.conf -Dflume.monitoring.type=http -Dflume.monitoring.port=10504 -Dflume.root.logger=INFO,console >> /data/flume/log/agent4-hdfs.log 2>&1 &

(4)分別啓動各 agent 並查看對應的日誌信息

  • [root@yz-bi-web01 bin]# bash start-agent4.sh
  • [root@yz-bi-web01 bin]# tail -f /data/flume/log/agent4-hdfs.log

Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/hadoop/hadoop/bin/hadoop) for HDFS access
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including HBASE libraries found via (/hadoop/hbase/bin/hbase) for HBASE access
Info: Excluding /hadoop/hbase/lib/slf4j-api-1.7.7.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-api-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar from classpath
Info: Excluding /hadoop/hadoop-2.7.1/lib/tez/lib/slf4j-api-1.7.10.jar from classpath
Info: Including Hive libraries found via (/hadoop/hive) for Hive access

...。。

  • [root@yz-sre-backup019 bin]# bash start-agent2.sh
  • [root@yz-sre-backup019 log]# tail -f /data/flume/log/agent2-netcat.log

Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10502 -Dflume.root.logger==INFO,console -cp '/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --name agent2 --conf-file /data/flume/job/agent2-netcat.conf
2020-04-15 16:55:45,585 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2020-04-15 16:55:45,592 (lifecycleSupervisor-1-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:78)] Configuration provider started

...。。

  • [root@yz-sre-backup019 bin]# bash start-agent1.sh
  • [root@yz-sre-backup019 bin]# tail -f /data/flume/log/agent1-exec.log

Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10501 -Dflume.root.logger==INFO,console -cp '/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --name agent1 --conf-file /data/flume/job/agent1-exec.conf
2020-04-15 16:54:29,551 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2020-04-15 16:54:29,558 (lifecycleSupervisor-1-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:78)] Configuration provider started

...。。

  • [root@yz-sre-backup019 bin]# bash start-agent3.sh
  • [root@yz-sre-backup019 log]# tail -f /data/flume/log/agent3-dir.log

Info: Sourcing environment configuration script /root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10503 -Dflume.root.logger==INFO,console -cp '/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --name agent3 --conf-file /data/flume/job/agent3-dir.conf
2020-04-15 16:57:24,138 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting
2020-04-15 16:57:24,145 (lifecycleSupervisor-1-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:78)] Configuration provider started

...。。

(5)分別進行測試並查看對應的日誌信息

// agent1 測試

  • [root@yz-sre-backup019 ~]# echo "帥飛飛!!!" >> /data/data.log
  • [root@yz-bi-web01 log]# tail -f /data/flume/log/agent4-hdfs.log

2020-04-15 17:16:04,150 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-15 17:16:04,444 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/2020-04-15/17/.1586942164150.tmp

// agent2 測試

  • [root@yz-sre-backup019 ~]# telnet 127.0.0.1 8888

Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
飛花點點輕!
OK

  • [root@yz-bi-web01 log]# tail -f /data/flume/log/agent4-hdfs.log

2020-04-15 17:17:07,096 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-15 17:17:07,143 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/2020-04-15/17/.1586942227097.tmp

// agent3 測試

 

  • [root@yz-bi-web01 log]# tail -f /data/flume/log/agent4-hdfs.log

2020-04-15 17:26:37,864 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-15 17:26:37,902 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/2020-04-15/17/wufei.csdn.1586942797865.tmp

2020-04-15 17:27:18,969 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:58)] Serializer = TEXT, UseRawLocalFileSystem = false
2020-04-15 17:27:19,009 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://yz-higo-nn1:9000/flume/2020-04-15/17/wufei.py.1586942838970.tmp

// 查看 hdfs 對應目錄是否生成相應的日誌信息

  • [root@yz-bi-web01 ~]# su - hadoop
  • [hadoop@yz-bi-web01 ~]$ hdfs dfs -ls /flume/2020-04-15/17

Found 4 items
-rw-r--r-- 3 root hadoop 33 2020-04-15 17:17 /flume/2020-04-15/17/.1586942164150
-rw-r--r-- 3 root hadoop 18 2020-04-15 17:18 /flume/2020-04-15/17/.1586942227097
-rw-r--r-- 3 root hadoop 31 2020-04-15 17:27 /flume/2020-04-15/17/wufei.csdn.1586942797865
-rw-r--r-- 3 root hadoop 31 2020-04-15 17:28 /flume/2020-04-15/17/wufei.py.1586942838970

  • [hadoop@yz-bi-web01 ~]$ hadoop fs -cat /flume/2020-04-15/17/.1586942164150

帥飛飛!!!

  • [hadoop@yz-bi-web01 ~]$ hadoop fs -cat /flume/2020-04-15/17/.1586942227097

飛花點點輕!

  • [hadoop@yz-bi-web01 ~]$ hadoop fs -cat /flume/2020-04-15/17/wufei.csdn.1586942797865

https://showufei.blog.csdn.net

  • [hadoop@yz-bi-web01 ~]$ hadoop fs -cat /flume/2020-04-15/17/wufei.py.1586942838970

https://showufei.blog.csdn.net

 

(6)查看 flume 度量值

// agent1

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
104 839 0 839 0 0 5109 0 --:--:-- --:--:-- --:--:-- 5179
{
"CHANNEL.memory_channel": {
"ChannelCapacity": "1000",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"EventTakeSuccessCount": "2",
"ChannelSize": "0",
"EventTakeAttemptCount": "409",
"StartTime": "1586940869786",
"EventPutAttemptCount": "2",
"EventPutSuccessCount": "2",
"StopTime": "0"
},
"SOURCE.exec_source": {
"EventReceivedCount": "2",
"AppendBatchAcceptedCount": "0",
"Type": "SOURCE",
"EventAcceptedCount": "2",
"AppendReceivedCount": "0",
"StartTime": "1586940869797",
"AppendAcceptedCount": "0",
"OpenConnectionCount": "0",
"AppendBatchReceivedCount": "0",
"StopTime": "0"
},
"SINK.avro_sink": {
"ConnectionCreatedCount": "1",
"ConnectionClosedCount": "0",
"Type": "SINK",
"BatchCompleteCount": "0",
"BatchEmptyCount": "404",
"EventDrainAttemptCount": "2",
"StartTime": "1586940869789",
"EventDrainSuccessCount": "2",
"BatchUnderflowCount": "2",
"StopTime": "0",
"ConnectionFailedCount": "0"
}
}

 

// agent3

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
105 843 0 843 0 0 5025 0 --:--:-- --:--:-- --:--:-- 5109
{
"CHANNEL.memory_channel": {
"ChannelCapacity": "1000",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"EventTakeSuccessCount": "2",
"ChannelSize": "0",
"EventTakeAttemptCount": "188",
"StartTime": "1586942644042",
"EventPutAttemptCount": "2",
"EventPutSuccessCount": "2",
"StopTime": "0"
},
"SINK.avro_sink": {
"ConnectionCreatedCount": "1",
"ConnectionClosedCount": "0",
"Type": "SINK",
"BatchCompleteCount": "0",
"BatchEmptyCount": "184",
"EventDrainAttemptCount": "2",
"StartTime": "1586942644045",
"EventDrainSuccessCount": "2",
"BatchUnderflowCount": "2",
"StopTime": "0",
"ConnectionFailedCount": "0"
},
"SOURCE.spooldir_source": {
"EventReceivedCount": "2",
"AppendBatchAcceptedCount": "2",
"Type": "SOURCE",
"AppendReceivedCount": "0",
"EventAcceptedCount": "2",
"StartTime": "1586942644126",
"AppendAcceptedCount": "0",
"OpenConnectionCount": "0",
"AppendBatchReceivedCount": "2",
"StopTime": "0"
}
}

 

// agent2

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
113 567 0 567 0 0 3419 0 --:--:-- --:--:-- --:--:-- 3478
{
"CHANNEL.memory_channel": {
"ChannelCapacity": "1000",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"ChannelSize": "0",
"EventTakeSuccessCount": "3",
"EventTakeAttemptCount": "357",
"StartTime": "1586941307761",
"EventPutAttemptCount": "3",
"EventPutSuccessCount": "3",
"StopTime": "0"
},
"SINK.avro_sink": {
"ConnectionCreatedCount": "1",
"ConnectionClosedCount": "0",
"Type": "SINK",
"BatchCompleteCount": "0",
"BatchEmptyCount": "351",
"EventDrainAttemptCount": "3",
"StartTime": "1586941307764",
"EventDrainSuccessCount": "3",
"BatchUnderflowCount": "3",
"StopTime": "0",
"ConnectionFailedCount": "0"
}
}

 

// agent4

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
104 839 0 839 0 0 7874 0 --:--:-- --:--:-- --:--:-- 8067
{
"SOURCE.avro_source": {
"OpenConnectionCount": "3",
"Type": "SOURCE",
"AppendBatchReceivedCount": "6",
"AppendBatchAcceptedCount": "6",
"EventAcceptedCount": "6",
"AppendReceivedCount": "0",
"StopTime": "0",
"StartTime": "1586942110599",
"EventReceivedCount": "6",
"AppendAcceptedCount": "0"
},
"SINK.hdfs_sink": {
"BatchCompleteCount": "0",
"ConnectionFailedCount": "0",
"EventDrainAttemptCount": "6",
"ConnectionCreatedCount": "5",
"Type": "SINK",
"BatchEmptyCount": "255",
"ConnectionClosedCount": "5",
"EventDrainSuccessCount": "6",
"StopTime": "0",
"StartTime": "1586942110125",
"BatchUnderflowCount": "6"
},
"CHANNEL.memory_channel": {
"EventPutSuccessCount": "6",
"ChannelFillPercentage": "0.0",
"Type": "CHANNEL",
"StopTime": "0",
"EventPutAttemptCount": "6",
"ChannelSize": "0",
"StartTime": "1586942110121",
"EventTakeSuccessCount": "6",
"ChannelCapacity": "1000",
"EventTakeAttemptCount": "268"
}
}

 

(7)測試完刪掉 flume 進程並清除 hdfs 上數據

[hadoop@yz-bi-web01 ~]$ hdfs dfs -rm -r -f -skipTrash /flume
Deleted /flume

實戰 06:挑選器案例

channel selector:通道挑選器,選擇指定的 event 發送到指定的 channel

  1. Replicating Channel Selector:默認副本挑選器,事件均以副本方式輸出,換句話說就是有幾個 channel 就發送幾個副本
  2. multiplexing selector:多路複用挑選器,作用就是可以將不同的內容發送到指定的 channel
  3. 詳情參考官方文檔:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-channel-selectors

流程圖:

實戰 07:主機攔截器案例

攔截器(interceptor):是 source 端的在處理過程中能夠對數據(event)進行修改或丟棄的組件。

常見攔截器有:

  1. host interceptor:將發送的 event 添加主機名 header
  2. timestamp interceptor:將發送的 event 添加時間戳的 header
  3. 更多攔截器可參考官方文檔:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#flume-interceptors

(1)編輯主機攔截器配置文件(案例一)

agent 選型:netcat source + memory channel + logger sink

[root@yz-sre-backup019 job]# vim flume-host_interceptor.conf
[root@yz-sre-backup019 job]# cat flume-host_interceptor.conf
# flume-host_interceptor.conf: A single_node Flume configuration
 
# Name the components on this agent
wf_host_interceptor.sources = netcat_source
wf_host_interceptor.sinks = logger_sink
wf_host_interceptor.channels = memory_channel
# wf_host_interceptor: agent 的名稱; netcat_source: source 的名稱; logger_sink: sink 的名稱; memory_channel: channel 的名稱
 
# Describe/configure the source
# 配置 source 的類型
wf_host_interceptor.sources.netcat_source.type = netcat
# 配置 source 綁定的主機
wf_host_interceptor.sources.netcat_source.bind = 127.0.0.1
# 配置 source 綁定的主機端口
wf_host_interceptor.sources.netcat_source.port = 8888
 
# 指定添加攔截器
wf_host_interceptor.sources.netcat_source.interceptors = host_interceptor
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.type = org.apache.flume.interceptor.HostInterceptor$Builder
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.preserveExisting = false
# 指定 header 的 key
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.hostHeader = hostname
# 指定 header 的 value 爲主機 IP
wf_host_interceptor.sources.netcat_source.interceptors.host_interceptor.useIP = true
 
# 指定 sink 的類型,我們這裏指定的爲 logger,即控制檯輸出
# 配置 sink 的類型,
wf_host_interceptor.sinks.logger_sink.type = logger
 
# 指定 channel 的類型爲 memory,指定 channel 的容量是 1000,每次傳輸的容量是 100
# 配置 channel 的類型
wf_host_interceptor.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
wf_host_interceptor.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
wf_host_interceptor.channels.memory_channel.transactionCapacity = 100
 
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
wf_host_interceptor.sources.netcat_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
wf_host_interceptor.sinks.logger_sink.channel = memory_channel

(2)編寫啓動腳本

[root@yz-sre-backup019 bin]# vim start-host_interceptor.sh
[root@yz-sre-backup019 bin]# chmod +x start-host_interceptor.sh
[root@yz-sre-backup019 bin]# cat start-host_interceptor.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Thu Apr 16 11:33:39 CST 2020
 
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-host_interceptor.conf --name wf_host_interceptor -Dflume.monitoring.type=http -Dflume.monitoring.port=10520 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-host_interceptor.log 2>&1 &

(3)啓動並連接到指定端口發送測試數據

  • [root@yz-sre-backup019 bin]# bash start-host_interceptor.sh
  • [root@yz-sre-backup019 bin]# ss -ntl

State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 50 *:10520 *:*
LISTEN 0 50 127.0.0.1:8888 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*

  • [root@yz-sre-backup019 log]# tail -f /data/flume/log/flume-host_interceptor.log

Info: Sourcing environment configuration script //root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10520 -Dflume.root.logger=INFO,console -cp '//root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --conf-file=/data/flume/job/flume-host_interceptor.conf --name wf_host_interceptor
2020-04-16 12:08:55,064 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting

...。。

2020-04-16 12:08:55,492 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started [email protected]:10520
2020-04-16 12:09:41,350 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{hostname=10.20.3.36} body: 53 48 4F 57 75 66 65 69 0D SHOWufei. }
2020-04-16 12:09:50,352 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{hostname=10.20.3.36} body: 41 6E 20 69 6E 74 65 72 63 65 70 74 6F 72 20 69 An interceptor i }

  • [root@yz-sre-backup019 ~]# telnet 127.0.0.1 8888

Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
SHOWufei
OK
An interceptor is an aircraft or ground-based missile system designed to intercept and attack enemy planes.
OK

(4)編輯時間戳攔截器配置文件(案例二)

agent 選型:netcat source + memory channel + logger sink

[root@yz-sre-backup019 job]# vim flume-timestamp_interceptor.conf
[root@yz-sre-backup019 job]# cat flume-timestamp_interceptor.conf
# flume-timestamp_interceptor.conf: A single_node Flume configuration
 
# Name the components on this agent
wf_timestamp_interceptor.sources = netcat_source
wf_timestamp_interceptor.sinks = logger_sink
wf_timestamp_interceptor.channels = memory_channel
# wf_timestamp_interceptor: agent 的名稱; netcat_source: source 的名稱; logger_sink: sink 的名稱; memory_channel: channel 的名稱
 
# Describe/configure the source
# 配置 source 的類型
wf_timestamp_interceptor.sources.netcat_source.type = netcat
# 配置 source 綁定的主機
wf_timestamp_interceptor.sources.netcat_source.bind = 127.0.0.1
# 配置 source 綁定的主機端口
wf_timestamp_interceptor.sources.netcat_source.port = 8888
 
# 指定添加攔截器
wf_timestamp_interceptor.sources.netcat_source.interceptors = timestamp_interceptor
wf_timestamp_interceptor.sources.netcat_source.interceptors.timestamp_interceptor.type = timestamp
 
# 指定 sink 的類型,我們這裏指定的爲 logger,即控制檯輸出
# 配置 sink 的類型,
wf_timestamp_interceptor.sinks.logger_sink.type = logger
 
# 指定 channel 的類型爲 memory,指定 channel 的容量是 1000,每次傳輸的容量是 100
# 配置 channel 的類型
wf_timestamp_interceptor.channels.memory_channel.type = memory
# 配置通道中存儲的最大 event 數
wf_timestamp_interceptor.channels.memory_channel.capacity = 1000
# 配置通道從源或提供接收器的最大 event 數
wf_timestamp_interceptor.channels.memory_channel.transactionCapacity = 100
 
# 綁定 source 和 sink
# 把 source 和 channel 做關聯,其中屬性是 channels,說明 sources 可以和多個 channel 做關聯
wf_timestamp_interceptor.sources.netcat_source.channels = memory_channel
# 把 sink 和 channel 做關聯,只能輸出到一個 channel
wf_timestamp_interceptor.sinks.logger_sink.channel = memory_channel

(5)編寫啓動腳本

[root@yz-sre-backup019 bin]# vim start-timestamp_interceptor.sh
[root@yz-sre-backup019 bin]# chmod +x start-timestamp_interceptor.sh
[root@yz-sre-backup019 bin]# cat start-timestamp_interceptor.sh
#!/bin/bash
# @author: wufei
# @wiki: http://wiki.inf.lehe.com/pages/viewpage.action?pageId=39096985
# @email: [email protected]
# @Date: Thu Apr 16 12:26:26 CST 2020
 
# 啓動 flume 自身的監控參數,默認執行以下腳本
# --conf: flume 的配置目錄
# --conf-file: 自定義 flume 的 agent 配置文件
# --name: 指定 agent 的名稱,與自定義 agent 配置文件中對應,即 a1
# -Dflume.root.logger: 日誌級別和輸出形式
nohup flume-ng agent --conf /${FLUME_HOME}/conf --conf-file=/data/flume/job/flume-timestamp_interceptor.conf --name wf_timestamp_interceptor -Dflume.monitoring.type=http -Dflume.monitoring.port=10521 -Dflume.root.logger=INFO,console >> /data/flume/log/flume-timestamp_interceptor.log 2>&1 &

(6)啓動並連接到指定端口發送測試數據

  • [root@yz-sre-backup019 bin]# bash start-timestamp_interceptor.sh
  • [root@yz-sre-backup019 bin]# ss -ntl

State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 *:22 *:*
LISTEN 0 50 127.0.0.1:8888 *:*
LISTEN 0 50 *:10521 *:*
LISTEN 0 100 127.0.0.1:25 *:*
LISTEN 0 128 *:1988 *:*

  • [root@yz-sre-backup019 log]# tail -f /data/flume/log/flume-timestamp_interceptor.log

Info: Sourcing environment configuration script //root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf/flume-env.sh
Info: Including Hive libraries found via () for Hive access
+ exec /usr/jdk1.8.0_241/bin/java -Xmx20m -Dflume.monitoring.type=http -Dflume.monitoring.port=10521 -Dflume.root.logger=INFO,console -cp '//root/apps/apache-flume-1.6.0-cdh5.7.0-bin/conf:/root/apps/apache-flume-1.6.0-cdh5.7.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --conf-file=/data/flume/job/flume-timestamp_interceptor.conf --name wf_timestamp_interceptor
2020-04-16 12:28:54,666 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:61)] Configuration provider starting

...。。

2020-04-16 12:28:55,062 (conf-file-poller-0) [INFO - org.mortbay.log.Slf4jLog.info(Slf4jLog.java:67)] Started [email protected]:10521

2020-04-16 12:30:15,386 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{timestamp=1587011415381} body: 53 48 4F 57 75 66 65 69 E3 80 82 2E 2E 2E E3 80 SHOWufei........ }
2020-04-16 12:30:24,945 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{timestamp=1587011424945} body: 41 6E 20 69 6E 74 65 72 63 65 70 74 6F 72 20 69 An interceptor i }

  •  [root@yz-sre-backup019 ~]# telnet 127.0.0.1 8888

Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
SHOWufei...。。
OK
An interceptor is an aircraft or ground-based missile system designed to intercept and attack enemy planes.
OK

四、在生成環境的實際應用

待實施...。。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章