Flume初學習

原創

2020-07-02 03:26

Flume（日誌收集系統）是Cloudera提供的一個高可用的，高可靠的，分佈式的海量日誌採集、聚合和傳輸的系統，Flume支持在日誌系統中定製各類數據發送方，用於收集數據；同時，Flume提供對數據進行簡單處理，並寫到各種數據接受方（可定製）的能力。

當前Flume有兩個版本Flume 0.9X版本的統稱Flume-og，Flume1.X版本的統稱Flume-ng。由於Flume-ng經過重大重構，與Flume-og有很大不同，使用時請注意區分。

過程如下

flume的核心是把數據從數據源(source)收集過來，在將收集到的數據送到指定的目的地(sink)。爲了保證輸送的過程一定成功，在送到目的地(sink)之前，會先緩存數據(channel),待數據真正到達目的地(sink)後，flume在刪除自己緩存的數據。

在整個數據的傳輸的過程中，流動的是event，即事務保證是在event級別進行的。event將傳輸的數據進行封裝，是flume傳輸數據的基本單位，如果是文本文件，通常是一行記錄，event也是事務的基本單位。event從source，流向channel，再到sink，本身爲一個字節數組，並可攜帶headers(頭信息)信息。event代表着一個數據的最小完整單元，從外部數據源來，向外部的目的地去。一個完整的event包括：event headers、event body、event信息(即文本文件中的單行記錄)

簡單介紹到這裏，我們先進行相關配置。

配置很簡單，只需要配置flume-env.sh中的Java路徑即可

 
#Enviroment variables can be set here.
export JAVA_HOME=/opt/jdk1.8.0_11

收集數據到HDFS上的話需要在flume目錄下的lib包中添加以下jar包（在hadoop目錄中可以找到）（我的是cdh版本）

先看flume的一些核心參數命令

 
bin/flume-ng
global options:
  --conf,-c <conf>          use configs in <conf> directory
  -Dproperty=value          sets a Java system property value
agent options:
  --name,-n <name>          the name of this agent (required)
  --conf-file,-f <file>     specify a config file (required if -z missing)

我們需要自己編寫agent，具體屬性查看官網點擊打開鏈接

第一個agent flume-hive_tail.conf（我在flume的conf下創建了該文件）

(我將hive.log的目錄修改到了hive目錄下自己創建的logs文件夾中)

 
收集log
	hive的日誌
	/opt/cdh-5.5.0/hive-1.1.0-cdh5.5.0/logs/hive.log
source:      exec可執行方式        如命令 "tail -f"進行讀取
channel:     memory channel緩存到內存
sink：       HDFS sink 存到HDFS中   /user/bpf/flume/hive_logs/

 
編寫agent
# define agent
a2.sources = r2
a2.channels = c2
a2.sinks = k2

# definde sources
a2.sources.r2.type = exec
a2.sources.r2.command = tail -f /opt/cdh-5.5.0/hive-1.1.0-cdh5.5.0/logs/hive.log
a2.sources.r2.shell = /bin/bash -c

# define channels
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# define sinks
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://BPF:9000/user/bpf/flume/hive_logs/
a2.sinks.k2.hdfs.fileType = DataStream 
a2.sinks.k2.hdfs.writeFormat = Text
a2.sinks.k2.hdfs.batchSize = 10

# bind channels to sources and sinks
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

 
bin/flume-ng agent \
-c conf \
-n a2 \
-f conf/flume-hive_tail.conf \
-Dflume.root.logger=DEBUG,console

簡單介紹一下spooling directory source

Spooling Directory Source可以獲取硬盤上“spooling”目錄的數據，這個Source將監視指定目錄是否有新文件，如果有新文件的話，就解析這個新文件。事件的解析邏輯是可插拔的。在文件的內容所有的都讀取到Channel之後，Spooling Directory Source會重名或者是刪除該文件以表示文件已經讀取完成。不像Exec Source，這個Source是可靠的，且不會丟失數據。

再來編寫一個agent：flume-app.conf

 
a3.sources = r3
a3.channels = c3
a3.sinks = k3


a3.sources.r3.type = spooldir
a3.sources.r3.ignorePattern = ^(.)*\\.log$ #正則表達式，該屬性意思爲忽略後綴爲.log的文件
a3.sources.r3.spoolDir = /opt/cdh-5.5.0/flume-1.6.0-cdh5.5.0/spool_logs
a3.sources.r3.fileSuffix = .finish #文件一旦被監控收集，原文件就會加一個.finish後綴


a3.channels.c3.type = file
#下面這兩個文件夾自己創建
a3.channels.c3.checkpointDir = /opt/cdh-5.5.0/flume-1.6.0-cdh5.5.0/filechannel/chkpoint
a3.channels.c3.dataDirs = /opt/cdh-5.5.0/flume-1.6.0-cdh5.5.0/filechannel/data


a3.sinks.k3.type = hdfs
#下面兩條屬性可以自動按照當前日期（年月日）生成子目錄
a3.sinks.k3.hdfs.useLocalTimeStamp = true
a3.sinks.k3.hdfs.path = hdfs://BPF:9000/user/bpf/flume/spooling_logs/%Y%m%d
#
a3.sinks.k3.hdfs.hdfs.fileType = DataStream 
a3.sinks.k3.hdfs.writeFormat = Text
a3.sinks.k3.hdfs.batchSize = 10


a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

 
bin/flume-ng agent \
-c conf \
-n a3 \
-f conf/flume-app.conf \
-Dflume.root.logger=DEBUG,console

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Flume初學習

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

oozie初學習

sqoop初學習

Flume初學習

HIVE實戰：簡單處理web日誌

Redis實現發佈訂閱模式簡述 Jedis實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結