（一）flume的介紹和簡單案例

原創

牵梦u

2020-02-23 17:54

一、flume 介紹

1、定義

flume 是 cloudera 提供的一個高可用的，高可靠的，分佈式的海量日誌採集、聚合和傳輸的系統。flume 基於流式框架，靈活簡單。如：

2、架構組成

（1）agent

agent 是一個 JVM 進程，它以事件的形式將數據從源頭送到目的地。

agent有三個組成部分：source、channel、sink。

（2）source

source是負責接收數據到 flume agent 的組件。source可以處理各種類型、各種格式的日誌數據，包括avto、thift、exec、jms、sysloh、http等。

（3）sink

sink 不斷地輪詢channel 中的事件，且批量的移除它們，並將這些事件批量寫入到存儲系統或索引系統或者發送到其他的 flume agent。

sink組件的目的地包括：hdfs、logger、kafka、hbase、solr等。

（4）channel

channel 是位於 source 和 sink 之間的緩衝區。因此，channel 允許 source 和 sink 運作在不同的速率上，可以避免積壓。channel 是線程安全的，可以同時處理幾個 source 的寫入操作和幾個sink 的讀取操作。

flume自帶2 種channel：memory channel 和 file channel 。

memory channel是內存中的隊列。優點是速度快。缺點是不太安全。因爲隨着機器的宕機或重啓，數據會全部丟失，且無法恢復。

file channel 是將所有事件寫入到磁盤。與memory channel對比，缺點是速度相對慢，優點是更加安全，數據不會丟失。此外file channel可以存儲的數據更多（磁盤容量大於內存容量）。

（5）event

傳輸單元，flume數據傳輸的基本單元，以 event 的形式將數據從源頭送至目的地。

event 由 header 和 body 兩部分組成。header 用來存放該event 的一些屬性，類似於元數據，爲K-V 結構。body 用來存放該條數據，形式是字節數組。如：

二、常見案例

1、監控端口數據官方案例

（1）案例需求

使用 flume 監聽一個端口，收集該端口數據，並打印到控制檯。

（2）配置文件（netcat-flume-conf.conf）

# example.conf: A single-node Flume configuration

# Name the components on this agent 第一步，給組件命名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source 第二步，配置 source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink	第三步，配置 sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory 第四步，配置channel
a1.channels.c1.type = memory
# 隊列的容量
a1.channels.c1.capacity = 1000
# 轉換的數目，需要小於容量，不然處理不了
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel 第五步，將source 和 sink綁定相應的 channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

（3）啓動

bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

2、實時監控單個追加文件

（1）案例需求

實時監控日誌文件，並上傳到 hdfs。

（2）配置文件（netcat-flume-conf.conf）

# example.conf: A single-node Flume configuration

# Name the components on this agent 第一步，給組件命名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source 第二步，配置 source
a1.sources.r1.type = exec
# 查看日誌的命令
a1.sources.r1.command = tail -F xxxx.log

# Describe the sink	第三步，配置 sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = xxxx
# 上傳文件的前綴
a1.sinks.k1.hdfs.filePrefix = log-
# 是否按照時間滾動文件夾
a1.sinks.k1.hdfs.round = true
# 多少時間單位創建一個新的文件夾
a1.sinks.k1.hdfs.roundValue = 1
# 重新定義時間單位
a1.sinks.k1.hdfs.roundUnit = hour
# 是否使用本地時間戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 積攢多少個 event 才 flush 到 hdfs 一次
a1.sinks.k1.hdfs.barchSize = 1000
# 設置文件類型，可支持壓縮
a1.sinks.k1.hdfs.fileType = DataStream
# 多久生成一個新的文件，60s，單位秒
a1.sinks.k1.hdfs.rollInterval = 60
# 設置每個文件的滾動大小，記住小於 128m（hdfs的塊大小）
a1.sinks.k1.hdfs.rollSize = 134217700
# 文件的滾動與 event 數量無關
a1.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory 第四步，配置channel
a1.channels.c1.type = memory
# 隊列的容量
a1.channels.c1.capacity = 1000
# 轉換的數目，需要小於容量，不然處理不了
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel 第五步，將source 和 sink綁定相應的 channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

4、斷點續傳-實時監控目錄下多個追加文件（Taildir）

（1）分析

Exec source 適用用監控一個實時追加的文件，但不能保證數據不丟失。
Spooldir source 能保證數據不丟失，且能夠實現斷點續傳，但延遲略高，不能實時監控。
Taildir source既能夠實現斷點續傳，又可以保證數據不丟失，還能夠實時監控。

（2）配置文件

# Name the components on this agent 第一步，給組件命名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source 第二步，配置 source
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = xxxx/.*.txt
# 記錄文件讀取信息
a1.sources.r1.positionFile = xxx

# Describe the sink	第三步，配置 sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = xxxx
# 上傳文件的前綴
a1.sinks.k1.hdfs.filePrefix = log-
# 是否按照時間滾動文件夾
a1.sinks.k1.hdfs.round = true
# 多少時間單位創建一個新的文件夾
a1.sinks.k1.hdfs.roundValue = 1
# 重新定義時間單位
a1.sinks.k1.hdfs.roundUnit = hour
# 是否使用本地時間戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 積攢多少個 event 才 flush 到 hdfs 一次
a1.sinks.k1.hdfs.barchSize = 1000
# 設置文件類型，可支持壓縮
a1.sinks.k1.hdfs.fileType = DataStream
# 多久生成一個新的文件，60s，單位秒
a1.sinks.k1.hdfs.rollInterval = 60
# 設置每個文件的滾動大小，記住小於 128m（hdfs的塊大小）
a1.sinks.k1.hdfs.rollSize = 134217700
# 文件的滾動與 event 數量無關
a1.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory 第四步，配置channel
a1.channels.c1.type = memory
# 隊列的容量
a1.channels.c1.capacity = 1000
# 轉換的數目，需要小於容量，不然處理不了
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel 第五步，將source 和 sink綁定相應的 channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

（一）flume的介紹和簡單案例

一、flume 介紹

1、定義

2、架構組成

（1）agent

（2）source

（3）sink

（4）channel

（5）event

二、常見案例

1、監控端口數據官方案例

（1）案例需求

（2）配置文件（netcat-flume-conf.conf）

（3）啓動

2、實時監控單個追加文件

（1）案例需求

（2）配置文件（netcat-flume-conf.conf）

4、斷點續傳-實時監控目錄下多個追加文件（Taildir）

（1）分析

（2）配置文件

elasticsearch-java-restful-api常見問題

（二）elasticsearch之入門介紹

（一）logstash和beats的簡單介紹

python3讀取csv文件並操作數據寫出csv

（六）elasticsearch之常見問題和原理（實在不知道起啥名了）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結