Flume總結（source|channel|sink 配置）

原創

花名无缺

2020-05-24 20:19

Flume總結

文章目錄

Flume總結

agent

source 、channel、sink

source

對接數據源

exec source

a1.sources.r1.type = exec
#配置命令
a1.sources.r1.command = tail -F /data/flumeData/tail.log
a1.sources.r1.channels = c1

spooldir source

a1.sources.r1.type = spooldir
#監控該文件夾採集新文件，已經採集過的文件會標記改名字。重新啓動會重複採集嗎
a1.sources.r1.spoolDir = /data/flumeData/files
###fileHeader是什麼意思
a1.sources.r1.fileHeader = true
a1.sources.r1.channels = c1

avro source 用於採集上一個sink的輸出

a1.sources.r1.type = avro
#AvroSource服務的ip地址
a1.sources.r1.bind = 192.168.200.110
#AvroSource服務的端口   
a1.sources.r1.port = 4141
a1.sources.r1.channels = c1

taildir source 可以同時採集文件和文件夾

a1.sources.s1.type = taildir
#通過 json 格式存下每個文件消費的偏移量，避免從頭消費
a1.sources.s1.positionFile = /data/flumeData/index/taildir_position.json
a1.sources.s1.filegroups = f1 f2 f3 
##該處應該可以配置目錄吧
a1.sources.s1.filegroups.f1 = /home/hadoop/taillogs/access.log
a1.sources.s1.filegroups.f2 = /home/hadoop/taillogs/nginx.log
a1.sources.s1.filegroups.f3 = /home/hadoop/taillogs/web.log
a1.sources.s1.headers.f1.headerKey = access
a1.sources.s1.headers.f2.headerKey = nginx
a1.sources.s1.headers.f3.headerKey = web
a1.sources.s1.fileHeader  = true

source 的攔截器

a1.sources.r1.interceptors = i1 i2
#i1的類型爲時間戳攔截器  可以解析%Y-%m-%d 時間
a1.sources.r1.interceptors.i1.type = timestamp
#i2的類型爲主機攔截器，可以獲取當前event中攜帶的主機名
a1.sources.r1.interceptors.i2.type = host
#指定主機名變量
a1.sources.r1.interceptors.i2.hostHeader=hostname
#hostname不使用ip顯示，直接就是該服務器對應的主機名
a1.sources.r1.interceptors.i2.useIP=false

channel

flume的安全性就是通過channel實現的

file channel 保存到磁盤一般用於數據準確性較高的場景，如銀行流水

a1.channels.c1.type = file
#設置檢查點目錄--該目錄是記錄下event在數據目錄下的位置
a1.channels.c1.checkpointDir=/data/flume_checkpoint
#數據存儲所在的目錄
a1.channels.c1.dataDirs=/data/flume_data

memory channel 緩存到內存速度較快，不安全

a1.channels.c1.type =memory
#指定channel的最多可以存放數據的容量
a1.channels.c1.capacity = 1000
#指定在一個事務中source寫數據到channel或者sink從channel取數據最大條數
a1.channels.c1.transactionCapacity = 100

sink

對接最終數據到達處

logger sink 控制檯輸出

a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

hdfs sink 保存到hdfs

a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1                             
#指定數據收集到hdfs目錄 路徑
a1.sinks.k1.hdfs.path = hdfs://node01:8020/tailFile/%Y-%m-
#指定生成文件名的前綴 
a1.sinks.k1.hdfs.filePrefix = events- 
#指定生成文件名的後綴
a1.sinks.hdfs-sink1.hdfs.fileSuffix = .log

#是否啓用時間上的”捨棄” 生產上一般設置一天一個目錄
a1.sinks.k1.hdfs.round = true
#時間上進行“捨棄”的值
a1.sinks.k1.hdfs.roundValue = 10
#時間上進行“捨棄”的單位
a1.sinks.k1.hdfs.roundUnit = minute

#60s或者50字節或者10條數據，誰先滿足，就開始滾動生成新文件                                 
#生成的文件按大小生成
a1.sinks.k1.hdfs.rollSize = 50
#生成的文件不按條數生成
a1.sinks.k1.hdfs.rollCount = 0
#生成的文件按時間生成
a1.sinks.k1.hdfs.rollInterval = 0
#每個批次寫入的數據量
a1.sinks.k1.hdfs.batchSize = 100

#開始本地時間戳--開啓後就可以使用%Y-%m-%d去解析時間
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件類型，默認是Sequencefile，可用DataStream，則爲普通文本
a1.sinks.k1.hdfs.fileType = DataStream 
a1.sinks.hdfs-sink1.hdfs.writeFormat = Text
#flume操作hdfs的線程數（包括新建，寫入等）  
a1.sinks.k1.hdfs.threadsPoolSize=10
#操作hdfs超時時間
a1.sinks.k1.hdfs.callTimeout=30000

avro sink

 a1.sinks.k1.type = avro
 a1.sinks.k1.channel = c1
 a1.sinks.k1.hostname = 192.168.200.110
 a1.sinks.k1.port = 4141

攔截器

調優

高可用

Put事務 Take事務

source–>channel是put事務

channel–>sink是take事務

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Flume總結（source|channel|sink 配置）

Flume總結

文章目錄

agent

source

channel

sink

攔截器

調優

高可用

Put事務 Take事務

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

plsql誤操作表（增刪改）的數據恢復(閃回操作)

JQuery之Ajax樣例(自用)

輕鬆玩轉hive中各種join之間的關係以及使用

含字母字符串對阿拉伯數字加減再取得結果字符串(字符串的正則表達式處理)

hive 的lateral view 與 explode函數詳解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結