flume使用Taildir Source采集文件夹数据到hdfs

一、说明

1、此方式适合生产环境;

2、Taildir Source 是Apache flume1.7新推出的,但是CDH Flume1.6做了集成;

3、Taildir Source是高可靠(reliable)的source,他会实时的将文件偏移量写到json文件中并保存到磁盘。下次重启Flume时会读取Json文件获取文件O偏移量,然后从之前的位置读取数据,保证数据零丢失;

4、taildir Source可同时监控多个文件夹以及文件。即使文件在实时写入数据;

5、Taildir Source也是无法采集递归文件下的数据,这需要改造源码;

6、Taildir Source监控一个文件夹下的所有文件一定要用.*正则;

二、conf下新建配置文件

1、在conf下新建hdfs-taildir-logger.conf配置文件

# Name the components on this agent
taildir-hdfs-agent.sources = taildir-source
taildir-hdfs-agent.sinks = hdfs-sink
taildir-hdfs-agent.channels = memory-channel

# Describe/configure the source
taildir-hdfs-agent.sources.taildir-source.type = TAILDIR
taildir-hdfs-agent.sources.taildir-source.filegroups = f1
taildir-hdfs-agent.sources.taildir-source.filegroups.f1 = /home/flume_data/.*
taildir-hdfs-agent.sources.taildir-source.positionFile = /home/taildir/taildir_position.json

# Describe the sink
taildir-hdfs-agent.sinks.hdfs-sink.type = hdfs
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.path = hdfs://master:9000/flume/taildir/%Y%m%d%H%M
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.fileType = CompressedStream
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat = Text
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.codeC = gzip
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.filePrefix = wsk
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 30
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 1024
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 0

# Use a channel which buffers events in memory
taildir-hdfs-agent.channels.memory-channel.type = memory
taildir-hdfs-agent.channels.memory-channel.capacity = 1000
taildir-hdfs-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
taildir-hdfs-agent.sources.taildir-source.channels = memory-channel
taildir-hdfs-agent.sinks.hdfs-sink.channel = memory-channel

2、命令启动

[root@master bin]# ./flume-ng agent --conf conf --conf-file ../conf/hdfs-taildir-logger.conf --name taildir-hdfs-agent -Dflume.root.logger=INFO,console

3、向采集目录发送数据

[root@master flumeData]# cp student /home/flume_data/
[root@master flumeData]# cp teacher /home/flume_data/
[root@master flumeData]# scp -r test /home/flume_data/
[root@master flumeData]# cat teacher 
chenlaoshi
malaoshi
haolaoshi
weilaoshi
hualaoshi
[root@master flumeData]# cat student 
zhangsan
lisi
wangwu
xiedajiao
xieguangkun

[root@master test]# cat woker 
laowang
laoli
laohao
laoxu
laochen

4、控制台输出

20/04/21 10:53:04 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: taildir-source started
20/04/21 10:53:14 INFO taildir.ReliableTaildirEventReader: Opening file: /home/flume_data/student, inode: 34409843, pos: 0
20/04/21 10:53:14 INFO hdfs.HDFSCompressedDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
20/04/21 10:53:14 INFO hdfs.BucketWriter: Creating hdfs://master:9000/flume/taildir/202004211053/wsk.1587437594134.gz.tmp
20/04/21 10:53:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
20/04/21 10:53:16 INFO compress.CodecPool: Got brand-new compressor [.gz]
20/04/21 10:53:46 INFO hdfs.HDFSEventSink: Writer callback called.
20/04/21 10:53:46 INFO hdfs.BucketWriter: Closing hdfs://master:9000/flume/taildir/202004211053/wsk.1587437594134.gz.tmp
20/04/21 10:53:47 INFO hdfs.BucketWriter: Renaming hdfs://master:9000/flume/taildir/202004211053/wsk.1587437594134.gz.tmp to hdfs://master:9000/flume/taildir/202004211053/wsk.1587437594134.gz
20/04/21 10:54:09 INFO taildir.ReliableTaildirEventReader: Opening file: /home/flume_data/teacher, inode: 34409846, pos: 0
20/04/21 10:54:09 INFO hdfs.HDFSCompressedDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
20/04/21 10:54:09 INFO hdfs.BucketWriter: Creating hdfs://master:9000/flume/taildir/202004211054/wsk.1587437649147.gz.tmp
20/04/21 10:54:39 INFO hdfs.HDFSEventSink: Writer callback called.
20/04/21 10:54:39 INFO hdfs.BucketWriter: Closing hdfs://master:9000/flume/taildir/202004211054/wsk.1587437649147.gz.tmp
20/04/21 10:54:39 INFO hdfs.BucketWriter: Renaming hdfs://master:9000/flume/taildir/202004211054/wsk.1587437649147.gz.tmp to hdfs://master:9000/flume/taildir/202004211054/wsk.1587437649147.gz
20/04/21 10:55:19 INFO taildir.TaildirSource: Closed file: /home/flume_data/student, inode: 34409843, pos: 43
20/04/21 10:56:14 INFO taildir.TaildirSource: Closed file: /home/flume_data/teacher, inode: 34409846, pos: 50

5、hdfs存储

下载并打开,并没有发现有test文件夹下worker中的数据,因为Taildir Source也是无法采集递归文件下的数据,这需要改造源码;

三、监控多个目录的写法

1、此配置以监控两个目录为例

# Name the components on this agent
taildir-hdfs-agent.sources = taildir-source
taildir-hdfs-agent.sinks = hdfs-sink
taildir-hdfs-agent.channels = memory-channel

# Describe/configure the source
taildir-hdfs-agent.sources.taildir-source.type = TAILDIR
taildir-hdfs-agent.sources.taildir-source.filegroups = f1 f2
taildir-hdfs-agent.sources.taildir-source.filegroups.f1 = /home/flume_data1/.*
taildir-hdfs-agent.sources.taildir-source.filegroups.f2 = /home/flume_data/.*
taildir-hdfs-agent.sources.taildir-source.positionFile = /home/taildir/taildir_position.json

# Describe the sink
taildir-hdfs-agent.sinks.hdfs-sink.type = hdfs
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.path = hdfs://master:9000/flume/taildir1/%Y%m%d%H%M
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.fileType = CompressedStream
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat = Text
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.codeC = gzip
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.filePrefix = wsk
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 30
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 1024
taildir-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 0

# Use a channel which buffers events in memory
taildir-hdfs-agent.channels.memory-channel.type = memory
taildir-hdfs-agent.channels.memory-channel.capacity = 1000
taildir-hdfs-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
taildir-hdfs-agent.sources.taildir-source.channels = memory-channel
"hdfs-taildir-logger.conf" 32L, 1601C   

2、其他

如果监控指定后缀文件可以这样写:

#之监控此目录下log结尾的文件
taildir-hdfs-agent.sources.taildir-source.filegroups.f1 = /home/flume_data1/.*log
#之监控此目录下txt结尾的文件
taildir-hdfs-agent.sources.taildir-source.filegroups.f2 = /home/flume_data/.*txt

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章