Flume使用Spooling Directory Source采集文件夹数据到hdfs

一、需求说明

flume监控linux上一个目录(/home/flume_data)下进入的文件,并写入hdfs的相应目录下(hdfs://master:9000/flume/spool/%Y%m%d%H%M)

二、新建配置文件

1、在conf下新建配置文件hdfs-logger.conf

# Name the components on this agent
spool-hdfs-agent.sources = spool-source
spool-hdfs-agent.sinks = hdfs-sink
spool-hdfs-agent.channels = memory-channel

# Describe/configure the source
spool-hdfs-agent.sources.spool-source.type = spooldir
spool-hdfs-agent.sources.spool-source.spoolDir = /home/flume_data

# Describe the sink
spool-hdfs-agent.sinks.hdfs-sink.type = hdfs
spool-hdfs-agent.sinks.hdfs-sink.hdfs.path = hdfs://master:9000/flume/spool/%Y%m%d%H%M
spool-hdfs-agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
spool-hdfs-agent.sinks.hdfs-sink.hdfs.fileType = CompressedStream
spool-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat = Text
spool-hdfs-agent.sinks.hdfs-sink.hdfs.codeC = gzip
spool-hdfs-agent.sinks.hdfs-sink.hdfs.filePrefix = wsk
spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 30
spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 1024
spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 0

# Use a channel which buffers events in memory
spool-hdfs-agent.channels.memory-channel.type = memory
spool-hdfs-agent.channels.memory-channel.capacity = 1000
spool-hdfs-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
spool-hdfs-agent.sources.spool-source.channels = memory-channel
spool-hdfs-agent.sinks.hdfs-sink.channel = memory-channel

2、说明

(1)spool-hdfs-agent为agent的名字,需要在启动flume命令中的- name中配置的;

(2)/home/flume_data为flume监控采集目录;

(3)hdfs://master:9000/flume/spool/%Y%m%d%H%M,为flume输出hdfs的目录地址,%Y%m%d%H%M是输出文件夹时间格式;

(4)flume有三种滚动方式。
按照时间:spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 30
按照大小:spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 1024
按照count:spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 0

滚动的意思是当flume监控的目录达到了配置信息中的某一条滚动方式的时候,会触发flume提交一个文件到hdfs中(即在hdfs中生成一个文件)

rollInterval

默认值:30
hdfs sink间隔多长将临时文件滚动成最终目标文件,单位:秒;
如果设置成0,则表示不根据时间来滚动文件;
注:滚动(roll)指的是,hdfs sink将临时文件重命名成最终目标文件,并新打开一个临时文件来写入数据;

rollSize

默认值:1024
当临时文件达到该大小(单位:bytes)时,滚动成目标文件;
如果设置成0,则表示不根据临时文件大小来滚动文件;

rollCount

默认值:10
当events数据达到该数量时候,将临时文件滚动成目标文件;
如果设置成0,则表示不根据events数据来滚动文件;

(5)rollSize控制的大小是指的压缩前的,所以若hdfs文件使用了压缩,需调大rollsize的大小

(6)当文件夹下的某个文件被采集到hdfs上,会有个。complete的标志

(7)使用Spooling Directory Source采集文件数据时若该文件数据已经被采集,再对该文件做修改是会报错的停止的,其次若放进去一个已经完成采集的同名数据文件也是会报错停止的

(8)写HDFS数据可按照时间分区,注意改时间刻度内无数据则不会生成该时间文件夹

(9)生成的文件名称默认是前缀+时间戳,这个是可以更改的。

三、启动flume

1、命令

[root@master bin]# ./flume-ng agent --conf conf --conf-file ../conf/hdfs-logger.conf --name spool-hdfs-agent -Dflume.root.logger=INFO,console

2、向采集目录发送文件

[root@master flumeData]# cp teacher /home/flume_data/
[root@master flumeData]# cp student /home/flume_data/
[root@master flumeData]# cat teacher 
chenlaoshi
malaoshi
haolaoshi
weilaoshi
hualaoshi
[root@master flumeData]# cat student 
zhangsan
lisi
wangwu
xiedajiao
xieguangkun

3、控制台日志打印

20/04/21 10:08:56 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: spool-source started
20/04/21 10:09:07 INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one.
20/04/21 10:09:07 INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /home/flume_data/teacher to /home/flume_data/teacher.COMPLETED
20/04/21 10:09:07 INFO hdfs.HDFSCompressedDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
20/04/21 10:09:07 INFO hdfs.BucketWriter: Creating hdfs://master:9000/flume/spool/202004211009/wsk.1587434947074.gz.tmp
20/04/21 10:09:08 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
20/04/21 10:09:08 INFO compress.CodecPool: Got brand-new compressor [.gz]
20/04/21 10:09:17 INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one.
20/04/21 10:09:17 INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /home/flume_data/student to /home/flume_data/student.COMPLETED

4、监控目录

[root@master flume_data]# ls
student.COMPLETED  teacher.COMPLETED

5、hdfs存储效果

下载解压打开

四、此方式的缺点

1、虽然能监控一个文件夹,但是无法监控递归的文件夹中的数据;

2、若采集时Flume挂了,无法保证重启时还从之前文件读取的那一行继续采集数据;

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章