一、需求说明

flume监控linux上一个目录(/home/flume_data)下进入的文件，并写入hdfs的相应目录下(hdfs://master:9000/flume/spool/%Y%m%d%H%M)

二、新建配置文件

1、在conf下新建配置文件hdfs-logger.conf

# Name the components on this agent
spool-hdfs-agent.sources = spool-source
spool-hdfs-agent.sinks = hdfs-sink
spool-hdfs-agent.channels = memory-channel

# Describe/configure the source
spool-hdfs-agent.sources.spool-source.type = spooldir
spool-hdfs-agent.sources.spool-source.spoolDir = /home/flume_data

# Describe the sink
spool-hdfs-agent.sinks.hdfs-sink.type = hdfs
spool-hdfs-agent.sinks.hdfs-sink.hdfs.path = hdfs://master:9000/flume/spool/%Y%m%d%H%M
spool-hdfs-agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
spool-hdfs-agent.sinks.hdfs-sink.hdfs.fileType = CompressedStream
spool-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat = Text
spool-hdfs-agent.sinks.hdfs-sink.hdfs.codeC = gzip
spool-hdfs-agent.sinks.hdfs-sink.hdfs.filePrefix = wsk
spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 30
spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 1024
spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 0

# Use a channel which buffers events in memory
spool-hdfs-agent.channels.memory-channel.type = memory
spool-hdfs-agent.channels.memory-channel.capacity = 1000
spool-hdfs-agent.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
spool-hdfs-agent.sources.spool-source.channels = memory-channel
spool-hdfs-agent.sinks.hdfs-sink.channel = memory-channel

2、说明

(1)spool-hdfs-agent为agent的名字，需要在启动flume命令中的- name中配置的；

(2)/home/flume_data为flume监控采集目录；

(3)hdfs://master:9000/flume/spool/%Y%m%d%H%M，为flume输出hdfs的目录地址，%Y%m%d%H%M是输出文件夹时间格式；

(4)flume有三种滚动方式。
按照时间:spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval = 30
按照大小:spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollSize = 1024
按照count:spool-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount = 0

滚动的意思是当flume监控的目录达到了配置信息中的某一条滚动方式的时候，会触发flume提交一个文件到hdfs中（即在hdfs中生成一个文件）

rollInterval

默认值：30
hdfs sink间隔多长将临时文件滚动成最终目标文件，单位：秒；
如果设置成0，则表示不根据时间来滚动文件；
注：滚动（roll）指的是，hdfs sink将临时文件重命名成最终目标文件，并新打开一个临时文件来写入数据；

rollSize

默认值：1024
当临时文件达到该大小（单位：bytes）时，滚动成目标文件；
如果设置成0，则表示不根据临时文件大小来滚动文件；

rollCount

默认值：10
当events数据达到该数量时候，将临时文件滚动成目标文件；
如果设置成0，则表示不根据events数据来滚动文件；

(5)rollSize控制的大小是指的压缩前的，所以若hdfs文件使用了压缩，需调大rollsize的大小

(6)当文件夹下的某个文件被采集到hdfs上，会有个。complete的标志

(7)使用Spooling Directory Source采集文件数据时若该文件数据已经被采集，再对该文件做修改是会报错的停止的，其次若放进去一个已经完成采集的同名数据文件也是会报错停止的

(8)写HDFS数据可按照时间分区，注意改时间刻度内无数据则不会生成该时间文件夹

(9)生成的文件名称默认是前缀+时间戳，这个是可以更改的。

三、启动flume

1、命令

[root@master bin]# ./flume-ng agent --conf conf --conf-file ../conf/hdfs-logger.conf --name spool-hdfs-agent -Dflume.root.logger=INFO,console

2、向采集目录发送文件

[root@master flumeData]# cp teacher /home/flume_data/
[root@master flumeData]# cp student /home/flume_data/

[root@master flumeData]# cat teacher 
chenlaoshi
malaoshi
haolaoshi
weilaoshi
hualaoshi
[root@master flumeData]# cat student 
zhangsan
lisi
wangwu
xiedajiao
xieguangkun

3、控制台日志打印

20/04/21 10:08:56 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: spool-source started
20/04/21 10:09:07 INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one.
20/04/21 10:09:07 INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /home/flume_data/teacher to /home/flume_data/teacher.COMPLETED
20/04/21 10:09:07 INFO hdfs.HDFSCompressedDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
20/04/21 10:09:07 INFO hdfs.BucketWriter: Creating hdfs://master:9000/flume/spool/202004211009/wsk.1587434947074.gz.tmp
20/04/21 10:09:08 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
20/04/21 10:09:08 INFO compress.CodecPool: Got brand-new compressor [.gz]
20/04/21 10:09:17 INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one.
20/04/21 10:09:17 INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /home/flume_data/student to /home/flume_data/student.COMPLETED

4、监控目录

[root@master flume_data]# ls
student.COMPLETED  teacher.COMPLETED

5、hdfs存储效果

下载解压打开

四、此方式的缺点

1、虽然能监控一个文件夹，但是无法监控递归的文件夹中的数据；

2、若采集时Flume挂了，无法保证重启时还从之前文件读取的那一行继续采集数据；

Flume使用Spooling Directory Source采集文件夹数据到hdfs

一、需求说明

二、新建配置文件

1、在conf下新建配置文件hdfs-logger.conf

2、说明

三、启动flume

1、命令

2、向采集目录发送文件

3、控制台日志打印

4、监控目录

5、hdfs存储效果

四、此方式的缺点

PDManer [元数建模]-v4.9.0 发布：一款简单好用的数据库建模平台

使用neovim打造go ide(支持代码跳转, 代码补全, 实时语法检查)

cs01 CSS Syntax

挑战程序设计竞赛 2.3章习题 poj 3046 Ant Counting

[MASM拾遗]Offset伪指令

h30 HTML Layout Elements

了解显卡

一款基于C#开发的通讯调试工具（支持Modbus RTU、MQTT调试）

Linux/Golang/glibC系统调用

cs04 CSS Measurement Units

sqoop的安裝及簡單使用

Flume單機安裝及測試

kafka+sparkStreaming+mysql

命令查看yarn當前任務列表

Es爲Hbase創建二級索引思路

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結