Flume初学习

Flume（日志收集系统）是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统，Flume支持在日志系统中定制各类数据发送方，用于收集数据；同时，Flume提供对数据进行简单处理，并写到各种数据接受方（可定制）的能力。

当前Flume有两个版本Flume 0.9X版本的统称Flume-og，Flume1.X版本的统称Flume-ng。由于Flume-ng经过重大重构，与Flume-og有很大不同，使用时请注意区分。

过程如下

flume的核心是把数据从数据源(source)收集过来，在将收集到的数据送到指定的目的地(sink)。为了保证输送的过程一定成功，在送到目的地(sink)之前，会先缓存数据(channel),待数据真正到达目的地(sink)后，flume在删除自己缓存的数据。

在整个数据的传输的过程中，流动的是event，即事务保证是在event级别进行的。event将传输的数据进行封装，是flume传输数据的基本单位，如果是文本文件，通常是一行记录，event也是事务的基本单位。event从source，流向channel，再到sink，本身为一个字节数组，并可携带headers(头信息)信息。event代表着一个数据的最小完整单元，从外部数据源来，向外部的目的地去。一个完整的event包括：event headers、event body、event信息(即文本文件中的单行记录)

简单介绍到这里，我们先进行相关配置。

配置很简单，只需要配置flume-env.sh中的Java路径即可

 
#Enviroment variables can be set here.
export JAVA_HOME=/opt/jdk1.8.0_11

收集数据到HDFS上的话需要在flume目录下的lib包中添加以下jar包（在hadoop目录中可以找到）（我的是cdh版本）

先看flume的一些核心参数命令

 
bin/flume-ng
global options:
  --conf,-c <conf>          use configs in <conf> directory
  -Dproperty=value          sets a Java system property value
agent options:
  --name,-n <name>          the name of this agent (required)
  --conf-file,-f <file>     specify a config file (required if -z missing)

我们需要自己编写agent，具体属性查看官网点击打开链接

第一个agent flume-hive_tail.conf（我在flume的conf下创建了该文件）

(我将hive.log的目录修改到了hive目录下自己创建的logs文件夹中)

 
收集log
	hive的日志
	/opt/cdh-5.5.0/hive-1.1.0-cdh5.5.0/logs/hive.log
source:      exec可执行方式        如命令 "tail -f"进行读取
channel:     memory channel缓存到内存
sink：       HDFS sink 存到HDFS中   /user/bpf/flume/hive_logs/

 
编写agent
# define agent
a2.sources = r2
a2.channels = c2
a2.sinks = k2

# definde sources
a2.sources.r2.type = exec
a2.sources.r2.command = tail -f /opt/cdh-5.5.0/hive-1.1.0-cdh5.5.0/logs/hive.log
a2.sources.r2.shell = /bin/bash -c

# define channels
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# define sinks
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://BPF:9000/user/bpf/flume/hive_logs/
a2.sinks.k2.hdfs.fileType = DataStream 
a2.sinks.k2.hdfs.writeFormat = Text
a2.sinks.k2.hdfs.batchSize = 10

# bind channels to sources and sinks
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

 
bin/flume-ng agent \
-c conf \
-n a2 \
-f conf/flume-hive_tail.conf \
-Dflume.root.logger=DEBUG,console

简单介绍一下spooling directory source

Spooling Directory Source可以获取硬盘上“spooling”目录的数据，这个Source将监视指定目录是否有新文件，如果有新文件的话，就解析这个新文件。事件的解析逻辑是可插拔的。在文件的内容所有的都读取到Channel之后，Spooling Directory Source会重名或者是删除该文件以表示文件已经读取完成。不像Exec Source，这个Source是可靠的，且不会丢失数据。

再来编写一个agent：flume-app.conf

 
a3.sources = r3
a3.channels = c3
a3.sinks = k3


a3.sources.r3.type = spooldir
a3.sources.r3.ignorePattern = ^(.)*\\.log$ #正则表达式，该属性意思为忽略后缀为.log的文件
a3.sources.r3.spoolDir = /opt/cdh-5.5.0/flume-1.6.0-cdh5.5.0/spool_logs
a3.sources.r3.fileSuffix = .finish #文件一旦被监控收集，原文件就会加一个.finish后缀


a3.channels.c3.type = file
#下面这两个文件夹自己创建
a3.channels.c3.checkpointDir = /opt/cdh-5.5.0/flume-1.6.0-cdh5.5.0/filechannel/chkpoint
a3.channels.c3.dataDirs = /opt/cdh-5.5.0/flume-1.6.0-cdh5.5.0/filechannel/data


a3.sinks.k3.type = hdfs
#下面两条属性可以自动按照当前日期（年月日）生成子目录
a3.sinks.k3.hdfs.useLocalTimeStamp = true
a3.sinks.k3.hdfs.path = hdfs://BPF:9000/user/bpf/flume/spooling_logs/%Y%m%d
#
a3.sinks.k3.hdfs.hdfs.fileType = DataStream 
a3.sinks.k3.hdfs.writeFormat = Text
a3.sinks.k3.hdfs.batchSize = 10


a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

 
bin/flume-ng agent \
-c conf \
-n a3 \
-f conf/flume-app.conf \
-Dflume.root.logger=DEBUG,console

oozie初學習

sqoop初學習

Flume初學習

HIVE實戰：簡單處理web日誌

Redis實現發佈訂閱模式簡述 Jedis實現

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結