前言

    在一個完整的大數據處理系統中，
    除了hdfs+mapreduce(或spark)+hive組成分析系統的核心之外，還需要數據採集、結果數據導出、任務調度等不可或缺的輔助系統，
    而這些輔助工具在hadoop生態體系中都有便捷的開源框架，如圖所示:

日誌採集框架Flume

Flume介紹

概述

Flume是一個分佈式、可靠、和高可用的海量日誌採集、聚合和傳輸的系統。
Flume可以採集文件，socket數據包等各種形式源數據，又可以將採集的數據輸出到HDFS、hbase、kafka等衆多外部存儲系統中
一般的採集需求，通過對flume的簡單配置即可實現
Flume針對特殊場景也具備良好的自定義擴展能力，因此，flume可以適用於大部分的日誌數據採集場景。

運行機制

Flume分佈式系統中最核心的角色是agent，flume採集系統就是由一個個agent所連接形成
每一個agent相當於一個數據傳遞員，內部有三個組件:

==Source到Channel到Sink之間傳遞數據的形式是Event事件:Event事件是一個數據流單元==

        a)Source:採集源，用於，用於跟數據源對接，以獲取數據

        b)Sink:下沉地，採集數據傳送的目的地，用於往下一級agent傳遞數據或者往最終存儲系統傳遞數據

        c)Channel:agent內部的數據傳輸通道，用於從source將數據傳遞到sink

Flume採集系統結構圖

1.簡單結構

    單個agent採集數據

2.複雜結構

    多級agent之間串聯
![image](https://note.youdao.com/yws/public/resource/2a1973505ed147e8a34da5b2c1189f29/xmlnote/65014D778B014DB8A4A8EE25077053F6/13190)

Flume的體系結構

Flume的體系結構分成三個部分：數據源、Flume、目的地

數據源種類有很多：可以來自directory、http、kafka等，flume提供了source組件用來採集數據源。

1、source作用：採集日誌

source種類：

1、spooling directory source：採集目錄中的日誌

2、htttp source：採集http中的日誌

3、kafka source：採集kafka中的日誌

……

採集到的日誌需要進行緩存，flume提供了channel組件用來緩存數據。

2、channel作用：緩存日誌

channel種類：

1、memory channel：緩存到內存中（最常用）

2、本地文件

3、JDBC channel：通過JDBC緩存到關係型數據庫中

4、kafka channel：緩存到kafka中

……

例如:
#描述和配置channel組件，此處使用是內存緩存的方式
a1.channels.c1.type=memory
#默認該通道中最大的可以存儲的event數量
a1.channels.c1.capacity=1000
#每次最大可以從source中拿到或者送到sink中的event數量
a1.channels.c1.transactionCapacity=100
--------------------------------------------------------
#對於channel的配置描述 使用文件做數據的臨時緩存 這種的安全性要高
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /home/uplooking/data/flume/checkpoint
a1.channels.c1.dataDirs = /home/uplooking/data/flume/data

==生產中一般用的是memory==

緩存的數據最終需要進行保存，flume提供了sink組件用來保存數據。

3、sink作用：保存日誌

sink種類：

1、HDFS sink：保存到HDFS中

2、HBase sink：保存到HBase中

3、Hive sink：保存到Hive中

4、kafka sink：保存到kafka中

……

Flume實戰案例

Flume的安裝部署

    1、Flume的安裝非常簡單，只需要解壓即可，當然，前提是已有hadoop環境。上傳安裝包到數據源所在節點上
    然後解壓    tar -zxvf apache-flume-1.6.0-bin.tar.gz
    然後進入flume的目錄，修改conf下的flume-env.sh，在裏面配置JAVA_HOME
    2、根據數據採集的需求配置採集方案，在配置文件中進行描述(文件名可任意自定義)
    3、指定採集方案配置文件，在相應的節點上啓動flume agent

示例

    先用一個最簡單的例子來測試一下程序環境是否正常

    1、先在flume的conf目錄下新建一個文件
    vi netcat-logger.conf
```
#定義這個agent中各組件的名字
a1.sources=r1
a1.sinks=k1
a1.channels=c1

描述和配置source組件:r1

a1.sources.r1.type=netcat

這裏如果填的是localhost迴環地址，那麼只有本機可以訪問。如果填寫的是server1，其他機器就可以訪問了

a1.sources.r1.bind=localhosta1.sources.r1.port=8888

描述和配置sink組件:k1

a1.sinks.k1.type=logger

描述和配置channel組件，此處使用是內存緩存的方式

a1.channels.c1.type=memory

默認該通道中最大的可以存儲的event數量

a1.channels.c1.capacity=1000

每次最大可以從source中拿到或者送到sink中的event數量

a1.channels.c1.transactionCapacity=100

描述和配置source,channel,sink之間的連接關係。注意，這裏的sources的channel有s。不要漏了

a1.sources.r1.channels=c1a1.sinks.k1.channel=c1

    2.啓動agent去採集數據

bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

    -c conf 指定flume自身的配置文件所在目錄
    -f conf/netcat-logger.conf 指定我們所描述的採集方案
    -n a1 指定我們這個agent的名字
    
    3.測試
    先要往agent採集監聽的端口上發送數據，讓agent有數據可採
    隨便在一個能跟agent節點聯網的機器上

telnet agent-hostname port


==題外話:經常有人問到linux中硬鏈接和軟鏈接的區別:只需記得硬鏈接實際上只是一個引用，就跟java中的對應一樣。而軟件鏈接實際上是一個文件，當我們用rm -rf去刪除一個使用了軟件鏈接的文件時，會把該文件真正刪掉==

## 採集案例
### 採集目錄
    
    採集需求:某服務器的某特定目錄下，會不斷產生新的文件，每當有新文件出現，就採集
    根據需求，首先定義一下3大要素
-     採集源，即source--監控文件目錄:spooldir
-     下沉目標，即sink--logger:logger
-     source和sink之間的傳遞通道--channel，可用file channel也可用channel

    編寫配置文件

Name the components on this agent

a1.sources = r1a1.sinks = k1a1.channels = c1

Describe/configure the source

a1.sources.r1.type = spooldir

監聽的文件目錄

a1.sources.r1.spoolDir = /home/hadoop/flumespool

表示在flume讀取數據之後，是否在封裝出來的event中將文件名添加到event的header中。

a1.sources.r1.fileHeader = true

Describe the sink

a1.sinks.k1.type = logger

channel以緩存的方式

a1.channels.c1.type = memory

channel中最多可以緩存1000個event

a1.channels.c1.capacity = 1000

100個event會傳輸到channel或指定目的地

a1.channels.c1.transactionCapacity = 100

Bind the source and sink to the channel

a1.sources.r1.channels = c1a1.sinks.k1.channel = c1

    啓動

bin/flume-ng agent -c conf -f conf/spoodir-logger.conf -n a1 -Dflume.root.logger=INFO,console

### 採集文件到HDFS
    
    採集需求:比如業務系統使用log4j生成的日誌，日誌內容不斷增加，需要把追加到日誌文件中的數據實時採集到hdfs
    
    根據需求，首先定義以下3大要素
-     採集源，即source--監控文件內容更新:exec 'tail -F file'
-     下沉目標，即sink--HDFS文件系統:hdfs sink
-     source和sink之間的傳遞通道--channel，可用file channel也可以用內存channel

    1.配置文件編寫

Name the components on this agent

a1.sources = r1a1.sinks = k1a1.channels = c1

exec 指的是命令

Describe/configure the source

a1.sources.r1.type = exec

F根據文件名追中, f根據文件的nodeid追中

a1.sources.r1.command = tail -F /home/hadoop/log/test.loga1.sources.r1.channels = c1

Describe the sink

下沉目標

a1.sinks.k1.type = hdfsa1.sinks.k1.channel = c1

指定目錄, flum幫做目的替換

a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/

文件的命名, 前綴

a1.sinks.k1.hdfs.filePrefix = events-

10 分鐘就改目錄

a1.sinks.k1.hdfs.round = truea1.sinks.k1.hdfs.roundValue = 10a1.sinks.k1.hdfs.roundUnit = minute

文件滾動之前的等待時間(秒)

a1.sinks.k1.hdfs.rollInterval = 3

文件滾動的大小限制(bytes)

a1.sinks.k1.hdfs.rollSize = 500

寫入多少個event數據後滾動文件(事件個數)。也就是說寫入20個event或者文件滿500字節或者等待3秒，該文件就會滾動一次。

a1.sinks.k1.hdfs.rollCount = 20

5個事件就往裏面寫入(flush到hdfs)

a1.sinks.k1.hdfs.batchSize = 5

用本地時間格式化目錄

a1.sinks.k1.hdfs.useLocalTimeStamp = true

下沉後, 生成的文件類型，默認是Sequencefile，可用DataStream，則爲普通文本

a1.sinks.k1.hdfs.fileType = DataStream

Use a channel which buffers events in memory

a1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100

Bind the source and sink to the channel

a1.sources.r1.channels = c1a1.sinks.k1.channel = c1

    2.仿照日誌生成腳本

!/bin/bash

while truedoecho iamkris >> /home/hadoop/log/test.logsleep 1done

    3.啓動

bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1

./makelog.sh


### 配置avro

    當我們有多個agent，多個agent之間的通信可以通過配置avro實現
    
    1.編寫avro客戶端配置文件

Name the components on this agent

a1.sources = r1a1.sinks = k1a1.channels = c1

Describe/configure the source

a1.sources.r1.type = execa1.sources.r1.command = tail -F /home/hadoop/log/test.loga1.sources.r1.channels = c1

Describe the sink

綁定的不是本機, 是另外一臺機器的服務地址, sink端的avro是一個發送端, avro的客戶端, 往server2這個機器上發

a1.sinks = k1a1.sinks.k1.type = avroa1.sinks.k1.channel = c1a1.sinks.k1.hostname = server2a1.sinks.k1.port = 4141a1.sinks.k1.batch-size = 2

Use a channel which buffers events in memory

a1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100

Bind the source and sink to the channel

a1.sources.r1.channels = c1a1.sinks.k1.channel = c1


    2.編寫avro服務端配置文件

a1.sources=r1a1.sinks=k1a1.channels=c1

avro服務端

a1.sources.r1.type=avro

綁定本機的任何地址進行接收

a1.sources.r1.bind=0.0.0.0a1.sources.r1.port=4141

a1.sinks.k1.type=hdfsa1.sinks.k1.hdfs.path=/flume/avrotohdfs/%y-%m-%d/%H-%Ma1.sinks.k1.hdfs.filePrefix=events-

a1.sinks.k1.hdfs.round=truea1.sinks.k1.hdfs.roundValue=10a1.sinks.k1.hdfs.roundUnit=minute

a1.sinks.k1.hdfs.rollInterval=60a1.sinks.k1.hdfs.rollSize=500a1.sinks.k1.hdfs.rollCount=20

a1.sinks.k1.hdfs.batchSize=5

a1.sinks.k1.hdfs.useLocalTimeStamp=true

a1.sinks.k1.hdfs.fileType=DataStream

a1.channels.c1.type=memorya1.channels.c1.capacity=1000a1.channels.c1.transactionCapacity=100

a1.sources.r1.channels=c1a1.sinks.k1.channel=c1


    3.啓動每個agent

avro服務端

bin/flume-ng agent -c conf -f conf/avro-hdfs.conf -n a1

avro客戶端

bin/flume-ng agent -c conf -f conf/tail-avro.conf -n a1


### 採集到kafka
#### config配置

a1.sources=r1a1.channels=c1a1.sinks=k1

a1.sources.r1.type=execa1.sources.r1.command=tail -F /export/servers/logs/data/data.log

a1.channels.c1.type=memorya1.channels.c1.capacity=1000a1.channels.c1.transationCapacity=100

a1.sinks.k1.type=org.apache.flume.sink.kafka.KafkaSinka1.sinks.k1.topic=flumetokafkaa1.sinks.k1.brokerList=server1:9092a1.sinks.k1.requiredAcks=1

a1.sources.r1.channels=c1a1.sinks.k1.channel=c1

#### 啓動

bin/flume-ng agent -n a1 -c conf -f conf/catdata.conf -Dflume.root.logger=INFO,console`

喜歡就關注公衆號:喜訊XiCent

數據採集-flume的使用

前言

日誌採集框架Flume

Flume介紹

概述

運行機制

Flume採集系統結構圖

1.簡單結構

2.複雜結構

Flume的體系結構

1、source作用：採集日誌

2、channel作用：緩存日誌

3、sink作用：保存日誌

Flume實戰案例

Flume的安裝部署

示例

描述和配置source組件:r1

這裏如果填的是localhost迴環地址，那麼只有本機可以訪問。如果填寫的是server1，其他機器就可以訪問了

描述和配置sink組件:k1

描述和配置channel組件，此處使用是內存緩存的方式

默認該通道中最大的可以存儲的event數量

每次最大可以從source中拿到或者送到sink中的event數量

描述和配置source,channel,sink之間的連接關係。注意，這裏的sources的channel有s。不要漏了

Name the components on this agent

Describe/configure the source

監聽的文件目錄

表示在flume讀取數據之後，是否在封裝出來的event中將文件名添加到event的header中。

Describe the sink

channel以緩存的方式

channel中最多可以緩存1000個event

100個event會傳輸到channel或指定目的地

Bind the source and sink to the channel

Name the components on this agent

exec 指的是命令

Describe/configure the source

F根據文件名追中, f根據文件的nodeid追中

Describe the sink

下沉目標

指定目錄, flum幫做目的替換

文件的命名, 前綴

10 分鐘就改目錄

文件滾動之前的等待時間(秒)

文件滾動的大小限制(bytes)

寫入多少個event數據後滾動文件(事件個數)。也就是說寫入20個event或者文件滿500字節或者等待3秒，該文件就會滾動一次。

5個事件就往裏面寫入(flush到hdfs)

用本地時間格式化目錄

下沉後, 生成的文件類型，默認是Sequencefile，可用DataStream，則爲普通文本

Use a channel which buffers events in memory

Bind the source and sink to the channel

!/bin/bash

Name the components on this agent

Describe/configure the source

Describe the sink

綁定的不是本機, 是另外一臺機器的服務地址, sink端的avro是一個發送端, avro的客戶端, 往server2這個機器上發

Use a channel which buffers events in memory

Bind the source and sink to the channel

avro服務端

綁定本機的任何地址進行接收

avro服務端

avro客戶端