Flume日誌採集系統的安裝和部署

Cloudera 公司開發,然後貢獻給了apache現已經成爲apache下面的一級開源項目。

基本介紹:按照flume的官方文檔,flume是一種分佈式的,可靠的,有效收集,聚集和移動大量的日誌數據的可用服務。它的架構基於數據流的簡單且靈活,具有很好的魯棒性和容錯可調的可靠性機制和多故障轉移和恢復機制。它使用了一個簡單的可擴展的數據模型,允許在線分析應用。

適用範圍:業界主要用flume來收集海量分佈式的日誌,常見案例是全量日誌進入hadoop進行離線分析,實時數據流進行在線分析。

官方文檔 flume

flume的安裝運行

前置條件:

  1. Java運行環境 - Java 1.6 or later (Java 1.7 Recommended)
  2. 其他:內存,磁盤空間,和採集目錄的文件讀寫相關權限

安裝和運行:
極其簡單,下載解壓後改給配置文件就可以運行
下載:

$: wget http://apache.dataguru.cn/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz
$: tar -xzvf apache-flume-1.6.0-bin.tar.gz

運行:
通過bin目錄下面的shell 腳本其他例如:





$: bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

簡單的例子(參考官網)

  1. 配置Java
$: cd conf
$: cp flume-env.sh.template flume-env.sh
$: vim  flume-env.sh
# 加入填入java home,例如
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

2.填寫配置文件

$: cp flume-conf.properties.template example.conf
$ vim  example.conf
#填入如下內容
# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  1. 啓動flume
$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

flume的常見架構

  • 瀑布型架構

flume agent1.1 -> flume agent 2 -> flume agent3
flume agent1.2
此架構主要是進行簡單的數據流轉,曾經用於我們阿里雲的服務器的數據流轉回內網的測試。agent1 系列主要是部署到個阿里雲主機進行數據採集,agent 2部署在阿里雲進行了數據後,通過 ssh轉發,數據統一流向了內網的 agent3 進行儲存和使用。

NOTE: 此架構 從agent2 到agent3 沒有考慮負載均衡,有單點故障的可能。

flume agent1 :負責數據採集,實例配置

NOTE: 此配置中的com.jfz.cdp.flume.source.CustomSpoolDirectorySource source是我們自定義的source,後面會將涉及。
### Main
a1.sources = src-exec1 src-cdir
a1.channels = ch-file1 ch-file2
a1.sinks = sink-avro1 sink-avro2

### Source ###
#exec source
a1.sources.src-exec1.type = exec
a1.sources.src-exec1.command = tail -F /data/java_logs/java1/bbs/mc/info.log
a1.sources.src-exec1.channels = ch-file1

#exec interceptor set
a1.sources.src-exec1.interceptors = i1-1 i1-2
a1.sources.src-exec1.interceptors.i1-1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.src-exec1.interceptors.i1-1.preserveExisting = false
a1.sources.src-exec1.interceptors.i1-1.hostHeader = clct-host
a1.sources.src-exec1.interceptors.i1-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

#custom spooldir
a1.sources.src-cdir.type = com.jfz.cdp.flume.source.CustomSpoolDirectorySource
a1.sources.src-cdir.channels = ch-file2
a1.sources.src-cdir.spoolDir = ../data/spoolDir_in
a1.sources.src-cdir.fileHeader = true
a1.sources.src-cdir.basenameHeader=true
a1.sources.src-cdir.decodeErrorPolicy = IGNORE
a1.sources.src-cdir.deletePolicy = immediate
a1.sources.src-cdir.skipReadFileModifyTimeLessThanMillis = 60000

#custom spooldir interceptor set
a1.sources.src-cdir.interceptors = i2-1 i2-2
a1.sources.src-cdir.interceptors.i2-1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.src-cdir.interceptors.i2-1.preserveExisting = false
a1.sources.src-cdir.interceptors.i2-1.hostHeader = clct-host
a1.sources.src-cdir.interceptors.i2-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

### Channel ###
#file channel 1 set
a1.channels.ch-file1.type = file
a1.channels.ch-file1.checkpointDir = ../data/fileChannels/ch-file1/checkpoint
a1.channels.ch-file1.dataDirs = ../data/fileChannels/ch-file1/data

#file channel 2 set
a1.channels.ch-file2.type = file
a1.channels.ch-file2.checkpointDir = ../data/fileChannels/ch-file2/checkpoint
a1.channels.ch-file2.dataDirs = ../data/fileChannels/ch-file2/data

### Sink ###
#sink1
a1.sinks.sink-avro1.type = avro
a1.sinks.sink-avro1.channel = ch-file1
a1.sinks.sink-avro1.hostname = 10.162.95.96
a1.sinks.sink-avro1.port = 50001
a1.sinks.sink-avro1.threads = 150

#sink2
a1.sinks.sink-avro2.type = avro
a1.sinks.sink-avro2.channel = ch-file2
a1.sinks.sink-avro2.hostname = 10.162.95.96
a1.sinks.sink-avro2.port = 50002
a1.sinks.sink-avro2.threads = 150

flume agent2 :作爲中間層進行數據中轉,實例配置

### Main
a1.sources = src-avro1 src-avro2
a1.channels = ch-file1 ch-file2
a1.sinks = sink-avro1 sink-avro2

### Source ###
#avro source 1 for really time stream
a1.sources.src-avro1.type = avro
a1.sources.src-avro1.channels = ch-file1
a1.sources.src-avro1.bind = 0.0.0.0
a1.sources.src-avro1.port = 50001
a1.sources.src-avro1.threads = 150

#avro interceptor 1
a1.sources.src-avro1.interceptors = i1-1 i1-2
a1.sources.src-avro1.interceptors.i1-1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.src-avro1.interceptors.i1-1.preserveExisting = true
a1.sources.src-avro1.interceptors.i1-1.hostHeader = clct-host
a1.sources.src-avro1.interceptors.i1-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sources.src-avro1.interceptors.i1-2.preserveExisting = true

#avro source 2 from spooldir
a1.sources.src-avro2.type = avro
a1.sources.src-avro2.channels = ch-file2
a1.sources.src-avro2.bind = 0.0.0.0
a1.sources.src-avro2.port = 50002
a1.sources.src-avro2.threads = 150

#avro interceptor 2
a1.sources.src-avro2.interceptors = i2-1 i2-2
a1.sources.src-avro2.interceptors.i2-1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.src-avro2.interceptors.i2-1.preserveExisting = true
a1.sources.src-avro2.interceptors.i2-1.hostHeader = clct-host
a1.sources.src-avro2.interceptors.i2-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sources.src-avro2.interceptors.i2-2.preserveExisting = true

### Channel ###
#file channel 1 set
a1.channels.ch-file1.type = file
a1.channels.ch-file1.checkpointDir = ../data/fileChannels/ch-file1/checkpoint
a1.channels.ch-file1.dataDirs = ../data/fileChannels/ch-file1/data

#file channel 2 set
a1.channels.ch-file2.type = file
a1.channels.ch-file2.checkpointDir = ../data/fileChannels/ch-file2/checkpoint
a1.channels.ch-file2.dataDirs = ../data/fileChannels/ch-file2/data

### Sink ###
#sink1
a1.sinks.sink-avro1.type = avro
a1.sinks.sink-avro1.channel = ch-file1
a1.sinks.sink-avro1.hostname = 127.0.0.1
a1.sinks.sink-avro1.port = 60001
a1.sinks.sink-avro1.threads = 150

#sink2
a1.sinks.sink-avro2.type = avro
a1.sinks.sink-avro2.channel = ch-file2
a1.sinks.sink-avro2.hostname = 127.0.0.1
a1.sinks.sink-avro2.port = 60002
a1.sinks.sink-avro2.threads = 150

flume agent2 :爲數據存儲做準備,實例配置

### Main ###
a1.sources = src-avro1 src-avro2
a1.channels = ch-file1 ch-file2
a1.sinks = sink-rfm sink-hdfs2

### Source ###
#avro source 1 for really time stream
a1.sources.src-avro1.type = avro
a1.sources.src-avro1.channels = ch-file1
a1.sources.src-avro1.bind = 0.0.0.0
a1.sources.src-avro1.port = 60001
a1.sources.src-avro1.threads = 150

#avro interceptor 1
a1.sources.src-avro1.interceptors = i1-1 i1-2
a1.sources.src-avro1.interceptors.i1-1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.src-avro1.interceptors.i1-1.preserveExisting = true
a1.sources.src-avro1.interceptors.i1-1.hostHeader = clct-host
a1.sources.src-avro1.interceptors.i1-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sources.src-avro1.interceptors.i1-2.preserveExisting = true

#avro source 2 from spooldir
a1.sources.src-avro2.type = avro
a1.sources.src-avro2.channels = ch-file2
a1.sources.src-avro2.bind = 0.0.0.0
a1.sources.src-avro2.port = 60002
a1.sources.src-avro2.threads = 150

#avro interceptor 2
a1.sources.src-avro2.interceptors = i2-1 i2-2
a1.sources.src-avro2.interceptors.i2-1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.src-avro2.interceptors.i2-1.preserveExisting = true
a1.sources.src-avro2.interceptors.i2-1.hostHeader = clct-host
a1.sources.src-avro2.interceptors.i2-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sources.src-avro2.interceptors.i2-2.preserveExisting = true
### Channel ###
#file channel 1 set
a1.channels.ch-file1.type = file
a1.channels.ch-file1.checkpointDir = ../data/fileChannels/ch-file1/checkpoint
a1.channels.ch-file1.dataDirs = ../data/fileChannels/ch-file1/data

#file channel 2 set
a1.channels.ch-file2.type = file
a1.channels.ch-file2.checkpointDir = ../data/fileChannels/ch-file2/checkpoint
a1.channels.ch-file2.dataDirs = ../data/fileChannels/ch-file2/data

# improved rolling file sink1
a1.sinks.sink-rfm.type = com.jfz.cdp.flume.sinks.ImprovedRollingFileSink
a1.sinks.sink-rfm.channel = ch-file1
a1.sinks.sink-rfm.sink.directory = ../data/logs/%Y-%m-%d
a1.sinks.sink-rfm.sink.fileName = %H-%M-%S
a1.sinks.sink-rfm.sink.rollInterval = 3600
a1.sinks.sink-rfm.sink.useLocalTime = false


#sink2 to hdfs
a1.sinks.sink-hdfs2.type = hdfs
a1.sinks.sink-hdfs2.channel = ch-file2
a1.sinks.sink-hdfs2.hdfs.path = /user/dadeng/flume_logs/%{category}/dt=%Y-%m-%d
a1.sinks.sink-hdfs2.hdfs.filePrefix = %{clct-host}_%{basename}
a1.sinks.sink-hdfs2.hdfs.fileType = DataStream
a1.sinks.sink-hdfs2.hdfs.rollSize = 102400000
a1.sinks.sink-hdfs2.hdfs.rollCount = 500000
  • 3層架構,中間有控制層進行負載均衡並避免單點,適合可靠的全量數據傳送。

Agent 1 的數據同時發往兩個 control agent的示例:

## 這裏省略了source的信息

a1.channels = ch-file1 ch-file2
a1.sinks = sink-avro1-1 sink-avro1-2 sink-avro2-1 sink-avro2-2

#file channel 1 set
a1.channels.ch-file1.type = file
a1.channels.ch-file1.checkpointDir = ../data/fileChannels/ch-file1/checkpoint
a1.channels.ch-file1.dataDirs = ../data/fileChannels/ch-file1/data

#file channel 2 set
a1.channels.ch-file2.type = file
a1.channels.ch-file2.checkpointDir = ../data/fileChannels/ch-file2/checkpoint
a1.channels.ch-file2.dataDirs = ../data/fileChannels/ch-file2/data

#sink group with load balancing
a1.sinkgroups = sg-avro1
a1.sinkgroups.sg-avro1.sinks = sink-avro1-1 sink-avro1-2
a1.sinkgroups.sg-avro1.processor.type = load_balance
a1.sinkgroups.sg-avro1.processor.backoff = true

#sink1 to 10.1.2.51:41414
a1.sinks.sink-avro1-1.type = avro
a1.sinks.sink-avro1-1.channel = ch-file1
a1.sinks.sink-avro1-1.hostname = 10.1.2.51
a1.sinks.sink-avro1-1.port = 41414

#sink2 to 10.1.2.52:41414
a1.sinks.sink-avro1-2.type = avro
a1.sinks.sink-avro1-2.channel = ch-file1
a1.sinks.sink-avro1-2.hostname = 10.1.2.52
a1.sinks.sink-avro1-2.port = 41414

#sink group with load balancing
a1.sinkgroups = sg-avro2
a1.sinkgroups.sg-avro2.sinks = sink-avro2-1 sink-avro2-2
a1.sinkgroups.sg-avro2.processor.type = load_balance
a1.sinkgroups.sg-avro2.processor.backoff = true

#sink1 to 10.1.2.51:41415
a1.sinks.sink-avro2-1.type = avro
a1.sinks.sink-avro2-1.channel = ch-file2
a1.sinks.sink-avro2-1.hostname = 10.1.2.51
a1.sinks.sink-avro2-1.port = 41415

#sink2 to 10.1.2.52:41415
a1.sinks.sink-avro2-2.type = avro
a1.sinks.sink-avro2-2.channel = ch-file2
a1.sinks.sink-avro2-2.hostname = 10.1.2.52
a1.sinks.sink-avro2-2.port = 41415

flume的常見配置問題

Source
flume的source類型很多,常用的有“spooldir”,“exec”,和“avro”
spooldir :適用於重要的日誌傳輸,而且一般傳輸前數據已經另外存文件。
NOTE:spooldir有兩個坑 1,如果傳輸的過程中有不可解碼的流出現會導致flume停止服務,所以我們最好加上" a1.sources.src-cdir.decodeErrorPolicy = IGNORE"配置, 2.放入spooldir的文件不允許再更改,如果你使用cp來複制比較大的文件到spooldir目錄的時候,有可能flume已經開始讀文件,但是發現它還在進行更改會導致停止服務。 爲了解決這個坑,我們自己開發了一個CustomSpooldirSource,它會暫時跳過配置文件“skipReadFileModifyTimeLessThanMillis ”指定的時間內有修改的文件來避免類似問題發生。
另外,spooldir可以通過log的短時間spilt產生新文件來帶到準實時數據傳輸。

exec:主要是tail -F xxx.log 來實時獲取更改。

avro:可以擁有接收指定主機的指定端口的數據,主要用來傳輸數據。或者可以和log4j集成,日誌數據實時通過avro流給flume。

log4j的日誌自動流給flume的配置

maven:加入一個依賴的jar包

<dependency>
    <groupId>org.apache.flume.flume-ng-clients</groupId>
    <artifactId>flume-ng-log4jappender</artifactId>
    <version>${flume.version}</version>
</dependency>

log4j增加一項配置

log4j.logger.flume=INFO, flume


log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = 10.1.2.50
log4j.appender.flume.Port = 41414
log4j.appender.flume.UnsafeMode = true
log4j.appender.flume.layout=org.apache.log4j.PatternLayout
log4j.appender.flume.layout.ConversionPattern=%m%n



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章