由Cloudera 公司開發,然後貢獻給了apache現已經成爲apache下面的一級開源項目。
基本介紹:按照flume的官方文檔,flume是一種分佈式的,可靠的,有效收集,聚集和移動大量的日誌數據的可用服務。它的架構基於數據流的簡單且靈活,具有很好的魯棒性和容錯可調的可靠性機制和多故障轉移和恢復機制。它使用了一個簡單的可擴展的數據模型,允許在線分析應用。
適用範圍:業界主要用flume來收集海量分佈式的日誌,常見案例是全量日誌進入hadoop進行離線分析,實時數據流進行在線分析。
官方文檔: flume
flume的安裝運行
前置條件:
- Java運行環境 - Java 1.6 or later (Java 1.7 Recommended)
- 其他:內存,磁盤空間,和採集目錄的文件讀寫相關權限
安裝和運行:
極其簡單,下載解壓後改給配置文件就可以運行
下載:
$: wget http://apache.dataguru.cn/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz
$: tar -xzvf apache-flume-1.6.0-bin.tar.gz
運行:
通過bin目錄下面的shell 腳本其他例如:
$: bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template
簡單的例子(參考官網)
- 配置Java
$: cd conf
$: cp flume-env.sh.template flume-env.sh
$: vim flume-env.sh
# 加入填入java home,例如
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
2.填寫配置文件
$: cp flume-conf.properties.template example.conf
$ vim example.conf
#填入如下內容
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 啓動flume
$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
flume的常見架構
- 瀑布型架構
flume agent1.1 -> flume agent 2 -> flume agent3
flume agent1.2
此架構主要是進行簡單的數據流轉,曾經用於我們阿里雲的服務器的數據流轉回內網的測試。agent1 系列主要是部署到個阿里雲主機進行數據採集,agent 2部署在阿里雲進行了數據後,通過 ssh轉發,數據統一流向了內網的 agent3 進行儲存和使用。
flume agent1 :負責數據採集,實例配置
### Main a1.sources = src-exec1 src-cdir a1.channels = ch-file1 ch-file2 a1.sinks = sink-avro1 sink-avro2 ### Source ### #exec source a1.sources.src-exec1.type = exec a1.sources.src-exec1.command = tail -F /data/java_logs/java1/bbs/mc/info.log a1.sources.src-exec1.channels = ch-file1 #exec interceptor set a1.sources.src-exec1.interceptors = i1-1 i1-2 a1.sources.src-exec1.interceptors.i1-1.type = org.apache.flume.interceptor.HostInterceptor$Builder a1.sources.src-exec1.interceptors.i1-1.preserveExisting = false a1.sources.src-exec1.interceptors.i1-1.hostHeader = clct-host a1.sources.src-exec1.interceptors.i1-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder #custom spooldir a1.sources.src-cdir.type = com.jfz.cdp.flume.source.CustomSpoolDirectorySource a1.sources.src-cdir.channels = ch-file2 a1.sources.src-cdir.spoolDir = ../data/spoolDir_in a1.sources.src-cdir.fileHeader = true a1.sources.src-cdir.basenameHeader=true a1.sources.src-cdir.decodeErrorPolicy = IGNORE a1.sources.src-cdir.deletePolicy = immediate a1.sources.src-cdir.skipReadFileModifyTimeLessThanMillis = 60000 #custom spooldir interceptor set a1.sources.src-cdir.interceptors = i2-1 i2-2 a1.sources.src-cdir.interceptors.i2-1.type = org.apache.flume.interceptor.HostInterceptor$Builder a1.sources.src-cdir.interceptors.i2-1.preserveExisting = false a1.sources.src-cdir.interceptors.i2-1.hostHeader = clct-host a1.sources.src-cdir.interceptors.i2-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder ### Channel ### #file channel 1 set a1.channels.ch-file1.type = file a1.channels.ch-file1.checkpointDir = ../data/fileChannels/ch-file1/checkpoint a1.channels.ch-file1.dataDirs = ../data/fileChannels/ch-file1/data #file channel 2 set a1.channels.ch-file2.type = file a1.channels.ch-file2.checkpointDir = ../data/fileChannels/ch-file2/checkpoint a1.channels.ch-file2.dataDirs = ../data/fileChannels/ch-file2/data ### Sink ### #sink1 a1.sinks.sink-avro1.type = avro a1.sinks.sink-avro1.channel = ch-file1 a1.sinks.sink-avro1.hostname = 10.162.95.96 a1.sinks.sink-avro1.port = 50001 a1.sinks.sink-avro1.threads = 150 #sink2 a1.sinks.sink-avro2.type = avro a1.sinks.sink-avro2.channel = ch-file2 a1.sinks.sink-avro2.hostname = 10.162.95.96 a1.sinks.sink-avro2.port = 50002 a1.sinks.sink-avro2.threads = 150
flume agent2 :作爲中間層進行數據中轉,實例配置
### Main a1.sources = src-avro1 src-avro2 a1.channels = ch-file1 ch-file2 a1.sinks = sink-avro1 sink-avro2 ### Source ### #avro source 1 for really time stream a1.sources.src-avro1.type = avro a1.sources.src-avro1.channels = ch-file1 a1.sources.src-avro1.bind = 0.0.0.0 a1.sources.src-avro1.port = 50001 a1.sources.src-avro1.threads = 150 #avro interceptor 1 a1.sources.src-avro1.interceptors = i1-1 i1-2 a1.sources.src-avro1.interceptors.i1-1.type = org.apache.flume.interceptor.HostInterceptor$Builder a1.sources.src-avro1.interceptors.i1-1.preserveExisting = true a1.sources.src-avro1.interceptors.i1-1.hostHeader = clct-host a1.sources.src-avro1.interceptors.i1-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder a1.sources.src-avro1.interceptors.i1-2.preserveExisting = true #avro source 2 from spooldir a1.sources.src-avro2.type = avro a1.sources.src-avro2.channels = ch-file2 a1.sources.src-avro2.bind = 0.0.0.0 a1.sources.src-avro2.port = 50002 a1.sources.src-avro2.threads = 150 #avro interceptor 2 a1.sources.src-avro2.interceptors = i2-1 i2-2 a1.sources.src-avro2.interceptors.i2-1.type = org.apache.flume.interceptor.HostInterceptor$Builder a1.sources.src-avro2.interceptors.i2-1.preserveExisting = true a1.sources.src-avro2.interceptors.i2-1.hostHeader = clct-host a1.sources.src-avro2.interceptors.i2-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder a1.sources.src-avro2.interceptors.i2-2.preserveExisting = true ### Channel ### #file channel 1 set a1.channels.ch-file1.type = file a1.channels.ch-file1.checkpointDir = ../data/fileChannels/ch-file1/checkpoint a1.channels.ch-file1.dataDirs = ../data/fileChannels/ch-file1/data #file channel 2 set a1.channels.ch-file2.type = file a1.channels.ch-file2.checkpointDir = ../data/fileChannels/ch-file2/checkpoint a1.channels.ch-file2.dataDirs = ../data/fileChannels/ch-file2/data ### Sink ### #sink1 a1.sinks.sink-avro1.type = avro a1.sinks.sink-avro1.channel = ch-file1 a1.sinks.sink-avro1.hostname = 127.0.0.1 a1.sinks.sink-avro1.port = 60001 a1.sinks.sink-avro1.threads = 150 #sink2 a1.sinks.sink-avro2.type = avro a1.sinks.sink-avro2.channel = ch-file2 a1.sinks.sink-avro2.hostname = 127.0.0.1 a1.sinks.sink-avro2.port = 60002 a1.sinks.sink-avro2.threads = 150
flume agent2 :爲數據存儲做準備,實例配置
### Main ### a1.sources = src-avro1 src-avro2 a1.channels = ch-file1 ch-file2 a1.sinks = sink-rfm sink-hdfs2 ### Source ### #avro source 1 for really time stream a1.sources.src-avro1.type = avro a1.sources.src-avro1.channels = ch-file1 a1.sources.src-avro1.bind = 0.0.0.0 a1.sources.src-avro1.port = 60001 a1.sources.src-avro1.threads = 150 #avro interceptor 1 a1.sources.src-avro1.interceptors = i1-1 i1-2 a1.sources.src-avro1.interceptors.i1-1.type = org.apache.flume.interceptor.HostInterceptor$Builder a1.sources.src-avro1.interceptors.i1-1.preserveExisting = true a1.sources.src-avro1.interceptors.i1-1.hostHeader = clct-host a1.sources.src-avro1.interceptors.i1-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder a1.sources.src-avro1.interceptors.i1-2.preserveExisting = true #avro source 2 from spooldir a1.sources.src-avro2.type = avro a1.sources.src-avro2.channels = ch-file2 a1.sources.src-avro2.bind = 0.0.0.0 a1.sources.src-avro2.port = 60002 a1.sources.src-avro2.threads = 150 #avro interceptor 2 a1.sources.src-avro2.interceptors = i2-1 i2-2 a1.sources.src-avro2.interceptors.i2-1.type = org.apache.flume.interceptor.HostInterceptor$Builder a1.sources.src-avro2.interceptors.i2-1.preserveExisting = true a1.sources.src-avro2.interceptors.i2-1.hostHeader = clct-host a1.sources.src-avro2.interceptors.i2-2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder a1.sources.src-avro2.interceptors.i2-2.preserveExisting = true ### Channel ### #file channel 1 set a1.channels.ch-file1.type = file a1.channels.ch-file1.checkpointDir = ../data/fileChannels/ch-file1/checkpoint a1.channels.ch-file1.dataDirs = ../data/fileChannels/ch-file1/data #file channel 2 set a1.channels.ch-file2.type = file a1.channels.ch-file2.checkpointDir = ../data/fileChannels/ch-file2/checkpoint a1.channels.ch-file2.dataDirs = ../data/fileChannels/ch-file2/data # improved rolling file sink1 a1.sinks.sink-rfm.type = com.jfz.cdp.flume.sinks.ImprovedRollingFileSink a1.sinks.sink-rfm.channel = ch-file1 a1.sinks.sink-rfm.sink.directory = ../data/logs/%Y-%m-%d a1.sinks.sink-rfm.sink.fileName = %H-%M-%S a1.sinks.sink-rfm.sink.rollInterval = 3600 a1.sinks.sink-rfm.sink.useLocalTime = false #sink2 to hdfs a1.sinks.sink-hdfs2.type = hdfs a1.sinks.sink-hdfs2.channel = ch-file2 a1.sinks.sink-hdfs2.hdfs.path = /user/dadeng/flume_logs/%{category}/dt=%Y-%m-%d a1.sinks.sink-hdfs2.hdfs.filePrefix = %{clct-host}_%{basename} a1.sinks.sink-hdfs2.hdfs.fileType = DataStream a1.sinks.sink-hdfs2.hdfs.rollSize = 102400000 a1.sinks.sink-hdfs2.hdfs.rollCount = 500000
- 3層架構,中間有控制層進行負載均衡並避免單點,適合可靠的全量數據傳送。
Agent 1 的數據同時發往兩個 control agent的示例:
## 這裏省略了source的信息 a1.channels = ch-file1 ch-file2 a1.sinks = sink-avro1-1 sink-avro1-2 sink-avro2-1 sink-avro2-2 #file channel 1 set a1.channels.ch-file1.type = file a1.channels.ch-file1.checkpointDir = ../data/fileChannels/ch-file1/checkpoint a1.channels.ch-file1.dataDirs = ../data/fileChannels/ch-file1/data #file channel 2 set a1.channels.ch-file2.type = file a1.channels.ch-file2.checkpointDir = ../data/fileChannels/ch-file2/checkpoint a1.channels.ch-file2.dataDirs = ../data/fileChannels/ch-file2/data #sink group with load balancing a1.sinkgroups = sg-avro1 a1.sinkgroups.sg-avro1.sinks = sink-avro1-1 sink-avro1-2 a1.sinkgroups.sg-avro1.processor.type = load_balance a1.sinkgroups.sg-avro1.processor.backoff = true #sink1 to 10.1.2.51:41414 a1.sinks.sink-avro1-1.type = avro a1.sinks.sink-avro1-1.channel = ch-file1 a1.sinks.sink-avro1-1.hostname = 10.1.2.51 a1.sinks.sink-avro1-1.port = 41414 #sink2 to 10.1.2.52:41414 a1.sinks.sink-avro1-2.type = avro a1.sinks.sink-avro1-2.channel = ch-file1 a1.sinks.sink-avro1-2.hostname = 10.1.2.52 a1.sinks.sink-avro1-2.port = 41414 #sink group with load balancing a1.sinkgroups = sg-avro2 a1.sinkgroups.sg-avro2.sinks = sink-avro2-1 sink-avro2-2 a1.sinkgroups.sg-avro2.processor.type = load_balance a1.sinkgroups.sg-avro2.processor.backoff = true #sink1 to 10.1.2.51:41415 a1.sinks.sink-avro2-1.type = avro a1.sinks.sink-avro2-1.channel = ch-file2 a1.sinks.sink-avro2-1.hostname = 10.1.2.51 a1.sinks.sink-avro2-1.port = 41415 #sink2 to 10.1.2.52:41415 a1.sinks.sink-avro2-2.type = avro a1.sinks.sink-avro2-2.channel = ch-file2 a1.sinks.sink-avro2-2.hostname = 10.1.2.52 a1.sinks.sink-avro2-2.port = 41415
flume的常見配置問題
Source
flume的source類型很多,常用的有“spooldir”,“exec”,和“avro”
spooldir :適用於重要的日誌傳輸,而且一般傳輸前數據已經另外存文件。
NOTE:spooldir有兩個坑 1,如果傳輸的過程中有不可解碼的流出現會導致flume停止服務,所以我們最好加上" a1.sources.src-cdir.decodeErrorPolicy = IGNORE"配置, 2.放入spooldir的文件不允許再更改,如果你使用cp來複制比較大的文件到spooldir目錄的時候,有可能flume已經開始讀文件,但是發現它還在進行更改會導致停止服務。 爲了解決這個坑,我們自己開發了一個CustomSpooldirSource,它會暫時跳過配置文件“skipReadFileModifyTimeLessThanMillis
”指定的時間內有修改的文件來避免類似問題發生。
另外,spooldir可以通過log的短時間spilt產生新文件來帶到準實時數據傳輸。
exec:主要是tail -F xxx.log 來實時獲取更改。
avro:可以擁有接收指定主機的指定端口的數據,主要用來傳輸數據。或者可以和log4j集成,日誌數據實時通過avro流給flume。
log4j的日誌自動流給flume的配置
maven:加入一個依賴的jar包
<dependency> <groupId>org.apache.flume.flume-ng-clients</groupId> <artifactId>flume-ng-log4jappender</artifactId> <version>${flume.version}</version> </dependency>
log4j增加一項配置
log4j.logger.flume=INFO, flume log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender log4j.appender.flume.Hostname = 10.1.2.50 log4j.appender.flume.Port = 41414 log4j.appender.flume.UnsafeMode = true log4j.appender.flume.layout=org.apache.log4j.PatternLayout log4j.appender.flume.layout.ConversionPattern=%m%n