Flume

大數據-Flume

環境
Flume-ng 1.9.0
source TAILDIR
channel memory
sink hdfs

一、性能測試

針對以下場景對Flume進行性能測試
Flume性能測試
場景1:一個channel,採用lzo壓縮
場景1:一個channel,不壓縮
場景1:兩個channel,採用lzo壓縮
場景1:兩個channel,不壓縮

配置文件


# Name the components on this agent
exec-hdfs-agent.sources = r1
exec-hdfs-agent.sinks = s1 s2
exec-hdfs-agent.channels = c1 c2

# Describe/configure the source
exec-hdfs-agent.sources.r1.selector.type = com.sjj.ParityChannelSelector 
exec-hdfs-agent.sources.r1.type = TAILDIR
exec-hdfs-agent.sources.r1.channels = c1 c2
exec-hdfs-agent.sources.r1.positionFile = /tmp/flume/logs/taildir_position.json
exec-hdfs-agent.sources.r1.filegroups = test
exec-hdfs-agent.sources.r1.filegroups.test= /tmp/flume/data/.*.test.log

# Describe the sink1
exec-hdfs-agent.sinks.s1.channel = c1
exec-hdfs-agent.sinks.s1.type = hdfs
exec-hdfs-agent.sinks.s1.hdfs.path = hdfs://namenode/test/%y-%m-%d
exec-hdfs-agent.sinks.s1.hdfs.fileType= DataStream
#exec-hdfs-agent.sinks.s1.hdfs.fileType= CompressedStream
#exec-hdfs-agent.sinks.s1.hdfs.codeC= com.hadoop.compression.lzo.LzopCodec
exec-hdfs-agent.sinks.s1.hdfs.writeFormat= Text
exec-hdfs-agent.sinks.s1.hdfs.batchSize= 20000
exec-hdfs-agent.sinks.s1.hdfs.rollSize= 128000000
exec-hdfs-agent.sinks.s1.hdfs.rollCount= 0
exec-hdfs-agent.sinks.s1.hdfs.rollInterval=0
exec-hdfs-agent.sinks.s1.hdfs.minBlockReplicas=1
exec-hdfs-agent.sinks.s1.hdfs.callTimeout=20000
exec-hdfs-agent.sinks.s1.hdfs.useLocalTimeStamp=true
exec-hdfs-agent.sinks.s1.hdfs.fileSuffix=.lzo
exec-hdfs-agent.sinks.s1.hdfs.filePrefix=c1

# Describe the sink2
exec-hdfs-agent.sinks.s2.channel = c2
exec-hdfs-agent.sinks.s2.type = hdfs 
exec-hdfs-agent.sinks.s2.hdfs.path = hdfs://namenode/test/%y-%m-%d
exec-hdfs-agent.sinks.s2.hdfs.fileType= DataStream
exec-hdfs-agent.sinks.s2.hdfs.writeFormat= Text
exec-hdfs-agent.sinks.s2.hdfs.batchSize= 10000
exec-hdfs-agent.sinks.s2.hdfs.rollSize= 128000000
exec-hdfs-agent.sinks.s2.hdfs.rollCount= 0
exec-hdfs-agent.sinks.s2.hdfs.rollInterval=0
exec-hdfs-agent.sinks.s2.hdfs.minBlockReplicas=1
exec-hdfs-agent.sinks.s2.hdfs.callTimeout=20000
exec-hdfs-agent.sinks.s2.hdfs.useLocalTimeStamp=true
exec-hdfs-agent.sinks.s2.hdfs.fileSuffix=.lzo
exec-hdfs-agent.sinks.s2.hdfs.filePrefix=c2

# Use a channel1 which buffers events in memory
exec-hdfs-agent.channels.c1.type = memory
exec-hdfs-agent.channels.c1.capacity=50000
exec-hdfs-agent.channels.c1.transactionCapacity=10000

# Use a channel1 which buffers events in memory
exec-hdfs-agent.channels.c2.type = memory
exec-hdfs-agent.channels.c2.capacity=50000
exec-hdfs-agent.channels.c2.transactionCapacity=10000

二、Flume寫hdfs(開啓kerberos)

(一)增加配置
在Flume配置文件加入以下配置
[email protected],/usr/local/flume/conf/test.keytab替換成你自己的

[email protected]
exec-hdfs-agent.sinks.s2.hdfs.kerberosKeytab=/usr/local/flume/conf/test.keytab

(二)拷貝文件

  1. 把keytab文件拷貝到flume的conf目錄下(創建keytab等操作請參考另一篇博文)。
  2. 把hadoop集羣的core-site.xml和hdfs-site.xml拷貝到flume的conf目錄下。
  3. wewe

三、Flume寫hdfs(使用lzo壓縮)

(一)增加配置
fileType= CompressedStream代表使用壓縮

#exec-hdfs-agent.sinks.s2.hdfs.fileType= CompressedStream
#exec-hdfs-agent.sinks.s2.hdfs.codeC= com.hadoop.compression.lzo.LzopCodec

(二)拷貝JAR包

  1. 手工編譯lzo和hadoop-lzo的,直接將jar包放在plugins.d下即可。
  2. 使用Cloudera安裝hadoop-lzo parcel的,要將jar包和native下鏈接都放在plugins.d下。
    爲何這樣可以參考 hadoop-lzo.jar和hadoop-gpl-compression.jar區別:http://guoyunsky.iteye.com/blog/1289475

(三)拷貝配置文件
從hadoop集羣上拉取core-site.xml放在flume/conf下,其實主要使用

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>

四、HA
由於HDFS集羣的HA機制,當hdfs集羣的namenode狀態發生變化時,flume上報時會報出Exception, Operation category READ(WRITE) is not supported in state standby,因爲standby namenode是不對提供服務的。那麼此時flume就處於不可用狀態,必須手工修改配置文件中sink的hdfs.path值的namenode,然後重啓flume才能解決。當要收集日誌的服務器很多時,會增加很多人力成本;另外,日誌上報狀態監控沒有做好的話,也許用到這個日誌的時候纔會發現flume出現問題。

解決這個問題也比較簡單,就是將集羣中hdfs-site.xml複製一份到flume的conf目錄下即可,當namenode狀態切換時,flume也能正確將日誌上報到hdfs中。此時,hdfs.path配置也可以省略域名。

exec-hdfs-agent.sinks.s1.hdfs.path = /test/%y-%m-%d
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章