kafka連接flume把數據分別推到HDFS和SparkStreaming

kafka沒有辦法多個消費者重複消費同一個topic,所以就在kafka後面掛載flume,然後利用replicating的selector把數據分別發往HDFS做存儲和sparkstreaming中實時分析。下面我貼一下flume的配置文件

a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
a1.sources.r1.selector.type = replicating
# 配置source
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 800
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = node03:9092,node04:9092,node05:9092
a1.sources.r1.kafka.topics = startup,event
a1.sources.r1.kafka.consumer.group.id = custom.g.id
# ----------------------------------------------------------------------------------
# 配置sink1
#配置HDFS sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://node02:9000/flume/%Y%m%d/%H
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = TEXT
a1.sinks.k1.hdfs.inUsePrefix=_
a1.sinks.k1.hdfs.round = true
a1.sinks.k2.hdfs.roundUnit = minute
a1.sinks.k2.hdfs.roundValue = 5
a1.sinks.k2.hdfs.useLocalTimeStamp = true
#設置輸出文件的前綴
a1.sinks.k1.hdfs.filePrefix = logs
#設置輸出文件的後綴
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.batchSize = 800
a1.sinks.k1.hdfs.rollSize = 134217700
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 300
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a2.sinks.k1.hdfs.minBlockReplicas = 1
# 配置sink2
# a1.sinks.k2.channel = c1
# a1.sinks.k2.type = org.apache.spark.streaming.flume.sink.SparkSink
# a1.sinks.k2.hostname=192.168.64.1 # 負責彙總數據的服務器的IP地址或主機名
# a1.sinks.k2.port = 8888
# a1.sinks.k2.batchSize= 2000 # 批處理大小 
a1.sinks.k2.type = logger
# ----------------------------------------------------------------------------------
a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity = 1000
#
a1.channels.c2.type = memory
a1.channels.c2.capacity = 2000
a1.channels.c2.transactionCapacity = 1000
# ----------------------------------------------------------------------------------
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

現在sink2暫時用logger輸出了,等需要的時候就放開k2把數據輸出到實時分析系統去,對了,這裏要注意。如果需要把數據保存到HDFS需要在flume的lib文件夾中放入相關的jar.....這些jar的版本我沒有具體研究過,基本上和hadoop一個版本應該沒有什麼問題了吧

commons-configuration-1.6.jar
commons-io-2.4.jar
hadoop-auth-2.7.2.jar
hadoop-common-2.7.2.jar
hadoop-hdfs-2.7.2.jar
htrace-core-3.1.0-incubating.jar

啓動flume

flume-ng agent --conf conf/ --name a1 --conf-file jobs/kafka-hdfs_sparkstreaming.config -Dflume.root.logger=INFO,console

輸出結果是按年月日劃分文件夾,下一級的是按每小時劃分。具體多少時間滾動一次,在上面flume中有設置,這裏我設置了128M或者每5分鐘滾動一次,儘量減少小文件插入HDFS中

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章