本篇文章將在Apache Flume介紹和使用案例三這篇文章的基礎上將logger sink修改爲kafka sink(即整合flume到kafka完成實時數據的採集)
1. 先說一下,爲什麼要使用 Flume + Kafka?
以實時流處理項目爲例,由於採集的數據量可能存在峯值和峯谷,假設是一個電商項目,那麼峯值通常出現在秒殺時,這時如果直接將 Flume
聚合後的數據輸入到 Storm
等分佈式計算框架中,可能就會超過集羣的處理能力,這時採用 Kafka
就可以起到削峯的作用。Kafka
天生爲大數據場景而設計,具有高吞吐的特性,能很好地抗住峯值數據的衝擊。
2. 大體流程如圖所示:
將配置文件:avro-memory-logger.conf
avro-memory-logger.sources = avro-source
avro-memory-logger.sinks = logger-sink
avro-memory-logger.channels = memory-channel
avro-memory-logger.sources.avro-source.type = avro
avro-memory-logger.sources.avro-source.bind = hadoop000
avro-memory-logger.sources.avro-source.port = 44444
avro-memory-logger.sinks.logger-sink.type = logger
avro-memory-logger.channels.memory-channel.type = memory
avro-memory-logger.sources.avro-source.channels = memory-channel
avro-memory-logger.sinks.logger-sink.channel = memory-channel
修改爲avro-memory-kafka.conf
avro-memory-kafka.sources = avro-source
avro-memory-kafka.sinks = kafka-sink
avro-memory-kafka.channels = memory-channel
avro-memory-kafka.sources.avro-source.type = avro
avro-memory-kafka.sources.avro-source.bind = hadoop000
avro-memory-kafka.sources.avro-source.port = 44444
avro-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
avro-memory-kafka.sinks.kafka-sink.brokerList = hadoop000:9092
avro-memory-kafka.sinks.kafka-sink.topic = hello_topic
avro-memory-kafka.sinks.kafka-sink.batchSize=5
avro-memory-kafka.sinks.kafka-sink.requiredAcks=1
avro-memory-kafka.channels.memory-channel.type = memory
avro-memory-kafka.sources.avro-source.channels = memory-channel
avro-memory-kafka.sinks.kafka-sink.channel = memory-channel
1. 啓動zookeeper
zkServer.sh start
2. 啓動kafka
kafka-server-start.sh $KAFKA_HOME/config/server.properties
3. 啓動:Flume avro-memory-kafka
flume-ng agent \
--name avro-memory-kafka \
--conf conf \
--conf-file $FLUME_HOME/conf/avro-memory-kafka.conf
-Dflume.root.logger=INFO,console
4. 啓動:Flume exec-memory-avro
flume-ng agent \
--name exec-memory-avro \
--conf conf \
--conf-file $FLUME_HOME/conf/exec-memory-avro.conf
-Dflume.root.logger=INFO,console
最後啓動消費者
kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic hello_topic #這裏的hello_topic配置在`avro-memory-kafka.conf`
測試:增加兩條信息促發Flume
採集
[hadoop@hadoop000 data]$ echo hello hadoop >> data.log
[hadoop@hadoop000 data]$ echo hello spark >> data.log
並檢測到kafka
消費者消費信息
[hadoop@hadoop000 ~]$ kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic hello_topic
hello hadoop
hello spark
參考:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#kafka-sink