本篇文章将在Apache Flume介绍和使用案例三这篇文章的基础上将logger sink修改为kafka sink(即整合flume到kafka完成实时数据的采集)
1. 先说一下,为什么要使用 Flume + Kafka?
以实时流处理项目为例,由于采集的数据量可能存在峰值和峰谷,假设是一个电商项目,那么峰值通常出现在秒杀时,这时如果直接将 Flume
聚合后的数据输入到 Storm
等分布式计算框架中,可能就会超过集群的处理能力,这时采用 Kafka
就可以起到削峰的作用。Kafka
天生为大数据场景而设计,具有高吞吐的特性,能很好地抗住峰值数据的冲击。
2. 大体流程如图所示:
将配置文件:avro-memory-logger.conf
avro-memory-logger.sources = avro-source
avro-memory-logger.sinks = logger-sink
avro-memory-logger.channels = memory-channel
avro-memory-logger.sources.avro-source.type = avro
avro-memory-logger.sources.avro-source.bind = hadoop000
avro-memory-logger.sources.avro-source.port = 44444
avro-memory-logger.sinks.logger-sink.type = logger
avro-memory-logger.channels.memory-channel.type = memory
avro-memory-logger.sources.avro-source.channels = memory-channel
avro-memory-logger.sinks.logger-sink.channel = memory-channel
修改为avro-memory-kafka.conf
avro-memory-kafka.sources = avro-source
avro-memory-kafka.sinks = kafka-sink
avro-memory-kafka.channels = memory-channel
avro-memory-kafka.sources.avro-source.type = avro
avro-memory-kafka.sources.avro-source.bind = hadoop000
avro-memory-kafka.sources.avro-source.port = 44444
avro-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
avro-memory-kafka.sinks.kafka-sink.brokerList = hadoop000:9092
avro-memory-kafka.sinks.kafka-sink.topic = hello_topic
avro-memory-kafka.sinks.kafka-sink.batchSize=5
avro-memory-kafka.sinks.kafka-sink.requiredAcks=1
avro-memory-kafka.channels.memory-channel.type = memory
avro-memory-kafka.sources.avro-source.channels = memory-channel
avro-memory-kafka.sinks.kafka-sink.channel = memory-channel
1. 启动zookeeper
zkServer.sh start
2. 启动kafka
kafka-server-start.sh $KAFKA_HOME/config/server.properties
3. 启动:Flume avro-memory-kafka
flume-ng agent \
--name avro-memory-kafka \
--conf conf \
--conf-file $FLUME_HOME/conf/avro-memory-kafka.conf
-Dflume.root.logger=INFO,console
4. 启动:Flume exec-memory-avro
flume-ng agent \
--name exec-memory-avro \
--conf conf \
--conf-file $FLUME_HOME/conf/exec-memory-avro.conf
-Dflume.root.logger=INFO,console
最后启动消费者
kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic hello_topic #这里的hello_topic配置在`avro-memory-kafka.conf`
测试:增加两条信息促发Flume
采集
[hadoop@hadoop000 data]$ echo hello hadoop >> data.log
[hadoop@hadoop000 data]$ echo hello spark >> data.log
并检测到kafka
消费者消费信息
[hadoop@hadoop000 ~]$ kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic hello_topic
hello hadoop
hello spark
参考:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#kafka-sink