API

消息發送流程

Kafka的Producer發送消息採用的是異步發送的方式。在消息發送的過程中，涉及到了兩個線程——main線程和Sender線程，以及一個線程共享變量——RecordAccumulator。main線程將消息發送給RecordAccumulator，Sender線程不斷從RecordAccumulator中拉取消息發送到Kafka broker。

相關參數：
batch.size：只有數據積累到batch.size之後，sender纔會發送數據。
linger.ms：如果數據遲遲未達到batch.size，sender等待linger.time之後就會發送數據。
注意消息經過攔截器，序列化，分區。

生產者API

新建Maven項目導入依賴

<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-clients</artifactId>
    <version>0.11.0.0</version>
</dependency>

需要用到的類：
KafkaProducer：需要創建一個生產者對象，用來發送數據
ProducerConfig：獲取所需的一系列配置參數
ProducerRecord：每條數據都要封裝成一個ProducerRecord對象

簡單的異步生產者

package com.sowhat.producer;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;

import java.util.Properties;

public class MyProducer
{

	public static void main(String[] args) throws InterruptedException
	{

		// 1.創建Kafka生產者的配置信息
		Properties properties = new Properties();

		// 2.給配置信息添加參數
		// 連接的Kafka集羣 Kafka中 主節點的IP地址
		properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "IP:9092");

		//指定KV的序列化類
		properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
		properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");

		// 應答級別 all = -1 
		properties.put("acks", "all");

		//重試次數
		properties.put("retries", 1);

		//批次大小
		properties.put("batch.size", 16384);

		//等待時間
		properties.put("linger.ms", 1);

		//3.創建Kafka的生產者對象
		KafkaProducer<String, String> producer = new KafkaProducer<>(properties);

		//4.發送數據
		for (int i = 0; i < 10; i++)
		{
			producer.send(new ProducerRecord<>("sowhat", "atguigu" + i));
		}

//        Thread.sleep(200);

		//5.關閉連接  關閉服務會觸發消息集體發送到Kafka，否則 沒到指定時間直接關閉 會無法收到信息
		producer.close();
	}
}

消費者可接受到信息

帶回調函數的API

回調函數會在producer收到ack時調用，爲異步調用，該方法有兩個參數，分別是RecordMetadata和Exception，如果Exception爲null，說明消息發送成功，如果Exception不爲null，說明消息發送失敗。
注意：消息發送失敗會自動重試，不需要我們在回調函數中手動重試。

package com.sowhat.producer;

import org.apache.kafka.clients.producer.*;

import java.util.Properties;

public class MyCallBackProducer {

    public static void main(String[] args) throws InterruptedException {

        //1.創建Kafka生產者的配置信息
        Properties properties = new Properties();

        //2.給配置信息添加參數
        //連接的Kafka集羣
        properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "IP:9092");

        //指定KV的序列化類
        properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");

        //應答級別
        properties.put("acks", "all");

        //重試次數
        properties.put("retries", 1);

        //批次大小
        properties.put("batch.size", 16384);

        //等待時間
        properties.put("linger.ms", 1);

        //3.創建Kafka的生產者對象
        KafkaProducer<String, String> producer = new KafkaProducer<>(properties);

        //4.發送數據  Alt + Enter 實現 匿名類跟 lambda 切換
        for (int i = 0; i < 10; i++) {
            producer.send(new ProducerRecord<>("sowhat", "atguigu" + i), (metadata, exception) -> {
				if (exception == null) {
					//打印元數據
					System.out.println("Topic:" + metadata.topic() +
							",Partition:" + metadata.partition() +
							",Offset:" + metadata.offset());
				}
			});
        }
        //5.關閉連接
        producer.close();
    }
}

同步發送API

同步發送的意思就是，一條消息發送之後，會阻塞當前線程，直至返回ack。由於send方法返回的是一個Future對象，根據Futrue對象的特點，我們也可以實現同步發送的效果，只需在調用Future對象的get方發即可。

        //4.發送數據  Alt + Enter 實現 匿名類跟 lambda 切換,
        for (int i = 0; i < 10; i++) {
			Future<RecordMetadata> sowhat = producer.send(new ProducerRecord<>("sowhat", "atguigu" + i), (metadata, exception) -> {
				if (exception == null)
				{
					//打印元數據
					System.out.println("Topic:" + metadata.topic() +
							",Partition:" + metadata.partition() +
							",Offset:" + metadata.offset());
				}
			});
			sowhat.get(); // 異步變 同步調用，唯一的區別之處。
		}

消費者 API

Consumer消費數據時的可靠性是很容易保證的，因爲數據在Kafka中是持久化的，故不用擔心數據丟失問題。
由於consumer在消費過程中可能會出現斷電宕機等故障，consumer恢復後，需要從故障前的位置的繼續消費，所以consumer需要實時記錄自己消費到了哪個offset，以便故障恢復後繼續消費。所以offset的維護是Consumer消費數據是必須考慮的問題。

需要用到的類：
KafkaConsumer：需要創建一個消費者對象，用來消費數據
ConsumerConfig：獲取所需的一系列配置參數
ConsuemrRecord：每條數據都要封裝成一個ConsumerRecord對象
爲了使我們能夠專注於自己的業務邏輯，Kafka提供了自動提交offset的功能。

自動提交offset

相關參數：
enable.auto.commit：是否開啓自動提交offset功能
auto.commit.interval.ms：自動提交offset的時間間隔

package com.sowhat.consumer;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;

import java.util.Collections;
import java.util.Properties;

public class MyConsumer {

    public static void main(String[] args) {

        //1.創建消費者配置信息
        Properties properties = new Properties();

        //2.給消費者配置信息添加參數
        //指定連接的kafka集羣
        properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "10.100.34.111:9092");

        //指定KV的反序列化類
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");

        // 設置重置 offset 從頭開始消費，還是直接消費最新的 latest
        properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

        //組名
        properties.put(ConsumerConfig.GROUP_ID_CONFIG, "banzhang");

        // 設置自動提交offset 敲重點
        properties.put("enable.auto.commit", "true");
        properties.put("auto.commit.interval.ms", "1000");

        //3.創建消費者的客戶端對象
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);

        //4.訂閱主題
        consumer.subscribe(Collections.singletonList("sowhat"));

        while (true) {

            //5.獲取數據
            ConsumerRecords<String, String> consumerRecords = consumer.poll(100);

            //6.遍歷數據並打印
            for (ConsumerRecord<String, String> consumerRecord : consumerRecords) {
                System.out.println("Topic:" + consumerRecord.topic() +
                        ",Partition:" + consumerRecord.partition() +
                        ",Key:" + consumerRecord.key() +
                        ",Value:" + consumerRecord.value());
            }
        }
    }
}

獲取所有分區數據拿來消費。

手動提交offset

雖然自動提交offset十分簡介便利，但由於其是基於時間提交的，開發人員難以把握offset提交的時機。因此Kafka還提供了手動提交offset的API。
手動提交offset的方法有兩種：分別是commitSync（同步提交）和commitAsync（異步提交）。兩者的相同點是，都會將本次poll的一批數據最高的偏移量提交；不同點是，commitSync阻塞當前線程，一直到提交成功，並且會自動失敗重試（由不可控因素導致，也會出現提交失敗）；而commitAsync則沒有失敗重試機制，故有可能提交失敗。

同步提交offset
由於同步提交offset有失敗重試機制，故更加可靠，以下爲同步提交offset的示例。

package com.sowhat.kafka.consumer;

import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;

import java.util.Arrays;
import java.util.Properties;

public class CustomComsumer {

    public static void main(String[] args) {

        Properties props = new Properties();

         //Kafka集羣
        props.put("bootstrap.servers", "IP:9092"); 

       //消費者組，只要group.id相同，就屬於同一個消費者組
        props.put("group.id", "test"); 

        props.put("enable.auto.commit", "false");//關閉自動提交offset

        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);

        consumer.subscribe(Arrays.asList("first"));//消費者訂閱主題

        while (true) {

            //消費者拉取數據
            ConsumerRecords<String, String> records = consumer.poll(100); 

            for (ConsumerRecord<String, String> record : records) {

                System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());

            }

            //同步提交，當前線程會阻塞直到offset提交成功
            consumer.commitSync();
        }
    }
}

異步提交offset
雖然同步提交offset更可靠一些，但是由於其會阻塞當前線程，直到提交成功。因此吞吐量會收到很大的影響。因此更多的情況下，會選用異步提交offset的方式。
以下爲異步提交offset的示例：

package com.sowhat.kafka.consumer;

import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.TopicPartition;

import java.util.Arrays;
import java.util.Map;
import java.util.Properties;


public class CustomConsumer {

    public static void main(String[] args) {

        Properties props = new Properties();

        //Kafka集羣
        props.put("bootstrap.servers", "IP:9092"); 

        //消費者組，只要group.id相同，就屬於同一個消費者組
        props.put("group.id", "test"); 

        //關閉自動提交offset
        props.put("enable.auto.commit", "false");

        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Arrays.asList("first"));//消費者訂閱主題

        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(100);//消費者拉取數據
            for (ConsumerRecord<String, String> record : records) {
                System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
            }

            //異步提交
            consumer.commitAsync(new OffsetCommitCallback() {
                @Override
                public void onComplete(Map<TopicPartition, OffsetAndMetadata> offsets, Exception exception) {
                    if (exception != null) {
                        System.err.println("Commit failed for" + offsets);
                    }
                }
            }); 
        }
    }
}

數據漏消費和重複消費分析
無論是同步提交還是異步提交offset，都有可能會造成數據的漏消費或者重複消費。先提交offset後消費，有可能造成數據的漏消費；而先消費後提交offset，有可能會造成數據的重複消費(提交時出錯)。

消費者組測試

生產者還是用簡單的異步生產者, 兩個消費者消費相同的topic然後嘗試下，消費者組會按照Range來消費partition，結果如下：

自定義offset

Kafka 0.9版本之前，offset存儲在zookeeper，0.9版本之後，默認將offset存儲在Kafka的一個內置的topic中(consumer_offset)。除此之外，Kafka還可以選擇自定義存儲offset。offset的維護是相當繁瑣的，因爲需要考慮到消費者的Rebalace。
當有新的消費者加入消費者組、已有的消費者退出消費者組或者所訂閱的主題的分區發生變化，都會觸發到分區的重新分配，重新分配的過程叫做Rebalance。
消費者發生Rebalance之後，每個消費者消費的分區就會發生變化。因此消費者要首先獲取到自己被重新分配到的分區，並且定位到每個分區最近提交的offset位置繼續消費。
要實現自定義存儲offset，需要藉助ConsumerRebalanceListener，以下爲示例代碼，其中提交和獲取offset的方法，需要根據所選的offset存儲系統自行實現。

package com.atguigu.kafka.consumer;

import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.TopicPartition;

import java.util.*;

public class CustomConsumer {

    private static Map<TopicPartition, Long> currentOffset = new HashMap<>();

    public static void main(String[] args) {

         //創建配置信息
        Properties props = new Properties();

        //Kafka集羣
        props.put("bootstrap.servers", "hadoop102:9092"); 

        //消費者組，只要group.id相同，就屬於同一個消費者組
        props.put("group.id", "test"); 

       //關閉自動提交offset
        props.put("enable.auto.commit", "false");

        //Key和Value的反序列化類
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

        //創建一個消費者
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);

        //消費者訂閱主題
        consumer.subscribe(Arrays.asList("first"), new ConsumerRebalanceListener() {
            
            //該方法會在Rebalance之前調用
            @Override
            public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
                commitOffset(currentOffset);
            }

            //該方法會在Rebalance之後調用
            @Override
            public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
                currentOffset.clear();
                for (TopicPartition partition : partitions) {
                    consumer.seek(partition, getOffset(partition));//定位到最近提交的offset位置繼續消費
                }
            }
        });

        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(100);//消費者拉取數據
            for (ConsumerRecord<String, String> record : records) {
                System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
                currentOffset.put(new TopicPartition(record.topic(), record.partition()), record.offset());
            }
            commitOffset(currentOffset);//異步提交
        }
    }

    // 獲取某分區的最新offset 自定義 存儲offset 屬於Kafka高級了，可以存儲到MySQL中
    private static long getOffset(TopicPartition partition) {
        return 0;
    }

    // 提交該消費者所有分區的offset
    private static void commitOffset(Map<TopicPartition, Long> currentOffset) {

    }
}

自定義partition

消費者也可以自定義partition來實現消費哪個分區。

package com.sowhat.producer;

import org.apache.kafka.clients.producer.Partitioner;
import org.apache.kafka.common.Cluster;

import java.util.Map;

public class MyPartitioner implements Partitioner {

    @Override
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {

        System.out.println(cluster.availablePartitionsForTopic(topic).get(0));
        System.out.println(cluster.availablePartitionsForTopic(topic).get(1));
        System.out.println(cluster.availablePartitionsForTopic(topic).get(2));

        return 0;
    }

    @Override
    public void close() {

    }

    @Override
    public void configure(Map<String, ?> configs) {

    }
}

PartitionProducer

package com.sowhat.producer;

import org.apache.kafka.clients.producer.*;

import java.util.Properties;

public class PartitionProducer {

    public static void main(String[] args) throws InterruptedException {

        //1.創建Kafka生產者的配置信息
        Properties properties = new Properties();

        //2.給配置信息添加參數
        //連接的Kafka集羣
        properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "hadoop102:9092");

        //指定KV的序列化類
        properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");

        //應答級別
        properties.put("acks", "all");

        //重試次數
        properties.put("retries", 1);

        //批次大小
        properties.put("batch.size", 16384);

        //等待時間
        properties.put("linger.ms", 1);

        // 添加分區器 !!! 
        properties.put(ProducerConfig.PARTITIONER_CLASS_CONFIG, "com.atguigu.producer.MyPartitioner");

        //3.創建Kafka的生產者對象
        KafkaProducer<String, String> producer = new KafkaProducer<>(properties);

        //4.發送數據
        for (int i = 0; i < 1; i++) {
            producer.send(new ProducerRecord<>("test", "atguigu" + i), new Callback() {
                @Override
                public void onCompletion(RecordMetadata metadata, Exception exception) {
                    if (exception == null) {

                        //打印元數據
                        System.out.println("Topic:" + metadata.topic() +
                                ",Partition:" + metadata.partition() +
                                ",Offset:" + metadata.offset());
                    }
                }
            });
        }

        //5.關閉連接
        producer.close();
    }
}

核心思想是生產者消息分發的時候，我們按照自己的邏輯分發到Kafka中，然後消費者不變。

自定義攔截器

攔截器原理

Producer攔截器(interceptor)是在Kafka 0.10版本被引入的，主要用於實現clients端的定製化控制邏輯。
對於producer而言，interceptor使得用戶在消息發送前以及producer回調邏輯前有機會對消息做一些定製化需求，比如修改消息等。同時，producer允許用戶指定多個interceptor按序作用於同一條消息從而形成一個攔截鏈(interceptor chain)。Intercetpor的實現接口是org.apache.kafka.clients.producer.ProducerInterceptor，其定義的方法包括：

configure(configs)

獲取配置信息和初始化數據時調用。

onSend(ProducerRecord)：

該方法封裝進KafkaProducer.send方法中，即它運行在用戶主線程中。Producer確保在消息被序列化以及計算分區前調用該方法。用戶可以在該方法中對消息做任何操作，但最好保證不要修改消息所屬的topic和分區，否則會影響目標分區的計算。

onAcknowledgement(RecordMetadata, Exception)：

該方法會在消息從RecordAccumulator成功發送到Kafka Broker之後，或者在發送過程中失敗時調用。並且通常都是在producer回調邏輯觸發之前。onAcknowledgement運行在producer的IO線程中，因此不要在該方法中放入很重的邏輯，否則會拖慢producer的消息發送效率。

close：

關閉interceptor，主要用於執行一些資源清理工作
如前所述，interceptor可能被運行在多個線程中，因此在具體實現時用戶需要自行確保線程安全。另外倘若指定了多個interceptor，則producer將按照指定順序調用它們，並僅僅是捕獲每個interceptor可能拋出的異常記錄到錯誤日誌中而非在向上傳遞。這在使用過程中要特別留意。

攔截器案例

需求：
實現一個簡單的雙interceptor組成的攔截鏈。第一個interceptor會在消息發送前將時間戳信息加到消息value的最前部；第二個interceptor會在消息發送後更新成功發送消息數或失敗發送消息數。

TimeInterceptor

package com.sowhat.producer;

import org.apache.kafka.clients.producer.ProducerInterceptor;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;

import java.util.Map;

public class TimeInterceptor implements ProducerInterceptor<String, String> {

    //核心攔截方法：添加時間戳
    @Override
    public ProducerRecord<String, String> onSend(ProducerRecord<String, String> record) {
        return new ProducerRecord<>(record.topic(), System.currentTimeMillis() + "--" + record.value());
    }

    @Override
    public void onAcknowledgement(RecordMetadata metadata, Exception exception) {

    }

    @Override
    public void close() {

    }

    @Override
    public void configure(Map<String, ?> configs) {

    }
}

CounterInterceptor

package com.sowhat.producer;

import org.apache.kafka.clients.producer.ProducerInterceptor;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;

import java.util.Map;

public class CounterInterceptor implements ProducerInterceptor<String, String> {

    int success = 0;
    int error = 0;

    @Override
    public ProducerRecord<String, String> onSend(ProducerRecord<String, String> record) {
        return record;
    }

    @Override
    public void onAcknowledgement(RecordMetadata metadata, Exception exception) {

        if (exception == null) {
            success++;
        } else {
            error++;
        }
    }

    @Override
    public void close() {
        System.out.println("發送成功" + success + "條數據！！！");
        System.out.println("發送失敗" + error + "條數據！！！");
    }

    @Override
    public void configure(Map<String, ?> configs) {

    }
}

InterceptorProducer

package com.sowhat.producer;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;

import java.util.ArrayList;
import java.util.Properties;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

public class InterceptorProducer {

    public static void main(String[] args) throws InterruptedException, ExecutionException {

        //1.創建Kafka生產者的配置信息
        Properties properties = new Properties();

        //2.給配置信息添加參數
        //連接的Kafka集羣
        properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "hadoop102:9092");

        //指定KV的序列化類
        properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");

        //應答級別
        properties.put("acks", "all");

        //重試次數
        properties.put("retries", 1);

        //批次大小
        properties.put("batch.size", 16384);

        //等待時間
        properties.put("linger.ms", 1);

        // 設置攔截器鏈
        ArrayList<String> interceptors = new ArrayList<>();

        interceptors.add("com.atguigu.producer.TimeInterceptor");
        interceptors.add("com.atguigu.producer.CounterInterceptor");

        properties.put(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG, interceptors);

        //3.創建Kafka的生產者對象
        KafkaProducer<String, String> producer = new KafkaProducer<>(properties);

        //4.發送數據
        for (int i = 0; i < 10; i++) {
            Future<RecordMetadata> future = producer.send(new ProducerRecord<>("first", "atguigu" + i));

            future.get();
        }
        Thread.sleep(200);
        //5.關閉連接
        producer.close();
    }
}

消費者控制檯會顯示

1501904047034,message0
1501904047225,message1
1501904047230,message2
1501904047234,message3
1501904047236,message4
1501904047240,message5
1501904047243,message6
1501904047246,message7
1501904047249,message8
1501904047252,message9

生產者控制檯會顯示：

Successful sent: 10
Failed sent: 0

Kafka Streams

Kafka Streams。Apache Kafka開源項目的一個組成部分。是一個功能強大，易於使用的庫。用於在Kafka上構建高可分佈式、拓展性，容錯的應用程序。
Kafka Streams特點

功能強大

高擴展性，彈性，容錯

輕量級

無需專門的集羣
一個庫，而不是框架

完全集成

100%的Kafka 0.10.0版本兼容
易於集成到現有的應用程序

實時性

毫秒級延遲
並非微批處理
窗口允許亂序數據
允許遲到數據

爲什麼要有Kafka Stream

當前已經有非常多的流式處理系統，最知名且應用最多的開源流式處理系統有Spark Streaming和Apache Storm。Apache Storm發展多年，應用廣泛，提供記錄級別的處理能力，當前也支持SQL on Stream。而Spark Streaming基於Apache Spark，可以非常方便與圖計算，SQL處理等集成，功能強大，對於熟悉其它Spark應用開發的用戶而言使用門檻低。另外，目前主流的Hadoop發行版，如Cloudera和Hortonworks，都集成了Apache Storm和Apache Spark，使得部署更容易。

既然Apache Spark與Apache Storm擁用如此多的優勢，那爲何還需要Kafka Stream呢？主要有如下原因。
第一，Spark和Storm都是流式處理框架，而Kafka Stream提供的是一個基於Kafka的流式處理類庫。框架要求開發者按照特定的方式去開發邏輯部分，供框架調用。開發者很難了解框架的具體運行方式，從而使得調試成本高，並且使用受限。而Kafka Stream作爲流式處理類庫，直接提供具體的類給開發者調用，整個應用的運行方式主要由開發者控制，方便使用和調試。

第二，雖然Cloudera與Hortonworks方便了Storm和Spark的部署，但是這些框架的部署仍然相對複雜。而Kafka Stream作爲類庫，可以非常方便的嵌入應用程序中，它對應用的打包和部署基本沒有任何要求。

第三，就流式處理系統而言，基本都支持Kafka作爲數據源。例如Storm具有專門的kafka-spout，而Spark也提供專門的spark-streaming-kafka模塊。事實上，Kafka基本上是主流的流式處理系統的標準數據源。換言之，大部分流式系統中都已部署了Kafka，此時使用Kafka Stream的成本非常低。

第四，使用Storm或Spark Streaming時，需要爲框架本身的進程預留資源，如Storm的supervisor和Spark on YARN的node manager。即使對於應用實例而言，框架本身也會佔用部分資源，如Spark Streaming需要爲shuffle和storage預留內存。但是Kafka作爲類庫不佔用系統資源。

第五，由於Kafka本身提供數據持久化，因此Kafka Stream提供滾動部署和滾動升級以及重新計算的能力。
第六，由於Kafka Consumer Rebalance機制，Kafka Stream可以在線動態調整並行度。

demo

需求：
實時處理單詞帶有”>>>”前綴的內容。例如輸入”atguigu>>>ximenqing”，最終處理成“ximenqing”

package com.sowhat.kafka;
import org.apache.kafka.streams.processor.Processor;
import org.apache.kafka.streams.processor.ProcessorContext;

public class LogProcessor implements Processor<byte[], byte[]> {

	private ProcessorContext context;

	public void init(ProcessorContext context) {
		this.context = context;
	}

	public void process(byte[] key, byte[] value) {
		String input = new String(value);

		// 如果包含“>>>”則只保留該標記後面的內容
		if (input.contains(">>>")) {
			input = input.split(">>>")[1].trim();
			// 輸出到下一個topic
			context.forward("logProcessor".getBytes(), input.getBytes());
		}else{
			context.forward("logProcessor".getBytes(), input.getBytes());
		}
	}

	public void punctuate(long timestamp) {

	}
	public void close() {

	}
}

package com.sowhat.kafka;
import java.util.Properties;

import com.sowhat.kafka.LogProcessor;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.processor.Processor;
import org.apache.kafka.streams.processor.ProcessorSupplier;
import org.apache.kafka.streams.processor.TopologyBuilder;

public class Application {

	public static void main(String[] args) {

		// 定義輸入的topic
		String from = "first";
		// 定義輸出的topic
		String to = "second";

		// 設置參數
		Properties settings = new Properties();
		settings.put(StreamsConfig.APPLICATION_ID_CONFIG, "logFilter");
		settings.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "hadoop102:9092");

		StreamsConfig config = new StreamsConfig(settings);

		// 構建拓撲
		TopologyBuilder builder = new TopologyBuilder();

		builder.addSource("SOURCE", from)
				.addProcessor("PROCESS", new ProcessorSupplier<byte[], byte[]>() {
					public Processor<byte[], byte[]> get() {
						// 具體分析處理
						return new LogProcessor();
					}
				}, "SOURCE")
				.addSink("SINK", to, "PROCESS");

		// 創建kafka stream
		KafkaStreams streams = new KafkaStreams(builder, config);
		streams.start();
	}
}

運行程序

在hadoop104上啓動生產者

[atguigu@hadoop104 kafka]$ bin/kafka-console-producer.sh \
--broker-list hadoop102:9092 --topic first
>hello>>>world
>h>>>atguigu
>hahaha

在hadoop103上啓動消費者

[atguigu@hadoop103 kafka]$ bin/kafka-console-consumer.sh \
--zookeeper hadoop102:2181 --from-beginning --topic second
world
atguigu
hahaha

Kafka配置信息

Broker配置信息

屬性	默認值	描述
broker.id		必填參數，broker的唯一標識
log.dirs	/tmp/kafka-logs	Kafka數據存放的目錄。可以指定多個目錄，中間用逗號分隔，當新partition被創建的時會被存放到當前存放partition最少的目錄。
port	9092	BrokerServer接受客戶端連接的端口號
zookeeper.connect	null	Zookeeper的連接串，格式爲：hostname1:port1,hostname2:port2,hostname3:port3。可以填一個或多個，爲了提高可靠性，建議都填上。注意，此配置允許我們指定一個zookeeper路徑來存放此kafka集羣的所有數據，爲了與其他應用集羣區分開，建議在此配置中指定本集羣存放目錄，格式爲：hostname1:port1,hostname2:port2,hostname3:port3/chroot/path 。需要注意的是，消費者的參數要和此參數一致。
message.max.bytes	1000000	服務器可以接收到的最大的消息大小。注意此參數要和consumer的maximum.message.size大小一致，否則會因爲生產者生產的消息太大導致消費者無法消費。
num.io.threads	8	服務器用來執行讀寫請求的IO線程數，此參數的數量至少要等於服務器上磁盤的數量。
queued.max.requests	500	I/O線程可以處理請求的隊列大小，若實際請求數超過此大小，網絡線程將停止接收新的請求。
socket.send.buffer.bytes	100 * 1024	The SO_SNDBUFF buffer the server prefers for socket connections.
socket.receive.buffer.bytes	100 * 1024	The SO_RCVBUFF buffer the server prefers for socket connections.
socket.request.max.bytes	100 * 1024 * 1024	服務器允許請求的最大值，用來防止內存溢出，其值應該小於 Java heap size.
num.partitions	1	默認partition數量，如果topic在創建時沒有指定partition數量，默認使用此值，建議改爲5
log.segment.bytes	1024 * 1024 * 1024	Segment文件的大小，超過此值將會自動新建一個segment，此值可以被topic級別的參數覆蓋。
log.roll.{ms,hours}	24 * 7 hours	新建segment文件的時間，此值可以被topic級別的參數覆蓋。
log.retention.{ms,minutes,hours}	7 days	Kafka segment log的保存週期，保存週期超過此時間日誌就會被刪除。此參數可以被topic級別參數覆蓋。數據量大時，建議減小此值。
log.retention.bytes	-1	每個partition的最大容量，若數據量超過此值，partition數據將會被刪除。注意這個參數控制的是每個partition而不是topic。此參數可以被log級別參數覆蓋。
log.retention.check.interval.ms	5 minutes	刪除策略的檢查週期
auto.create.topics.enable	true	自動創建topic參數，建議此值設置爲false，嚴格控制topic管理，防止生產者錯寫topic。
default.replication.factor	1	默認副本數量，建議改爲2。
replica.lag.time.max.ms	10000	在此窗口時間內沒有收到follower的fetch請求，leader會將其從ISR(in-sync replicas)中移除。
replica.lag.max.messages	4000	如果replica節點落後leader節點此值大小的消息數量，leader節點就會將其從ISR中移除。
replica.socket.timeout.ms	30 * 1000	replica向leader發送請求的超時時間。
replica.socket.receive.buffer.bytes	64 * 1024	The socket receive buffer for network requests to the leader for replicating data.
replica.fetch.max.bytes	1024 * 1024	The number of byes of messages to attempt to fetch for each partition in the fetch requests the replicas send to the leader.
replica.fetch.wait.max.ms	500	The maximum amount of time to wait time for data to arrive on the leader in the fetch requests sent by the replicas to the leader.
num.replica.fetchers	1	Number of threads used to replicate messages from leaders. Increasing this value can increase the degree of I/O parallelism in the follower broker.
fetch.purgatory.purge.interval.requests	1000	The purge interval (in number of requests) of the fetch request purgatory.
zookeeper.session.timeout.ms	6000	ZooKeeper session 超時時間。如果在此時間內server沒有向zookeeper發送心跳，zookeeper就會認爲此節點已掛掉。此值太低導致節點容易被標記死亡；若太高，.會導致太遲發現節點死亡。
zookeeper.connection.timeout.ms	6000	客戶端連接zookeeper的超時時間。
zookeeper.sync.time.ms	2000	H ZK follower落後 ZK leader的時間。
controlled.shutdown.enable	true	允許broker shutdown。如果啓用，broker在關閉自己之前會把它上面的所有leaders轉移到其它brokers上，建議啓用，增加集羣穩定性。
auto.leader.rebalance.enable	true	If this is enabled the controller will automatically try to balance leadership for partitions among the brokers by periodically returning leadership to the “preferred” replica for each partition if it is available.
leader.imbalance.per.broker.percentage	10	The percentage of leader imbalance allowed per broker. The controller will rebalance leadership if this ratio goes above the configured value per broker.
leader.imbalance.check.interval.seconds	300	The frequency with which to check for leader imbalance.
offset.metadata.max.bytes	4096	The maximum amount of metadata to allow clients to save with their offsets.
connections.max.idle.ms	600000	Idle connections timeout: the server socket processor threads close the connections that idle more than this.
num.recovery.threads.per.data.dir	1	The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
unclean.leader.election.enable	true	Indicates whether to enable replicas not in the ISR set to be elected as leader as a last resort, even though doing so may result in data loss.
delete.topic.enable	false	啓用deletetopic參數，建議設置爲true。
offsets.topic.num.partitions	50	The number of partitions for the offset commit topic. Since changing this after deployment is currently unsupported, we recommend using a higher setting for production (e.g., 100-200).
offsets.topic.retention.minutes	1440	Offsets that are older than this age will be marked for deletion. The actual purge will occur when the log cleaner compacts the offsets topic.
offsets.retention.check.interval.ms	600000	The frequency at which the offset manager checks for stale offsets.
offsets.topic.replication.factor	3	The replication factor for the offset commit topic. A higher setting (e.g., three or four) is recommended in order to ensure higher availability. If the offsets topic is created when fewer brokers than the replication factor then the offsets topic will be created with fewer replicas.
offsets.topic.segment.bytes	104857600	Segment size for the offsets topic. Since it uses a compacted topic, this should be kept relatively low in order to facilitate faster log compaction and loads.
offsets.load.buffer.size	5242880	An offset load occurs when a broker becomes the offset manager for a set of consumer groups (i.e., when it becomes a leader for an offsets topic partition). This setting corresponds to the batch size (in bytes) to use when reading from the offsets segments when loading offsets into the offset manager’s cache.
offsets.commit.required.acks	-1	The number of acknowledgements that are required before the offset commit can be accepted. This is similar to the producer’s acknowledgement setting. In general, the default should not be overridden.
offsets.commit.timeout.ms	5000	The offset commit will be delayed until this timeout or the required number of replicas have received the offset commit. This is similar to the producer request timeout.

Producer配置信息

屬性	默認值	描述
metadata.broker.list		啓動時producer查詢brokers的列表，可以是集羣中所有brokers的一個子集。注意，這個參數只是用來獲取topic的元信息用，producer會從元信息中挑選合適的broker並與之建立socket連接。格式是：host1:port1,host2:port2。
request.required.acks	0	參見3.2節介紹
request.timeout.ms	10000	Broker等待ack的超時時間，若等待時間超過此值，會返回客戶端錯誤信息。
producer.type	sync	同步異步模式。async表示異步，sync表示同步。如果設置成異步模式，可以允許生產者以batch的形式push數據，這樣會極大的提高broker性能，推薦設置爲異步。
serializer.class	kafka.serializer.DefaultEncoder	序列號類，.默認序列化成 byte[] 。
key.serializer.class		Key的序列化類，默認同上。
partitioner.class	kafka.producer.DefaultPartitioner	Partition類，默認對key進行hash。
compression.codec	none	指定producer消息的壓縮格式，可選參數爲： “none”, “gzip” and “snappy”。關於壓縮參見4.1節
compressed.topics	null	啓用壓縮的topic名稱。若上面參數選擇了一個壓縮格式，那麼壓縮僅對本參數指定的topic有效，若本參數爲空，則對所有topic有效。
message.send.max.retries	3	Producer發送失敗時重試次數。若網絡出現問題，可能會導致不斷重試。
retry.backoff.ms	100	Before each retry, the producer refreshes the metadata of relevant topics to see if a new leader has been elected. Since leader election takes a bit of time, this property specifies the amount of time that the producer waits before refreshing the metadata.
topic.metadata.refresh.interval.ms	600 * 1000	The producer generally refreshes the topic metadata from brokers when there is a failure (partition missing, leader not available…). It will also poll regularly (default: every 10min so 600000ms). If you set this to a negative value, metadata will only get refreshed on failure. If you set this to zero, the metadata will get refreshed after each message sent (not recommended). Important note: the refresh happen only AFTER the message is sent, so if the producer never sends a message the metadata is never refreshed
queue.buffering.max.ms	5000	啓用異步模式時，producer緩存消息的時間。比如我們設置成1000時，它會緩存1秒的數據再一次發送出去，這樣可以極大的增加broker吞吐量，但也會造成時效性的降低。
queue.buffering.max.messages	10000	採用異步模式時producer buffer 隊列裏最大緩存的消息數量，如果超過這個數值，producer就會阻塞或者丟掉消息。
queue.enqueue.timeout.ms	-1	當達到上面參數值時producer阻塞等待的時間。如果值設置爲0，buffer隊列滿時producer不會阻塞，消息直接被丟掉。若值設置爲-1，producer會被阻塞，不會丟消息。
batch.num.messages	200	採用異步模式時，一個batch緩存的消息數量。達到這個數量值時producer纔會發送消息。
send.buffer.bytes	100 * 1024	Socket write buffer size
client.id	“”	The client id is a user-specified string sent in each request to help trace calls. It should logically identify the application making the request.

Consumer配置信息

屬性	默認值	描述
group.id		Consumer的組ID，相同goup.id的consumer屬於同一個組。
zookeeper.connect		Consumer的zookeeper連接串，要和broker的配置一致。
consumer.id	null	如果不設置會自動生成。
socket.timeout.ms	30 * 1000	網絡請求的socket超時時間。實際超時時間由max.fetch.wait + socket.timeout.ms 確定。
socket.receive.buffer.bytes	64 * 1024	The socket receive buffer for network requests.
fetch.message.max.bytes	1024 * 1024	查詢topic-partition時允許的最大消息大小。consumer會爲每個partition緩存此大小的消息到內存，因此，這個參數可以控制consumer的內存使用量。這個值應該至少比server允許的最大消息大小大，以免producer發送的消息大於consumer允許的消息。
num.consumer.fetchers	1	The number fetcher threads used to fetch data.
auto.commit.enable	true	如果此值設置爲true，consumer會週期性的把當前消費的offset值保存到zookeeper。當consumer失敗重啓之後將會使用此值作爲新開始消費的值。
auto.commit.interval.ms	60 * 1000	Consumer提交offset值到zookeeper的週期。
queued.max.message.chunks	2	用來被consumer消費的message chunks 數量，每個chunk可以緩存fetch.message.max.bytes大小的數據量。
auto.commit.interval.ms	60 * 1000	Consumer提交offset值到zookeeper的週期。
queued.max.message.chunks	2	用來被consumer消費的message chunks 數量，每個chunk可以緩存fetch.message.max.bytes大小的數據量。
fetch.min.bytes	1	The minimum amount of data the server should return for a fetch request. If insufficient data is available the request will wait for that much data to accumulate before answering the request.
fetch.wait.max.ms	100	The maximum amount of time the server will block before answering the fetch request if there isn’t sufficient data to immediately satisfy fetch.min.bytes.
rebalance.backoff.ms	2000	Backoff time between retries during rebalance.
refresh.leader.backoff.ms	200	Backoff time to wait before trying to determine the leader of a partition that has just lost its leader.
auto.offset.reset	largest	What to do when there is no initial offset in ZooKeeper or if an offset is out of range ;smallest : automatically reset the offset to the smallest offset; largest : automatically reset the offset to the largest offset;anything else: throw exception to the consumer
consumer.timeout.ms	-1	若在指定時間內沒有消息消費，consumer將會拋出異常。
exclude.internal.topics	true	Whether messages from internal topics (such as offsets) should be exposed to the consumer.
zookeeper.session.timeout.ms	6000	ZooKeeper session timeout. If the consumer fails to heartbeat to ZooKeeper for this period of time it is considered dead and a rebalance will occur.
zookeeper.connection.timeout.ms	6000	The max time that the client waits while establishing a connection to zookeeper.
zookeeper.sync.time.ms	2000	How far a ZK follower can be behind a ZK leader

參考

Kafka學習筆記

第二天：Kafka API操作

API