【源碼解讀】Flink-Kafka連接器自定義序列器和分區器

開篇導語

Flink將數據sink至Kafka的過程中,在初始化生產者對象FlinkKafkaProducer時通常會採用默認的分區器和序列化器,這樣數據只會發送至指定Topic的某一個分區中。對於存在多分區的Topic我們一般要自定義分區器和序列化器,指定數據發送至不同分區的邏輯。
此篇博客所涉及的組件版本
Flink:1.10.0
Kafka:2.3.0


序列化器

在Kafka生產者將數據寫入至Kafka集羣中時,爲了能夠在網絡中傳輸數據對象,需要先將數據進行序列化處理,對於初學者來說,在初始化生產者對象時,一般都會採用默認的序列化器。
默認的序列化器不會對數據進行任何操作,也不會生成key。如果我們需要指定數據的key或者在數據發送前進行一些定製化的操作,那麼我們就需要自定義序列化器,並且在初始化生產者對象時指定我們自己的序列化器。

分區器

對於Kakfa中一個topic存在多個分區的情況下,我們怎麼知道發送的數據會被分配到哪個分區呢,這時候就要通過分區器來進行區分。
在Kafka中,主要有以下四種數據分區策略
第一種分區策略:給定了分區號,直接將數據發送到指定的分區裏面去
第二種分區策略:沒有給定分區號,給定數據的key值,通過key取hashCode進行分區
第三種分區策略:既沒有給定分區號,也沒有給定key值,直接輪循進行分區
第四種分區策略:自定義分區
分區器就是以上分區策略的代碼實現。





Flink中的Kafka序列化器

源碼解讀

在之前的Flink版中中,自定義Kafka序列化器都是實現KeyedSerializationSchema接口,看一下它的源碼:

//表示當前接口已經不推薦使用
@Deprecated
@PublicEvolving
//當前接口實現時需要指定生產者所要傳輸的對象類型
public interface KeyedSerializationSchema<T> extends Serializable {
   
   
	 //根據傳入的對象生成key並序列化爲字節數組
    //如果沒有生成key,這個方法可返回null。
	byte[] serializeKey(T element);
	//根據傳入的對象生成value並序列化爲字節數組
	byte[] serializeValue(T element);
	//根據傳入的對象指定需要發送的Topic
	//此方法可以返回null,因爲在初始化生產者對象的時候就已經指定了Topic。
	String getTargetTopic(T element);
}

以上接口存在三個方法,但是每個輸入的參數都是一樣的,代碼複用性低。
於是現在的Flink版本一般推薦實現KafkaSerializationSchema接口來實現序列化器,看一下它的源碼:

//當前接口實現時需要指定生產者所要傳輸的對象類型
@PublicEvolving
public interface KafkaSerializationSchema<T> extends Serializable {
   
   

	/**
	 * Serializes given element and returns it as a {@link ProducerRecord}.
	 *
	 * @param element element to be serialized
	 * @param timestamp timestamp (can be null)
	 * @return Kafka {@link ProducerRecord}
	 */
	ProducerRecord<byte[], byte[]> serialize(T element, @Nullable Long timestamp);
}

可以看到當前接口只需要實現serialize方法,根據傳入的對象構造並返回一個ProducerRecord<byte[], byte[]>對象。
我們來看一下ProducerRecord類的源碼:

package org.apache.kafka.clients.producer;

import java.util.Objects;
import org.apache.kafka.common.header.Header;
import org.apache.kafka.common.header.Headers;
import org.apache.kafka.common.header.internals.RecordHeaders;

public class ProducerRecord<K, V> {
   
   
    private final String topic;
    private final Integer partition;
    private final Headers headers;
    private final K key;
    private final V value;
    private final Long timestamp;

    public ProducerRecord(String topic, Integer partition, Long timestamp, K key, V value, Iterable<Header> headers) {
   
   
        if (topic == null) {
   
   
            throw new IllegalArgumentException("Topic cannot be null.");
        } else if (timestamp != null && timestamp < 0L) {
   
   
            throw new IllegalArgumentException(String.format("Invalid timestamp: %d. Timestamp should always be non-negative or null.", timestamp));
        } else if (partition != null && partition < 0) {
   
   
            throw new IllegalArgumentException(String.format("Invalid partition: %d. Partition number should always be non-negative or null.", partition));
        } else {
   
   
            this.topic = topic;
            this.partition = partition;
            this.key = key;
            this.value = value;
            this.timestamp = timestamp;
            this.headers = new RecordHeaders(headers);
        }
    }

    public ProducerRecord(String topic, Integer partition, Long timestamp, K key, V value) {
   
   
        this(topic, partition, timestamp, key, value, (Iterable)null);
    }

    public ProducerRecord(String topic, Integer partition, K key, V value, Iterable<Header> headers) {
   
   
        this(topic, partition, (Long)null, key, value, headers);
    }

    public ProducerRecord(String topic, Integer partition, K key, V value) {
   
   
        this(topic, partition, (Long)null, key, value, (Iterable)null);
    }

    public ProducerRecord(String topic, K key, V value) {
   
   
        this(topic, (Integer)null, (Long)null, key, value, (Iterable)null);
    }

    public ProducerRecord(String topic, V value) {
   
   
        this(topic, (Integer)null, (Long)null, (Object)null, value, (Iterable)null);
    }

    public String topic() {
   
   
        return this.topic;
    }

    public Headers headers() {
   
   
        return this.headers;
    }

    public K key() {
   
   
        return this.key;
    }

    public V value() {
   
   
        return this.value;
    }

    public Long timestamp() {
   
   
        return this.timestamp;
    }

    public Integer partition() {
   
   
        return this.partition;
    }

    public String toString() {
   
   
        String headers = this.headers == null ? "null" : this.headers.toString();
        String key = this.key == null ? "null" : this.key.toString();
        String value = this.value == null ? "null" : this.value.toString();
        String timestamp = this.timestamp == null ? "null" : this.timestamp.toString();
        return "ProducerRecord(topic=" + this.topic + ", partition=" + this.partition + ", headers=" + headers + ", key=" + key + ", value=" + value + ", timestamp=" + timestamp + ")";
    }

    public boolean equals(Object o) {
   
   
        if (this == o) {
   
   
            return true;
        } else if (!(o instanceof ProducerRecord)) {
   
   
            return false;
        } else {
   
   
            ProducerRecord<?, ?> that = (ProducerRecord)o;
            return Objects.equals(this.key, that.key) && Objects.equals(this.partition, that.partition) && Objects.equals(this.topic, that.topic) && Objects.equals(this.headers, that.headers) && Objects.equals(this.value, that.value) && Objects.equals(this.timestamp, that.timestamp);
        }
    }

    public int hashCode() {
   
   
        int result = this.topic != null ? this.topic.hashCode() : 0;
        result = 31 * result + (this.partition != null ? this.partition.hashCode() : 0);
        result = 31 * result + (this.headers != null ? this.headers.hashCode() : 0);
        result = 31 * result + (this.key != null ? this.key.hashCode() : 0);
        result = 31 * result + (this.value != null ? this.value.hashCode() : 0);
        result = 31 * result + (this.timestamp != null ? this.timestamp.hashCode() : 0);
        return result;
    }
}

可以看到ProducerRecord在初始化時需要以下參數:

String topic;//分區名稱,不可以爲空
Integer partition;//當前記錄需要寫入的分區值,可以爲空
Headers headers;//kafka頭信息,可以爲空
K key;//當前記錄的key,可以爲空
V value;//當前記錄的實際value,不可以爲空
Long timestamp;//指定生產者創建當前記錄的時間戳,可以爲空




在ProducerRecord的多個重構的構造函數中,參數最少的一個只需要傳入topic名稱和value即可。其他值都可爲空。

public ProducerRecord(String topic, V value) {
   
   
        this(topic, (Integer)null, (Long)null, (Object)null, value, (Iterable)null);
    }

自定義序列化器示例

基於上述知識,我們可以通過實現KafkaSerializationSchema自定義序列化器,以下是一個最簡單的自定義序列化器

package lenrnflink;

import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema;
import org.apache.kafka.clients.producer.ProducerRecord;

import javax.annotation.Nullable;
import java.util.Properties;

public class Test {
   
   
	//實現KafkaSerializationSchema接口來自定義序列化器,傳入的參數設定爲String類型。
    public static class MySerializationSchema implements KafkaSerializationSchema<String> {
   
   
        @Override
        public ProducerRecord<byte[], byte[]> serialize(String element, @Nullable Long timestamp) {
   
   
        	//返回一條記錄,指定topic爲test,分區爲0,key爲null,value爲傳入對象轉化而成的字節數組。
            return new ProducerRecord<>("test",0,null,element.getBytes());
        }
    }

    public static void main(String[] args) throws Exception {
   
   
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "cdh1:9092,cdh2:9092,cdh3:9092,cdh4:9092,cdh5:9092");
        //初始化Flink-Kafka生產者,並將自定義序列化器作爲參數傳遞。
        FlinkKafkaProducer<String> kafkaSink = new FlinkKafkaProducer<String>("test",new MySerializationSchema(),properties,FlinkKafkaProducer.Semantic.AT_LEAST_ONCE);
        DataStream<String> dataStream = environment.readTextFile("E:/test.txt","GB2312");
        dataStream.addSink(kafkaSink);
        environment.execute("WordCount");
    }
}

以上只是一個最簡單的自定義序列化器,用戶在自己編寫的時候,可以結合業務邏輯,根據傳遞過來的對象,在序列化器中指定其topic,partition,key和value等。

Flink中的Kafka分區器

源碼解讀

在Flink中,自定義Kafka分區器需要繼承FlinkKafkaPartitioner抽象類,看一下源碼:


@PublicEvolving
public abstract class FlinkKafkaPartitioner<T> implements Serializable {
   
   

	private static final long serialVersionUID = -9086719227828020494L;
	 //當前方法接收上游傳過來的並行實例ID和並行實例總數。
	 //不是抽象方法,可以不用在子類實現。
	public void open(int parallelInstanceId, int parallelInstances) {
   
   
		// overwrite this method if needed.
	}

	/**
	 * @param record the record value
	 * @param key serialized key of the record
	 * @param value serialized value of the record
	 * @param targetTopic target topic for the record
	 * @param partitions found partitions for the target topic
	 * @return the id of the target partition
	 */
	 //當前方法爲抽象方法,需要在子類中實現其方法體。指定其具體的分區。
	public abstract int partition(T record, byte[] key, byte[] value, String targetTopic, int[] partitions);
}

此類主要爲open方法用於接收上游傳過來的並行實例ID和總數。和partition抽象方法,進行指定分區的具體操作。

Flink中根據此實現了一個默認的分區器FlinkFixedPartitioner,看其源碼:

@PublicEvolving
public class FlinkFixedPartitioner<T> extends FlinkKafkaPartitioner<T> {
   
   

	private static final long serialVersionUID = -3785320239953858777L;

	private int parallelInstanceId;

	@Override
	public void open(int parallelInstanceId, int parallelInstances) {
   
   
		Preconditions.checkArgument(parallelInstanceId >= 0, "Id of this subtask cannot be negative.");
		Preconditions.checkArgument(parallelInstances > 0, "Number of subtasks must be larger than 0.");
		//接收上游傳過來的並行實例ID,並將值傳遞給成員變量,用於分區操作。
		this.parallelInstanceId = parallelInstanceId;
	}

	@Override
	public int partition(T record, byte[] key, byte[] value, String targetTopic, int[] partitions) {
   
   
		Preconditions.checkArgument(
			partitions != null && partitions.length > 0,
			"Partitions of the target topic is empty.");
		//通過Flink並行實例的id去和Kafka分區的數量取餘來決定這個實例的數據寫到哪個Kafka分區
		return partitions[parallelInstanceId % partitions.length];
	}

	@Override
	public boolean equals(Object o) {
   
   
		return this == o || o instanceof FlinkFixedPartitioner;
	}

	@Override
	public int hashCode() {
   
   
		return FlinkFixedPartitioner.class.hashCode();
	}
}

通過源碼可以看出,當前分區器主要是通過Flink並行實例的id和Kafka分區的數量取餘來決定這個實例的數據寫到哪個Kafka分區,並且一個實例只寫Kafka中的一個分區。
這樣做的好處最大限度的利用了Flink和Kafka的可擴展性,提高數據處理效率。

自定義分區器示例

package lenrnflink;

import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema;
import org.apache.flink.streaming.connectors.kafka.partitioner.FlinkKafkaPartitioner;
import org.apache.kafka.clients.producer.ProducerRecord;

import javax.annotation.Nullable;
import java.util.Optional;
import java.util.Properties;

public class Test {
   
   
	//繼承FlinkKafkaPartitioner自定義分區器
    public static class KafkaPartitioner extends FlinkKafkaPartitioner<String> {
   
   
        @Override
        //當前分區器的邏輯爲,如果傳入的對象開頭爲1,則分區的值返回1,開頭爲2,則分區的值返回2,其他則返回0。
        public int partition(String s, byte[] bytes, byte[] bytes1, String s2, int[] ints) {
   
   
            if(s.startsWith("1")){
   
   
                return 1;
            }
            if(s.startsWith("2")){
   
   
                return 2;
            }
            return 0;
        }
    }
    
    public static void main(String[] args) throws Exception {
   
   
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "cdh1:9092,cdh2:9092,cdh3:9092,cdh4:9092,cdh5:9092");
        //必須通過Optional容器添加自定義分區器。
        Optional<FlinkKafkaPartitioner> ps = Optional.of(new KafkaPartitioner());
        //初始化Flink-kafka生產者時,指定自定義分區器。
        FlinkKafkaProducer kafkaSink = new FlinkKafkaProducer("test",new SimpleStringSchema(), properties, ps);
        DataStream<String> dataStream = environment.readTextFile("E:/test.txt","GB2312");
        dataStream.addSink(kafkaSink);
        environment.execute("WordCount");
    }
}

以上爲一個非常簡單的自定義分區器,用戶可根據業務需要制訂一個屬於自己的分區器。

結束語

以下爲FlinkKafkaProducer源碼

/** @deprecated */
    @Deprecated
    public FlinkKafkaProducer(String brokerList, String topicId, SerializationSchema<IN> serializationSchema) {
   
   
        this(topicId, (KeyedSerializationSchema)(new KeyedSerializationSchemaWrapper(serializationSchema)), getPropertiesFromBrokerList(brokerList), (Optional)Optional.of(new FlinkFixedPartitioner()));
    }

    /** @deprecated */
    @Deprecated
    public FlinkKafkaProducer(String topicId, SerializationSchema<IN> serializationSchema, Properties producerConfig) {
   
   
        this(topicId, (KeyedSerializationSchema)(new KeyedSerializationSchemaWrapper(serializationSchema)), producerConfig, (Optional)Optional.of(new FlinkFixedPartitioner()));
    }

    /** @deprecated */
    @Deprecated
    public FlinkKafkaProducer(String topicId, SerializationSchema<IN> serializationSchema, Properties producerConfig, Optional<FlinkKafkaPartitioner<IN>> customPartitioner) {
   
   
        this(topicId, (KeyedSerializationSchema)(new KeyedSerializationSchemaWrapper(serializationSchema)), producerConfig, (Optional)customPartitioner);
    }

    /** @deprecated */
    @Deprecated
    public FlinkKafkaProducer(String brokerList, String topicId, KeyedSerializationSchema<IN> serializationSchema) {
   
   
        this(topicId, serializationSchema, getPropertiesFromBrokerList(brokerList), Optional.of(new FlinkFixedPartitioner()));
    }

    /** @deprecated */
    @Deprecated
    public FlinkKafkaProducer(String topicId, KeyedSerializationSchema<IN> serializationSchema, Properties producerConfig) {
   
   
        this(topicId, serializationSchema, producerConfig, Optional.of(new FlinkFixedPartitioner()));
    }

    /** @deprecated */
    @Deprecated
    public FlinkKafkaProducer(String topicId, KeyedSerializationSchema<IN> serializationSchema, Properties producerConfig, FlinkKafkaProducer.Semantic semantic) {
   
   
        this(topicId, serializationSchema, producerConfig, Optional.of(new FlinkFixedPartitioner()), semantic, 5);
    }

    /** @deprecated */
    @Deprecated
    public FlinkKafkaProducer(String defaultTopicId, KeyedSerializationSchema<IN> serializationSchema, Properties producerConfig, Optional<FlinkKafkaPartitioner<IN>> customPartitioner) {
   
   
        this(defaultTopicId, serializationSchema, producerConfig, customPartitioner, FlinkKafkaProducer.Semantic.AT_LEAST_ONCE, 5);
    }

    /** @deprecated */
    @Deprecated
    public FlinkKafkaProducer(String defaultTopicId, KeyedSerializationSchema<IN> serializationSchema, Properties producerConfig, Optional<FlinkKafkaPartitioner<IN>> customPartitioner, FlinkKafkaProducer.Semantic semantic, int kafkaProducersPoolSize) {
   
   
        this(defaultTopicId, serializationSchema, (FlinkKafkaPartitioner)customPartitioner.orElse((Object)null), (KafkaSerializationSchema)null, producerConfig, semantic, kafkaProducersPoolSize);
    }

    public FlinkKafkaProducer(String defaultTopic, KafkaSerializationSchema<IN> serializationSchema, Properties producerConfig, FlinkKafkaProducer.Semantic semantic) {
   
   
        this(defaultTopic, serializationSchema, producerConfig, semantic, 5);
    }

    public FlinkKafkaProducer(String defaultTopic, KafkaSerializationSchema<IN> serializationSchema, Properties producerConfig, FlinkKafkaProducer.Semantic semantic, int kafkaProducersPoolSize) {
   
   
        this(defaultTopic, (KeyedSerializationSchema)null, (FlinkKafkaPartitioner)null, serializationSchema, producerConfig, semantic, kafkaProducersPoolSize);
    }

在閱讀Flink中的Kafka生產者源碼FlinkKafkaProducer時發現其多個構造函數,凡是參數中包含FlinkKafkaProducer的都被標記爲了deprecated,說明官方已經不推薦使用自定義分區器來進行數據的分區操作。
只有參數包含KafkaSerializationSchema的兩個構造函數是正常的,說明現在官方推薦使用KafkaSerializationSchema接口來進行序列化的操作。
並且閱讀源碼的過程中可以發現,KafkaSerializationSchema中也有對數據的分區操作。只需要結合KafkaContextAware接口即可實現獲取Flink並行實例ID和數量的功能。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章