Apache Beam處理Kafka數據源源碼分析

Apache Beam將Kafka作爲數據輸入的實踐案例源碼分析：

首先，我們建立一個maven工程，在添加原始的Beam依賴之後，還需要添加如下的支持Kafka的依賴

<groupId>org.apache.beam</groupId>

<artifactId>beam-sdks-java-io-kafka</artifactId>

</dependency>

依賴下載完成之後，我們就可以使用像諸如KafkaIO工具之類的工具類來處理Beam中的來自kafka的數據。

在使用KafkaIO工具類之前，我們先來對KafkaIo類的源碼做一個翻譯工作，以便更好的使用它


/**
 * An unbounded source and a sink 
 for <a href="http://kafka.apache.org/"
>Kafka</a> topics.

（以kakfa的topic形式的無邊界數據源）
 * Kafka version 0.9 
 and above are supported.

（支持kafka 0.9以及0.9以上的版本）
 *

（從kafka的topics中讀取數據）
 *
 * <p>KafkaIO source returns 
unbounded collection of Kafka records as
 * {@code PCollection<KafkaRecord<K, V>>}
. A {@link KafkaRecord} includes basic
 * metadata like topic-partition and offset
 , along with key and value associated with a Kafka
 * record.
 *

（通過讀取kafka toppics中的數據從而返回一
個無邊界的集合對象
PCollection<KafkaRecord<K, V>>

，該對象包含基本的元數據信息，包括topic分區
以及偏移量
，並以鍵值對的形式記錄kafka的每一個record）


 * <p>Although most applications consume a
 single topic, the source can be configured to consume
 * multiple topics or even a
  specific set of {@link TopicPartition}s.
 *

(雖然很多應用消費單個的kafka數據topic,

但是數據源也可以配置消費多個topics，，
甚至可以指定TopicPartion集合。配置kafka數據源
，必須至少要配置好bootstrapServers，至少一個topic。

)
 
 案例代碼如下所示：

WordCountOptions options =
PipelineOptionsFactory
.fromArgs(args).withValidation()
  .as(WordCountOptions.class);


Pipeline pipeline = 
Pipeline.create(options);

// 返回 PCollection<KafkaRecord<byte[], byte[]>

//配置kafka server和topic，該配置是必須的

pipeline.apply(KafkaIO.read()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopics(ImmutableList.of("topic_a", "topic_b"))

// 接下來的配置是可以選擇的（可選項） :

//配置鍵值對編碼
.withKeyCoder(BigEndianLongCoder.of())
.withValueCoder(StringUtf8Coder.of())

//配置緩衝字節大小

.updateConsumerProperties(ImmutableMap
.of("receive.buffer.bytes", 1024 * 1024))
//自定義函數計算記錄時間戳(缺省爲處理時間)

.withTimestampFn(new MyTypestampFunction())
//自定義 watermark函數 (默認是 timestamp)
.withWatermarkFn(new MyWatermarkFunction())
//最後,如果你不需要Kafka 元數據的話,你可以丟棄它

.withoutMetadata()

//返回PCollection<String>，指定類型爲String

.apply(Values.<String>create())

* <h3>分區和檢查點</h3>
 * The Kafka partitions are 
evenly distributed among splits (workers).

（kafka分區分佈在各個workers上）
 * Dataflow checkpointing is f
 ully supported and
 * each split can resume 
 from previous checkpoint

（Dataflow支持每個分片可以從以前的檢查點上恢復）

. See
{@link UnboundedKafkaSource#generateInitialSplits
(int, PipelineOptions)} for more details on
 * splits and checkpoint support.
 *（當pipeline第一次開始執行的時候，
沒有任何checkpoint,source開始消費最新的數據，）
 *
 * <h3>寫入數據到kafka</h3>
 *(以鍵值對的形式寫入kafka)
 * <p>KafkaIO sink supports writing
 key-value pairs to a Kafka topic. Users can also write

(也可以只寫入value值)
 * just the values. To configure 
 a Kafka sink, you must specify 
 at the minimum Kafka
 
  PCollection<KV<Long, String>> kvColl = ...;
  kvColl.apply(KafkaIO.write()

.withBootstrapServers("broker_1:9092,broker_2:9092")
  .withTopic("results")
  // 設置Key and Value編碼

.withKeyCoder(BigEndianLongCoder.of())

.withValueCoder(StringUtf8Coder.of())
  // 進一步指定KafkaProducer

// 指定KafkaProducer屬性，比如壓縮格式等
.updateProducerProperties(ImmutableMap
.of("compression.type", "gzip")));

 （通常情況下，有可能只需要寫入值到kafka中
那就按照如下的方式進行）
  <p>Often you might want to write just 
values without any keys to Kafka. Use {@code values()} to
  write records with default empty(null) key:

 PCollection<String> strings = ...;
 strings.apply(KafkaIO.write()
 .withBootstrapServers("broker_1:9092,broker_2:9092")
 .withTopic("results")
 .withValueCoder(StringUtf8Coder.of())

 

 .values() //只需要在此寫入默認的key就行了，默認爲null值
 );
 

 */

Apache Beam處理Kafka數據源源碼分析

工作中用到的腳本合集

24-5-18 X

Apache Beam處理Kafka數據源源碼分析

Beam 超實用examples之Pi值計算

Zeppelin使用心得

Apache Beam程序嚮導4

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結