Beam中流的join之kafka和文件join

導讀：讀取文件中的數據和kafka讀取的數據進行join

雖然可以從kafka讀，但寫入文件不能採用Unbounded的形式，因此只能讀前幾條記錄或者一段時間的記錄進行處理(代碼第104行左右)。

完整代碼：

/**
 * 利用kafka進行 Join 測試
 * Beam版本：2.3
 * @author: maqy
 * @date: 2018.09.22
 */

import com.google.common.collect.ImmutableMap;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.joinlibrary.Join;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.kafka.KafkaIO;
import org.apache.beam.sdk.options.*;
import org.apache.beam.sdk.transforms.*;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.apache.kafka.common.serialization.LongDeserializer;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.joda.time.Duration;

public class KafkaJoin {
    //用於輸出Pcollection中的字符串
    static class printString extends DoFn<KV<String ,String> ,KV<String ,String>> {
        @ProcessElement
        public void processElement(ProcessContext c) {
            System.out.println("c.element:" + c.element());
            c.output(c.element());
        }
    }
    //將每行數據分爲 key value
    static class SetValue extends DoFn<String ,KV<String ,String>>{
        @ProcessElement
        public void processElement(ProcessContext c){
            System.out.println("c.element:"+c.element());
            String[] temps=c.element().split(",");
//            for(String temp:temps){
//                System.out.println(temp);
//            }
            KV<String ,String> kv=KV.of(temps[0],temps[1]);
            c.output(kv);
        }
    }

    //預處理，即將每行爲 a,b De 的數據轉化爲KV<a,b>
    public static class Preprocess extends PTransform<PCollection<String>,PCollection<KV<String,String>>> {
        @Override
        public PCollection<KV<String,String>> expand(PCollection<String> lines){
            //String[] temps = lines.toString().split(",");
            PCollection<KV<String,String>> result = lines.apply(ParDo.of(new SetValue()));
            return result;
        }

    }

    //用於輸出
    public static class FormatAsTextFn extends SimpleFunction<KV<String,KV<String,String>>, String> {
        @Override
        public String apply(KV<String,KV<String,String>> input) {
            return "key:"+input.getKey()+"   value:"+input.getValue();
        }
    }
    static final String TOKENIZER_PATTERN = "[^\\p{L}]+";

    public interface KafkaJoinOptions extends PipelineOptions,StreamingOptions {
        /**
         * By default, this example reads from a public dataset containing the text of
         * King Lear. Set this option to choose a different input file or glob.
         */
        @Description("Path of the file to read from")
        @Default.String("/home/maqy/Documents/beam_samples/output/test.txt")
        //Default.String("gs://apache-beam-samples/shakespeare/kinglear.txt")
        String getInputFile();
        void setInputFile(String value);

        /**
         * Set this required option to specify where to write the output.
         */
        @Description("Path of the file to write to")
        @Validation.Required
        @Default.String("/home/maqy/文檔/beam_samples/output/GroupbyTest")
        String getOutput();
        void setOutput(String value);
    }

    public static void main(String[] args) {
        KafkaJoinOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(KafkaJoinOptions.class);

        options.setStreaming(true);
        // Create the Pipeline object with the options we defined above.
        Pipeline p = Pipeline.create(options);

        //不知道爲什麼從kafka裏讀出來需要是<Long,String>格式
        PCollection<KV<Long,String>> source=p.apply(KafkaIO.<Long, String>read()
                .withBootstrapServers("localhost:9092")
                .withTopic("test")
                .withKeyDeserializer(LongDeserializer.class)
                .withValueDeserializer(StringDeserializer.class)

                .updateConsumerProperties(ImmutableMap.of("auto.offset.reset", (Object)"earliest"))
                // We're writing to a file, which does not support unbounded data sources. This line makes it bounded to
                // the first 5 records.
                // In reality, we would likely be writing to a data source that supports unbounded data, such as BigQuery.
                // 要麼讀取一段時間，要麼讀取固定數量的記錄
                //.withMaxNumRecords(5)
                .withMaxReadTime(Duration.millis(5000))
                .withoutMetadata() // PCollection<KV<Long, String>>
        );

        //將kafka中得到的<Long,String> 得到 String
        PCollection<String> kafkaLine=source.apply(Values.<String>create());
        //得到文件中的數據
        PCollection<String> fileLine=p.apply(TextIO.read().from("/home/maqy/桌面/output/kafkaJoin"));

        //預處理，即將每行爲 a,b De 的數據轉化爲KV<a,b>
        PCollection<KV<String,String>> leftPcollection=kafkaLine.apply(new Preprocess());

        PCollection<KV<String,String>> rightPcollection=fileLine.apply(new Preprocess());

        PCollection<KV<String,KV<String,String>>> joinedPcollection = Join.innerJoin(leftPcollection,rightPcollection);
//
//        //results.apply(ParDo.of(new printString()));
        joinedPcollection.apply(MapElements.via(new FormatAsTextFn()))
                .apply("WriteCounts", TextIO.write().to("/home/maqy/桌面/output/kafkaJoinOut1"));

        p.run().waitUntilFinish();
    }
}

kafkaJoin文本內容爲：

ma,qy
li,x
a,b
c,d

通過kafka進行輸入的內容：

ma,hhhhh
a,bbbbb
c,ddsdfsdf

連接後的結果：

key:a   value:KV{bbbbb, b}
key:c   value:KV{ddsdfsdf, d}
key:ma   value:KV{hhhhh, qy}

Beam中流的join之kafka和文件join

導讀：讀取文件中的數據和kafka讀取的數據進行join

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

TensorFlow多機分佈式訓練

TensorFlowOnSpark運行demo

matlab畫柱狀圖並填充

Flink讀取HDFS中的數據源碼分析

Ubuntu下安裝Gurobi

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結