Flink WordCount 之lamda版

學習Flink的時候第一個入門程序WordCount,官方給的使用匿名類實現方法,這樣看起來代碼不簡潔。於是想用lamda改寫下,踩了不少坑,記錄下。

Table of Contents

官方給定版本

Lamda第一版 POJO版

 錯誤1:Collector無泛型參數錯誤

錯誤2: .keyBy("word") 類型不能做key的錯誤

Lamda第二版 Tuple2版


flink 版本 1.9 

https://ci.apache.org/projects/flink/flink-docs-release-1.9/getting-started/tutorials/local_setup.html


官方給定版本

public class SocketWindowWordCount {

    public static void main(String[] args) throws Exception {

        // the port to connect to
        final int port;
        try {
            final ParameterTool params = ParameterTool.fromArgs(args);
            port = params.getInt("port");
        } catch (Exception e) {
            System.err.println("No port specified. Please run 'SocketWindowWordCount --port <port>'");
            return;
        }

        // get the execution environment
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // get input data by connecting to the socket
        DataStream<String> text = env.socketTextStream("localhost", port, "\n");

        // parse the data, group it, window it, and aggregate the counts
        DataStream<WordWithCount> windowCounts = text
            .flatMap(new FlatMapFunction<String, WordWithCount>() {
                @Override
                public void flatMap(String value, Collector<WordWithCount> out) {
                    for (String word : value.split("\\s")) {
                        out.collect(new WordWithCount(word, 1L));
                    }
                }
            })
            .keyBy("word")
            .timeWindow(Time.seconds(5), Time.seconds(1))
            .reduce(new ReduceFunction<WordWithCount>() {
                @Override
                public WordWithCount reduce(WordWithCount a, WordWithCount b) {
                    return new WordWithCount(a.word, a.count + b.count);
                }
            });

        // print the results with a single thread, rather than in parallel
        windowCounts.print().setParallelism(1);

        env.execute("Socket Window WordCount");
    }

    // Data type for words with count
    public static class WordWithCount {

        public String word;
        public long count;

        public WordWithCount() {}

        public WordWithCount(String word, long count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return word + " : " + count;
        }
    }
}

Lamda第一版 POJO版

package com.my.study.flink;

import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;


/**
 * Description: 
 *
 * @author adore.chen
 * @date 2019-11-19
 */
public class SocketStreamWordCount {

    public static void main(String[] args) throws Exception {
        ParameterTool tool = ParameterTool.fromArgs(args);
        int port = tool.getInt("port");

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<String> dataStream = env.socketTextStream("localhost", port, "\n");

        dataStream.flatMap((String value, Collector<WordCount> out) -> {
            for (String word: value.split("\\s")) {
                if (word.trim().length()>0) {
                    out.collect(new WordCount(word, 1));
                }
            }
        })
            .returns(WordCount.class) 
            .keyBy((WordCount wc) -> wc.word)
            .reduce((WordCount wc1, WordCount wc2) -> new WordCount(wc1.word, wc1.count + wc2.count))
            .print();

        env.execute("socket word count");

    }

    public static class WordCount {

        private String word;
        private int count;

        public WordCount(String word, int count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return word + ":" +count;
        }
    }

} 

 錯誤1:Collector無泛型參數錯誤

 InvalidTypesException: The generic type parameters of 'Collector' are missing. In many cases lambda methods don't provide enough information for automatic type extraction when Java generics are involved. An easy workaround is to use an (anonymous) class instead that implements the 'org.apache.flink.api.common.functions.FlatMapFunction' interface. Otherwise the type has to be specified explicitly using type information.
    at org.apache.flink.api.java.typeutils.TypeExtractionUtils.validateLambdaType(TypeExtractionUtils.java:350)
    at org.apache.flink.api.java.typeutils.TypeExtractionUtils.extractTypeFromLambda(TypeExtractionUtils.java:176)
    at org.apache.flink.api.java.typeutils.TypeExtractor.getUnaryOperatorReturnType(TypeExtractor.java:571)
    at org.apache.flink.api.java.typeutils.TypeExtractor.getFlatMapReturnTypes(TypeExtractor.java:196)
    at org.apache.flink.streaming.api.datastream.DataStream.flatMap(DataStream.java:611)
    at com.coupang.ecfds.flink.SocketStreamWordCount.main(SocketStreamWordCount.java:24) 

Lamda表達式編譯之後,編譯器擦除了泛型GenericType,所以不知道返回類型,需要顯示指定。通過 returns(TypeInformation)語句指定。

詳細參考:Flink TypeInformation https://www.cnblogs.com/qcloud1001/p/9626462.html

 

錯誤2: .keyBy("word") 類型不能做key的錯誤

InvalidProgramException: This type (GenericType<com.coupang.ecfds.flink.SocketStreamWordCount.WordCount>) cannot be used as key.
    at org.apache.flink.api.common.operators.Keys$ExpressionKeys.<init>(Keys.java:330)
    at org.apache.flink.streaming.api.datastream.DataStream.keyBy(DataStream.java:337)
    at com.coupang.ecfds.flink.SocketStreamWordCount.main(SocketStreamWordCount.java:32)

這應該是Flink代碼的一個錯誤,懶得去改了,直接使用lamda表達式實現 KeySelector函數接口解決。

解決方案

.keyBy((WordCount wc) -> wc.word)

參考資料:KeySelector https://www.jianshu.com/p/3763854d609b

 

Lamda第二版 Tuple2版

package com.coupang.ecfds.flink;

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;


/**
 * Description: 
 *
 * @author adore.chen
 * @date 2019-11-19
 */
public class SocketStreamWordCount {

    public static void main(String[] args) throws Exception {
        ParameterTool tool = ParameterTool.fromArgs(args);
        int port = tool.getInt("port");

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<String> dataStream = env.socketTextStream("localhost", port, "\n");

        dataStream.flatMap((String value, Collector<Tuple2<String,Integer>> out) -> {
            for (String word: value.split("\\s")) {
                if (word.trim().length()>0) {
                    out.collect(new Tuple2<>(word, 1));
                }
            }
        })
            .returns(Types.TUPLE(Types.STRING, Types.INT))
            .keyBy(0)
            .reduce((Tuple2<String,Integer> wc1, Tuple2<String,Integer> wc2) -> new Tuple2<>(wc1.f0, wc1.f1 + wc2.f1))
            .print();

        env.execute("socket word count");

    }

} 

感覺簡潔了不少。 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章