Flink WordCount 之lamda版

学习Flink的时候第一个入门程序WordCount,官方给的使用匿名类实现方法,这样看起来代码不简洁。于是想用lamda改写下,踩了不少坑,记录下。

Table of Contents

官方给定版本

Lamda第一版 POJO版

 错误1:Collector无泛型参数错误

错误2: .keyBy("word") 类型不能做key的错误

Lamda第二版 Tuple2版


flink 版本 1.9 

https://ci.apache.org/projects/flink/flink-docs-release-1.9/getting-started/tutorials/local_setup.html


官方给定版本

public class SocketWindowWordCount {

    public static void main(String[] args) throws Exception {

        // the port to connect to
        final int port;
        try {
            final ParameterTool params = ParameterTool.fromArgs(args);
            port = params.getInt("port");
        } catch (Exception e) {
            System.err.println("No port specified. Please run 'SocketWindowWordCount --port <port>'");
            return;
        }

        // get the execution environment
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // get input data by connecting to the socket
        DataStream<String> text = env.socketTextStream("localhost", port, "\n");

        // parse the data, group it, window it, and aggregate the counts
        DataStream<WordWithCount> windowCounts = text
            .flatMap(new FlatMapFunction<String, WordWithCount>() {
                @Override
                public void flatMap(String value, Collector<WordWithCount> out) {
                    for (String word : value.split("\\s")) {
                        out.collect(new WordWithCount(word, 1L));
                    }
                }
            })
            .keyBy("word")
            .timeWindow(Time.seconds(5), Time.seconds(1))
            .reduce(new ReduceFunction<WordWithCount>() {
                @Override
                public WordWithCount reduce(WordWithCount a, WordWithCount b) {
                    return new WordWithCount(a.word, a.count + b.count);
                }
            });

        // print the results with a single thread, rather than in parallel
        windowCounts.print().setParallelism(1);

        env.execute("Socket Window WordCount");
    }

    // Data type for words with count
    public static class WordWithCount {

        public String word;
        public long count;

        public WordWithCount() {}

        public WordWithCount(String word, long count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return word + " : " + count;
        }
    }
}

Lamda第一版 POJO版

package com.my.study.flink;

import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;


/**
 * Description: 
 *
 * @author adore.chen
 * @date 2019-11-19
 */
public class SocketStreamWordCount {

    public static void main(String[] args) throws Exception {
        ParameterTool tool = ParameterTool.fromArgs(args);
        int port = tool.getInt("port");

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<String> dataStream = env.socketTextStream("localhost", port, "\n");

        dataStream.flatMap((String value, Collector<WordCount> out) -> {
            for (String word: value.split("\\s")) {
                if (word.trim().length()>0) {
                    out.collect(new WordCount(word, 1));
                }
            }
        })
            .returns(WordCount.class) 
            .keyBy((WordCount wc) -> wc.word)
            .reduce((WordCount wc1, WordCount wc2) -> new WordCount(wc1.word, wc1.count + wc2.count))
            .print();

        env.execute("socket word count");

    }

    public static class WordCount {

        private String word;
        private int count;

        public WordCount(String word, int count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return word + ":" +count;
        }
    }

} 

 错误1:Collector无泛型参数错误

 InvalidTypesException: The generic type parameters of 'Collector' are missing. In many cases lambda methods don't provide enough information for automatic type extraction when Java generics are involved. An easy workaround is to use an (anonymous) class instead that implements the 'org.apache.flink.api.common.functions.FlatMapFunction' interface. Otherwise the type has to be specified explicitly using type information.
    at org.apache.flink.api.java.typeutils.TypeExtractionUtils.validateLambdaType(TypeExtractionUtils.java:350)
    at org.apache.flink.api.java.typeutils.TypeExtractionUtils.extractTypeFromLambda(TypeExtractionUtils.java:176)
    at org.apache.flink.api.java.typeutils.TypeExtractor.getUnaryOperatorReturnType(TypeExtractor.java:571)
    at org.apache.flink.api.java.typeutils.TypeExtractor.getFlatMapReturnTypes(TypeExtractor.java:196)
    at org.apache.flink.streaming.api.datastream.DataStream.flatMap(DataStream.java:611)
    at com.coupang.ecfds.flink.SocketStreamWordCount.main(SocketStreamWordCount.java:24) 

Lamda表达式编译之后,编译器擦除了泛型GenericType,所以不知道返回类型,需要显示指定。通过 returns(TypeInformation)语句指定。

详细参考:Flink TypeInformation https://www.cnblogs.com/qcloud1001/p/9626462.html

 

错误2: .keyBy("word") 类型不能做key的错误

InvalidProgramException: This type (GenericType<com.coupang.ecfds.flink.SocketStreamWordCount.WordCount>) cannot be used as key.
    at org.apache.flink.api.common.operators.Keys$ExpressionKeys.<init>(Keys.java:330)
    at org.apache.flink.streaming.api.datastream.DataStream.keyBy(DataStream.java:337)
    at com.coupang.ecfds.flink.SocketStreamWordCount.main(SocketStreamWordCount.java:32)

这应该是Flink代码的一个错误,懒得去改了,直接使用lamda表达式实现 KeySelector函数接口解决。

解决方案

.keyBy((WordCount wc) -> wc.word)

参考资料:KeySelector https://www.jianshu.com/p/3763854d609b

 

Lamda第二版 Tuple2版

package com.coupang.ecfds.flink;

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;


/**
 * Description: 
 *
 * @author adore.chen
 * @date 2019-11-19
 */
public class SocketStreamWordCount {

    public static void main(String[] args) throws Exception {
        ParameterTool tool = ParameterTool.fromArgs(args);
        int port = tool.getInt("port");

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<String> dataStream = env.socketTextStream("localhost", port, "\n");

        dataStream.flatMap((String value, Collector<Tuple2<String,Integer>> out) -> {
            for (String word: value.split("\\s")) {
                if (word.trim().length()>0) {
                    out.collect(new Tuple2<>(word, 1));
                }
            }
        })
            .returns(Types.TUPLE(Types.STRING, Types.INT))
            .keyBy(0)
            .reduce((Tuple2<String,Integer> wc1, Tuple2<String,Integer> wc2) -> new Tuple2<>(wc1.f0, wc1.f1 + wc2.f1))
            .print();

        env.execute("socket word count");

    }

} 

感觉简洁了不少。 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章