Flink 入門(3)----常用算子Transformation(轉換)

Flink 中算子是將一個或多個DataStream轉換爲新的DataStream,可以將多個轉換組合成複雜的數據流拓撲,在flink中有多種不同的DataStream類型,他們之間是通過使用各種算子進行的。

Map 

map可以理解爲映射,對每個元素進行一定的變換後,映射爲另一個元素。

使用場景: 過濾髒數據、數據清洗等
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

//這個例子是監聽9992 socket端口,對於發送來的數據,以\n爲分隔符分割後進行處理,
//將分割後的每個元素,添加上一個字符串後,打印出來。
public class FlinkMapDemo {
    private static int index = 1;
    public static void main(String[] args) throws Exception {
        //1.獲取執行環境配置信息
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //2.定義加載或創建數據源(source),監聽9000端口的socket消息
        DataStream<String> textStream = env.socketTextStream("localhost", 9992, "\n");
        //3.map操作。
        DataStream<String> result = textStream.map(s -> (index++) + ".您輸入的是:" + s);
        //4.打印輸出sink
        result.print();
        //5.開始執行
        env.execute();
    }

}

Filter

數據篩選(滿足條件event的被篩選出來進行後續處理),根據FliterFunction 返回的布爾值來判斷是否保留元素,

true爲保留,false則丟棄

使用場景: 過濾髒數據、數據清洗等

import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.util.Arrays;

public class FilterDemo {

    public static void main(String[] args) throws Exception {
        //1.獲取執行環境配置信息
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 輸入: 用戶行爲。某個用戶在某個時刻點擊或瀏覽了某個商品,以及商品的價格。
        DataStreamSource<UserAction> source = env.fromCollection(Arrays.asList(
                new UserAction("userID1", 1293984000000l, "click", "productID1", 10),
                new UserAction("userID2", 1293984001000l, "browse", "productID2", 8),
                new UserAction("userID1", 1293984002000l, "click", "productID1", 10)
        ));

        // 過濾: 過濾出用戶ID爲userID1的用戶行爲
        SingleOutputStreamOperator<UserAction> result = source.filter(new FilterFunction<UserAction>() {
            @Override
            public boolean filter(UserAction value) throws Exception {
                return value.getUserId().equals("userID1");
            }
        });

        // 輸出: 輸出到控制檯
        // UserAction(userID=userID1, timeStamp=1293984000000l, eventType=click, productID=productID1, productPrice=10)
        // UserAction(userID=userID1, timeStamp=1293984002000l, eventType=click, productID=productID1, productPrice=10)
        result.print();

        env.execute();

    }


    public static class UserAction{
        private String userId;
        private Long timeStamp;
        private String eventType;
        private String productId;
        private Integer productPrice;

        public UserAction(String userId, Long timeStamp, String eventType, String productId, Integer productPrice) {
            this.userId = userId;
            this.timeStamp = timeStamp;
            this.eventType = eventType;
            this.productId = productId;
            this.productPrice = productPrice;
        }

      

        public String getUserId() {
            return userId;
        }

        public void setUserId(String userId) {
            this.userId = userId;
        }

        public Long getTimeStamp() {
            return timeStamp;
        }

        public void setTimeStamp(Long timeStamp) {
            this.timeStamp = timeStamp;
        }

        public String getType() {
            return eventType;
        }

        public void setType(String eventType) {
            this.eventType = eventType;
        }

        public String getProductId() {
            return productId;
        }

        public void setProductId(String productId) {
            this.productId = productId;
        }

        public Integer getAge() {
            return productPrice;
        }

        public void setAge(Integer productPrice) {
            this.productPrice = productPrice;
        }

        @Override
        public String toString() {
            return "UserAction{" +
                    "userId='" + userId + '\'' +
                    ", timeStamp=" + timeStamp +
                    ", eventType='" + eventType + '\'' +
                    ", productId='" + productId + '\'' +
                    ", productPrice=" + productPrice +
                    '}';
        }
    }
}

FlatMap: 一行變零到多行。如下,將一個句子(一行)分割成多個單詞(多行)。

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class FlatMapOperatorDemo {
    public static void main(String[] args) throws Exception{
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 輸入: 英文電影臺詞
        DataStreamSource<String> source = env
                .fromElements(
                        "hello world",
                        "I like to play football"
                );

        // 轉換: 將包含football的句子轉換爲每行一個單詞
        SingleOutputStreamOperator<String> result = source.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String value, Collector<String> out) throws Exception {
                if(value.contains("football")){
                    String[] words = value.split(" ");
                    for (String word : words) {
                        out.collect(word);
                    }
                }
            }
        });

        // 輸出: 輸出到控制檯
        // I
        // like
        // to
        // play
        // football
        result.print();

        env.execute();
    }

}

Reduce:對數據進行聚合操作,結合當前元素和上一次reduce返回的值進行聚 合操作,然後返回一個新的值。

import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.datastream.DataStreamSource;

import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.util.Arrays;

public class ReduceOperatorDemo {
    public static void main(String[] args) throws Exception{

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 輸入: 用戶行爲。某個用戶在某個時刻點擊或瀏覽了某個商品,以及商品的價格。
        DataStreamSource<UserAction> source = env.fromCollection(Arrays.asList(
                new UserAction("userID1", 1293984000l, "click", "productID1", 10),
                new UserAction("userID2", 1293984001l, "browse", "productID2", 8),
                new UserAction("userID2", 1293984002l, "browse", "productID2", 8),
                new UserAction("userID2", 1293984003l, "browse", "productID2", 8),
                new UserAction("userID1", 1293984002l, "click", "productID1", 10),
                new UserAction("userID1", 1293984003l, "click", "productID3", 10),
                new UserAction("userID1", 1293984004l, "click", "productID1", 10)
        ));

        // 轉換: KeyBy對數據重分區
        KeyedStream<UserAction, String> keyedStream = source.keyBy(new KeySelector<UserAction, String>() {
            @Override
            public String getKey(UserAction value) throws Exception {
                return value.getUserId();
            }
        });

        // 轉換: Reduce滾動聚合。這裏,滾動聚合每個用戶對應的商品總價格。
        SingleOutputStreamOperator<UserAction> result = keyedStream.reduce(new ReduceFunction<UserAction>() {
            @Override
            public UserAction reduce(UserAction value1, UserAction value2) throws Exception {
                int newProductPrice = value1.getProductPrice() + value2.getProductPrice();
                return new UserAction(value1.getUserId(), -1l, "", "", newProductPrice);
            }
        });
        
        result.print();

        env.execute();
    }


    public static class UserAction{
        private String userId;
        private Long timeStamp;
        private String eventType;
        private String productId;
        private Integer productPrice;

        public UserAction(String userId, Long timeStamp, String eventType, String productId, Integer productPrice) {
            this.userId = userId;
            this.timeStamp = timeStamp;
            this.eventType = eventType;
            this.productId = productId;
            this.productPrice = productPrice;
        }



        public String getUserId() {
            return userId;
        }

        public void setUserId(String userId) {
            this.userId = userId;
        }

        public Long getTimeStamp() {
            return timeStamp;
        }

        public void setTimeStamp(Long timeStamp) {
            this.timeStamp = timeStamp;
        }

        public String getType() {
            return eventType;
        }

        public void setType(String eventType) {
            this.eventType = eventType;
        }

        public String getProductId() {
            return productId;
        }

        public void setProductId(String productId) {
            this.productId = productId;
        }

        public String getEventType() {
            return eventType;
        }

        public void setEventType(String eventType) {
            this.eventType = eventType;
        }

        public Integer getProductPrice() {
            return productPrice;
        }

        public void setProductPrice(Integer productPrice) {
            this.productPrice = productPrice;
        }

        @Override
        public String toString() {
            return "UserAction{" +
                    "userId='" + userId + '\'' +
                    ", timeStamp=" + timeStamp +
                    ", eventType='" + eventType + '\'' +
                    ", productId='" + productId + '\'' +
                    ", productPrice=" + productPrice +
                    '}';
        }
    }
}

Distinct :返回數據集中不相同的元素。 它從輸入DataSet中刪除重複條目,依據元 素的所有字段或字段的子集。

import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple3;

//數據去重
public class DistinctDemo {
    public static void main(String[] args) throws Exception {
        //獲取執行環境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        //讀取數據源
        DataSet<Tuple3<Long, String, Integer>> ds = env.fromElements(Tuple3.of(1L, "zhangsan", 28),
                Tuple3.of(3L, "lisi", 34),
                Tuple3.of(3L, "wangwu", 23),
                Tuple3.of(3L, "zhaoliu", 34),
                Tuple3.of(3L, "lili", 25));

        ds.distinct(0).print();
    }
}

Union 生成兩個DataSet的並集,兩個DataSet必須是相同類型的

import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;

import java.util.ArrayList;

//數據union操作
public class UnionDemo {

    public static void main(String[] args) throws Exception {
        //獲取執行環境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        ArrayList<Tuple2<Integer,String>> list1 = new ArrayList<>();
        list1.add(new Tuple2<>(101,"lily"));
        list1.add(new Tuple2<>(102,"lucy"));
        list1.add(new Tuple2<>(103,"tom"));

        ArrayList<Tuple2<Integer,String>> list2 = new ArrayList<>();
        list2.add(new Tuple2<>(101,"lili"));
        list2.add(new Tuple2<>(102,"jack"));
        list2.add(new Tuple2<>(103,"jetty"));

        DataSet<Tuple2<Integer, String>> ds1 = env.fromCollection(list1);
        DataSet<Tuple2<Integer, String>> ds2 = env.fromCollection(list2);

        DataSet<Tuple2<Integer, String>> union = ds1.union(ds2);

        union.print();
    }
}

OuterJoin :OuterJoin對兩個數據集執行left, right, full outer join。  Join之後可以使用JoinFunction、 FlatJoinFunction對join後的數據對進行處理。

import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;

import java.util.ArrayList;

/**
 * 外連接
 *
 * @author dajiangtai
 * @create 2019-07-29-11:50
 */
public class OuterJoinDemo {
    public static void main(String[] args) throws Exception {
        //獲取執行環境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        ArrayList<Tuple2<Integer,String>> list1 = new ArrayList<>();
        list1.add(new Tuple2<>(1,"lily"));
        list1.add(new Tuple2<>(2,"lucy"));
        list1.add(new Tuple2<>(4,"jack"));

        ArrayList<Tuple2<Integer,String>> list2 = new ArrayList<>();
        list2.add(new Tuple2<>(1,"beijing"));
        list2.add(new Tuple2<>(2,"shanghai"));
        list2.add(new Tuple2<>(3,"guangzhou"));

        DataSet<Tuple2<Integer, String>> ds1 = env.fromCollection(list1);
        DataSet<Tuple2<Integer, String>> ds2 = env.fromCollection(list2);

        /**
         * 左外連接
         * 注意:second tuple中的元素可能爲null
         */
//        ds1.leftOuterJoin(ds2)
//                .where(0)
//                .equalTo(0)
//                .with(new JoinFunction<Tuple2<Integer, String>,Tuple2<Integer, String>,Tuple3<Integer,String,String>>(){
//                    @Override
//                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
//                       if(second == null){
//                           return new Tuple3<>(first.f0,first.f1,"null");
//                       }else{
//                           return new Tuple3<>(first.f0,first.f1,second.f1);
//                       }
//                    }
//                }).print();
        /**
         * 右外連接
         * 注意:first 這個tuple中的數據可能爲null
         *
         */
//        ds1.rightOuterJoin(ds2)
//                .where(0)
//                .equalTo(0)
//                .with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer,String,String>>() {
//                    @Override
//                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
//                        if(first == null){
//                            return new Tuple3<>(second.f0,"null",second.f1);
//                        }else{
//                            return new Tuple3<>(first.f0,first.f1,second.f1);
//                        }
//                    }
//                }).print();
        /**
         * 全外連接
         * 注意:first 和 second 他們的tuple 都有可能爲 null
         */
        ds1.fullOuterJoin(ds2)
                .where(0)
                .equalTo(0)
                .with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer,String,String>>() {
                    @Override
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
                       if(first == null){
                           return new Tuple3<>(second.f0,"null",second.f1);
                       }else if(second == null){
                           return  new Tuple3<>(first.f0,first.f1,"null");
                       }else{
                           return new Tuple3<>(first.f0,first.f1,second.f1);
                       }
                    }
                }).print();
    }
}

Inner join(等值連接)

在Default join生成的二元組上with一個Join Function/ FlatJoinFunction處理連接後的元組;
join Function處理連接元組的每條數據都返回一條數據,而FlatJoinFunction會返回n條數 據(n可以爲0),類比map和flatmap。

 

import lombok.Data;
import lombok.ToString;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;

import java.util.ArrayList;

/**
 * 數據join並處理
 */
public class JoinWithJoinFunctionDemo {
    public static void main(String[] args) throws Exception {
        //獲取執行環境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        ArrayList<Tuple2<Integer,String>> list1 = new ArrayList<>();
        list1.add(new Tuple2<>(1,"lily"));
        list1.add(new Tuple2<>(2,"lucy"));
        list1.add(new Tuple2<>(3,"tom"));
        list1.add(new Tuple2<>(4,"jack"));

        ArrayList<Tuple2<Integer,String>> list2 = new ArrayList<>();
        list2.add(new Tuple2<>(1,"beijing"));
        list2.add(new Tuple2<>(2,"shanghai"));
        list2.add(new Tuple2<>(3,"guangzhou"));

        DataSet<Tuple2<Integer, String>> ds1 = env.fromCollection(list1);
        DataSet<Tuple2<Integer, String>> ds2 = env.fromCollection(list2);

        DataSet<UserInfo> joinedData =
                ds1.join(ds2)
                        .where(0)
                        .equalTo(0)
                        .with(new UserInfoJoinFun());

        joinedData.print();

    }

    public static class UserInfoJoinFun implements JoinFunction<Tuple2<Integer,String>,Tuple2<Integer,String>,UserInfo>{
        @Override
        public UserInfo join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
            return UserInfo.of(first.f0,first.f1,second.f1);
        }
    }


    @Data
    @ToString
    public static class UserInfo{
        private Integer userId;
        private String userName;
        private String address;

        public UserInfo(Integer userId,String userName,String address){
            this.userId = userId;
            this.userName = userName;
            this.address = address;
        }

        public static UserInfo of(Integer userId,String userName,String address){
            return new UserInfo(userId,userName,address);
        }
    }

}

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章