一.聚合操作算子簡介
DataStream裏沒有reduce和sum這類聚合操作的方法,因爲Flink設計中,所有數據必須先分組才能做聚合操作。
先keyBy得到KeyedStream,然後調用其reduce、sum等聚合操作方法。(先分組後聚合)
常見的聚合操作算子主要有:
- keyBy
- 滾動聚合算子Rolling Aggregation
- reduce
1.1 KeyBy
DataStream -> KeyedStream:邏輯地將一個流拆分成不相交的分區,每個分區包含具有相同key的元素,在內部以hash的形式實現的。
1、KeyBy會重新分區;
2、不同的key有可能分到一起,因爲是通過hash原理實現的;
1.2 Rolling Aggregation
這些算子可以針對KeyedStream的每一個支流做聚合。
sum()
min()
max()
minBy()
maxBy()
1.3 reduce
Reduce適用於更加一般化的聚合操作場景。java中需要實現ReduceFunction函數式接口。
在前面Rolling Aggregation的前提下,對需求進行修改。獲取同組歷史溫度最高的傳感器信息,同時要求實時更新其時間戳信息。
二.代碼實現
數據準備:
sensor.txt
sensor_1,1547718199,35.8
sensor_6,1547718201,15.4
sensor_7,1547718202,6.7
sensor_10,1547718205,38.1
sensor_1,1547718207,36.3
sensor_1,1547718209,32.8
sensor_1,1547718212,37.1
2.1 maxby
代碼:
package org.example;
/**
* @author 只是甲
* @date 2021-08-31
* @remark Flink 基礎Transform RollingAggregation
*/
import org.flink.beans.SensorReading;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import sun.awt.SunHints;
import javax.xml.crypto.Data;
public class TransformTest2_RollingAggregation {
public static void main(String[] args) throws Exception {
// 創建 執行環境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 執行環境並行度設置1
env.setParallelism(1);
DataStream<String> dataStream = env.readTextFile("C:\\Users\\Administrator\\IdeaProjects\\FlinkStudy\\src\\main\\resources\\sensor.txt");
// DataStream<SensorReading> sensorStream = dataStream.map(new MapFunction<String, SensorReading>() {
// @Override
// public SensorReading map(String value) throws Exception {
// String[] fields = value.split(",");
// return new SensorReading(fields[0],new Long(fields[1]),new Double(fields[2]));
// }
// });
DataStream<SensorReading> sensorStream = dataStream.map(line -> {
String[] fields = line.split(",");
return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
// 先分組再聚合
// 分組
KeyedStream<SensorReading, Tuple> keyedStream = sensorStream.keyBy("id");
//KeyedStream<SensorReading, String> keyedStream = sensorStream.keyBy(SensorReading::getId);
// 滾動聚合,max和maxBy區別在於,maxBy除了用於max比較的字段以外,其他字段也會更新成最新的,而max只有比較的字段更新,其他字段不變
DataStream<SensorReading> resultStream = keyedStream.maxBy("temperature");
resultStream.print("result");
env.execute();
}
}
測試記錄:
因爲Flink是流式處理,來一條處理一條,而且我設的並行度是1,所以根據文件的順序,讀取到每一條,都會和上一條對比溫度,然後輸出對應的id、以及溫度大的那個的timestamp及temperature。
如果我要輸出最新的時間戳,該如何處理呢?
這個留到下一節reduce來處理。
2.2 reduce
代碼:
package org.flink.transform;
/**
* @author 只是甲
* @date 2021-08-31
* @remark Flink 基礎Transform Reduce
*/
import org.flink.beans.SensorReading;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class TransformTest3_Reduce {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 從文件讀取數據
DataStream<String> inputStream = env.readTextFile("C:\\Users\\Administrator\\IdeaProjects\\FlinkStudy\\src\\main\\resources\\sensor.txt");
// 轉換成SensorReading類型
DataStream<SensorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
// 分組
KeyedStream<SensorReading, Tuple> keyedStream = dataStream.keyBy("id");
// reduce聚合,取最大的溫度值,以及當前最新的時間戳
SingleOutputStreamOperator<SensorReading> resultStream = keyedStream.reduce(new ReduceFunction<SensorReading>() {
@Override
public SensorReading reduce(SensorReading value1, SensorReading value2) throws Exception {
return new SensorReading(value1.getId(), value2.getTimestamp(), Math.max(value1.getTemperature(), value2.getTemperature()));
}
});
keyedStream.reduce( (curState, newData) -> {
return new SensorReading(curState.getId(), newData.getTimestamp(), Math.max(curState.getTemperature(), newData.getTemperature()));
});
resultStream.print();
env.execute();
}
}
測試記錄:
如下可知,輸出對應id,當前timestamp,以及當前最大的temperature