一、需求簡單概述
1. 原因:
估計看到這篇文章的人都會覺得統計每天的下載量排名這個需求聽起來就是T+1的離線批處理需求,其實我也是這麼覺得的,所以爲什麼要寫這個呢?其實這是以前的需求,以前是實時統計的需求,但是排名什麼的是在後期的接口通過讀取數據庫的數據進行實現的,現在就覺得通過接口來獲取數據庫的數據進行排序什麼的效率比較低,就希望直接把排序結果直接寫到數據庫中。這也是爲什麼平常我比較習慣用Scala去寫Spark和Flink,而這次使用Java編寫的原因,因爲寫在以前的項目中所以就得使用他們以前的編寫方式了。然後我爲什麼要整理成文章呢?因爲這個需求用到了Flink的 狀態編程、WaterMark、時間語義、狀態後端、定時器 等,一開始在網上找相關的資料也不多,所以還是有點記錄意義的。
2. 需求:
一開始我以爲就是實時統計當天0點到計算時候的這個時間段每個遊戲的下載量排名,所以就按照這樣去實現,後面我寫完了才知道只需要實時統計下載量最大的那個遊戲就可以了,其實差不多,所以代碼中我會有註釋掉的一部分。
3. 實現思路:
拿到數據以後,因爲所有的遊戲數量並不是很大,所以用一個狀態保存所有遊戲的下載量,方便後面統計拿到最大下載量的遊戲數據進行入庫。所以同一天的數據會使用同一個狀態,那問題來了,這個今天的狀態後面應該就沒什麼用了,那這個狀態每天一個每天一個的不是越來越多,所以我設置了個定時器,定時器兩天後觸發,定時器觸發的時候進行的操作就是對兩天前的那個狀態進行一個清空。所以還需要第二個狀態來保存定時器的時間。我還設置了第三個定時器用於去重使用。這裏還有一個小的細節,就是比如你的數據是兩天以後纔到,這時候兩天前的定時器都清空了,就容易導致數據出錯,所以還得進行一個判斷,拿watermark的時間和傳進來的時間進行一個比較,如果傳進來的時間加上兩天還大於watermark的時間那就捨去不處理。大概的思路是這樣,因爲這種需求是今天零點到現在的下載量最大,不是每個鍾或者某段時間,像每五分鐘的下載量最大這種其實用Window就會方便很多,也不用考慮那麼多狀態。但是這裏其實也可以加上Window,就是每多少分鐘執行一次,但是我這邊沒有加,也差不多。最後是輸出到MongoDB數據庫,這裏就不貼輸出的代碼了。
二、代碼實現
pom.xml:
<!-- Apache Flink dependencies -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-core</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb_2.11</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>3.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-hadoop-compatibility_2.11</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.mongodb.mongo-hadoop</groupId>
<artifactId>mongo-hadoop-core</artifactId>
<version>2.0.0</version>
</dependency>
數據格式:
遊戲id 時間戳 設備標識 版本 渠道
id1 1576684800 imei1 version1 channel1
一條數據就是一次下載,上面這條數據的意思就是imei1這個設備在1576684800這個時間戳下載了id1的遊戲。
- Topology.java
import com.mongodb.client.model.WriteModel;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.contrib.streaming.state.RocksDBStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.bson.Document;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.List;
public class Topology {
public static final String ProjectName = "halo-download";
public static void main(String[] args) throws Exception {
Logger logger = LoggerFactory.getLogger(Topology.class);
// 不同項目加載不同的路徑的配置
ParameterTool params = ParameterTool
.fromPropertiesFile(
Topology.class.getResourceAsStream("/normal.properties")
).mergeWith(ParameterTool.fromPropertiesFile(
Topology.class.getResourceAsStream("/mongodb.properties")
)).mergeWith(ParameterTool.fromPropertiesFile(
Topology.class.getResourceAsStream("/loghub.properties")
)).mergeWith(
ParameterTool.fromArgs(args)
);
String hdfsMaster = params.get("hdfs", "hdfs://");
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
.enableCheckpointing(Time.seconds(
Integer.valueOf(params.get("checkpoint.sec", "300"))
).toMilliseconds())
.setStateBackend(new RocksDBStateBackend(params.get("RocksDBStateBackend", hdfsMaster + "/flink/checkpoints")));
env.getConfig().setGlobalJobParameters(params);
// 執行環境
String execEnv = params.get("exec.env", "dev");
// 開啓恰好一次語義
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// 開啓系統log輸出
env.getConfig().enableSysoutLogging();
// 以事件時間爲時間窗,就是事件時間語義,EventTime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// latency檢測時間間隔
env.getConfig().setLatencyTrackingInterval(1000);
//生成Watermark的時間間隔爲100毫秒
env.getConfig().setAutoWatermarkInterval(100L);
logger.info("execEnv: " + execEnv);
//真實場景的數據就不拿了,隨便造點類似的數據進行模擬測試
//模擬數據用於測試,測試數據不能換行,換行的話時間戳會有問題
//id1 1576684800 imei1 version1 channel1 這是數據格式
//12-22 id2 1576944000 imei2 version2 channel1 前面的12-22是對應時間戳的日期,主要是方便我測試,沒有實際意義
//12-23 id3 1577030400 imei2 version2 channel1
//12-24 id4 1577116800 imei2 version2 channel1
//12-25 id5 1577203200 imei2 version2 channel1
//12-26 id6 1577289600 imei2 version2 channel1
//12-27 id7 1577376000 imei2 version2 channel1
//12-28 id8 1577462400 imei2 version2 channel1
SingleOutputStreamOperator<SortDowComplete> localhost = env.socketTextStream("localhost", 8888).map(new MapFunction<String, SortDowComplete>() {
@Override
public SortDowComplete map(String s) throws Exception {
String[] splits = s.split("\\W+");
SortDowComplete sortDowComplete = new SortDowComplete(splits[0], Integer.parseInt(splits[1]), splits[2], splits[3], splits[4]);
return sortDowComplete;
}
});//.setParallelism(1)
SingleOutputStreamOperator<List<WriteModel<? extends Document>>> process =
//生成WaterMark,把時間字段傳進去,並且把整個流數據當做遞增的數據 不提供延遲時間,現在這一條數據產生到watermark要在下一條數據到達纔會拿得到
//https://ci.apache.org/projects/flink/flink-docs-release-1.10/zh/dev/event_timestamp_extractors.html
localhost.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<SortDowComplete>() {
@Override
public long extractAscendingTimestamp(SortDowComplete sortDowComplete) {
return sortDowComplete.getTs() * 1000L;
}
}).setParallelism(1)
//把時間當做keyBy的對象,就能保證同一天的數據進到同一個processFunction中
.keyBy(new SortKeyByFunction()).process(new KeyTimeProcessFunction());
process.println();//輸出
}
}
- SortDowComplete.java
/**
* @Author: fseast
* @Date: 2020/3/24 下午8:32
* @Description:
*/
public class SortDowComplete{
private String gameId;
private long ts;
private String imei;
private String haloVersion;
private String haloChannel;
public SortDowComplete(){}
public SortDowComplete(String gameId, int ts, String imei, String haloVersion, String haloChannel) {
this.gameId = gameId;
this.ts = ts;
this.imei = imei;
this.haloVersion = haloVersion;
this.haloChannel = haloChannel;
}
public String getGameId() {
return gameId;
}
public void setGameId(String gameId) {
this.gameId = gameId;
}
public Long getTs() {
return ts;
}
public void setTs(int ts) {
this.ts = ts;
}
public String getImei() {
return imei;
}
public void setImei(String imei) {
this.imei = imei;
}
public String getHaloVersion() {
return haloVersion;
}
public void setHaloVersion(String haloVersion) {
this.haloVersion = haloVersion;
}
public String getHaloChannel() {
return haloChannel;
}
public void setHaloChannel(String haloChannel) {
this.haloChannel = haloChannel;
}
@Override
public String toString() {
return "SortDowComplete{" +
"gameId='" + gameId + '\'' +
", ts=" + ts +
", haloVersion='" + haloVersion + '\'' +
", haloChannel='" + haloChannel + '\'' +
", imei='" + imei + '\'' +
'}';
}
}
- SortKeyByFunction.java
import org.apache.flink.api.java.functions.KeySelector;
/**
* @Author: fseast
* @Date: 2020/3/24 下午11:14
* @Description:
*/
public class SortKeyByFunction implements KeySelector<SortDowComplete,Long> {
@Override
public Long getKey(SortDowComplete sortDowComplete) throws Exception {
return sortDowComplete.getTs();
}
}
- KeyTimeProcessFunction.java
import com.mongodb.client.model.UpdateOneModel;
import com.mongodb.client.model.WriteModel;
import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
import org.bson.Document;
import java.util.*;
/**
* @Author: fseast
* @Date: 2020/3/24 下午5:23
* @Description:
*/
public class KeyTimeProcessFunction extends KeyedProcessFunction<Long,SortDowComplete,List<WriteModel<? extends Document>>> {
//private ValueState<String> gameState1;
//如果是同一天的數據會使用同一個狀態
//定義一個state,用於保存每個遊戲下載量,state的類型是MapState,key是gameId,value是(ga,下載量)
private transient MapState<String,Integer> gameState;
//定義一個state,用於保存定時器的時間,
private transient ValueState<Long> timerState;
//定義一個狀態用來去重,同一天、同一個imei,同一個gameID就只需要一個
private transient MapState<String,Integer> distinctState;
//List<SortGameDownloads> list ;
//Map<Long,Integer> map;
@Override
public void open(Configuration parameters) throws Exception {
//System.out.println("====");
//super.open(parameters);
gameState = getRuntimeContext().getMapState(new MapStateDescriptor<>("gameState", String.class,Integer.class));
timerState = getRuntimeContext().getState(new ValueStateDescriptor<Long>("timerState",Long.class));
distinctState = getRuntimeContext().getMapState(new MapStateDescriptor<String, Integer>("distinctState", String.class, Integer.class));
//map = new HashMap<>();
//list = new ArrayList<>();
}
@Override
public void processElement(SortDowComplete sortDowComplete, Context context, Collector<List<WriteModel<? extends Document>>> collector) throws Exception {
//set聲明不能放在外面
TreeSet<SortGameDownloads> set = new TreeSet<>();
//數據時間,乘以1000就是毫秒的。
long oneTime = sortDowComplete.getTs() * 1000L;
long watermark = context.timerService().currentWatermark() + 1L;
//System.out.println(day);//null或者1
System.out.println("當前process時間:" + context.timerService().currentProcessingTime() + ",watermark時間信息:"+ (context.timerService().currentWatermark()+1));
System.out.println("sortDowComplete傳進來的時間:" + oneTime);
//System.out.println("定時器的state:"+ timerState.value());
//在狀態還是空到時候,直接拿value是null
//watermark的時間也就是當前數據跑到的最遠時間,小於這條數據的時間,才註冊定時器。
//watermark大於當前數據的時間就證明這條數據是遲到數據,這樣可以避免有些數據很多天以後纔過來導致state已經被清空又重複被更新。
//判斷有沒有設置過定時器,沒有設置過定時器的話,就註冊一個定時器,
if (oneTime > watermark && timerState.value() == null) {
//拿到傳進來到時間加上一天半的時間,一天半以後清空狀態
long timerTs = oneTime + 129600000L;//加36個小時,其實就是3號的狀態,5號0點才清除
context.timerService().registerEventTimeTimer(timerTs);
//把時間戳保存到state中,
timerState.update(timerTs);
System.out.println("設置定時器時間爲:" + timerTs);
}
//System.out.println("定時器的state:"+ timerState.value());
/*Iterator<Map.Entry<String, Integer>> iterator = gameState.iterator();
while (iterator.hasNext()) {
Map.Entry<String, Integer> next = iterator.next();
System.out.println("gameState的key:"+next.getKey()+"==,value:"+next.getValue());
}
gameState.put(sortDowComplete.getGameId(),101);*/
//如果數據時間遲到24小時,那麼直接忽略不計算,比如3號的數據,在5號以後纔到,那麼直接忽略該數據
long judge = oneTime + 172800000;
//System.out.println("判斷判斷:"+judge+" ,watermark:"+watermark);
if (judge <= watermark){
System.out.println("不執行");
}else {
System.out.println("執行");
//把根據遊戲ID獲取到對應的下載量。
Integer gameDowNum = gameState.get(sortDowComplete.getGameId());
String stateKey = sortDowComplete.getGameId() + "_" + sortDowComplete.getImei();
System.out.println(stateKey+", "+distinctState.get(stateKey));
//如果這個遊戲添加到狀態中過了,那麼進行添加操作
if (gameDowNum != null){
Integer dist = distinctState.get(stateKey);
//如果這個用戶沒有下載過這個遊戲,那麼下載量加一
if (dist == null || dist == 0){
gameDowNum +=1;
//更新這個遊戲和它的下載量
gameState.put(sortDowComplete.getGameId(),gameDowNum);
//更新去重狀態
distinctState.put(stateKey,1);//把遊戲名和imei的組合添加到去重狀態中
}
} else if (gameDowNum == null || gameDowNum == 0){//如果這個遊戲今天位置沒有添加到狀態中過,則賦一個初值1,
gameState.put(sortDowComplete.getGameId(),1);
distinctState.put(stateKey,1);//把遊戲名和imei的組合添加到去重狀態中
}
Iterator<Map.Entry<String, Integer>> iterator = gameState.iterator();
while (iterator.hasNext()){
Map.Entry<String, Integer> next = iterator.next();
//System.out.println("key:"+next.getKey()+",value:"+next.getValue());
SortGameDownloads sortGameDownloads = new SortGameDownloads(sortDowComplete.getTs(),next.getKey(), next.getValue());
set.add(sortGameDownloads);
}
List<WriteModel<? extends Document>> writeModels = new ArrayList<>();
//只拿下載量最大的
SortGameDownloads first = set.first();
first.setTop(1);
writeModels.add(new UpdateOneModel(first.getQueryDoc(),first.getUpdateDoc(),first.getOption()));
/*前面是隻要了下載量最大的那條數據而已,這裏是拿所有的數據並排序
Iterator<SortGameDownloads> iteSort = set.iterator();
Integer i = 1;
Integer temp = -1;
while (iteSort.hasNext()) {
SortGameDownloads sortNext = iteSort.next();
if (i == 1){
temp = sortNext.getDownloads();
//添加數據
sortNext.setTop(1);
//writeModels.add(new DeleteManyModel(sortNext.deleteFilter()));
writeModels.add(new UpdateOneModel(sortNext.getQueryDoc(),sortNext.getUpdateDoc(),sortNext.getOption()));
i = 2;
} else if (i == 2){//第一有多個
if (temp == sortNext.getDownloads()){//第二個 和第一個相等
//添加數據
sortNext.setTop(1);
writeModels.add(new UpdateOneModel(sortNext.getQueryDoc(),sortNext.getUpdateDoc(),sortNext.getOption()));
i = 2;
} else {
i ++;
break;
}
}
}*/
System.out.println("set開始");
System.out.println(first);
System.out.println(set);
System.out.println("set結束");
/*for (SortGameDownloads sortGameDow : set) {
writeModels.add(new UpdateOneModel(sortGameDow.getQueryDoc(),sortGameDow.getUpdateDoc(),sortGameDow.getOption()));
}*/
System.out.println(writeModels);
System.out.println("===1111111");
collector.collect(writeModels);
}
System.out.println();
}
//如果watermark等於或者大於你設定的觸發到時間,則會觸發
//每個keyBy對應到key的第一條數據都會觸發一次,而且觸發時間就是這條數據的時間
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<List<WriteModel<? extends Document>>> out) throws Exception {
//super.onTimer(timestamp, ctx, out);
//定時器觸發則執行清空目前設置定時器的keyBy對應的state,不會刪除別的狀態。
gameState.clear();
timerState.clear();
distinctState.clear();
System.out.println("==觸發定時器時間:"+timestamp);
}
}
- SortGameDownloads.java
import com.mongodb.client.model.UpdateOptions;
import org.bson.Document;
import java.util.Date;
/**
* @Author: fseast
* @Date: 2020/3/26 下午5:24
* @Description:
*/
public class SortGameDownloads implements Comparable , IMongoUpdate {
private Long time;
private String gameId;
private Integer downloads;
private Integer top = -1;
public SortGameDownloads(Long time, String gameId, Integer downloads) {
this.time = time;
this.gameId = gameId;
this.downloads = downloads;
}
public Long getTime() {
return time;
}
public void setTime(Long time) {
this.time = time;
}
public String getGameId() {
return gameId;
}
public void setGameId(String gameId) {
this.gameId = gameId;
}
public Integer getDownloads() {
return downloads;
}
public void setDownloads(Integer downloads) {
this.downloads = downloads;
}
public Integer getTop() {
return top;
}
public void setTop(Integer top) {
this.top = top;
}
@Override
public String toString() {
return "SortGameDownloads{" +
"time=" + time +
", gameId='" + gameId + '\'' +
", downloads=" + downloads +
", top=" + top +
'}';
}
@Override
public int compareTo(Object o) {
if (o instanceof SortGameDownloads){
SortGameDownloads sort = (SortGameDownloads) o;
//如果下載量一樣,按照遊戲id排序
int num = -(this.downloads - sort.downloads);
if (num == 0){
return this.gameId.compareTo(sort.gameId);
}
return num;
}
return 0;
}
@Override
public Document getQueryDoc() {
Document query = new Document();
//query.append("game_id",gameId);
query.append("time",new Date(time * 1000L));
return query;
}
@Override
public Document getUpdateDoc() {
Document inc = new Document();
inc.append("downloads",downloads);
inc.append("top",top);
inc.append("game_id",gameId);
return new Document("$set",inc);
}
@Override
public UpdateOptions getOption() {
return new UpdateOptions().upsert(true);
}
//刪除操作
/*public Document deleteFilter(){
Document delete = new Document();
delete.append("time",new Date(time * 1000L));
return delete;
}*/
}
- IMongoUpdate.java
import com.mongodb.client.model.UpdateOptions;
import org.bson.Document;
public interface IMongoUpdate {
Document getQueryDoc();
Document getUpdateDoc();
UpdateOptions getOption();
}