場景:來自kafka的流數據,需要和外部文件數據進行對比,外部文件每天更新,所以需要在flink流處理中定時讀取外部流,並廣播到下游。本文介紹怎麼在flink流處理中執行定時任務。
廣播作用:
廣播狀態(Broadcast State)的引入是爲了支持一些來自一個流的數據需要廣播到所有下游任務的情況,它存儲在本地,用於處理其他流上的所有傳入元素。例如,廣播狀態可以作爲一種自然匹配出現,您可以想象一個低吞吐量流,其中包含一組規則,我們希望對來自另一個流的所有元素進行評估。——尼小摩
1、實時流:
基於flink1.9.2,必須使用FlinkKafkaConsumer
FlinkKafkaConsumer ssConsumer = new FlinkKafkaConsumer(READ_TOPIC, new SimpleStringSchema(), properties);
2、文件流:
DataStreamSource<JSONObject> fileStreamSource = env.addSource(new MyRishSourceFileReader());
3、自定義Source:
自定義的Source,繼承RichSourceFunction,重寫函數。在open函數中讀取文件,存入ConcurrentHashMap中,在run函數中ctx.collect()出去,然後在BroadcastProcessFunction中的processBroadcastElement函數裏接收。
import com.alibaba.fastjson.JSONObject;
import com.maxmind.geoip2.DatabaseReader;
import com.qianxin.ida.dto.DeviceUserBaseLineDto;
import com.qianxin.ida.dto.GpsBaseLineDto;
import com.qianxin.ida.dto.TimeBaseLineDto;
import com.qianxin.ida.dto.UserDeviceBaseLineDto;
import com.qianxin.ida.enrich.BuildBaseLineDto;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
public class MyRishSourceFileReader extends RichSourceFunction<JSONObject> {
public static DatabaseReader reader;
private List<TimeBaseLineDto> timeBaseLineDtos;
public final static ConcurrentHashMap<String, Object> map = new ConcurrentHashMap<>();
private static final Logger logger = LoggerFactory.getLogger(MyRishSourceFileReader.class);
@Override
public void open(Configuration configuration) {
try {
//啓動時讀取首次
query();
reader = TransUtil.getDatabaseReader();
//線程定時任務,每隔23小時,執行一次
ScheduledExecutorService service = Executors.newScheduledThreadPool(5);
service.scheduleWithFixedDelay(() -> {
try {
query();
} catch (Exception e) {
e.printStackTrace();
}
}, 10L, 23L, TimeUnit.HOURS);
} catch (Exception e) {
logger.error("讀取文件失敗", e);
}
}
public void query() {
logger.info("當前讀取基線文件的時間:" + LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")));
timeBaseLineDtos = BuildBaseLineDto.getTimeBaseLine();
map.put("timeBaseLineDtos", timeBaseLineDtos);
}
@Override
public void run(SourceContext ctx) {
try {
JSONObject out = new JSONObject();
JSONObject configJsonFile = JSONObject.parseObject(JsonFileReaderUtil.readJsonData(PropertyReaderUtil.getStrValue("config.json.path")));
out.put("configJsonFile", configJsonFile);
out.put("timeBaseLineDtos", map.get("timeBaseLineDtos"));
ctx.collect(out);
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public void cancel() {
}
}
將文件流廣播,connect實時流ssConsumer,自定義廣播流函數。
4、廣播:
需要自己實現兩個方法:processBroadcastElement()
負責處理廣播流中的傳入元素,processElement()
負責處理非廣播流中的傳入元素。從ReadOnlyContext中取到SourceContext的map,實時流數據和廣播流數據匯聚,進行業務邏輯處理,最後out輸出,進行sink等操作。
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.qianxin.ida.dto.DeviceUserBaseLineDto;
import com.qianxin.ida.dto.GpsBaseLineDto;
import com.qianxin.ida.dto.TimeBaseLineDto;
import com.qianxin.ida.dto.UserDeviceBaseLineDto;
import com.qianxin.ida.utils.TransUtil;
import org.apache.flink.api.common.state.BroadcastState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.shaded.netty4.io.netty.util.internal.StringUtil;
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.math.BigDecimal;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public class MyBroadcastProcessFunction extends BroadcastProcessFunction<String, JSONObject, String> {
private static final Logger logger = LoggerFactory.getLogger(MyBroadcastProcessFunction.class);
private MapStateDescriptor<String, JSONObject> ruleStateDescriptor;
private String eventType;
public MyBroadcastProcessFunction(MapStateDescriptor<String, JSONObject> ruleStateDescriptor, String eventType) {
this.ruleStateDescriptor = ruleStateDescriptor;
this.eventType = eventType;
}
//這裏處理廣播流的數據
@Override
public void processBroadcastElement(JSONObject jsonObject, Context ctx, Collector<String> collector) throws Exception {
BroadcastState<String, JSONObject> broadcastState = ctx.getBroadcastState(ruleStateDescriptor);
broadcastState.put("broadcast", jsonObject);
}
//這裏處理數據流的數據
@Override
public void processElement(String value, ReadOnlyContext ctx, Collector<String> out) {
double probability = 0;
JSONObject currentStreamData = JSON.parseObject(value);
if (currentStreamData != null) {
try {
Iterator<Map.Entry<String, JSONObject>> iterator = ctx.getBroadcastState(ruleStateDescriptor).immutableEntries().iterator();
while (iterator.hasNext()) {
String outStr = "";
Object object = iterator.next().getValue();
JSONObject jsonObject = (JSONObject) JSON.toJSON(object);
JSONObject configJsonFile = (JSONObject) JSON.toJSON(jsonObject.get("configJsonFile"));
List<TimeBaseLineDto> timeBaseLineDto = (List<TimeBaseLineDto>) jsonObject.get("timeBaseLineDtos");
if ("1".equals(eventType)) {
//業務邏輯函數
outStr = doTimeOutierEvent(timeBaseLineDto, currentStreamData, configJsonFile);
}
if (!StringUtil.isNullOrEmpty(outStr)) {
out.collect(outStr);
}
}
} catch (Exception e) {
logger.error("處理廣播流和數據流數據出錯:", e);
}
}
}
}
5、連接兩個流:
將實時流和廣播流連接,非廣播流上調用connect()
BroadcastStream<JSONObject> timeBroadcast = fileStreamSource.setParallelism(1).broadcast(ruleStateDesc);
DataStream<JSONObject> timeStream = env.addSource(ssConsumer)
.connect(timeBroadcast).process(new MyBroadcastProcessFunction(ruleStateDesc,"1"));
5、Sink:
timeStream.addSink(FlinkKafkaProducerCustom.create(WRITE_TOPIC, properties)).name("flink-kafka-timeStream");