flink使用checkpoint方式保存task的狀態,當task失敗時,可以從之前checkpoint地方恢復狀態;
如果說整個應用掛了,如何根據之前checkpoint來恢復應用的狀態;
首先應用掛了的話,它默認會刪除之前checkpoint數據,當然我們可以在代碼中設置應用退出時保留checkpoint數據
CheckpointConfig config = env.getCheckpointConfig();
config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
第二步:指定checkpoint目錄重啓應用
bin/flink run -s hdfs://aba/aa.... ...
接下來模擬下整個過程,以WordCount爲例;我們定義了一個叫TMap的MapFunction,他的作用就是在收到“error”字符串的時候拋出異常;
我們的重啓策略是task失敗時,重啓3次,每次間隔2秒
object StateWordCount {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
//開啓checkpoint纔會有重啓策略
env.enableCheckpointing(5000)
//失敗後最多重啓3次,每次重啓間隔2000ms
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 2000))
val config = env.getCheckpointConfig
config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
env.setStateBackend(new FsStateBackend("hdfs://hadoop:8020/tmp/flinkck"))
val inputStream = env.socketTextStream("hadoop7", 9999) //nc -lk 9999
inputStream.flatMap(_.split(" ")).map(new TMap()).map((_, 1)).keyBy(0).sum(1).print()
env.execute("stream word count job")
}
}
/**當接收的數據是error字符串時候,拋出異常*/
class TMap() extends MapFunction[String, String] {
override def map(t: String): String = {
if ("error".equals(t)) {
throw new RuntimeException("error message ")
}
t
}
}
在hadoop7機器啓動一個nc服務
nc -lk 9999
第一次執行程序
/data2/flink-1.10.0/bin/flink run -yD yarn.containers.vcores=2 -m yarn-cluster -c StateWordCount wordcount.jar
往9999端口發送數據
[root@hadoop7]$ nc -lk 9999
spark
spark
flink
flink
此時打印日誌
(spark,1)
(spark,2)
(flink,1)
(flink,2)
現在發送一條錯誤數據“error”,task已經失敗了
2020-03-26 16:04:00,424 INFO org.apache.flink.runtime.taskmanager.Task - Source: Socket Stream -> Flat Map -> Map -> Map (1/1) (f7348fec435dcd81fca53edd5042d791)
switched from RUNNING to FAILED. java.lang.RuntimeException: error message at TMap.map(StateWordCount.scala:42) at TMap.map(StateWordCount.scala:39) at
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41) at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator
但是大概兩秒後,task恢復了
2020-03-26 16:04:02,496 INFO org.apache.flink.runtime.taskmanager.Task - aggregation -> Sink: Print to Std. Out (1/1) (33a473ca0606f0b96ad9542e8fc41bef) switched from DEPLOYING to RUNNING.
重新發送一條數據“flink”
[hdfs@hadoop7 hz_adm]$ nc -lk 9999
spark
spark
flink
flink
error
flink
查看結果
(spark,1)
(spark,2)
(flink,1)
(flink,2)
(flink,3)
可以看到task恢復了,計算結果也是基於task失敗前保存的state上計算的
在代碼裏設置的是程序最多可以重啓三次,接下來我們輸入3條“error”數據,讓程序徹底崩潰
[hdfs@hadoop7]$ nc -lk 9999
spark
spark
flink
flink
error
flink
error
error
error
因爲之前設置過保存checkpoint數據,在hdfs上查看下保存的state的數據
hdfs dfs -ls /tmp/flinkck/d3fad3e90704e5440fa605a00b9a9b97/chk-987
Found 2 items
-rw-r--r-- 3 hdfs supergroup 2019 2020-03-26 17:17 /tmp/flinkck/d3fad3e90704e5440fa605a00b9a9b97/chk-987/88073e87-e08d-4206-a7e3-59b82da2b7fa
-rw-r--r-- 3 hdfs supergroup 1292 2020-03-26 17:17 /tmp/flinkck/d3fad3e90704e5440fa605a00b9a9b97/chk-987/_metadata
指定checkpoint目錄,重啓應用
/data2/flink-1.10.0/bin/flink run -s hdfs://hadoop:8020/tmp/flinkck/d3fad3e90704e5440fa605a00b9a9b97/chk-987 \
-yD yarn.containers.vcores=2 -m yarn-cluster -c StateWordCount wordcount.jar
輸入兩條數據“flink”和“spark”
[hdfs@hadoop7]$ nc -lk 9999
spark
spark
flink
flink
error
flink
error
error
error
flink
spark
可以看到結果是在程序崩潰之前保存的state基礎之上計算的
(flink,4)
(spark,3)
end