【Flink系列一】Flink開啓Checkpoint,以及從Checkpoint恢復

前言

Flink提供了Checkpoint/Savepoint來保存狀態,以便在出錯時進行恢復,在上一個狀態的基礎上恢復計算流程。

問題

1. 如何開啓Checkpoint?

Flink-Checkpointing

// get the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

//...

env.enableCheckpointing(300 * 1000);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(300 * 1000);
env.getCheckpointConfig().setCheckpointTimeout(60000);

// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);

// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

// allow job recovery fallback to checkpoint when there is a more recent savepoint
env.getCheckpointConfig().setPreferCheckpointForRecovery(true);

2. 如何從Checkpoint恢復?

Checkpoint恢復

Difference to Savepoints

Checkpoints have a few differences from savepoints. They
use a state backend specific (low-level) data format, may be incremental.
do not support Flink specific features like rescaling.

Resuming from a retained checkpoint

A job may be resumed from a checkpoint just as from a savepoint by using the checkpoint’s meta data file instead (see the savepoint restore guide). Note that if the meta data file is not self-contained, the jobmanager needs to have access to the data files it refers to (see Directory Structure above).
$ bin/flink run -s :checkpointMetaDataPath [:runArgs]


Restore a savepoint

./bin/flink run -s ...

The run command has a savepoint flag to submit a job, which restores its state from a savepoint. The savepoint path is returned by the savepoint trigger command.
By default, we try to match all savepoint state to the job being submitted. If you want to allow to skip savepoint state that cannot be restored with the new job you can set the allowNonRestoredState flag. You need to allow this if you removed an operator from your program that was part of the program when the savepoint was triggered and you still want to use the savepoint.
./bin/flink run -s -n ...
This is useful if your program dropped an operator that was part of the savepoint.

-n,--allowNonRestoredState Allow to skip savepoint state that
cannot be restored. You need to allow
this if you removed an operator from
your program that was part of the
program when the savepoint was
triggered.

-s,--fromSavepoint Path to a savepoint to restore the job
from (for example
hdfs:///flink/savepoint-1537).

執行命令中加入以下參數

bin/flink -s hdfs://your-node/application/flink/slankka/checkpoint/37736d4edffd6150c97ff24d6a48bbf4/chk-225 -n ...其他參數

除了從Flink的UI中可以看到,還可以通過YARN等,FLink的REST API 訪問獲取

// 例如訪問YARN的 http://yarn-node.slankka.com:8088/proxy/application_1595593091318_0082/jobs/37736d4edffd6150c97ff24d6a48bbf4/metrics?get=lastCheckpointExternalPath
// 得到
[
  {
    "id": "lastCheckpointExternalPath",
    "value": "hdfs://your-node/application/flink/slankka/checkpoint/37736d4edffd6150c97ff24d6a48bbf4/chk-248"
  }
]

但是實際使用的時候,最好將這個指標收集起來

收集Flink Metrics(尤其是lastCheckpointExternalPath這種非Number類型指標)

Prometheus行不行?查看源碼後發現,是不行的,Prometheus不支持這個指標。

參見以下文檔,可以查看Flink支持的收集器(時序數據庫)
Flink Metrics

可參見下一篇文章:Flink系列二,用Influxdb收集Flink指標

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章