文章目錄
CheckpointCoordinator
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
Flink Fault Tolerance 非常核心的類,
調用關係:
JobManager[submitJob] ==> ExecutionGraphBuilder[buildGraph] ==> new ExecutionGraph().enableCheckpointing() ==> new CheckpointCoordinator()
JobManager在提交任務時,會構建執行計劃,如果此時啓用CheckPoint,則會新建CheckPoint協調器(CheckpointCoordinator),該類主要用來協調Operator和State的分佈式快照
週期性的檢查點觸發機制
ScheduledTrigger
檢查點的觸發機制是基於定時器的週期性觸發,定時實現類爲CheckpointCoordinator的一個內部類:ScheduledTrigger
具體實現後面會詳細介紹
private final class ScheduledTrigger implements Runnable {
@Override
public void run() {
try {
triggerCheckpoint(System.currentTimeMillis(), true);
}
catch (Exception e) {
LOG.error("Exception while triggering checkpoint for job {}.", job, e);
}
}
}
startCheckpointScheduler
啓動觸發檢查點的定時任務的方法實現:
public void startCheckpointScheduler() {
synchronized (lock) {
if (shutdown) {
throw new IllegalArgumentException("Checkpoint coordinator is shut down");
}
// 確保所有原先的定時器都停止
stopCheckpointScheduler();
periodicScheduling = true;
// 啓動觸發
currentPeriodicTrigger = timer.scheduleAtFixedRate(
new ScheduledTrigger(),
baseInterval, baseInterval, TimeUnit.MILLISECONDS);
}
}
stopCheckpointScheduler
關閉定時任務的方法,用來釋放資源,重置一些標記變量
triggerCheckpoint
該方法是觸發一個新的檢查點的核心邏輯
@VisibleForTesting
public CheckpointTriggerResult triggerCheckpoint(
long timestamp,
CheckpointProperties props,
@Nullable String externalSavepointLocation,
boolean isPeriodic) {
// 觸發檢查點之前,先做預檢查
synchronized (lock) {
// 如果協調器(coordinator)已經被關閉,則丟棄
if (shutdown) {
return new CheckpointTriggerResult(CheckpointDeclineReason.COORDINATOR_SHUTDOWN);
}
// 如果定時調度被禁用關閉了,則不觸發檢查點
if (isPeriodic && !periodicScheduling) {
return new CheckpointTriggerResult(CheckpointDeclineReason.PERIODIC_SCHEDULER_SHUTDOWN);
}
// 校驗檢查點是否會被觸發,需滿足條件:
// 1. 最大並行的CheckPoint數有最大限制
// 2. checkpoints之間的時間間隔有最小限制
// (savepoints無這些限制)
if (!props.forceCheckpoint()) {
// 安全檢查: 確保當前並行的檢查點個數不超過設置的最大值
if (triggerRequestQueued) {
LOG.warn("Trying to trigger another checkpoint for job {} while one was queued already.", job);
return new CheckpointTriggerResult(CheckpointDeclineReason.ALREADY_QUEUED);
}
// 如果當前並行的檢查點個數超過設置的最大值,則對'triggerRequestQueued'進行標誌,並返回錯誤信息
if (pendingCheckpoints.size() >= maxConcurrentCheckpointAttempts) {
triggerRequestQueued = true;
if (currentPeriodicTrigger != null) {
currentPeriodicTrigger.cancel(false);
currentPeriodicTrigger = null;
}
return new CheckpointTriggerResult(CheckpointDeclineReason.TOO_MANY_CONCURRENT_CHECKPOINTS);
}
// 確保checkpoints之間的最小時間間隔
final long earliestNext = lastCheckpointCompletionNanos + minPauseBetweenCheckpointsNanos;
final long durationTillNextMillis = (earliestNext - System.nanoTime()) / 1_000_000;
if (durationTillNextMillis > 0) {
if (currentPeriodicTrigger != null) {
currentPeriodicTrigger.cancel(false);
currentPeriodicTrigger = null;
}
// Reassign the new trigger to the currentPeriodicTrigger
currentPeriodicTrigger = timer.scheduleAtFixedRate(
new ScheduledTrigger(),
durationTillNextMillis, baseInterval, TimeUnit.MILLISECONDS);
return new CheckpointTriggerResult(CheckpointDeclineReason.MINIMUM_TIME_BETWEEN_CHECKPOINTS);
}
}
}
// 確保待觸發checkpoint的task都在running.
// 只要有一個task不在running狀態,則返回
Execution[] executions = new Execution[tasksToTrigger.length];
for (int i = 0; i < tasksToTrigger.length; i++) {
Execution ee = tasksToTrigger[i].getCurrentExecutionAttempt();
if (ee != null && ee.getState() == ExecutionState.RUNNING) {
executions[i] = ee;
} else {
LOG.info("Checkpoint triggering task {} of job {} is not being executed at the moment. Aborting checkpoint.",
tasksToTrigger[i].getTaskNameWithSubtaskIndex(),
job);
return new CheckpointTriggerResult(CheckpointDeclineReason.NOT_ALL_REQUIRED_TASKS_RUNNING);
}
}
// 確保所有需要acknowledge的task都running.
// 只要其中有一個task不在running狀態,則返回
Map<ExecutionAttemptID, ExecutionVertex> ackTasks = new HashMap<>(tasksToWaitFor.length);
for (ExecutionVertex ev : tasksToWaitFor) {
Execution ee = ev.getCurrentExecutionAttempt();
if (ee != null) {
ackTasks.put(ee.getAttemptId(), ev);
} else {
LOG.info("Checkpoint acknowledging task {} of job {} is not being executed at the moment. Aborting checkpoint.",
ev.getTaskNameWithSubtaskIndex(),
job);
return new CheckpointTriggerResult(CheckpointDeclineReason.NOT_ALL_REQUIRED_TASKS_RUNNING);
}
}
// 下面是觸發CheckPoint的具體實現過程
// 這邊用了新的鎖:triggerLock,是爲了保證trigger CheckPoint請求不會彼此覆蓋,
// 而不用之前使用的‘coordinator-wide’鎖是爲了不阻塞協調器處理接收到的“acknowledge/decline”消息
synchronized (triggerLock) {
final CheckpointStorageLocation checkpointStorageLocation;
final long checkpointID;
try {
// 獲取checkpointID
// 此處涉及到全局鎖,因爲生成checkpointID時,必須和外部服務進行通信(HA模式),導致會阻塞一會
checkpointID = checkpointIdCounter.getAndIncrement();
checkpointStorageLocation = props.isSavepoint() ?
checkpointStorage.initializeLocationForSavepoint(checkpointID, externalSavepointLocation) :
checkpointStorage.initializeLocationForCheckpoint(checkpointID);
}
catch (Throwable t) {
int numUnsuccessful = numUnsuccessfulCheckpointsTriggers.incrementAndGet();
LOG.warn("Failed to trigger checkpoint for job {} ({} consecutive failed attempts so far).",
job,
numUnsuccessful,
t);
return new CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION);
}
final PendingCheckpoint checkpoint = new PendingCheckpoint(
job,
checkpointID,
timestamp,
ackTasks,
props,
checkpointStorageLocation,
executor);
if (statsTracker != null) {
PendingCheckpointStats callback = statsTracker.reportPendingCheckpoint(
checkpointID,
timestamp,
props);
checkpoint.setStatsCallback(callback);
}
// 清理資源的定時器實現,對象爲超時的checkpoints
final Runnable canceller = () -> {
synchronized (lock) {
if (!checkpoint.isDiscarded()) {
LOG.info("Checkpoint {} of job {} expired before completing.", checkpointID, job);
checkpoint.abortExpired();
pendingCheckpoints.remove(checkpointID);
rememberRecentCheckpointId(checkpointID);
triggerQueuedRequests();
}
}
};
try {
// 重新獲取‘coordinator-wide’鎖
synchronized (lock) {
// 因爲此方法中途釋放過該鎖,狀態可能被改變,所有需要重新校驗條件
if (shutdown) {
return new CheckpointTriggerResult(CheckpointDeclineReason.COORDINATOR_SHUTDOWN);
}
else if (!props.forceCheckpoint()) {
if (triggerRequestQueued) {
LOG.warn("Trying to trigger another checkpoint for job {} while one was queued already.", job);
return new CheckpointTriggerResult(CheckpointDeclineReason.ALREADY_QUEUED);
}
if (pendingCheckpoints.size() >= maxConcurrentCheckpointAttempts) {
triggerRequestQueued = true;
if (currentPeriodicTrigger != null) {
currentPeriodicTrigger.cancel(false);
currentPeriodicTrigger = null;
}
return new CheckpointTriggerResult(CheckpointDeclineReason.TOO_MANY_CONCURRENT_CHECKPOINTS);
}
final long earliestNext = lastCheckpointCompletionNanos + minPauseBetweenCheckpointsNanos;
final long durationTillNextMillis = (earliestNext - System.nanoTime()) / 1_000_000;
if (durationTillNextMillis > 0) {
if (currentPeriodicTrigger != null) {
currentPeriodicTrigger.cancel(false);
currentPeriodicTrigger = null;
}
currentPeriodicTrigger = timer.scheduleAtFixedRate(
new ScheduledTrigger(),
durationTillNextMillis, baseInterval, TimeUnit.MILLISECONDS);
return new CheckpointTriggerResult(CheckpointDeclineReason.MINIMUM_TIME_BETWEEN_CHECKPOINTS);
}
}
LOG.info("Triggering checkpoint {} @ {} for job {}.", checkpointID, timestamp, job);
// 檢查後,如果觸發檢查點的條件仍然是滿足的,那麼將上面創建的PendingCheckpoint對象加入集合中
pendingCheckpoints.put(checkpointID, checkpoint);
// 啓動針對當前檢查點的超時取消器
ScheduledFuture<?> cancellerHandle = timer.schedule(
canceller,
checkpointTimeout, TimeUnit.MILLISECONDS);
if (!checkpoint.setCancellerHandle(cancellerHandle)) {
// 如果該檢查點已經廢棄,需不要清理
cancellerHandle.cancel(false);
}
// trigger the master hooks for the checkpoint
final List<MasterState> masterStates = MasterHooks.triggerMasterHooks(masterHooks.values(),
checkpointID, timestamp, executor, Time.milliseconds(checkpointTimeout));
for (MasterState s : masterStates) {
checkpoint.addMasterState(s);
}
}
final CheckpointOptions checkpointOptions = new CheckpointOptions(
props.getCheckpointType(),
checkpointStorageLocation.getLocationReference());
// 通知所有Task去觸發checkpoint(底層基於Akka框架的Actor機制)
for (Execution execution: executions) {
execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
}
numUnsuccessfulCheckpointsTriggers.set(0);
return new CheckpointTriggerResult(checkpoint);
}
catch (Throwable t) {
// 清理資源
}
} // end trigger lock
}
基於Actor的消息驅動的協同機制
Flink 使用 Akka 在 JobClient,JobManager,TaskManager 三個分佈式技術組件之間進行通信:
JobClient 獲取用戶提交的job,然後將其提交給 JobManager
JobManager 隨後對提交的 job 進行執行的環境準備
首先,它會分配job的執行需要的大量資源,這些資源主要是在TaskManager上的execution slots
在資源分配完成之後,JobManager會部署不同的task到特定的TaskManager上
在接收到task之後,TaskManager會創建線程來執行
所有的狀態改變,比如開始計算或者完成計算都將給發回給JobManager
基於這些狀態的改變,JobManager將引導task的執行直到其完成。
一旦job完成執行,其執行結果將會返回給JobClient,進而告知用戶
流程如下圖:
CheckpointCoordinatorDeActivator
類CheckpointCoordinatorDeActivator是一個Actor實現(理解爲一個監聽器),他基於消息(來自TaskManager的消息)來驅動檢查點定時任務的啓停(如上所有,啓停方法爲startCheckpointScheduler()和stopCheckpointScheduler())
@Override
public void jobStatusChanges(JobID jobId, JobStatus newJobStatus, long timestamp, Throwable error) {
if (newJobStatus == JobStatus.RUNNING) {
// start the checkpoint scheduler
coordinator.startCheckpointScheduler();
} else {
// anything else should stop the trigger for now
coordinator.stopCheckpointScheduler();
}
}
該Actor會收到JobStatus的狀態的變化通知,一旦變成Running狀態,那麼檢查點的定時任務會被立即啓動;否則會被立即關閉
CheckpointCoordinatorDeActivator被創建是在CheckpointCoordinator中的createActivatorDeactivator()方法中:
public JobStatusListener createActivatorDeactivator() {
synchronized (lock) {
if (shutdown) {
throw new IllegalArgumentException("Checkpoint coordinator is shut down");
}
if (jobStatusListener == null) {
jobStatusListener = new CheckpointCoordinatorDeActivator(this);
}
return jobStatusListener;
}
}
檢查點觸發消息傳遞代碼流程
Actor(監聽器)註冊流程:
JobManager[submitJob] ==> ExecutionGraphBuilder[buildGraph]
==> new ExecutionGraph().enableCheckpointing()
==> registerJobStatusListener(checkpointCoordinator.createActivatorDeactivator());
構建執行計劃,會在創建CheckPoint協調器(CheckpointCoordinator)的同時,將新建CheckpointCoordinatorDeActivator並進行註冊,當收到TaskManager的消息時,進行相應的CheckPoint的啓停
消息驅動代碼流程:
JobMaster ==> ExecutionGraph[cancel/suspendrestart]
==> notifyJobStatusChange
==> 啓停CheckPoint
AbstractCheckpointMessage
Checkpoint消息的基礎類
- job:JobID的實例,表示當前這條消息實例的歸屬;
- askExecutionId:ExecutionAttemptID的實例,表示檢查點的源/目的任務
- checkpointId:當前消息協調的檢查點ID
TriggerCheckpoint
該消息由JobManager發送給TaskManager,用於告訴一個task觸發它的檢查點。
觸發消息:位於CheckpointCoordinator類的triggerCheckpoint中,上面已經提及過。
// send the messages to the tasks that trigger their checkpoint
for (Execution execution: executions) {
execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
}
消息處理:TaskManager的handleCheckpointingMessage實現:
case message: TriggerCheckpoint =>
val taskExecutionId = message.getTaskExecutionId
val checkpointId = message.getCheckpointId
val timestamp = message.getTimestamp
val checkpointOptions = message.getCheckpointOptions
log.debug(s"Receiver TriggerCheckpoint $checkpointId@$timestamp for $taskExecutionId.")
val task = runningTasks.get(taskExecutionId)
if (task != null) {
task.triggerCheckpointBarrier(checkpointId, timestamp, checkpointOptions)
} else {
log.debug(s"TaskManager received a checkpoint request for unknown task $taskExecutionId.")
}
DeclineCheckpoint
該消息由TaskManager發送給JobManager,用於告訴檢查點協調器:檢查點的請求還沒有能夠被處理
這種情況通常發生於:某task已處於RUNNING狀態,但在內部可能還沒有準備好執行檢查點
它除了AbstractCheckpointMessage需要的三個屬性外,還需要用於關聯檢查點的timestamp。
觸發消息:位於Task類的triggerCheckpointBarrier方法中
消息處理:TaskManager的handleCheckpointingMessage實現
AcknowledgeCheckpoint
該消息是一個應答信號,表示某個獨立的task的檢查點已經完成。也是由TaskManager發送給JobManager。該消息會攜帶task的狀態:
- state
- stateSize
觸發消息:RuntimeEnvironment類的acknowledgeCheckpoint方法
消息處理:具體的實現在CheckpointCoordinator的receiveAcknowledgeMessage中
NotifyCheckpointComplete
該消息由JobManager發送給TaskManager,用於告訴一個task它的檢查點已經得到完成確認,task可以向第三方提交該檢查點。
觸發消息:位於CheckpointCoordinator類的receiveAcknowledgeMessage方法中,當檢查點acktask完成,轉化爲CompletedCheckpoint之後
消息處理:TaskManager的handleCheckpointingMessage