文章目录
CheckpointCoordinator
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
Flink Fault Tolerance 非常核心的类,
调用关系:
JobManager[submitJob] ==> ExecutionGraphBuilder[buildGraph] ==> new ExecutionGraph().enableCheckpointing() ==> new CheckpointCoordinator()
JobManager在提交任务时,会构建执行计划,如果此时启用CheckPoint,则会新建CheckPoint协调器(CheckpointCoordinator),该类主要用来协调Operator和State的分布式快照
周期性的检查点触发机制
ScheduledTrigger
检查点的触发机制是基于定时器的周期性触发,定时实现类为CheckpointCoordinator的一个内部类:ScheduledTrigger
具体实现后面会详细介绍
private final class ScheduledTrigger implements Runnable {
@Override
public void run() {
try {
triggerCheckpoint(System.currentTimeMillis(), true);
}
catch (Exception e) {
LOG.error("Exception while triggering checkpoint for job {}.", job, e);
}
}
}
startCheckpointScheduler
启动触发检查点的定时任务的方法实现:
public void startCheckpointScheduler() {
synchronized (lock) {
if (shutdown) {
throw new IllegalArgumentException("Checkpoint coordinator is shut down");
}
// 确保所有原先的定时器都停止
stopCheckpointScheduler();
periodicScheduling = true;
// 启动触发
currentPeriodicTrigger = timer.scheduleAtFixedRate(
new ScheduledTrigger(),
baseInterval, baseInterval, TimeUnit.MILLISECONDS);
}
}
stopCheckpointScheduler
关闭定时任务的方法,用来释放资源,重置一些标记变量
triggerCheckpoint
该方法是触发一个新的检查点的核心逻辑
@VisibleForTesting
public CheckpointTriggerResult triggerCheckpoint(
long timestamp,
CheckpointProperties props,
@Nullable String externalSavepointLocation,
boolean isPeriodic) {
// 触发检查点之前,先做预检查
synchronized (lock) {
// 如果协调器(coordinator)已经被关闭,则丢弃
if (shutdown) {
return new CheckpointTriggerResult(CheckpointDeclineReason.COORDINATOR_SHUTDOWN);
}
// 如果定时调度被禁用关闭了,则不触发检查点
if (isPeriodic && !periodicScheduling) {
return new CheckpointTriggerResult(CheckpointDeclineReason.PERIODIC_SCHEDULER_SHUTDOWN);
}
// 校验检查点是否会被触发,需满足条件:
// 1. 最大并行的CheckPoint数有最大限制
// 2. checkpoints之间的时间间隔有最小限制
// (savepoints无这些限制)
if (!props.forceCheckpoint()) {
// 安全检查: 确保当前并行的检查点个数不超过设置的最大值
if (triggerRequestQueued) {
LOG.warn("Trying to trigger another checkpoint for job {} while one was queued already.", job);
return new CheckpointTriggerResult(CheckpointDeclineReason.ALREADY_QUEUED);
}
// 如果当前并行的检查点个数超过设置的最大值,则对'triggerRequestQueued'进行标志,并返回错误信息
if (pendingCheckpoints.size() >= maxConcurrentCheckpointAttempts) {
triggerRequestQueued = true;
if (currentPeriodicTrigger != null) {
currentPeriodicTrigger.cancel(false);
currentPeriodicTrigger = null;
}
return new CheckpointTriggerResult(CheckpointDeclineReason.TOO_MANY_CONCURRENT_CHECKPOINTS);
}
// 确保checkpoints之间的最小时间间隔
final long earliestNext = lastCheckpointCompletionNanos + minPauseBetweenCheckpointsNanos;
final long durationTillNextMillis = (earliestNext - System.nanoTime()) / 1_000_000;
if (durationTillNextMillis > 0) {
if (currentPeriodicTrigger != null) {
currentPeriodicTrigger.cancel(false);
currentPeriodicTrigger = null;
}
// Reassign the new trigger to the currentPeriodicTrigger
currentPeriodicTrigger = timer.scheduleAtFixedRate(
new ScheduledTrigger(),
durationTillNextMillis, baseInterval, TimeUnit.MILLISECONDS);
return new CheckpointTriggerResult(CheckpointDeclineReason.MINIMUM_TIME_BETWEEN_CHECKPOINTS);
}
}
}
// 确保待触发checkpoint的task都在running.
// 只要有一个task不在running状态,则返回
Execution[] executions = new Execution[tasksToTrigger.length];
for (int i = 0; i < tasksToTrigger.length; i++) {
Execution ee = tasksToTrigger[i].getCurrentExecutionAttempt();
if (ee != null && ee.getState() == ExecutionState.RUNNING) {
executions[i] = ee;
} else {
LOG.info("Checkpoint triggering task {} of job {} is not being executed at the moment. Aborting checkpoint.",
tasksToTrigger[i].getTaskNameWithSubtaskIndex(),
job);
return new CheckpointTriggerResult(CheckpointDeclineReason.NOT_ALL_REQUIRED_TASKS_RUNNING);
}
}
// 确保所有需要acknowledge的task都running.
// 只要其中有一个task不在running状态,则返回
Map<ExecutionAttemptID, ExecutionVertex> ackTasks = new HashMap<>(tasksToWaitFor.length);
for (ExecutionVertex ev : tasksToWaitFor) {
Execution ee = ev.getCurrentExecutionAttempt();
if (ee != null) {
ackTasks.put(ee.getAttemptId(), ev);
} else {
LOG.info("Checkpoint acknowledging task {} of job {} is not being executed at the moment. Aborting checkpoint.",
ev.getTaskNameWithSubtaskIndex(),
job);
return new CheckpointTriggerResult(CheckpointDeclineReason.NOT_ALL_REQUIRED_TASKS_RUNNING);
}
}
// 下面是触发CheckPoint的具体实现过程
// 这边用了新的锁:triggerLock,是为了保证trigger CheckPoint请求不会彼此覆盖,
// 而不用之前使用的‘coordinator-wide’锁是为了不阻塞协调器处理接收到的“acknowledge/decline”消息
synchronized (triggerLock) {
final CheckpointStorageLocation checkpointStorageLocation;
final long checkpointID;
try {
// 获取checkpointID
// 此处涉及到全局锁,因为生成checkpointID时,必须和外部服务进行通信(HA模式),导致会阻塞一会
checkpointID = checkpointIdCounter.getAndIncrement();
checkpointStorageLocation = props.isSavepoint() ?
checkpointStorage.initializeLocationForSavepoint(checkpointID, externalSavepointLocation) :
checkpointStorage.initializeLocationForCheckpoint(checkpointID);
}
catch (Throwable t) {
int numUnsuccessful = numUnsuccessfulCheckpointsTriggers.incrementAndGet();
LOG.warn("Failed to trigger checkpoint for job {} ({} consecutive failed attempts so far).",
job,
numUnsuccessful,
t);
return new CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION);
}
final PendingCheckpoint checkpoint = new PendingCheckpoint(
job,
checkpointID,
timestamp,
ackTasks,
props,
checkpointStorageLocation,
executor);
if (statsTracker != null) {
PendingCheckpointStats callback = statsTracker.reportPendingCheckpoint(
checkpointID,
timestamp,
props);
checkpoint.setStatsCallback(callback);
}
// 清理资源的定时器实现,对象为超时的checkpoints
final Runnable canceller = () -> {
synchronized (lock) {
if (!checkpoint.isDiscarded()) {
LOG.info("Checkpoint {} of job {} expired before completing.", checkpointID, job);
checkpoint.abortExpired();
pendingCheckpoints.remove(checkpointID);
rememberRecentCheckpointId(checkpointID);
triggerQueuedRequests();
}
}
};
try {
// 重新获取‘coordinator-wide’锁
synchronized (lock) {
// 因为此方法中途释放过该锁,状态可能被改变,所有需要重新校验条件
if (shutdown) {
return new CheckpointTriggerResult(CheckpointDeclineReason.COORDINATOR_SHUTDOWN);
}
else if (!props.forceCheckpoint()) {
if (triggerRequestQueued) {
LOG.warn("Trying to trigger another checkpoint for job {} while one was queued already.", job);
return new CheckpointTriggerResult(CheckpointDeclineReason.ALREADY_QUEUED);
}
if (pendingCheckpoints.size() >= maxConcurrentCheckpointAttempts) {
triggerRequestQueued = true;
if (currentPeriodicTrigger != null) {
currentPeriodicTrigger.cancel(false);
currentPeriodicTrigger = null;
}
return new CheckpointTriggerResult(CheckpointDeclineReason.TOO_MANY_CONCURRENT_CHECKPOINTS);
}
final long earliestNext = lastCheckpointCompletionNanos + minPauseBetweenCheckpointsNanos;
final long durationTillNextMillis = (earliestNext - System.nanoTime()) / 1_000_000;
if (durationTillNextMillis > 0) {
if (currentPeriodicTrigger != null) {
currentPeriodicTrigger.cancel(false);
currentPeriodicTrigger = null;
}
currentPeriodicTrigger = timer.scheduleAtFixedRate(
new ScheduledTrigger(),
durationTillNextMillis, baseInterval, TimeUnit.MILLISECONDS);
return new CheckpointTriggerResult(CheckpointDeclineReason.MINIMUM_TIME_BETWEEN_CHECKPOINTS);
}
}
LOG.info("Triggering checkpoint {} @ {} for job {}.", checkpointID, timestamp, job);
// 检查后,如果触发检查点的条件仍然是满足的,那么将上面创建的PendingCheckpoint对象加入集合中
pendingCheckpoints.put(checkpointID, checkpoint);
// 启动针对当前检查点的超时取消器
ScheduledFuture<?> cancellerHandle = timer.schedule(
canceller,
checkpointTimeout, TimeUnit.MILLISECONDS);
if (!checkpoint.setCancellerHandle(cancellerHandle)) {
// 如果该检查点已经废弃,需不要清理
cancellerHandle.cancel(false);
}
// trigger the master hooks for the checkpoint
final List<MasterState> masterStates = MasterHooks.triggerMasterHooks(masterHooks.values(),
checkpointID, timestamp, executor, Time.milliseconds(checkpointTimeout));
for (MasterState s : masterStates) {
checkpoint.addMasterState(s);
}
}
final CheckpointOptions checkpointOptions = new CheckpointOptions(
props.getCheckpointType(),
checkpointStorageLocation.getLocationReference());
// 通知所有Task去触发checkpoint(底层基于Akka框架的Actor机制)
for (Execution execution: executions) {
execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
}
numUnsuccessfulCheckpointsTriggers.set(0);
return new CheckpointTriggerResult(checkpoint);
}
catch (Throwable t) {
// 清理资源
}
} // end trigger lock
}
基于Actor的消息驱动的协同机制
Flink 使用 Akka 在 JobClient,JobManager,TaskManager 三个分布式技术组件之间进行通信:
JobClient 获取用户提交的job,然后将其提交给 JobManager
JobManager 随后对提交的 job 进行执行的环境准备
首先,它会分配job的执行需要的大量资源,这些资源主要是在TaskManager上的execution slots
在资源分配完成之后,JobManager会部署不同的task到特定的TaskManager上
在接收到task之后,TaskManager会创建线程来执行
所有的状态改变,比如开始计算或者完成计算都将给发回给JobManager
基于这些状态的改变,JobManager将引导task的执行直到其完成。
一旦job完成执行,其执行结果将会返回给JobClient,进而告知用户
流程如下图:
CheckpointCoordinatorDeActivator
类CheckpointCoordinatorDeActivator是一个Actor实现(理解为一个监听器),他基于消息(来自TaskManager的消息)来驱动检查点定时任务的启停(如上所有,启停方法为startCheckpointScheduler()和stopCheckpointScheduler())
@Override
public void jobStatusChanges(JobID jobId, JobStatus newJobStatus, long timestamp, Throwable error) {
if (newJobStatus == JobStatus.RUNNING) {
// start the checkpoint scheduler
coordinator.startCheckpointScheduler();
} else {
// anything else should stop the trigger for now
coordinator.stopCheckpointScheduler();
}
}
该Actor会收到JobStatus的状态的变化通知,一旦变成Running状态,那么检查点的定时任务会被立即启动;否则会被立即关闭
CheckpointCoordinatorDeActivator被创建是在CheckpointCoordinator中的createActivatorDeactivator()方法中:
public JobStatusListener createActivatorDeactivator() {
synchronized (lock) {
if (shutdown) {
throw new IllegalArgumentException("Checkpoint coordinator is shut down");
}
if (jobStatusListener == null) {
jobStatusListener = new CheckpointCoordinatorDeActivator(this);
}
return jobStatusListener;
}
}
检查点触发消息传递代码流程
Actor(监听器)注册流程:
JobManager[submitJob] ==> ExecutionGraphBuilder[buildGraph]
==> new ExecutionGraph().enableCheckpointing()
==> registerJobStatusListener(checkpointCoordinator.createActivatorDeactivator());
构建执行计划,会在创建CheckPoint协调器(CheckpointCoordinator)的同时,将新建CheckpointCoordinatorDeActivator并进行注册,当收到TaskManager的消息时,进行相应的CheckPoint的启停
消息驱动代码流程:
JobMaster ==> ExecutionGraph[cancel/suspendrestart]
==> notifyJobStatusChange
==> 启停CheckPoint
AbstractCheckpointMessage
Checkpoint消息的基础类
- job:JobID的实例,表示当前这条消息实例的归属;
- askExecutionId:ExecutionAttemptID的实例,表示检查点的源/目的任务
- checkpointId:当前消息协调的检查点ID
TriggerCheckpoint
该消息由JobManager发送给TaskManager,用于告诉一个task触发它的检查点。
触发消息:位于CheckpointCoordinator类的triggerCheckpoint中,上面已经提及过。
// send the messages to the tasks that trigger their checkpoint
for (Execution execution: executions) {
execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
}
消息处理:TaskManager的handleCheckpointingMessage实现:
case message: TriggerCheckpoint =>
val taskExecutionId = message.getTaskExecutionId
val checkpointId = message.getCheckpointId
val timestamp = message.getTimestamp
val checkpointOptions = message.getCheckpointOptions
log.debug(s"Receiver TriggerCheckpoint $checkpointId@$timestamp for $taskExecutionId.")
val task = runningTasks.get(taskExecutionId)
if (task != null) {
task.triggerCheckpointBarrier(checkpointId, timestamp, checkpointOptions)
} else {
log.debug(s"TaskManager received a checkpoint request for unknown task $taskExecutionId.")
}
DeclineCheckpoint
该消息由TaskManager发送给JobManager,用于告诉检查点协调器:检查点的请求还没有能够被处理
这种情况通常发生于:某task已处于RUNNING状态,但在内部可能还没有准备好执行检查点
它除了AbstractCheckpointMessage需要的三个属性外,还需要用于关联检查点的timestamp。
触发消息:位于Task类的triggerCheckpointBarrier方法中
消息处理:TaskManager的handleCheckpointingMessage实现
AcknowledgeCheckpoint
该消息是一个应答信号,表示某个独立的task的检查点已经完成。也是由TaskManager发送给JobManager。该消息会携带task的状态:
- state
- stateSize
触发消息:RuntimeEnvironment类的acknowledgeCheckpoint方法
消息处理:具体的实现在CheckpointCoordinator的receiveAcknowledgeMessage中
NotifyCheckpointComplete
该消息由JobManager发送给TaskManager,用于告诉一个task它的检查点已经得到完成确认,task可以向第三方提交该检查点。
触发消息:位于CheckpointCoordinator类的receiveAcknowledgeMessage方法中,当检查点acktask完成,转化为CompletedCheckpoint之后
消息处理:TaskManager的handleCheckpointingMessage