文章目錄

基於Actor的消息驅動的協同機制

CheckpointCoordinator

org.apache.flink.runtime.checkpoint.CheckpointCoordinator
Flink Fault Tolerance 非常核心的類，
調用關係：

JobManager[submitJob]  ==>  ExecutionGraphBuilder[buildGraph]  ==>  new ExecutionGraph().enableCheckpointing()  ==>  new CheckpointCoordinator()

JobManager在提交任務時，會構建執行計劃，如果此時啓用CheckPoint，則會新建CheckPoint協調器(CheckpointCoordinator)，該類主要用來協調Operator和State的分佈式快照

週期性的檢查點觸發機制

ScheduledTrigger

檢查點的觸發機制是基於定時器的週期性觸發，定時實現類爲CheckpointCoordinator的一個內部類：ScheduledTrigger

具體實現後面會詳細介紹

private final class ScheduledTrigger implements Runnable {

	@Override
	public void run() {
		try {
			triggerCheckpoint(System.currentTimeMillis(), true);
		}
		catch (Exception e) {
			LOG.error("Exception while triggering checkpoint for job {}.", job, e);
		}
	}
}

startCheckpointScheduler

啓動觸發檢查點的定時任務的方法實現：

public void startCheckpointScheduler() {
	synchronized (lock) {
		if (shutdown) {
			throw new IllegalArgumentException("Checkpoint coordinator is shut down");
		}

		// 確保所有原先的定時器都停止
		stopCheckpointScheduler();

		periodicScheduling = true;
        // 啓動觸發
		currentPeriodicTrigger = timer.scheduleAtFixedRate(
				new ScheduledTrigger(),
				baseInterval, baseInterval, TimeUnit.MILLISECONDS);
	}
}

stopCheckpointScheduler

關閉定時任務的方法，用來釋放資源，重置一些標記變量

triggerCheckpoint

該方法是觸發一個新的檢查點的核心邏輯

@VisibleForTesting
public CheckpointTriggerResult triggerCheckpoint(
		long timestamp,
		CheckpointProperties props,
		@Nullable String externalSavepointLocation,
		boolean isPeriodic) {

	// 觸發檢查點之前，先做預檢查
	synchronized (lock) {
		// 如果協調器(coordinator)已經被關閉，則丟棄
		if (shutdown) {
			return new CheckpointTriggerResult(CheckpointDeclineReason.COORDINATOR_SHUTDOWN);
		}

		// 如果定時調度被禁用關閉了，則不觸發檢查點
		if (isPeriodic && !periodicScheduling) {
			return new CheckpointTriggerResult(CheckpointDeclineReason.PERIODIC_SCHEDULER_SHUTDOWN);
		}

		// 校驗檢查點是否會被觸發，需滿足條件：
		// 1. 最大並行的CheckPoint數有最大限制
		// 2. checkpoints之間的時間間隔有最小限制
		// (savepoints無這些限制)
		if (!props.forceCheckpoint()) {
			// 安全檢查: 確保當前並行的檢查點個數不超過設置的最大值
			if (triggerRequestQueued) {
				LOG.warn("Trying to trigger another checkpoint for job {} while one was queued already.", job);
				return new CheckpointTriggerResult(CheckpointDeclineReason.ALREADY_QUEUED);
			}

			// 如果當前並行的檢查點個數超過設置的最大值，則對'triggerRequestQueued'進行標誌，並返回錯誤信息
			if (pendingCheckpoints.size() >= maxConcurrentCheckpointAttempts) {
				triggerRequestQueued = true;
				if (currentPeriodicTrigger != null) {
					currentPeriodicTrigger.cancel(false);
					currentPeriodicTrigger = null;
				}
				return new CheckpointTriggerResult(CheckpointDeclineReason.TOO_MANY_CONCURRENT_CHECKPOINTS);
			}

			// 確保checkpoints之間的最小時間間隔
			final long earliestNext = lastCheckpointCompletionNanos + minPauseBetweenCheckpointsNanos;
			final long durationTillNextMillis = (earliestNext - System.nanoTime()) / 1_000_000;

			if (durationTillNextMillis > 0) {
				if (currentPeriodicTrigger != null) {
					currentPeriodicTrigger.cancel(false);
					currentPeriodicTrigger = null;
				}
				// Reassign the new trigger to the currentPeriodicTrigger
				currentPeriodicTrigger = timer.scheduleAtFixedRate(
						new ScheduledTrigger(),
						durationTillNextMillis, baseInterval, TimeUnit.MILLISECONDS);

				return new CheckpointTriggerResult(CheckpointDeclineReason.MINIMUM_TIME_BETWEEN_CHECKPOINTS);
			}
		}
	}

	// 確保待觸發checkpoint的task都在running.
	// 只要有一個task不在running狀態，則返回
	Execution[] executions = new Execution[tasksToTrigger.length];
	for (int i = 0; i < tasksToTrigger.length; i++) {
		Execution ee = tasksToTrigger[i].getCurrentExecutionAttempt();
		if (ee != null && ee.getState() == ExecutionState.RUNNING) {
			executions[i] = ee;
		} else {
			LOG.info("Checkpoint triggering task {} of job {} is not being executed at the moment. Aborting checkpoint.",
					tasksToTrigger[i].getTaskNameWithSubtaskIndex(),
					job);
			return new CheckpointTriggerResult(CheckpointDeclineReason.NOT_ALL_REQUIRED_TASKS_RUNNING);
		}
	}

	// 確保所有需要acknowledge的task都running.
	// 只要其中有一個task不在running狀態，則返回
	Map<ExecutionAttemptID, ExecutionVertex> ackTasks = new HashMap<>(tasksToWaitFor.length);

	for (ExecutionVertex ev : tasksToWaitFor) {
		Execution ee = ev.getCurrentExecutionAttempt();
		if (ee != null) {
			ackTasks.put(ee.getAttemptId(), ev);
		} else {
			LOG.info("Checkpoint acknowledging task {} of job {} is not being executed at the moment. Aborting checkpoint.",
					ev.getTaskNameWithSubtaskIndex(),
					job);
			return new CheckpointTriggerResult(CheckpointDeclineReason.NOT_ALL_REQUIRED_TASKS_RUNNING);
		}
	}

	// 下面是觸發CheckPoint的具體實現過程

	// 這邊用了新的鎖：triggerLock，是爲了保證trigger CheckPoint請求不會彼此覆蓋，
	// 而不用之前使用的‘coordinator-wide’鎖是爲了不阻塞協調器處理接收到的“acknowledge/decline”消息
	synchronized (triggerLock) {

		final CheckpointStorageLocation checkpointStorageLocation;
		final long checkpointID;

		try {
			// 獲取checkpointID
			// 此處涉及到全局鎖，因爲生成checkpointID時，必須和外部服務進行通信（HA模式），導致會阻塞一會
			checkpointID = checkpointIdCounter.getAndIncrement();

			checkpointStorageLocation = props.isSavepoint() ?
					checkpointStorage.initializeLocationForSavepoint(checkpointID, externalSavepointLocation) :
					checkpointStorage.initializeLocationForCheckpoint(checkpointID);
		}
		catch (Throwable t) {
			int numUnsuccessful = numUnsuccessfulCheckpointsTriggers.incrementAndGet();
			LOG.warn("Failed to trigger checkpoint for job {} ({} consecutive failed attempts so far).",
					job,
					numUnsuccessful,
					t);
			return new CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION);
		}

		final PendingCheckpoint checkpoint = new PendingCheckpoint(
			job,
			checkpointID,
			timestamp,
			ackTasks,
			props,
			checkpointStorageLocation,
			executor);

		if (statsTracker != null) {
			PendingCheckpointStats callback = statsTracker.reportPendingCheckpoint(
				checkpointID,
				timestamp,
				props);

			checkpoint.setStatsCallback(callback);
		}

		// 清理資源的定時器實現，對象爲超時的checkpoints
		final Runnable canceller = () -> {
			synchronized (lock) {
				if (!checkpoint.isDiscarded()) {
					LOG.info("Checkpoint {} of job {} expired before completing.", checkpointID, job);

					checkpoint.abortExpired();
					pendingCheckpoints.remove(checkpointID);
					rememberRecentCheckpointId(checkpointID);

					triggerQueuedRequests();
				}
			}
		};

		try {
			// 重新獲取‘coordinator-wide’鎖
			synchronized (lock) {
				// 因爲此方法中途釋放過該鎖，狀態可能被改變，所有需要重新校驗條件
				if (shutdown) {
					return new CheckpointTriggerResult(CheckpointDeclineReason.COORDINATOR_SHUTDOWN);
				}
				else if (!props.forceCheckpoint()) {
					if (triggerRequestQueued) {
						LOG.warn("Trying to trigger another checkpoint for job {} while one was queued already.", job);
						return new CheckpointTriggerResult(CheckpointDeclineReason.ALREADY_QUEUED);
					}

					if (pendingCheckpoints.size() >= maxConcurrentCheckpointAttempts) {
						triggerRequestQueued = true;
						if (currentPeriodicTrigger != null) {
							currentPeriodicTrigger.cancel(false);
							currentPeriodicTrigger = null;
						}
						return new CheckpointTriggerResult(CheckpointDeclineReason.TOO_MANY_CONCURRENT_CHECKPOINTS);
					}

					final long earliestNext = lastCheckpointCompletionNanos + minPauseBetweenCheckpointsNanos;
					final long durationTillNextMillis = (earliestNext - System.nanoTime()) / 1_000_000;

					if (durationTillNextMillis > 0) {
						if (currentPeriodicTrigger != null) {
							currentPeriodicTrigger.cancel(false);
							currentPeriodicTrigger = null;
						}

						currentPeriodicTrigger = timer.scheduleAtFixedRate(
								new ScheduledTrigger(),
								durationTillNextMillis, baseInterval, TimeUnit.MILLISECONDS);

						return new CheckpointTriggerResult(CheckpointDeclineReason.MINIMUM_TIME_BETWEEN_CHECKPOINTS);
					}
				}

				LOG.info("Triggering checkpoint {} @ {} for job {}.", checkpointID, timestamp, job);

				// 檢查後，如果觸發檢查點的條件仍然是滿足的，那麼將上面創建的PendingCheckpoint對象加入集合中
				pendingCheckpoints.put(checkpointID, checkpoint);

				// 啓動針對當前檢查點的超時取消器
				ScheduledFuture<?> cancellerHandle = timer.schedule(
						canceller,
						checkpointTimeout, TimeUnit.MILLISECONDS);

				if (!checkpoint.setCancellerHandle(cancellerHandle)) {
					// 如果該檢查點已經廢棄，需不要清理
					cancellerHandle.cancel(false);
				}

				// trigger the master hooks for the checkpoint
				final List<MasterState> masterStates = MasterHooks.triggerMasterHooks(masterHooks.values(),
						checkpointID, timestamp, executor, Time.milliseconds(checkpointTimeout));
				for (MasterState s : masterStates) {
					checkpoint.addMasterState(s);
				}
			}

			final CheckpointOptions checkpointOptions = new CheckpointOptions(
					props.getCheckpointType(),
					checkpointStorageLocation.getLocationReference());

			// 通知所有Task去觸發checkpoint(底層基於Akka框架的Actor機制)
			for (Execution execution: executions) {
				execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
			}

			numUnsuccessfulCheckpointsTriggers.set(0);
			return new CheckpointTriggerResult(checkpoint);
		}
		catch (Throwable t) {
			// 清理資源
		}

	} // end trigger lock
}

基於Actor的消息驅動的協同機制

Flink 使用 Akka 在 JobClient，JobManager，TaskManager 三個分佈式技術組件之間進行通信：

JobClient 獲取用戶提交的job，然後將其提交給 JobManager

JobManager 隨後對提交的 job 進行執行的環境準備

首先，它會分配job的執行需要的大量資源，這些資源主要是在TaskManager上的execution slots

在資源分配完成之後，JobManager會部署不同的task到特定的TaskManager上

在接收到task之後，TaskManager會創建線程來執行

所有的狀態改變，比如開始計算或者完成計算都將給發回給JobManager

基於這些狀態的改變，JobManager將引導task的執行直到其完成。

一旦job完成執行，其執行結果將會返回給JobClient，進而告知用戶

流程如下圖：

CheckpointCoordinatorDeActivator

類CheckpointCoordinatorDeActivator是一個Actor實現(理解爲一個監聽器)，他基於消息(來自TaskManager的消息)來驅動檢查點定時任務的啓停(如上所有，啓停方法爲startCheckpointScheduler()和stopCheckpointScheduler())

@Override
public void jobStatusChanges(JobID jobId, JobStatus newJobStatus, long timestamp, Throwable error) {
	if (newJobStatus == JobStatus.RUNNING) {
		// start the checkpoint scheduler
		coordinator.startCheckpointScheduler();
	} else {
		// anything else should stop the trigger for now
		coordinator.stopCheckpointScheduler();
	}
}

該Actor會收到JobStatus的狀態的變化通知，一旦變成Running狀態，那麼檢查點的定時任務會被立即啓動；否則會被立即關閉

CheckpointCoordinatorDeActivator被創建是在CheckpointCoordinator中的createActivatorDeactivator()方法中：
public JobStatusListener createActivatorDeactivator() {
	synchronized (lock) {
		if (shutdown) {
			throw new IllegalArgumentException("Checkpoint coordinator is shut down");
		}

		if (jobStatusListener == null) {
			jobStatusListener = new CheckpointCoordinatorDeActivator(this);
		}

		return jobStatusListener;
	}
}

檢查點觸發消息傳遞代碼流程

Actor(監聽器)註冊流程：

JobManager[submitJob]  ==>  ExecutionGraphBuilder[buildGraph]  
==>  new ExecutionGraph().enableCheckpointing()
==>  registerJobStatusListener(checkpointCoordinator.createActivatorDeactivator());

構建執行計劃，會在創建CheckPoint協調器(CheckpointCoordinator)的同時，將新建CheckpointCoordinatorDeActivator並進行註冊，當收到TaskManager的消息時，進行相應的CheckPoint的啓停

消息驅動代碼流程：

JobMaster  ==>  ExecutionGraph[cancel/suspendrestart]  
==>  notifyJobStatusChange  
==>  啓停CheckPoint

AbstractCheckpointMessage

Checkpoint消息的基礎類

job：JobID的實例，表示當前這條消息實例的歸屬；
askExecutionId：ExecutionAttemptID的實例，表示檢查點的源/目的任務
checkpointId：當前消息協調的檢查點ID

TriggerCheckpoint

該消息由JobManager發送給TaskManager，用於告訴一個task觸發它的檢查點。

觸發消息：位於CheckpointCoordinator類的triggerCheckpoint中，上面已經提及過。

// send the messages to the tasks that trigger their checkpoint
for (Execution execution: executions) {
	execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
}

消息處理：TaskManager的handleCheckpointingMessage實現：

case message: TriggerCheckpoint =>
	val taskExecutionId = message.getTaskExecutionId
	val checkpointId = message.getCheckpointId
	val timestamp = message.getTimestamp
	val checkpointOptions = message.getCheckpointOptions

	log.debug(s"Receiver TriggerCheckpoint $checkpointId@$timestamp for $taskExecutionId.")

	val task = runningTasks.get(taskExecutionId)
	if (task != null) {
	  task.triggerCheckpointBarrier(checkpointId, timestamp, checkpointOptions)
	} else {
	  log.debug(s"TaskManager received a checkpoint request for unknown task $taskExecutionId.")
	}

DeclineCheckpoint

該消息由TaskManager發送給JobManager，用於告訴檢查點協調器：檢查點的請求還沒有能夠被處理

這種情況通常發生於：某task已處於RUNNING狀態，但在內部可能還沒有準備好執行檢查點

它除了AbstractCheckpointMessage需要的三個屬性外，還需要用於關聯檢查點的timestamp。

觸發消息：位於Task類的triggerCheckpointBarrier方法中

消息處理：TaskManager的handleCheckpointingMessage實現

AcknowledgeCheckpoint

該消息是一個應答信號，表示某個獨立的task的檢查點已經完成。也是由TaskManager發送給JobManager。該消息會攜帶task的狀態：

state
stateSize

觸發消息：RuntimeEnvironment類的acknowledgeCheckpoint方法
消息處理：具體的實現在CheckpointCoordinator的receiveAcknowledgeMessage中

NotifyCheckpointComplete

該消息由JobManager發送給TaskManager，用於告訴一個task它的檢查點已經得到完成確認，task可以向第三方提交該檢查點。

觸發消息：位於CheckpointCoordinator類的receiveAcknowledgeMessage方法中，當檢查點acktask完成，轉化爲CompletedCheckpoint之後

消息處理：TaskManager的handleCheckpointingMessage

Flink1.4 Fault Tolerance源碼解析-2

文章目錄

CheckpointCoordinator

週期性的檢查點觸發機制

ScheduledTrigger

startCheckpointScheduler

stopCheckpointScheduler

triggerCheckpoint

基於Actor的消息驅動的協同機制

CheckpointCoordinatorDeActivator

檢查點觸發消息傳遞代碼流程

AbstractCheckpointMessage

TriggerCheckpoint

DeclineCheckpoint

AcknowledgeCheckpoint

NotifyCheckpointComplete

最全Redis工具類

Spark開發注意事項小結(性能方面)

Spark總結整理(五)：Spark Core 性能優化之數據傾斜調優

Hadoop環境搭建-3. Hadoop集羣版

Spark Streaming 性能優化(一)：spark.streaming.concurrentJobs 提高 Job 執行的並行度

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結