Spark的Checkpoint源碼和機制

1 Overview

A checkpoint creates a known good point from which the SQL Server Database Engine can start applying changes contained in the log during recovery after an unexpected shutdown or crash.

在流式計算裏，需要高容錯的機制來確保程序的穩定和健壯。從源碼中看看，在 Spark 中，Checkpoint 到底做了什麼。在源碼中搜索，可以在 Streaming 包中的 Checkpoint。

作爲 Spark 程序的入口，我們首先關注一下 SparkContext 裏關於 Checkpoint 是怎麼寫的。SparkContext 我們知道，定義了很多 Spark 內部的對象的引用。可以找到 Checkpoint 的文件夾路徑是這麼定義的。

我們從一段簡單的代碼開始看一下checkpoint 。spark版本2.1.1

  val sparkConf = new SparkConf().setAppName("streaming").setMaster("local[*]")
    val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
    val sc = sparkSession.sparkContext
    val ssc = new StreamingContext(sc,Seconds(5))
    //設置檢查點目錄
    ssc.checkpoint("./streaming_checkpoint")

看一下ssc.checkpoint

  def checkpoint(directory: String) {
    if (directory != null) {
      val path = new Path(directory)
      val fs = path.getFileSystem(sparkContext.hadoopConfiguration)
      fs.mkdirs(path)
      val fullPath = fs.getFileStatus(path).getPath().toString
      sc.setCheckpointDir(fullPath)
      checkpointDir = fullPath
    } else {
      checkpointDir = null
    }
  }

看一下sc.setCheckpointDir

// 定義 checkpointDir
private[spark] var checkpointDir: Option[String] = None
/**

Set the directory under which RDDs are going to be checkpointed. The directory must
be a HDFS path if running on a cluster.
*/
def setCheckpointDir(directory: String) {
// If we are running on a cluster, log a warning if the directory is local.
// Otherwise, the driver may attempt to reconstruct the checkpointed RDD from
// its own local file system, which is incorrect because the checkpoint files
// are actually on the executor machines.
// 如果運行的是 cluster 模式，當設置本地文件夾的時候，會報 warning
// 道理很簡單，被創建出來的文件夾路徑實際上是 executor 本地的文件夾路徑，不是不行，
// 只是有點不合理，Checkpoint 的東西最好還是放在分佈式的文件系統中
if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
logWarning("Spark is not running in local mode, therefore the checkpoint directory " +
s"must not be on the local filesystem. Directory ‘$directory’ " +
“appears to be on the local filesystem.”)
}

checkpointDir = Option(directory).map { dir =>
// 顯然文件夾名就是 UUID.randoUUID() 生成的
val path = new Path(dir, UUID.randomUUID().toString)
val fs = path.getFileSystem(hadoopConfiguration)
fs.mkdirs(path)
fs.getFileStatus(path).getPath.toString
}
}

關於 setCheckpointDir 被那些類調用了，可以看以下截圖。除了常見的 StreamingContext 中需要使用（因爲容錯性是流式計算的基本保證），另外的就是一些需要反覆迭代計算使用 RDD 的場景，包括各種機器學習算法的時候，圖中可以看到像 ALS, Decision Tree 等等算法，這些算法往往需要反覆使用 RDD，遇到大的數據集用 Cache 就沒有什麼意義了，所以一般會用 Checkpoint。

此處我只計劃深挖一下 spark core 裏的代碼。推薦大家一個 IDEA 的功能，下圖右下方可以將你搜索的關鍵詞的代碼輸出到外部文件中，到時候可以打開自己看看 spark core 中關於 Checkpoint 的代碼是怎麼組織的。

繼續找找 Checkpoint 的相關信息，可以看到 runJob 方法的最後是一個 rdd.toCheckPoint() 的使用。runJob 我們知道是觸發 action 的一個方法，那麼我們進入 doCheckpoint() 看看。

/**
 * Run a function on a given set of partitions in an RDD and pass the results to the given
 * handler function. This is the main entry point for all actions in Spark.
 */
def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    resultHandler: (Int, U) => Unit): Unit = {
  if (stopped.get()) {
    throw new IllegalStateException("SparkContext has been shutdown")
  }
  val callSite = getCallSite
  val cleanedFunc = clean(func)
  logInfo("Starting job: " + callSite.shortForm)
  if (conf.getBoolean("spark.logLineage", false)) {
    logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
  }
  dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
  progressBar.foreach(_.finishAll())
  rdd.doCheckpoint()
}

然後基本就發現了 Checkpoint 的核心方法了。而 doCheckpoint() 是 RDD 的私有方法，所以這裏基本可以回答最開始提出的問題，我們在說 Checkpoint 的時候，到底是 Checkpoint 什麼。答案就是 RDD。

private[spark] def doCheckpoint(): Unit = {
  RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) {
    // 該rdd是否已經調用doCheckpoint，如果還沒有，則開始處理
    if (!doCheckpointCalled) {
      // 判斷RDDCheckpointData是否已經定義了，如果已經定義了
      doCheckpointCalled = true
      if (checkpointData.isDefined) {
        // 查看是否需要把該rdd的所有依賴即血緣全部checkpoint
        if (checkpointAllMarkedAncestors) {
          // Linestage上的每一個rdd遞歸調用該方法
          dependencies.foreach(_.rdd.doCheckpoint())
        }
        // 調用RDDCheckpointData的checkpoint方法
        checkpointData.get.checkpoint()
      } else {
        dependencies.foreach(_.rdd.doCheckpoint())
      }
    }
  }
}

上面代碼可以看到，需要判斷一下一個變量 checkpointData 是否爲空。那麼它是這麼被定義的。

private[spark] var checkpointData: Option[RDDCheckpointData[T]] = None

然後看看 RDDCheckPointData 是個什麼樣的數據結構。

/**
 * This class contains all the information related to RDD checkpointing. Each instance of this
 * class is associated with an RDD. It manages process of checkpointing of the associated RDD,
 * as well as, manages the post-checkpoint state by providing the updated partitions,
 * iterator and preferred locations of the checkpointed RDD.
 */
private[spark] abstract class RDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T])
  extends Serializable {
  import CheckpointState._
  // The checkpoint state of the associated RDD.
  protected var cpState = Initialized
  // The RDD that contains our checkpointed data
  // 顯然，這個就是被 Checkpoint 的 RDD 的數據
  private var cpRDD: Option[CheckpointRDD[T]] = None
  // TODO: are we sure we need to use a global lock in the following methods?
  /**
   * Return whether the checkpoint data for this RDD is already persisted.
   */
  def isCheckpointed: Boolean = RDDCheckpointData.synchronized {
    cpState == Checkpointed
  }
  /**
   * Materialize this RDD and persist its content.
   * This is called immediately after the first action invoked on this RDD has completed.
   */
  final def checkpoint(): Unit = {
    // Guard against multiple threads checkpointing the same RDD by
    // atomically flipping the state of this RDDCheckpointData
    RDDCheckpointData.synchronized {
      if (cpState == Initialized) {
        cpState = CheckpointingInProgress
      } else {
        return
      }
    }
    val newRDD = doCheckpoint()
    // Update our state and truncate the RDD lineage
    // 可以看到 cpRDD 在此處被賦值，通過 newRDD 來生成，而生成的方法是 doCheckpointa()
    RDDCheckpointData.synchronized {
      cpRDD = Some(newRDD)
      cpState = Checkpointed
      rdd.markCheckpointed()
    }
  }
  /**
   * Materialize this RDD and persist its content.
   *
   * Subclasses should override this method to define custom checkpointing behavior.
   * @return the checkpoint RDD created in the process.
   */
   // 這個是 Checkpoint RDD 的抽象方法
  protected def doCheckpoint(): CheckpointRDD[T]
  /**
   * Return the RDD that contains our checkpointed data.
   * This is only defined if the checkpoint state is `Checkpointed`.
   */
  def checkpointRDD: Option[CheckpointRDD[T]] = RDDCheckpointData.synchronized { cpRDD }
  /**
   * Return the partitions of the resulting checkpoint RDD.
   * For tests only.
   */
  def getPartitions: Array[Partition] = RDDCheckpointData.synchronized {
    cpRDD.map(_.partitions).getOrElse { Array.empty }
  }
}

根據註釋，可以知道這個類涵蓋了 RDD Checkpoint 的所有信息。除了控制 Checkpoint 的過程，還會處理之後的狀態變更。說到 Checkpoint 的狀態變更，我們看看是如何定義的。

/**
 * Enumeration to manage state transitions of an RDD through checkpointing
 *
 * [ Initialized --{@literal >} checkpointing in progress --{@literal >} checkpointed ]
 */
private[spark] object CheckpointState extends Enumeration {
  type CheckpointState = Value
  val Initialized, CheckpointingInProgress, Checkpointed = Value
}

顯然 Checkpoint 的過程分爲初始化[Initialized] -> 正在 Checkpoint[CheckpointingInProgress] -> 結束 Checkpoint[Checkpointed] 三種狀態。

圖片.png

關於 RDDCheckpointData 有兩個實現，分別分析一下。

LocalRDDCheckpointData: RDD 會被保存到 Executor 本地文件系統中，以減少保存到分佈式容錯性文件系統的鉅額開銷，因此 Local 形式的 Checkpoint 是基於持久化來做的，沒有寫到外部分佈式文件系統。
ReliableRDDCheckpointData: Reliable 很好理解，就是把 RDD Checkpoint 到可依賴的文件系統，言下之意就是 Driver 重啓的時候也可以從失敗的時間點進行恢復，無需再走一次 RDD 的轉換過程。

接着查看RDDCheckpointData的checkpoint方法，如下：

final def checkpoint(): Unit = {
  // 將checkpoint的狀態從Initialized置爲CheckpointingInProgress
  RDDCheckpointData.synchronized {
    if (cpState == Initialized) {
      cpState = CheckpointingInProgress
    } else {
      return
    }
  }
  // 調用子類的doCheckpoint，我們以ReliableCheckpointRDD爲例，創建一個新的CheckpointRDD
  val newRDD = doCheckpoint()

  // 將checkpoint狀態置爲Checkpointed狀態，並且改變rdd之前的依賴，設置父rdd爲新創建的CheckpointRDD
  RDDCheckpointData.synchronized {
    cpRDD = Some(newRDD)
    cpState = Checkpointed
    rdd.markCheckpointed()
  }
}

1.1 LocalRDDCheckpointData

LocalRDDCheckpointData 中的核心方法 doCheckpoint()。需要保證 RDD 用了 useDisk 級別的持久化。需要運行一個 Spark 任務來重新構建這個 RDD。最終 new 一個 LocalCheckpointRDD 實例。

/**
 * Ensure the RDD is fully cached so the partitions can be recovered later.
 */
protected override def doCheckpoint(): CheckpointRDD[T] = {
  val level = rdd.getStorageLevel
// Assume storage level uses disk; otherwise memory eviction may cause data loss
assume(level.useDisk, s"Storage level $level is not appropriate for local checkpointing")

// Not all actions compute all partitions of the RDD (e.g. take). For correctness, we
// must cache any missing partitions. TODO: avoid running another job here (SPARK-8582).
val action = (tc: TaskContext, iterator: Iterator[T]) => Utils.getIteratorSize(iterator)
val missingPartitionIndices = rdd.partitions.map(_.index).filter { i =>
!SparkEnv.get.blockManager.master.contains(RDDBlockId(rdd.id, i))
}
if (missingPartitionIndices.nonEmpty) {
rdd.sparkContext.runJob(rdd, action, missingPartitionIndices)
}

new LocalCheckpointRDDT
}

1.2 ReliableRDDCheckpointData

這個是寫外部文件系統的 Checkpoint 類。

/**
 * Materialize this RDD and write its content to a reliable DFS.
 * This is called immediately after the first action invoked on this RDD has completed.
 */
protected override def doCheckpoint(): CheckpointRDD[T] = {
  val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)
// Optionally clean our checkpoint files if the reference is out of scope
if (rdd.conf.getBoolean(“spark.cleaner.referenceTracking.cleanCheckpoints”, false)) {
rdd.context.cleaner.foreach { cleaner =>
cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
}
}

logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
newRDD
}

可以看到核心方法是通過 ReliableCheckpointRDD.writeRDDToCheckpointDirectory() 來寫 newRDD。這個方法代碼邏輯非常清晰，同樣是起一個 Spark 任務把 RDD 生成之後按 Partition 來寫到文件系統中。

def writeRDDToCheckpointDirectory[T: ClassTag](
    originalRDD: RDD[T],
    checkpointDir: String,
    blockSize: Int = -1): ReliableCheckpointRDD[T] = {
  val checkpointStartTimeNs = System.nanoTime()

  val sc = originalRDD.sparkContext

  // Create the output path for the checkpoint
  // 創建checkpoint輸出目錄
  val checkpointDirPath = new Path(checkpointDir)
  // 獲取HDFS文件系統API接口
  val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)
  // 創建目錄
  if (!fs.mkdirs(checkpointDirPath)) {
    throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")
  }

  // Save to file, and reload it as an RDD
  // 將配置文件信息廣播到所有節點
  val broadcastedConf = sc.broadcast(
    new SerializableConfiguration(sc.hadoopConfiguration))
  // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
  // 重新啓動一個job,將rdd的分區數據寫入HDFS
  sc.runJob(originalRDD,
    writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)
  // 如果rdd的partitioner不爲空，則將partitioner寫入checkpoint目錄
  if (originalRDD.partitioner.nonEmpty) {
    writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)
  }

  val checkpointDurationMs =
    TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - checkpointStartTimeNs)
  logInfo(s"Checkpointing took $checkpointDurationMs ms.")
  // 創建一個CheckpointRDD,該分區數目應該和原始的rdd的分區數是一樣的
  val newRDD = new ReliableCheckpointRDD[T](
    sc, checkpointDirPath.toString, originalRDD.partitioner)
  if (newRDD.partitions.length != originalRDD.partitions.length) {
    throw new SparkException(
      s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " +
        s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})")
  }
  newRDD
}

2 Checkpoint嘗試

Spark 的 Checkpoint 機制通過上文在源碼上分析了一下，那麼也可以在 Local 模式下實踐一下。利用 spark-shell 來簡單嘗試一下就好了。

scala> val data = sc.parallelize(List(1, 2, 3))
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> sc.setCheckpointDir("/tmp")
scala> data.checkpoint
scala> data.count
res2: Long = 3

從以上代碼示例上可以看到，首先構建一個 rdd，並且設置 Checkpoint 文件夾，因爲是 Local 模式，所以可以設定本地文件夾做嘗試。

# list 一下 /tmp 目錄，發現 Checkpoint 的文件夾</span>
➜  /tmp ls
73d8442e-a375-401c-b1fc-84284e25b89c

<span class="hljs-comment"><span class="hljs-comment"># tree 一下 Checkpoint 文件夾看看是什麼結構的，可以看到默認構建的 rdd 四個分區都被 checkpoint 了</span>
➜  /tmp tree 73d8442e-a375-401c-b1fc-84284e25b89c
73d8442e-a375-401c-b1fc-84284e25b89c
└── rdd-0
    ├── part-00000
    ├── part-00001
    ├── part-00002
    └── part-00003

1 directory, 4 files

=============================================

讀取checkpoint原理分析

詳細分析
Spark RDD主要由Dependency、Partition、Partitioner組成，Partition是其中之一。一份待處理的原始數據會被按照相應的邏輯(例如jdbc和hdfs的split邏輯)切分成n份，每份數據對應到RDD中的一個Partition，Partition的數量決定了task的數量，影響着程序的並行度。

**1. ** 我們從Partition入手，Partition源碼如下：

/**
 * An identifier for a partition in an RDD.
 */
trait Partition extends Serializable {
  /** Get the partition's index within its parent RDD*/
  def index: Int

  // A better default implementation of HashCode
  override def hashCode(): Int = index

  override def equals(other: Any): Boolean = super.equals(other)
}

Partition的定義很簡單。Partition和RDD是伴生的，所以每一種RDD都有其對應的Partition實現，所以，分析Partition主要是分析其子類。

在RDD.scala中，定義了很多方法，如：

//輸入一個partition，對其代表的數據進行計算
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]
//數據如何被split的邏輯
protected def getPartitions: Array[Partition]
//這個RDD的依賴——它的父RDD
protected def getDependencies: Seq[Dependency[_]] = deps
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
@transient val partitioner: Option[Partitioner] = None

其中的第二個方法，getPartitions()是數據源如何被切分的邏輯，返回值正是Partition，第一個方法compute()是消費切割後的Partition的方法，我們從getPartitions和compute方法入手。

RDD 是通過 iterator 來進行計算：每當 Task 運行的時候會調用 RDD 的 Compute 方法進行計算，而 Compute 方法會調用 iterator 方法。iterator()方法在RDD.scala中的源碼如下：

  final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {
      // 如果StorageLevel不爲空，表示該RDD已經持久化過了,可能是在內存，也有可能是在磁盤，
      // 如果是磁盤獲取的，需要把block緩存在內存中
      getOrCompute(split, context)
    } else {
      // 進行rdd partition的計算或者根據checkpoint讀取數據
      computeOrReadCheckpoint(split, context)
    }
  }

這個方法是 final 級別【不能覆寫但可以被子類去使用】，先看持久化的邏輯，我們可以看getOrCompute方法，這個方法從內存或者磁盤獲取，如果從磁盤獲取需要將block緩存到內存：

  private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
      
    val blockId = RDDBlockId(id, partition.index)  // 根據rdd id創建RDDBlockId
      
    var readCachedBlock = true                     // 是否從緩存的block讀取
      
    SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
        
      // 如果執行到了這，說明沒有獲取到block，readCachedBlock設置成false，表示不能從cache中讀取。
      readCachedBlock = false
      // 需要調用該函數重新計算或者從checkpoint讀取
      computeOrReadCheckpoint(partition, context)
        
    }) match {
      // 獲取到了結果直接返回
      case Left(blockResult) =>   // 如果從cache讀取block
        if (readCachedBlock) {
          val existingMetrics = context.taskMetrics().inputMetrics
          existingMetrics.incBytesRead(blockResult.bytes)
          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
            override def next(): T = {
              existingMetrics.incRecordsRead(1)
              delegate.next()
            }
          }
        } else {
          new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
        }
      case Right(iter) =>
        new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
    }
  }

其中getOrElseUpdate方法做了什麼：如果指定的block存在，則直接獲取，否則調用makeIterator方法去計算block，然後持久化最後返回值，代碼如下：

 def getOrElseUpdate[T](
      blockId: BlockId,
      level: StorageLevel,
      classTag: ClassTag[T],
      makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
    // get方法 嘗試從本地獲取數據，如果獲取不到則從遠端獲取
    get[T](blockId)(classTag) match {
      case Some(block) =>
        return Left(block)
      case _ =>
        // Need to compute the block.
    }
    // 如果本地化和遠端都沒有獲取到數據，則調用makeIterator計算，最後將結果寫入block
    doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {
      case None =>    // 表示寫入成功
        val blockResult = getLocalValues(blockId).getOrElse {      // 從本地獲取數據塊
          releaseLock(blockId)
          throw new SparkException(s"get() failed for block $blockId even though we held a lock")
        }
        releaseLock(blockId)
        Left(blockResult)
      case Some(iter) =>                 // 如果寫入失敗
        // 如果put操作失敗，表示可能是因爲數據太大，無法寫入內存，又無法被磁盤drop，因此我們需要返回這個iterator給調用者以至於他們能夠做出決定這個值是什麼，怎麼辦
       Right(iter)
    }
  }

通過get方法獲取數據，先從本地獲取數據，如果沒有則從遠端獲取，代碼如下：

  def get[T: ClassTag](blockId: BlockId): Option[BlockResult] = {
    // 從本地獲取block
    val local = getLocalValues(blockId)
    // 如果本地獲取到了則返回
    if (local.isDefined) {
      logInfo(s"Found block $blockId locally")
      return local
    }
    // 如果本地沒有獲取到則從遠端獲取
    val remote = getRemoteValues[T](blockId)
    // 如果遠端獲取到了則返回，沒有返回None
    if (remote.isDefined) {
      logInfo(s"Found block $blockId remotely")
      return remote
    }
    None
  }

如何從本地獲取block的邏輯在getLocalValues方法中，這個方法會從本地獲取block,如果存在返回BlockResult，不存在返回None；如果storage level是磁盤，則還需將得到的block緩存到內存存儲，方便下次讀取，具體如下：

def getLocalValues(blockId: BlockId): Option[BlockResult] = {
  logDebug(s"Getting local block $blockId")
  // 調用block info manager，鎖定該block，然後讀取block,返回該block 元數據block info
  blockInfoManager.lockForReading(blockId) match {
    // 沒有讀取到則返回None
    case None =>
      logDebug(s"Block $blockId was not found")
      None
    // 讀取到block元數據
    case Some(info) =>
      val level = info.level    // 獲取存儲級別storage level
      logDebug(s"Level for block $blockId is $level")
      val taskAttemptId = Option(TaskContext.get()).map(_.taskAttemptId())
      if (level.useMemory && memoryStore.contains(blockId)) {   // 如果使用內存，且內存memory store包含這個block id
        // 判斷是不是storage level是不是反序列化的，如果是反序列化的，則調用MemoryStore的getValues方法
        // 否則調用MemoryStore的getBytes然後反序列輸入流返回數據作爲迭代器
        val iter: Iterator[Any] = if (level.deserialized) {
          memoryStore.getValues(blockId).get
        } else {
          serializerManager.dataDeserializeStream(
            blockId, memoryStore.getBytes(blockId).get.toInputStream())(info.classTag)
        }
        val ci = CompletionIterator[Any, Iterator[Any]](iter, {
          releaseLock(blockId, taskAttemptId)
        })
        // 構建一個BlockResult對象返回,這個對象包括數據，讀取方式以及字節大小
        Some(new BlockResult(ci, DataReadMethod.Memory, info.size))
      } else if (level.useDisk && diskStore.contains(blockId)) {  // 如果使用磁盤存儲，且disk store包含這個block則從磁盤獲取，並且把結果放入內存
        val diskData = diskStore.getBytes(blockId)  // 調用DiskStore的getBytes方法，如果需要反序列化，則進行反序列
        val iterToReturn: Iterator[Any] = {
          if (level.deserialized) {
            val diskValues = serializerManager.dataDeserializeStream(
              blockId,
              diskData.toInputStream())(info.classTag)
            // 嘗試將從磁盤讀的溢寫的值加載到內存，方便後續快速讀取
            maybeCacheDiskValuesInMemory(info, blockId, level, diskValues)
          } else {
            // 如果不需要反序列化，首先將讀取到的流加載到內存，方便後續快速讀取
            val stream = maybeCacheDiskBytesInMemory(info, blockId, level, diskData)
              .map { _.toInputStream(dispose = false) }
              .getOrElse { diskData.toInputStream() }
            // 然後再返回反序列化之後的數據
            serializerManager.dataDeserializeStream(blockId, stream)(info.classTag)
          }
        }
        // 構建BlockResult返回
        val ci = CompletionIterator[Any, Iterator[Any]](iterToReturn, {
          releaseLockAndDispose(blockId, diskData, taskAttemptId)
        })
        Some(new BlockResult(ci, DataReadMethod.Disk, info.size))
      } else {
        // 處理本地讀取block失敗，報告driver這是一個無效的block,將會刪除這個block
        handleLocalReadFailure(blockId)
      }
  }
}

遠端讀取，即從block所存放的其他block manager（其他節點）獲取block，邏輯如下：

private def getRemoteValues[T: ClassTag](blockId: BlockId): Option[BlockResult] = {
  val ct = implicitly[ClassTag[T]]
  getRemoteBytes(blockId).map { 
      // 將遠程fetch的結果進行反序列化，然後構建BlockResult返回
      data => val values =
      serializerManager.dataDeserializeStream(blockId, data.toInputStream(dispose = true))(ct)
    new BlockResult(values, DataReadMethod.Network, data.size)
  }
}

其中獲取獲取數據的方法getRemoteBytes，邏輯如下：

def getRemoteBytes(blockId: BlockId): Option[ChunkedByteBuffer] = {
  logDebug(s"Getting remote block $blockId")
  require(blockId != null, "BlockId is null")
  var runningFailureCount = 0
  var totalFailureCount = 0
  // 首先根據blockId獲取當前block存在在哪些block manager上
  val locations = getLocations(blockId)
  // 最大允許的獲取block的失敗次數爲該block對應的block manager數量
  val maxFetchFailures = locations.size
  var locationIterator = locations.iterator
  while (locationIterator.hasNext) { // 開始遍歷block manager
    val loc = locationIterator.next()
    logDebug(s"Getting remote block $blockId from $loc")
    val data = try {
      // 通過調用BlockTransferSerivce的fetchBlockSync方法從遠端獲取block
      blockTransferService.fetchBlockSync(
        loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer()
    } catch {
      case NonFatal(e) =>
        runningFailureCount += 1
        totalFailureCount += 1
        // 如果總的失敗數量大於了閥值則返回None
        if (totalFailureCount >= maxFetchFailures) {
          logWarning(s"Failed to fetch block after $totalFailureCount fetch failures. " +
            s"Most recent failure cause:", e)
          return None
        }

        logWarning(s"Failed to fetch remote block $blockId " +
          s"from $loc (failed attempt $runningFailureCount)", e)

        if (runningFailureCount >= maxFailuresBeforeLocationRefresh) {
          locationIterator = getLocations(blockId).iterator
          logDebug(s"Refreshed locations from the driver " +
            s"after ${runningFailureCount} fetch failures.")
          runningFailureCount = 0
        }
        null
    }

    // 成功的話，返回ChunkedByteBuffer
    if (data != null) {
      return Some(new ChunkedByteBuffer(data))
    }
    logDebug(s"The value of block $blockId is null")
  }
  logDebug(s"Block $blockId not found")
  None
}

另一個分支checkpiont

根據上面的iterator()的另一個分支：如果block沒有被持久化，即storage level爲None，我們就需要進行計算或者從Checkpoint讀取數據;如果已經checkpoint了，則調用ietrator去讀取block數據，否則調用Parent的RDD的compute方法。

private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
{
    // 當前rdd是否已經checkpoint和物理化了，如果已經checkpoint，則調用第一個parent rdd的iterator方法獲取
  if (isCheckpointedAndMaterialized) {
    firstParent[T].iterator(split, context)
  } else {
    //否則調用rdd的compute方法開始計算，返回一個Iterator對象
    compute(split, context)
  }
}

看一下ReliableRDDCheckpointData的compute實現方式

ReliableRDDCheckpointData
//讀取與給定分區關聯的檢查點文件的內容。
  override def compute(split: Partition, context: TaskContext): Iterator[T] = {
    val file = new Path(checkpointPath, ReliableCheckpointRDD.checkpointFileName(split.index))
    ReliableCheckpointRDD.readCheckpointFile(file, broadcastedConf, context)
  }

 /**
   * Read the content of the specified checkpoint file.
   * 讀取指定檢查點文件的內容。
   */
  def readCheckpointFile[T](
      path: Path,
      broadcastedConf: Broadcast[SerializableConfiguration],
      context: TaskContext): Iterator[T] = {
    val env = SparkEnv.get
    val fs = path.getFileSystem(broadcastedConf.value.value)
    val bufferSize = env.conf.getInt("spark.buffer.size", 65536)
    val fileInputStream = fs.open(path, bufferSize)
    val serializer = env.serializer.newInstance()
    val deserializeStream = serializer.deserializeStream(fileInputStream)

    // Register an on-task-completion callback to close the input stream.
    context.addTaskCompletionListener(context => deserializeStream.close())

    deserializeStream.asIterator.asInstanceOf[Iterator[T]]
  }

看一下LocalCheckpointRDD的compute

//拋出異常，找不到checkpoint。只有在original RDD未被顯式地持久化或一個executor丟失了纔會有這樣的情況。然而，在正常情況下，original RDD將完全緩存，因此應已計算所有分區，並且在塊存儲中可用。
  override def compute(partition: Partition, context: TaskContext): Iterator[T] = {
    throw new SparkException(
      s"Checkpoint block ${RDDBlockId(rddId, partition.index)} not found! Either the executor " +
      s"that originally checkpointed this partition is no longer alive, or the original RDD is " +
      s"unpersisted. If this problem persists, you may consider using `rdd.checkpoint()` " +
      s"instead, which is slower than local checkpointing but more fault-tolerant.")
  }

3 Summary

檢查點（本質是通過將RDD寫入Disk做檢查點）是爲了通過lineage做容錯的輔助。lineage過長會造成容錯成本過高。這樣就不如在中間階段做檢查點容錯，假設之後有節點出現故障而丟失分區。從做檢查點的RDD開始重做Lineage，就會降低開銷。

建議：做檢查點的RDD最好是已緩存在內存中，否則保存檢查點的過程還需要重新計算，產生I/O開銷。
checkpoint基礎
 checkpoint 可能遇到的問題參考這裏
參考：https://www.jianshu.com/p/a75d0439c2f9
參考：https://blog.csdn.net/changshuchao/article/details/88634555
參考：https://www.cnblogs.com/small-k/p/8909942.html

本文只是學習用，很多都是從官網、個人博客摘的，如有問題請留言，我會處理的。

Spark的Checkpoint源碼和機制