淺析Apache Spark Caching和Checkpointing

Apache Spark應用開發中，內存管理是最重要的人物之一，但cacheing和checkpointing之間的差異可能會導致混亂。這2種操作是都是用來防止rdd(彈性分佈式數據集)每次被引用時被重複計算帶來的時間和空間上不必要的損失。然而他們之間的區別是什麼呢？

fengmian.png

Caching

cache 機制保證了需要訪問重複數據的應用（如迭代型算法和交互式應用）可以運行的更快。有多種級別的持久化策略讓開發者選擇，使開發者能夠對空間和計算成本進行權衡，同時能指定out of memory時對rdd的操作（緩存在內存或者磁盤，並且可以指定在內存不夠的情況下按照FIFO的策略選取一部分block交換到磁盤來產生空餘空間）。

因此Spark不但可以對rdd重複計算還能在節點發生故障時重新計算丟失的分區。最後，被緩存的rdd存在於一個running的應用的生命週期內，如果這個應用終止了，那麼緩存的rdd也會同時被刪除。

Checkpointing

checkpointing把rdd存儲到一個可靠的存儲系統（例如HDFS,S3）。checkpoint一個rdd有點類似於Hadoop中把中間計算結果存儲到磁盤，損失部分執行性能來獲得更好的從運行過程中出現failures時recover的能力。因爲rdd是checkpoint在外部的存儲系統（磁盤，HDFS,S3等），所以checkpoint過的rdd能夠被其他的應用重用。

caching和checkpointing的聯繫

我們先來看rdd的計算路徑來了解caching和checkpointing的相互作用。
Spark engine的核心是DAGScheduler。它把一個spark job分解成由若干個stages組成的DAG。每一個shuffle或者result stage再分解成一個個在RDD的分區中獨立運行的task。一個RDD的iterator方法是一個task訪問基礎數據分區的入口：

/**
   * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
   * This should ''not'' be called by users directly, but is available for implementors of custom
   * subclasses of RDD.
*/ 
 final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {
      getOrCompute(split, context)
    } else {
      computeOrReadCheckpoint(split, context)
    }
  }

我們可以從代碼中看到，如果設置了存儲級別，表明rdd可能被緩存，它首先嚐試調用getOrCompute方法從block manager中得到分區。

/**
   * Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.
   */
  private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
    val blockId = RDDBlockId(id, partition.index)
    var readCachedBlock = true
    // This method is called on executors, so we need call SparkEnv.get instead of sc.env.
    SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
      readCachedBlock = false
      computeOrReadCheckpoint(partition, context)
    }) match {
      case Left(blockResult) =>
        if (readCachedBlock) {
          val existingMetrics = context.taskMetrics().registerInputMetrics(blockResult.readMethod)
          existingMetrics.incBytesReadInternal(blockResult.bytes)
          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
            override def next(): T = {
              existingMetrics.incRecordsReadInternal(1)
              delegate.next()
            }
          }
        } else {
          new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
        }
      case Right(iter) =>
        new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
    }
  }

如果block manager中沒有這個rdd的分區，那麼它就去computeOrReadCheckpoint:

/**
   * Compute an RDD partition or read it from a checkpoint if the RDD is checkpointing.
   */
  private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
  {
    if (isCheckpointedAndMaterialized) {
      firstParent[T].iterator(split, context)
    } else {
      compute(split, context)
    }
  }

正如你猜測的一樣，這個computeOrReadCheckpoint這個方法會從checkpoint中尋找對應的數據，如果rdd沒有被checkpoint，那麼就從當前計算的分區開始計算。

是時候聊聊他們的區別了

cache 機制是每計算出一個要 cache 的 partition 就直接將其 cache 到內存了。但是checkpoint 沒有使用這種第一次計算得到就存儲的方法，而是等到 job 結束後另外啓動專門的 job 去完成 checkpoint 。也就是說需要 checkpoint 的 RDD 會被計算兩次。

因此，在使用 rdd.checkpoint() 的時候，建議加上 rdd.cache()，這樣第二次運行的 job 就不用再去計算該 rdd 了，直接讀取 cache 寫磁盤。其實 Spark 提供了 rdd.persist(StorageLevel.DISK_ONLY) 這樣的方法，相當於 cache 到磁盤上，這樣可以做到 rdd 第一次被計算得到時就存儲到磁盤上，但這個 persist 和 checkpoint 有很多不同。前者雖然可以將 RDD 的 partition 持久化到磁盤，但該 partition 由 blockManager 管理。

一旦 driver program 執行結束，也就是 executor 所在進程 CoarseGrainedExecutorBackend stop，blockManager 也會 stop，被 cache 到磁盤上的 RDD 也會被清空（整個 blockManager 使用的 local 文件夾被刪除）。

而 checkpoint 將 RDD 持久化到 HDFS 或本地文件夾，如果不被手動 remove 掉，是一直存在的，也就是說可以被下一個 driver program 使用，而 cached RDD 不能被其他 dirver program 使用。

總結

使用checkpoint會消耗更多的時間在rdd的讀寫上（因爲要使用外部存儲系統HDFS,S3，或者磁盤），但是Spark worker的一些failures不一定導致重新計算。

另一方面，caching的rdd 不會永久佔用存儲空間，但是重新計算在Spark worker出現一些failures的時候是必要的。

綜上，這2個東東都是取決於開發者自己的角度結合業務場景來使用，一般情況下，綜合計算任務的性能來進行2者的選擇（tips:大部分情況用cache就夠了，如果感覺 job 可能會出錯可以手動去 checkpoint 一些 critical 的 RDD）。

淺析Apache Spark Caching和Checkpointing

Caching

Checkpointing

caching和checkpointing的聯繫

是時候聊聊他們的區別了

總結

回答阿里社招面試如何準備，順便談談對於Java程序猿學習當中各個階段的建議

Spark RDD、DataFrame和DataSet的區別

Spark算子：RDD基本轉換操作(7)–zipWithIndex、zipWithUniqueId

Spark: sortBy和sortByKey函數詳解

Spark算子：RDD創建操作

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結