Spark持久化&檢查點

1.持久化
Spark持久化過程包括persist、cache、upersist3個操作

      /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
      def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
     
      /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
      def cache(): this.type = persist()
     
      /**
       * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
       *
       * @param blocking Whether to block until all blocks are deleted.
       * @return This RDD.
       */
      def unpersist(blocking: Boolean = true): this.type = {
        logInfo("Removing RDD " + id + " from persistence list")
        sc.unpersistRDD(id, blocking)
        storageLevel = StorageLevel.NONE
        this
      }

cache方法等價於StorageLevel.MEMORY_ONLY的persist方法,而persist方法也僅僅是簡單修改了當前RDD的存儲級別而已,SparkContext中維護了一張哈希表persistRdds,用於登記所有被持久化的RDD,執行persist操作是,會將RDD的編號作爲鍵,把RDD記錄到persistRdds表中,unpersist函數會調用SparkContext對象的unpersistRDD方法,除了將RDD從哈希表persistRdds中移除之外,該方法還會將該RDD中的分區對於的所有塊從存儲介質中刪除。

如下給出持久化的類型   

object StorageLevel {
      val NONE = new StorageLevel(false, false, false, false)
      val DISK_ONLY = new StorageLevel(true, false, false, false)
      val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
      val MEMORY_ONLY = new StorageLevel(false, true, false, true)
      val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
      val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
      val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
      val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
      val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
      val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
      val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
      val OFF_HEAP = new StorageLevel(false, false, true, false)

    class StorageLevel private(
        private var _useDisk: Boolean,
        private var _useMemory: Boolean,
        private var _useOffHeap: Boolean,
        private var _deserialized: Boolean,
        private var _replication: Int = 1)
     extends Externalizable


2.檢查點
檢查點機制的實現和持久化的實現有着較大的區別。檢查點並非第一次計算就將結果進行存儲,而是等到一個作業結束後啓動專門的一個作業完成存儲的操作。
checkPoint操作的實現在RDD類中,checkPoint方法會實例化ReliableRDDCheckpointData用於標記當前的RDD

    

  /**
       * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
       * directory set with `SparkContext#setCheckpointDir` and all references to its parent
       * RDDs will be removed. This function must be called before any job has been
       * executed on this RDD. It is strongly recommended that this RDD is persisted in
       * memory, otherwise saving it on a file will require recomputation.
       */
      def checkpoint(): Unit = RDDCheckpointData.synchronized {
        if (context.checkpointDir.isEmpty) {
          throw new SparkException("Checkpoint directory has not been set in the SparkContext")
        } else if (checkpointData.isEmpty) {
          checkpointData = Some(new ReliableRDDCheckpointData(this))
        }
      }

RDDCheckpointData類內部有一個枚舉類型CheckpointState

  

  /**
     * Enumeration to manage state transitions of an RDD through checkpointing
     * [ Initialized --> checkpointing in progress --> checkpointed ].
     */
    private[spark] object CheckpointState extends Enumeration {
      type CheckpointState = Value
      val Initialized, CheckpointingInProgress, Checkpointed = Value
    }

用於表示RDD檢查點的當前狀態,其值有Initialized 、CheckpointingInProgress、 checkpointed。其轉換過程如下
(1)Initialized狀態
該狀態是實例化ReliableRDDCheckpointData後的默認狀態,用於標記當前的RDD已經建立了檢查點(較v1.4.x少一個MarkForCheckPiont狀態)

(2)CheckpointingInProgress狀態
每個作業結束後都會對作業的末RDD調用其doCheckPoint方法,該方法會順着RDD的關係依賴鏈往前遍歷,直到遇見內部RDDCheckpointData對象被標記爲Initialized的爲止,此時將RDD的RDDCheckpointData對象標記爲CheckpointingInProgress,並啓動一個作業完成數據的寫入操作。

(3)Checkpointed狀態
新啓動作業完成數據寫入操作之後,將建立檢查點的RDD的所有依賴全部清除,將RDD內部的RDDCheckpointData對象標記爲Checkpointed,將父RDD重新設置爲一個CheckPointRDD對象,父RDD的compute方法會直接從系統中讀取數據。
 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章