Spark緩存之 Collect Cache Persist

原創

2020-07-02 12:43

Spark緩存之 Collect Cache Persist

三者都有匯聚數據，拉取數據存儲的作用，mark一下各自的作用。

Collect:

  /**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

collect操作將RDD中所有元素轉換爲Array，一般多用於本地local模式下測試輸出使用；集羣模式下不推薦使用，正如源碼所說，collect操作應該用於數組預期比較小的情況，因爲這裏數據會加載到dirver端內存中，本地測試時影響不大，但是集羣模式下，如果dirver端內存申請太小就很容易oom。

Cache:

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def cache(): this.type = persist()

cache其實就是persist的最基礎的一種模式，可以理解爲persist的一個多態，因爲源碼裏persist也有這樣一個定義：

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

可以看到，調用的cache其實對應的就是無參數的persist，這裏使用場景一般是緩存一些多次使用且佔用空間較小的RDD，有點類似Map Join廣播的小表一樣，這裏 MEMORY_ONLY代表只存放在內存中，所以需要考慮要緩存的RDD大小。

Persist:

  /**
   * Mark this RDD for persisting using the specified level.
   *
   * @param newLevel the target storage level
   * @param allowOverride whether to override any existing level with the new one
   */
  private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
    // TODO: Handle changes of StorageLevel
    if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
      throw new UnsupportedOperationException(
        "Cannot change storage level of an RDD after it was already assigned a level")
    }
    // If this is the first time this RDD is marked for persisting, register it
    // with the SparkContext for cleanups and accounting. Do this only once.
    if (storageLevel == StorageLevel.NONE) {
      sc.cleaner.foreach(_.registerRDDForCleanup(this))
      sc.persistRDD(this)
    }
    storageLevel = newLevel
    this
  }

persist相對於cache，提供了更靈活的選擇：StorageLevel 即儲存水平，第二個參數是否允許覆蓋是針對spark任務中修改一個RDD的緩存級別，平常用到的機會比較小，大致說一下有哪些存儲水平~

StorageLevel Class 主類

class StorageLevel private(
    private var _useDisk: Boolean,
    private var _useMemory: Boolean,
    private var _useOffHeap: Boolean,
    private var _deserialized: Boolean,
    private var _replication: Int = 1)
  extends Externalizable {

主類裏可以看到StorageLevel有5個構造參數，分別爲：

_useDisk ：使用硬盤，可以理解爲當RDD太大而內存放不下時，會放在HDFS或者其他存儲的位置

_useMemory: 使用內存，cache 和 persist() 就是這種模式

_useOffHeap: 使用堆外內存，JVM還不熟悉，後續深挖一下

_deserialized: 反序列化，可以理解爲空間不足或者想節省存儲空間的做法，所以採用序列化可以縮減內存佔用

_replication: 備份數量，這裏默認值爲1，如果本身任務緩存數據較大，且任務失敗再執行的代價比較高，爲了提高容錯率，可以修改爲2，這裏常用場景就是大規模任務日誌落地時，防止oom，io等錯誤導致落地失敗而再次重啓大規模任務而準備

StorageLevel Object 靜態類

object StorageLevel {
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

根據上面五個參數，這裏靜態類給出了多種構造方法，最常用的是MEMORY_ONLY：適用於小數據，可以放在內存中，中大型數據適合MEMORY_AND_DISK_SER，如果任務失敗重啓代價太高，可以考慮MEMORY_AND_DISK_SER_2。這裏序列化會節省空間，但是相對應也會因爲序列化和反序列增加cpu的處理時間，因此是MEMORY_AND_DISK_SER還是MEMORY_AND_DISK可以結合不同使用場景靈活操作。

使用方法：

      rdd.persist(StorageLevel.MEMORY_AND_DISK_SER_2)

總計下使用場景：

1.本地測試多用於 collect

2.RDD數據量不大 cache

3.RDD數據量較大 Cpu不足 persist(MEMORY_AND_DISK) 重啓代價高替換爲 MEMORY_AND_DISK_2

4.RDD數據量較大 Cpu充足 persist(MEMORY_AND_DISK_SER) 重啓代價高替換爲 MEMORY_AND_DISK_SER_2

常用場景就是這些，在一些RDD需要多次複用時可以考慮採用上述操作，但MEMORY模式下容易出現OOM，DISK模式下則會因爲磁盤之間IO而增加運行的時長，這些都是需要考慮的元素，最後記得用完RDD之後調用unpersist釋放多餘的空間。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Spark緩存之 Collect Cache Persist

Spark緩存之 Collect Cache Persist

Collect:

Cache:

Persist:

總計下使用場景：

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

Spark緩存之 Collect Cache Persist

java.lang.NoSuchMethodError 之依賴衝突解決方案

Maven 打包踩坑之ClassNotFoundException 與 NoClassDefFoundError

io.netty | ERROR org.apache.spark.network.client.TransportClient - Failed to send RPC

Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar 解決方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Spark緩存 之 Collect Cache Persist

Spark緩存 之 Collect Cache Persist

Collect:

Cache:

Persist:

總計下使用場景：

Spark緩存之 Collect Cache Persist

Spark緩存之 Collect Cache Persist