目錄
Spark持久化策略_緩存優化
RDD的持久化策略
當某個RDD需要進行頻繁複用的時候,spark提供RDD的持久化功能,可以通過使用persist()、cache()兩種方法進行RDD的持久化。如下所示:
//scala
myRDD.persist()
myRDD.cache()
爲什麼要使用持久化?
因爲RDD1經過Action生成新的RDD2之後,原先的RDD1就會被從內存中刪除,如果在接下來的操作中還需要複用到RDD1,Spark會一路向上追溯,重新讀取數據,然後重新計算出RDD1,然後進行計算。這會增加磁盤IO和計算成本,持久化會保存數據,等下一次Action時直接使用。
cache和persist的源碼
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()
從源碼中我們可以看到, cache方法實際上是調用無參數傳遞的persis方法,所以我們只要研究persist方法即可。而無參的persist默認的參數是StorageLevel.MEMORY_ONLY,我們可以看一下類StorageLevel的源碼。
/**
* Various [[org.apache.spark.storage.StorageLevel]] defined and utility functions for creating
* new storage levels.
*/
object StorageLevel {
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
/**
* :: DeveloperApi ::
* Return the StorageLevel object with the specified name.
*/
@DeveloperApi
def fromString(s: String): StorageLevel = s match {
case "NONE" => NONE
case "DISK_ONLY" => DISK_ONLY
case "DISK_ONLY_2" => DISK_ONLY_2
case "MEMORY_ONLY" => MEMORY_ONLY
case "MEMORY_ONLY_2" => MEMORY_ONLY_2
case "MEMORY_ONLY_SER" => MEMORY_ONLY_SER
case "MEMORY_ONLY_SER_2" => MEMORY_ONLY_SER_2
case "MEMORY_AND_DISK" => MEMORY_AND_DISK
case "MEMORY_AND_DISK_2" => MEMORY_AND_DISK_2
case "MEMORY_AND_DISK_SER" => MEMORY_AND_DISK_SER
case "MEMORY_AND_DISK_SER_2" => MEMORY_AND_DISK_SER_2
case "OFF_HEAP" => OFF_HEAP
case _ => throw new IllegalArgumentException(s"Invalid StorageLevel: $s")
}
/**
* :: DeveloperApi ::
* Create a new StorageLevel object.
*/
@DeveloperApi
def apply(
useDisk: Boolean,
useMemory: Boolean,
useOffHeap: Boolean,
deserialized: Boolean,
replication: Int): StorageLevel = {
getCachedStorageLevel(
new StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication))
}
/**
* :: DeveloperApi ::
* Create a new StorageLevel object without setting useOffHeap.
*/
@DeveloperApi
def apply(
useDisk: Boolean,
useMemory: Boolean,
deserialized: Boolean,
replication: Int = 1): StorageLevel = {
getCachedStorageLevel(new StorageLevel(useDisk, useMemory, false, deserialized, replication))
}
可以看到,StorageLevel參數包括:
參數:默認 | 含義 |
useDisk: Boolean | 是否使用磁盤做持久化 |
useMemory: Boolean | 是否使用內存做持久化 |
useOffHeap: Boolean | 是否使用JAVA堆內存 |
deserialized: Boolean | 是否序列化 |
replication:1 | 副本數(做容錯) |
所以我們可以得到:
- NONE:是默認的配置
- DISK_ONLY:僅僅緩存於磁盤
- DISK_ONLY_2:僅僅緩存於磁盤並且保持2個副本
- MEMORY_ONLY:僅僅緩存於磁盤內存
- MEMORY_ONLY_2:僅僅緩存於磁盤內存並且保持2個副本
- MEMORY_ONLY_SER:僅僅緩存於磁盤內存且序列化
- MEMORY_ONLY_SER_2:僅僅緩存於磁盤內存且序列化和保持2個副本
- MEMORY_AND_DISK:緩存於內存滿之後,就會緩存於磁盤
- MEMORY_AND_DISK_2:緩存於內存滿之後,就會緩存於磁盤並且保持2個副本
- MEMORY_AND_DISK_SER:緩存於內存滿之後,就會緩存於磁盤且序列化
- MEMORY_AND_DISK_SER_2:緩存於內存滿之後,就會緩存於磁盤且序列化,以及保持2個副本
- OFF_HEAP:緩存遠離堆內存
序列化可以類似於壓縮,便於節省存儲空間,但會增加計算成本,因爲每次使用需要序列化以及反序列化;副本數量默認爲1,是爲了防止數據丟失,增強容錯能力;OFF_HEAP將RDD存儲在etachyon上,使得具有更低的垃圾回收開銷,瞭解即可;DISK_ONLY沒什麼可說的,下面主要比較MEMORY_ONLY和MEMORY_AND_DISK。
MEMORY_ONLY和MEMORY_AND_DISK
MEMORY_ONLY:RDD僅緩存與內存中,內存中放不下的分區將在被使用時重新從磁盤讀取數據並計算。
MEMORY_AND_DISK:儘量往內存中存,存不下的分區將會被保存在磁盤中,避免了重算的過程。
直觀看來,MEMORY_ONLY還需要計算過程,效率相對來說較低,但事實上,由於是在內存中進行計算,所以重新計算時間消耗是遠小於磁盤IO的,所以通常默認使用MEMORY_ONLY。除非中間計算開銷特別大,這時候使用MEMORY_AND_DISK纔會是一個更好的選擇。
總結