1.Spark性能調優:checkPoint的使用
https://blog.csdn.net/leen0304/article/details/78718346
概述
checkpoint的意思就是建立檢查點,類似於快照,例如在spark計算裏面,計算流程DAG特別長,服務器需要將整個DAG計算完成得出結果,但是如果在這很長的計算流程中突然中間算出的數據丟失了,spark又會根據RDD的依賴關係從頭到尾計算一遍,這樣子就很費性能,當然我們可以將中間的計算結果通過cache或者persist放到內存或者磁盤中,但是這樣也不能保證數據完全不會丟失,存儲的這個內存出問題了或者磁盤壞了,也會導致spark從頭再根據RDD計算一遍,所以就有了checkpoint,其中checkpoint的作用就是將DAG中比較重要的中間數據做一個檢查點將結果存儲到一個高可用的地方(通常這個地方就是HDFS裏面)。
使用Checkpoint
使用checkpoint 需要 先設置 checkpoint 的目錄,例如如下代碼:
val sparkConf = new SparkConf
sparkConf
.setAppName("JOINSkewedData")
.set("spark.sql.autoBroadcastJoinThreshold", "1048576") //1M broadcastJOIN
//.set("spark.sql.autoBroadcastJoinThreshold", "104857600") //100M broadcastJOIN
.set("spark.sql.shuffle.partitions", "3")
if (args.length > 0 && args(0).equals("ide")) {
sparkConf
.setMaster("local[3]")
}
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val sparkContext = spark.sparkContext
sparkContext.setLogLevel("WARN")
sparkContext.setCheckpointDir("file:///D:/checkpoint/")
不同環境的設置代碼
有的時候需要本地調試,需要設置爲windows 或者 linux 的本地目錄
windows
sparkContext.setCheckpointDir("file:///D:/checkpoint/")
linux
sparkContext.setCheckpointDir("file:///tmp/checkpoint")
hdfs
sparkContext.setCheckpointDir("hdfs://leen:8020/checkPointDir")
調用checkpoint
使用 checkpoint 的時候,需要在建立 checkpoint 的 rdd 上進行函數調用即可
rdd.checkpoint
注意 :
使用 checkpoint 的時候,建議先將 rdd.cache 一次,因爲 checkpoint 是 transform 算子,
執行的時候相當於走了兩次流程,前面計算了一遍,然後checkpoint又會計算一次,所以一般我們先進行cache然後做checkpoint就會只走一次流程,checkpoint的時候就會從剛cache到內存中取數據寫入hdfs中,如下:
rdd.cache()
rdd.checkpoint()
rdd.collect
Sparkstreaming 中的 checkpoint
在streaming中使用checkpoint主要包含以下兩點:設置checkpoint目錄,初始化StreamingContext時調用getOrCreate方法,即當checkpoint目錄沒有數據時,則新建streamingContext實例,並且設置checkpoint目錄,否則從checkpoint目錄中讀取相關配置和數據創建streamingcontext。
// Function to create and setup a new StreamingContext def functionToCreateContext(): StreamingContext = { val ssc = new StreamingContext(...) // new context val lines = ssc.socketTextStream(...) // create DStreams ... ssc.checkpoint(checkpointDirectory) // set checkpoint directory ssc } // Get StreamingContext from checkpoint data or create a new one val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
Checkpoint 與 cache 區別
checkpoint 與 cache 是不一樣的,checkpoint 會切除前面算子的rdd 依賴, 而 cache 是將數據暫存在一個具體的位置。
rdd 的 checkpoint 實現
/**
* Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
* directory set with `SparkContext#setCheckpointDir` and all references to its parent
* RDDs will be removed. This function must be called before any job has been
* executed on this RDD. It is strongly recommended that this RDD is persisted in
* memory, otherwise saving it on a file will require recomputation.
*/
def checkpoint(): Unit = RDDCheckpointData.synchronized {
// NOTE: we use a global lock here due to complexities downstream with ensuring
// children RDD partitions point to the correct parent partitions. In the future
// we should revisit this consideration.
if (context.checkpointDir.isEmpty) {
throw new SparkException("Checkpoint directory has not been set in the SparkContext")
} else if (checkpointData.isEmpty) {
checkpointData = Some(new ReliableRDDCheckpointData(this))
}
}
dataframe 的 checkpoint 實現
/**
* Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate
* the logical plan of this Dataset, which is especially useful in iterative algorithms where the
* plan may grow exponentially. It will be saved to files inside the checkpoint
* directory set with `SparkContext#setCheckpointDir`.
*
* @group basic
* @since 2.1.0
*/
@Experimental
@InterfaceStability.Evolving
def checkpoint(): Dataset[T] = checkpoint(eager = true)