Spark Streaming如何使用checkpoint容錯

最近在做一個實時流計算的項目，採用的是Spark Steaming，主要是對接Spark方便，一個 Streaming Application 往往需要7*24不間斷的跑，所以需要有抵禦意外的能力（比如機器或者系統掛掉，JVM crash等）。爲了讓這成爲可能，Spark Streaming需要 checkpoint 足夠多信息至一個具有容錯設計的存儲系統才能讓 Application 從失敗中恢復。Spark Streaming 會 checkpoint 兩種類型的數據。

1、Metadata（元數據） checkpointing - 保存定義了 Streaming 計算邏輯至類似 HDFS 的支持容錯的存儲系統。用來恢復 driver，元數據包括：

配置 - 用於創建該 streaming application 的所有配置

DStream 操作 - DStream 一些列的操作

未完成的 batches - 那些提交了 job 但尚未執行或未完成的 batches

2、Data checkpointing - 保存已生成的RDDs至可靠的存儲。這在某些 stateful 轉換中是需要的，在這種轉換中，生成 RDD 需要依賴前面的 batches，會導致依賴鏈隨着時間而變長。爲了避免這種沒有盡頭的變長，要定期將中間生成的 RDDs 保存到可靠存儲來切斷依賴鏈。

3、總結下：
metadata 元數據的checkpoint是用來恢復當驅動程序失敗的場景下，
而數據本身或者RDD的checkpoint通常是用來容錯有狀態的數據處理失敗的場景

import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by csw on 2017/7/13.
  */
object CheckPointTest {
  Logger.getLogger("org").setLevel(Level.WARN)
  val conf = new SparkConf().setAppName("Spark shell")
  val sc = new SparkContext(conf)
  //設置時間間隔
  val batchDuration=2
  // 設置Metadata在HDFS上的checkpoint目錄
  val dir = "hdfs://master:9000/csw/tmp/test3"
  // 通過函數來創建或者從已有的checkpoint裏面構建StreamingContext
  def functionToCreatContext(): StreamingContext = {
    val ssc = new StreamingContext(sc, Seconds(batchDuration))
    ssc.checkpoint(dir)
    val fileStream: DStream[String] = ssc.textFileStream("hdfs://master:9000/csw/tmp/testStreaming")
    //設置通過間隔時間，定時持久checkpoint到hdfs上
    fileStream.checkpoint(Seconds(batchDuration*5))
    fileStream.foreachRDD(x => {
      val collect: Array[String] = x.collect()
      collect.foreach(x => println(x))
    })
    ssc
  }

  def main(args: Array[String]) {
    val context: StreamingContext = StreamingContext.getOrCreate(dir, functionToCreatContext _)
    context.start()
    context.awaitTermination()
  }
}

（1）處理的邏輯必須寫在functionToCreateContext函數中，你要是直接寫在main方法中，在首次啓動後，kill關閉，再啓動就會報錯

17/07/13 10:57:10 INFO WriteAheadLogManager  for Thread: Reading from the logs:
hdfs://master:9000/csw/tmp/test3/receivedBlockMetadata/log-1499914584482-1499914644482
17/07/13 10:57:10 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped
org.apache.spark.SparkException: org.apache.spark.streaming.dstream.MappedDStream@4735d6e5 has not been initialized
	at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:323)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)

這個錯誤因爲處理邏輯沒放在函數中，全部放在main函數中，雖然能正常運行，也能記錄checkpoint數據，但是再次啓動先報上面的錯誤

解決方案：將邏輯寫在函數中，不要寫main方法中

（2）打包編譯重新上傳服務器運行，會發現依舊報錯，這次的錯誤和上面的不一樣了

17/07/13 11:26:45 ERROR util.Utils: Exception encountered
java.lang.ClassNotFoundException: streaming.CheckPointTest$$anonfun$functionToCreatContext$1
....
17/07/13 11:26:45 WARN streaming.CheckpointReader: Error reading checkpoint from file hdfs://master:9000/csw/tmp/test3/checkpoint-1499916310000
java.io.IOException: java.lang.ClassNotFoundException: streaming.CheckPointTest$$anonfun$functionToCreatContext$1
......

問題就出在checkpoint上，因爲checkpoint的元數據會記錄jar的序列化的二進制文件，因爲你改動過代碼，然後重新編譯，新的序列化jar文件，在checkpoint的記錄中並不存在，所以就導致了上述錯誤，如何解決：

也非常簡單，刪除checkpoint開頭的的文件即可，不影響數據本身的checkpoint

hadoop fs -rm /csw/tmp/test3/checkpoint*

然後再次啓動，發現一切ok，能從checkpoint恢復數據，然後kill掉又一次啓動
就能正常工作了。

但是要注意的是，雖然數據可靠性得到保障了，但是要謹慎的設置刷新間隔，這可能會影響吞吐量，因爲每隔固定時間都要向HDFS上寫入checkpoint數據，spark streaming官方推薦checkpoint定時持久的刷新間隔一般爲批處理間隔的5到10倍是比較好的一個方式。

守貓de人

發佈了45 篇原創文章 · 獲贊 23 · 訪問量 7萬+

私信關注

Spark Streaming如何使用checkpoint容錯

本地多級文件合併上傳到hdfs（遞歸上傳）

本地多級文件原樣上傳到hdfs

Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not acces

Spark Streaming 將數據保存在msyql中

elasticsearch 之Aggregation聚合

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結