週期性清除Spark Streaming流狀態的方法

原創

2019-09-06 00:43

原文鏈接：https://mp.weixin.qq.com/s?__biz=MzU3MzgwNTU2Mg==&mid=2247485138&idx=1&sn=8f71070470c8963e7c973b5f10bf3c03&chksm=fd3d4047ca4ac951981f7f0fa08f9f6a1821270441b5008da28115b67020b01ef2936b5f1e88&scene=21#wechat_redirect

在Spark Streaming程序中，我們經常需要使用有狀態的流來統計一些累積性的指標，比如各個商品的PV。簡單的代碼描述如下，使用mapWithState()算子：

 val productPvStream = stream.mapPartitions(records => {
    var result = new ListBuffer[(String, Int)]
      for (record <- records) {
        result += Tuple2(record.key(), 1)
      }
    result.iterator
  }).reduceByKey(_ + _).mapWithState(
    StateSpec.function((productId: String, pv: Option[Int], state: State[Int]) => {
      val sum = pv.getOrElse(0) + state.getOption().getOrElse(0)
      state.update(sum)
      (productId, sum)
  })).stateSnapshots()

現在的問題是，PV並不是一直累加的，而是每天歸零，重新統計數據。要達到在凌晨0點清除狀態的目的，有以下兩種方法。

編寫腳本重啓Streaming程序

用crontab、Azkaban等在凌晨0點調度執行下面的Shell腳本：

stream_app_name='com.xyz.streaming.MallForwardStreaming'
cnt=`ps aux | grep SparkSubmit | grep ${stream_app_name} | wc -l`

if [ ${cnt} -eq 1 ]; then
  pid=`ps aux | grep SparkSubmit | grep ${stream_app_name} | awk '{print $2}'`
  kill -9 ${pid}
  sleep 20
  cnt=`ps aux | grep SparkSubmit | grep ${stream_app_name} | wc -l`
  if [ ${cnt} -eq 0 ]; then
    nohup sh /path/to/streaming/bin/mall_forward.sh > /path/to/streaming/logs/mall_forward.log 2>&1
  fi
fi

這種方式最簡單，也不需要對程序本身做任何改動。但隨着同時運行的Streaming任務越來越多，就會顯得越來越累贅了。

給StreamingContext設置超時

在程序啓動之前，先計算出當前時間點距離第二天凌晨0點的毫秒數：

def msTillTomorrow = {
  val now = new Date()
  val tomorrow = new Date(now.getYear, now.getMonth, now.getDate + 1)
  tomorrow.getTime - now.getTime
}

然後將Streaming程序的主要邏輯寫在while(true)循環中，並且不像平常一樣調用StreamingContext.awaitTermination()方法，而改用awaitTerminationOrTimeout()方法，即：

while (true) {
    val ssc = new StreamingContext(sc, Seconds(BATCH_INTERVAL))
    ssc.checkpoint(CHECKPOINT_DIR)

    // ...處理邏輯...

    ssc.start()
    ssc.awaitTerminationOrTimeout(msTillTomorrow)
    ssc.stop(false, true)
    Thread.sleep(BATCH_INTERVAL * 1000)
  }

在經過msTillTomorrow毫秒之後，StreamingContext就會超時，再調用其stop()方法（注意兩個參數，stopSparkContext表示是否停止關聯的SparkContext，stopGracefully表示是否優雅停止），就可以停止並重啓StreamingContext。

以上兩種方法都是仍然採用Spark Streaming的機制進行狀態計算的。如果其他條件允許的話，我們還可以拋棄mapWithState()，直接藉助外部存儲自己維護狀態。比如將Redis的Key設計爲product_pv:[product_id]:[date]，然後在Spark Streaming的每個批次中使用incrby指令，就能方便地統計PV了，不必考慮定時的問題。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

週期性清除Spark Streaming流狀態的方法

卷積神經網絡之AlexNet

七、圖像邊緣檢測之 Sobel、Scharr、拉普拉斯算子、Canny

Python Logging 模塊

卷積神經網絡之VGG

卷積神經網絡之LeNet

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結