Spark Streaming 接任意數據源作爲 Stream

Spark Streaming 接任意數據源作爲 Stream

問題出發點

工程中遇到流式處理的問題時,多采用Spark Streaming 或者 Storm 來處理;Strom採用Spout的流接入方式,Streaming採用Stream的流接入方式,爲了方便本地測試,所以選擇了spark streaming,但是官方僅支持如下幾種方案,當遇到其他高吞吐數據量作爲流時,就需要主角 Receiver 登場:

 

實現關鍵類

Receiver

Receiver是spark內部實現的一套機制,通過自定義一個類繼承Receiver即可實現自定義數據源,再通過ssc的receiverStream接口即可實現數據轉RDD的操作,即可像Kafka,Flume等正常操作Spark Streaming。本質上通過receiverStream得到的是ReceiverInputDStreaming。

class MyReceiver(storageLevel: StorageLevel) extends NetworkReceiver[String](storageLevel) {
    def onStart() {
        // Setup stuff (start threads, open sockets, etc.) to start receiving data.
        // Must start new thread to receive data, as onStart() must be non-blocking.

        // Call store(...) in those threads to store received data into Spark's memory.

        // Call stop(...), restart(...) or reportError(...) on any thread based on how
        // different errors need to be handled.

        // See corresponding method documentation for more details
    }

    def onStop() {
        // Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
    }
}

 

這裏需要實現兩個函數,onStart 和 onStop ,onStart裏就是你數據源的具體邏輯,按照官方的說法,onstart方法下你需要啓動線程,連接sockets以開始接收數據。要求必須啓動新線程以接收數據,且保證onStart() 是非阻塞的。在這些線程中調用store()方法將接收到的數據存儲到Spark的內存中,作爲一次流的內容,這裏store方法是Receiver中自帶的,無需自己實現。這裏需要注意你連接的client必須非堵塞,如果同時連接多個端口或者一個key只能一個線程消費時,就會引發異常。

 

具體實現

spark streaming 主類:

    import org.apache.spark.SparkConf
    import org.apache.spark.streaming.{Seconds, StreamingContext}

    val sparkConf = new SparkConf().setAppName(appName)
    val ssc = new StreamingContext(sparkConf, Seconds(interval.toInt))
    val stream = ssc.receiverStream(new MyReceiver())
    stream.foreachRDD(rdd => {
      rdd.foreachPartition(partition => {
        partition.foreach(line => {
          println(line)
        })
      })
    })

    try {
      ssc.start()
      ssc.awaitTermination()
    } catch {
      case e: Exception => {
        println(e.getStackTrace)
      }
    }

 

MyReceiver類:

大概解釋一下 onStart 方法啓一個線程,執行receiver函數,receiver中初始化自己的數據連接服務器並get數據,將get到的數據調用store方法,即可存到spark的內存中。正常情況下,receiver函數中while (ture) 即可,除非是限時的流式處理(比較少見)

1)onStop方法不寫也可以,主要實現onStart方法即可

2) 可以根據自己服務器環境 調整StorageLevel

3) 如果非堵塞 也可以在onstart方法中實現多線程增加吞吐

import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver


class MyReceiver(host: String, port: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {

  def onStart(): Unit = {
    new Thread("Socket Receiver") {
      override def run() { receive() }
    }.start()
  }

  def onStop: Unit = {
    if (Thread.currentThread.isInterrupted) {
      sys.exit(1)
    }
  }

  // myClient可以是任意連接
  private def receive(): Unit = {
    var client: MyClient = null
    try {
      client = new MyClient(host, port)
    } catch {
      case e: Exception => {
        println(e.getStackTrace)
        println("MyClient 連接失敗!")
      }
    }

    while ({
      !Thread.currentThread.isInterrupted
    }) {
      try {
        val message = client.get(key)
        if (message != null) store(message)
      } catch {
        case e: Exception => {
          e.printStackTrace()
        }
      }
    }
  }

}

 

Tips:

具體實現Receiver的話還有RawNetworkReceiver和SocketReciver兩種方法,有興趣實現也可以參考文檔和上面的寫法實現。核心就是onStart對數據源接入的定義。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章