Spark Streaming 接任意數據源作爲 Stream
問題出發點
工程中遇到流式處理的問題時,多采用Spark Streaming 或者 Storm 來處理;Strom採用Spout的流接入方式,Streaming採用Stream的流接入方式,爲了方便本地測試,所以選擇了spark streaming,但是官方僅支持如下幾種方案,當遇到其他高吞吐數據量作爲流時,就需要主角 Receiver 登場:
實現關鍵類
Receiver是spark內部實現的一套機制,通過自定義一個類繼承Receiver即可實現自定義數據源,再通過ssc的receiverStream接口即可實現數據轉RDD的操作,即可像Kafka,Flume等正常操作Spark Streaming。本質上通過receiverStream得到的是ReceiverInputDStreaming。
class MyReceiver(storageLevel: StorageLevel) extends NetworkReceiver[String](storageLevel) {
def onStart() {
// Setup stuff (start threads, open sockets, etc.) to start receiving data.
// Must start new thread to receive data, as onStart() must be non-blocking.
// Call store(...) in those threads to store received data into Spark's memory.
// Call stop(...), restart(...) or reportError(...) on any thread based on how
// different errors need to be handled.
// See corresponding method documentation for more details
}
def onStop() {
// Cleanup stuff (stop threads, close sockets, etc.) to stop receiving data.
}
}
這裏需要實現兩個函數,onStart 和 onStop ,onStart裏就是你數據源的具體邏輯,按照官方的說法,onstart方法下你需要啓動線程,連接sockets以開始接收數據。要求必須啓動新線程以接收數據,且保證onStart() 是非阻塞的。在這些線程中調用store()方法將接收到的數據存儲到Spark的內存中,作爲一次流的內容,這裏store方法是Receiver中自帶的,無需自己實現。這裏需要注意你連接的client必須非堵塞,如果同時連接多個端口或者一個key只能一個線程消費時,就會引發異常。
具體實現
spark streaming 主類:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
val sparkConf = new SparkConf().setAppName(appName)
val ssc = new StreamingContext(sparkConf, Seconds(interval.toInt))
val stream = ssc.receiverStream(new MyReceiver())
stream.foreachRDD(rdd => {
rdd.foreachPartition(partition => {
partition.foreach(line => {
println(line)
})
})
})
try {
ssc.start()
ssc.awaitTermination()
} catch {
case e: Exception => {
println(e.getStackTrace)
}
}
MyReceiver類:
大概解釋一下 onStart 方法啓一個線程,執行receiver函數,receiver中初始化自己的數據連接服務器並get數據,將get到的數據調用store方法,即可存到spark的內存中。正常情況下,receiver函數中while (ture) 即可,除非是限時的流式處理(比較少見)
1)onStop方法不寫也可以,主要實現onStart方法即可
2) 可以根據自己服務器環境 調整StorageLevel
3) 如果非堵塞 也可以在onstart方法中實現多線程增加吞吐
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
class MyReceiver(host: String, port: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart(): Unit = {
new Thread("Socket Receiver") {
override def run() { receive() }
}.start()
}
def onStop: Unit = {
if (Thread.currentThread.isInterrupted) {
sys.exit(1)
}
}
// myClient可以是任意連接
private def receive(): Unit = {
var client: MyClient = null
try {
client = new MyClient(host, port)
} catch {
case e: Exception => {
println(e.getStackTrace)
println("MyClient 連接失敗!")
}
}
while ({
!Thread.currentThread.isInterrupted
}) {
try {
val message = client.get(key)
if (message != null) store(message)
} catch {
case e: Exception => {
e.printStackTrace()
}
}
}
}
}
Tips:
具體實現Receiver的話還有RawNetworkReceiver和SocketReciver兩種方法,有興趣實現也可以參考文檔和上面的寫法實現。核心就是onStart對數據源接入的定義。