Spark Streaming

流計算定義

一般流式計算會與批量計算相比較。在流式計算模型中，輸入是持續的，可以認爲在時間上是無界的，也就意味着，永遠拿不到全量數據去做計算。同時，計算結果是持續輸出的，也即計算結果在時間上也是無界的。流式計算一般對實時性要求較高，同時一般是先定義目標計算，然後數據到來之後將計算邏輯應用於數據。同時爲了提高計算效率，往往儘可能採用增量計算代替全量計算。批量處理模型中，一般先有全量數據集，然後定義計算邏輯，並將計算應用於全量數據。特點是全量計算，並且計算結果一次性全量輸出。

Spark Streaming是構建在Spark 批處理之上一款流處理框架。與批處理不同的是，流處理計算的數據是無界數據流，輸出也是持續的。Spark Streaming底層將Spark RDD Batch 拆分成 Macro RDD Batch實現類似流處理的功能。因此spark Streaming在微觀上依舊是批處理框架。

批處理 VS 流處理區別

數據形式數據量級計算延遲計算形式

批處理 - 輸入數據一般是數據靜態的，數據量級大-GB+、計算延遲高-分鐘|小時、階段性計算-終止
流處理 - 輸入數據一定是動態的、數據量級-Byte級別、延遲較小-ms、持續計算-7*24小時

目前主流流處理框架：輕量級 Kafka Streaming、Storm（JStrom） | Storm2.0+ 一代、Spark Streaming 二代、Flink（BLink）三代

快速入門

pom.xml

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.4.5</version>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.11</artifactId>
    <version>2.4.5</version>
</dependency>

書寫Driver

import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object DStreamWordCount {
  def main(args: Array[String]): Unit = {
    //1.創建StreamContext
    val conf = new SparkConf().setAppName("wordcount").setMaster("local[6]")
    val ssc = new StreamingContext(conf,Seconds(1))

    //2.創建linesStream receiver 測試
    var linesStream:ReceiverInputDStream[String]=ssc.socketTextStream("CentOS",9999)

    //3.對linesStream做算子轉換
    linesStream.flatMap(_.split(" "))
        .map((_,1))
        .reduceByKey(_+_)
        .print()//打印輸出console

    //4.啓動計算
    ssc.start()
    //5.等待關閉
    ssc.awaitTermination()
  }
}

需要安裝nectcat插件。yum install -y nc

Discretized Streams (DStreams)

Discretized Stream或DStream是Spark Streaming提供的基本抽象。它表示連續的數據流，可以是從源接收的輸入數據流，也可以是通過轉換輸入流生成的已處理數據流。在內部，DStream由一系列連續的RDD表示，這是Spark對不可變分佈式數據集的抽象。DStream中的每個RDD都包含來自特定時間間隔的數據，如下圖所示。

應用於DStream的任何操作都轉換爲底層RDD上的操作。例如，在先前Quick Start示例中，flatMap操作應用於行DStream中的每個RDD以生成單詞DStream的RDD。如下圖所示。

注意：通過對DStream底層運行機制的探討，在設計StreamingContext的時候要求設置的Seconds()間隔要略大於微批的計算時間。這樣纔可以有效的避免 back presure(背壓)。https://blog.csdn.net/lingbo229/article/details/82380555

Spark Streaming結構

構建StreamingContext(sparkconf,Seconds(1)).
設置數據的Receiver(Basic|Advance)
使用DStream（Macro Batch RDD）轉換算子
啓動流計算ssc.start()
等待系統關閉流計算ssc.awaitTermination()

Input DStreams and Receivers

在SparkStreaming中每個InputStream（除去File Stream）底層都對應着一個Receiver的實現。每個Receiver負責接收來自外圍系統的數據，並且把接收的數據存儲到Spark的內存中，用於後續的處理。Spark提供了兩種內建的Source用於讀取外圍系統的數據。

Basic sources: 直接藉助StreamingContext API調用。例如：file systems和socket connections
Advanced sources: 例如：Kafka、Flume等可以藉助一一些工具創建。一般需要導入第三方依賴。

例如Kafka：spark-streaming-kafka-0-10_2.11依賴。

注意：一個Receiver也會佔用一個Cores，因此大家在跑Spark Streaming的程序的時候，一定要給程序分配 n個core(n > recevicers個數)，否則Spark只能接受，無法處理。

Basic Sources

File Streams

可以讀取任意一種能和HDFS文件系統兼容文件。DStream 可以通過 StreamingContext.fileStream[KeyClass, ValueClass, InputFormatClass]方式創建。File streams不需要額外Receiver也就意味着不需要給FileStream分配計算資源cores.

·textFileStream

val conf = new SparkConf().setMaster("local[6]").setAppName("wordcount")
var ssc  =new StreamingContext(conf,Seconds(1))

val lines: DStream[String] = ssc.textFileStream("hdfs://CentOS:9000/files")

lines.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKey(_+_)
.print()

ssc.start()
ssc.awaitTermination()

fileStream

val conf = new SparkConf().setMaster("local[6]").setAppName("wordcount")
var ssc  =new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("fatal")

val lines: DStream[(LongWritable,Text)] = ssc.fileStream[LongWritable,Text,TextInputFormat]("hdfs://CentOS:9000/files")

lines.map(t=>t._2.toString)
.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKey(_+_)
.print()

ssc.start()
ssc.awaitTermination()

注意在使用之前需要同步運行節點和HDFS節點的時鐘。

[root@CentOS ~]# yum install -y ntp
[root@CentOS ~]# ntpdate time.ntp.org
16 Aug 15:10:55 ntpdate[1807]: step time server 139.199.215.251 offset -28784.287137 sec
[root@CentOS ~]# clock -w #保存此次時間

一旦文件被認定爲處理過了（該文件處理的時間窗口消失）Spark將忽略這種文件更新。因此如果用戶使用File Streams處理文本數據，一般要求文件上傳到hdfs非採樣目錄，等待上傳結束後，再移動到採樣目錄。

[root@CentOS ~]# hdfs dfs -put install.log.syslog  /
[root@CentOS ~]# hdfs dfs -mv /install.log.syslog  /files

Custom Receivers

用戶需要繼承org.apache.spark.streaming.receiver.Receiver制定存儲級別，因爲Recevicer本職工作是接收外圍系統的數據，然後將外圍系統的數據寫入到Spark中。在Receiver提供了一個store方法可以將數據寫入Spark內存。

class MySocketReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging{

  override def onStart(): Unit = { //在onStart方法中接收外圍系統數據
    new Thread("Socket Receiver") {
      override def run() { receive() }
    }.start()
  }

  override def onStop(): Unit = {}
   //通過網絡socket讀取遠端數據。
  private def receive() { 
    var socket: Socket = null
    var userInput: String = null
    try {
      // 連接netcat
      socket = new Socket(host, port)
      //讀取遠程數據
      val reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8))
      userInput = reader.readLine()
      while(!isStopped && userInput != null) {
        store(userInput) //存儲到spark
        userInput = reader.readLine()
      }
      reader.close()
      socket.close()
      // 重啓服務
      restart("Trying to connect again")
    } catch {
      case e: java.net.ConnectException =>
        // restart if could not connect to server
        restart("Error connecting to " + host + ":" + port, e)
      case t: Throwable =>
        // restart if there is any other error
        restart("Error receiving data", t)
    }
  }
}
object MySocketReceiver{
  def apply(host: String, port: Int): MySocketReceiver = new MySocketReceiver(host, port)
}

Queue RDDs as a Stream（測試）

爲了使用測試數據測試Spark Streaming應用程序，還可以使用streamingContext.queueStream（queueOfRDDs）基於RDD隊列創建DStream。推入隊列的每個RDD將被視爲DStream中的一批數據，並像流一樣處理。

val conf = new SparkConf().setMaster("local[6]").setAppName("wordcount")
var ssc  =new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("fatal")
var queue=new mutable.Queue[RDD[String]]()

val lines: DStream[String] = ssc.queueStream(queue)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKey(_+_)
.print()

ssc.start()

new Thread(){
    override def run(): Unit = {
        for(i<- 0 to 100){
            queue += ssc.sparkContext.makeRDD(List("hello spark"))
            Thread.sleep(100)
        }
    }
}.start()

ssc.awaitTermination()

kafka Source

pom.xml

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
    <version>2.4.5</version>
</dependency>

適配Kafka broker版本0.10+不兼容0.10以前版本。（面試考點kafka 0.8和0.10、0.11區別）

drver編寫

import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object DirectKafkaWordCount {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[6]")
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    ssc.sparkContext.setLogLevel("FATAL")
    val kafkaParams = Map[String, Object](
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "CentOS:9092",
      ConsumerConfig.GROUP_ID_CONFIG -> "g1",
      ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
      ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])

    //直接讀取Kafka中的數據 將kafka獨立維護Kafka topic的offset偏移量 checkpoint
    val messages = KafkaUtils.createDirectStream[String, String](ssc,
      LocationStrategies.PreferConsistent,//設置讀取策略，如果你的spark計算節點和kafka broker節點不在一臺物理主機
      ConsumerStrategies.Subscribe[String, String](List("topic01"), kafkaParams))

    messages.map(record=>record.value)
    .flatMap(line=>line.split(" "))
    .map(word => (word, 1L))
    .reduceByKey(_ + _)
    .print()
    
    ssc.start()
    ssc.awaitTermination()
  }
}

早期面試題考點：https://www.cnblogs.com/runnerjack/p/8597981.html

Transformations on DStreams

和RDD的轉換類似，轉換算子的作用是轉換DStream.其中DStream常見的許多算子使用和Spark RDD保持一致。

轉換	含義
map(func)	Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)	Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)	Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count()	Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
countByValue()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
join(otherStream, [numTasks])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])	When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)	Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

map算子

//1,zhangsan,true
lines.map(line=> line.split(","))
    .map(words=>(words(0).toInt,words(1),words(2).toBoolean))
    .print()

flatMap

//hello spark
lines.flatMap(line=> line.split("\\s+"))
        .map((_,1)) //(hello,1)(spark,1)
        .print()

filter

//只會對含有hello的數據過濾
lines.filter(line => line.contains("hello"))
    .flatMap(line=> line.split("\\s+"))
    .map((_,1))
    .print()

repartition(修改分區)

lines.repartition(10) //修改程序並行度 分區數
    .filter(line => line.contains("hello"))
    .flatMap(line=> line.split("\\s+"))
    .map((_,1))
    .print()

union(將兩個流合併)

val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)
    .filter(line => line.contains("hello"))
    .flatMap(line=> line.split("\\s+"))
    .map((_,1))
    .print()

count

val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)
    .flatMap(line=> line.split("\\s+"))
    .count() //計算微批處中RDD元素的個數
    .print()

reduce(func)

val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)  // aa bb
    .flatMap(line=> line.split("\\s+"))
    .reduce(_+"|"+_)
    .print() //aa|bb

countByValue(key計數)

val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10) // a a b c
    .flatMap(line=> line.split("\\s+"))
    .countByValue()  (a，2) （b,1） (c,1)
    .print()

reduceByKey(func, [numTasks])

var lines:DStream[String]=ssc.socketTextStream("CentOS",9999) //this is spark this
    lines.repartition(10)
    .flatMap(line=> line.split("\\s+").map((_,1)))
    .reduceByKey(_+_)// （this,2）(is,1)(spark ,1)
    .print()

join(otherStream, [numTasks])

//1 zhangsan
val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
//1 apple 1 4.5
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)

val userPair:DStream[(String,String)]=stream1.map(line=>{
    var tokens= line.split(" ")
    (tokens(0),tokens(1))
})
val orderItemPair:DStream[(String,(String,Double))]=stream2.map(line=>{
    val tokens = line.split(" ")
    (tokens(0),(tokens(1),tokens(2).toInt * tokens(3).toDouble))
})
userPair.join(orderItemPair)
.map(t=>(t._1,t._2._1,t._2._2._1,t._2._2._2))//1 zhangsan apple 4.5
.print()

必須保證兩個流同時發送數據，否則無法完成join,因此意義不大。

transform

可以使用stream和RDD做計算，因爲transform可以拿到底層macro batch RDD，繼而實現stream-batch join

//1 apple 2 4.5
val orderLog: DStream[String] = ssc.socketTextStream("CentOS",8888)
var userRDD=ssc.sparkContext.makeRDD(List(("1","zhangs"),("2","wangw")))
val orderItemPair:DStream[(String,(String,Double))]=orderLog.map(line=>{
    val tokens = line.split(" ")
    (tokens(0),(tokens(1),tokens(2).toInt * tokens(3).toDouble))
})
orderItemPair.transform(rdd=> rdd.join(userRDD))
.print()

updateStateByKey(有狀態計算，全量輸出)

ssc.checkpoint("file:///D:/checkpoints")//存儲程序的狀態信息，以及代碼

var lines:DStream[String]=ssc.socketTextStream("CentOS",9999)
lines.flatMap(_.split("\\s+"))
    .map((_,1))
    .updateStateByKey((newValues:Seq[Int],state:Option[Int])=>{//Complete輸出
        val newValue = newValues.sum
        val historyValue=state.getOrElse(0)
        Some(newValue+historyValue)
    })
.print()

必須設定checkpointdir用於存儲程序的狀態信息。

mapWithState(有狀態計算，增量輸出)

ssc.checkpoint("file:///D:/checkpoints")//存儲程序的狀態信息，以及代碼

var lines:DStream[String]=ssc.socketTextStream("CentOS",9999)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.mapWithState(StateSpec.function((k:String,v:Option[Int],stage:State[Int])=>{
    var total:Int=0
    if(stage.exists()){
        total=stage.getOption().getOrElse(0)
    }
    total += v.getOrElse(0)
    stage.update(total)//更新歷史狀態
    (k,total)
}))
.print() //Update 輸出，只會輸出更新的key-value

必須設定checkpointdir用於存儲程序的狀態信息。

Window Operations

Spark Streaming支持窗口的計算。在時間的維度上對數據進行歸類。系統會將落入一個時間範圍的數據，作爲此次窗口計算的輸入。

窗口的長度、滑動間隔必須是微批的整數倍。比如：微批處理2s，窗口寬度只能是2、4、6、8,...滑動間隔 2,4,6,...窗口長度

通常在流處理中窗口的時間參數有三種情況分別是：ingestion time、procession-time、event-time。

從DStream的定義來看，目前DStream僅僅做的是基於Procession-Time的計算。同時如果大家想使用基於EventTime計算Spark同樣也支持，Spark目前在Structured Streaming板塊中實現了基於EventTme語義處理。

一些常見的window operations如下.所有的操作都需要攜帶兩個參數 - windowLength 和slideInterval.

Transformation	Meaning
window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval)	Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength, slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])	A more efficient version of the above `reduceByKeyAndWindow()` where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
countByValueAndWindow(windowLength,slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.

window(windowLength, slideInterval)

每間隔5秒觸發一個長度爲10秒的時間窗口。

var lines:DStream[String]=ssc.socketTextStream("CentOS",9999)
lines.flatMap(_.split("\\s+"))
    .map((_,1))
    .window(Seconds(10),Seconds(5))
    .reduceByKey(_+_)
    .print()

通過觀察得出一個結論比如：countByWindow等價window+count、reduceByKeyAndWindow等價window+reduceByKey

由於count、reduce、·reduceByKey、countByValue都已經在上述章節當中講解過了，這裏就以reduceByKeyAndWindow爲例展示一下使用：

reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])

var lines:DStream[String]=ssc.socketTextStream("CentOS",9999)
    lines.flatMap(_.split("\\s+"))
    .map((_,1))
    //.window(Seconds(10),Seconds(5))
    //.reduceByKey(_+_)
    .reduceByKeyAndWindow((v1:Int,v2:Int)=>v1+v2,Seconds(10),Seconds(5))
    .print()

reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])

當window中重疊的元素過半時候，使用invFunc效率更高。

val conf = new SparkConf().setMaster("local[6]").setAppName("wordcount")
var ssc  =new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("fatal")
ssc.checkpoint("file:///D:/checkpoints")
    var lines:DStream[String]=ssc.socketTextStream("CentOS",9999)
    lines.flatMap(_.split("\\s+"))
    .map((_,1)).reduceByKeyAndWindow(
        (v1:Int,v2:Int)=>v1+v2,
        (v1:Int,v2:Int)=>v1-v2,
        Seconds(6),Seconds(3),
        filterFunc = (t:(String,Int))=> t._2>0)
    .print()

ssc.start()
ssc.awaitTermination()

Output Operations

Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

Output Operation

Meaning

print()

Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.

foreachRDD(func)

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

foreachRDD(func)

定製KafkaSink將流計算的結果寫到Kafka中。

import java.util.{Properties, UUID}

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer

class KafkaSink(topic:String,severs:String) extends Serializable {

  def createKafkaConnection(): KafkaProducer[String, String] = {
    val props = new Properties()
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,severs)
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer].getName)
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer].getName)
    props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG,"true")//開啓冪等性
    props.put(ProducerConfig.RETRIES_CONFIG,"2")//設置重試
    props.put(ProducerConfig.BATCH_SIZE_CONFIG,"100")//設置緩衝區大小
    props.put(ProducerConfig.LINGER_MS_CONFIG,"1000")//最多延遲1000毫秒
    new KafkaProducer[String,String](props)
  }

  lazy val kafkaProducer:KafkaProducer[String,String]= createKafkaConnection()
  Runtime.getRuntime.addShutdownHook(new Thread(){
    override def run(): Unit = {
      kafkaProducer.close()
    }
  })
  def save(vs: Iterator[(String, Int)]): Unit = {

    try{
      vs.foreach(tuple=>{
        val record = new ProducerRecord[String,String](topic,tuple._1,tuple._2.toString)
        kafkaProducer.send(record)
      })

    }catch {
      case e:Exception=> println("發郵件，出錯啦~")
    }

  }
}

val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[6]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoints")
val kafkaParams = Map[String, Object](
    ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "CentOS:9092",
    ConsumerConfig.GROUP_ID_CONFIG -> "g1",
    ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
    ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])

val kafkaSinkBroadcast=ssc.sparkContext.broadcast(new KafkaSink("topic02","CentOS:9092"))

//直接讀取Kafka中的數據 將kafka獨立維護Kafka topic的offset偏移量 checkpoint
val messages = KafkaUtils.createDirectStream[String, String](ssc,
                                                             LocationStrategies.PreferConsistent,//設置讀取策略，如果你的spark計算節點和kafka broker節點不在一臺物理主機
                                                             ConsumerStrategies.Subscribe[String, String](List("topic01"), kafkaParams))

messages.map(record=>record.value)
.flatMap(line=>line.split(" "))
.map(word => (word, 1))
.mapWithState(StateSpec.function((k:String,v:Option[Int],stage:State[Int])=>{
    var total:Int=0
    if(stage.exists()){
        total=stage.getOption().getOrElse(0)
    }
    total += v.getOrElse(0)
    stage.update(total)//更新歷史狀態
    (k,total)
}))
.foreachRDD(rdd=>{
    rdd.foreachPartition(vs=>{
        val kafkaSink = kafkaSinkBroadcast.value
        kafkaSink.save(vs)
    })
})

ssc.start()
ssc.awaitTermination()

DStream故障恢復

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer

object KafkaSink extends Serializable {

  def createKafkaConnection(): KafkaProducer[String, String] = {
    val props = new Properties()
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer].getName)
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer].getName)
    props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG,"true")//開啓冪等性
    props.put(ProducerConfig.RETRIES_CONFIG,"2")//設置重試
    props.put(ProducerConfig.BATCH_SIZE_CONFIG,"100")//設置緩衝區大小
    props.put(ProducerConfig.LINGER_MS_CONFIG,"1000")//最多延遲1000毫秒
    new KafkaProducer[String,String](props)
  }

  lazy val kafkaProducer:KafkaProducer[String,String]= createKafkaConnection()
  Runtime.getRuntime.addShutdownHook(new Thread(){
    override def run(): Unit = {
      kafkaProducer.close()
    }
  })
  def save(vs: Iterator[(String, Int)]): Unit = {

    try{
      vs.foreach(tuple=>{
        val record = new ProducerRecord[String,String]("topic02",tuple._1,tuple._2.toString)
        kafkaProducer.send(record)
      })

    }catch {
      case e:Exception=> println("發郵件，出錯啦~")
    }

  }
}

val checkpointDir="file:///D:/checkpointdir"
val ssc=StreamingContext.getOrCreate(checkpointDir,()=>{
    println("==========init ssc==========")
    val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[6]")
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    ssc.checkpoint(checkpointDir)
    val kafkaParams = Map[String, Object](
        ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "CentOS:9092",
        ConsumerConfig.GROUP_ID_CONFIG -> "g1",
        ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
        ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])


    //直接讀取Kafka中的數據 將kafka獨立維護Kafka topic的offset偏移量 checkpoint
    val messages = KafkaUtils.createDirectStream[String, String](ssc,
                                                                 LocationStrategies.PreferConsistent,//設置讀取策略，如果你的spark計算節點和kafka broker節點不在一臺物理主機
                                                                 ConsumerStrategies.Subscribe[String, String](List("topic01"), kafkaParams))

    messages.map(record=>record.value)
    .flatMap(line=>line.split(" "))
    .map(word => (word, 1))
    .mapWithState(StateSpec.function((k:String,v:Option[Int],stage:State[Int])=>{
        var total:Int=0
        if(stage.exists()){
            total=stage.getOption().getOrElse(0)
        }
        total += v.getOrElse(0)
        stage.update(total)//更新歷史狀態
        (k,total)
    }))
    .foreachRDD(rdd=>{
        rdd.foreachPartition(vs=>{
            KafkaSink.save(vs)
        })
    })
    ssc
})

ssc.sparkContext.setLogLevel("FATAL")

ssc.start()
ssc.awaitTermination()
}

如果用戶修改了流計算的代碼塊，必須清除checkpoint目錄，否則更改不生效。

Apache Spark Streaming-教案