Spark-DStream的窗口计算

基本概念

Spark Steaming支持对某个时间窗口内实现对数据计算
在这里插入图片描述
上图描绘了是以3倍的微批次作为一个窗口长度,并且以2倍微批次作为滑动间隔。将落入到相同窗口的微批次合并成一个相对较大的微批次-窗口批次。

Spark要求所有的窗口的长度以及滑动的间隔必须是微批次的整数倍

  1. 滑动窗口:窗口长度 > 滑动间隔 窗口与窗口之间存在元素的重叠。
  2. 滚动窗口:窗口长度 = 滑动间隔 窗口与窗口之间没有元素的重叠。
    注:目前不存在:==窗口长度 < 滑动间隔 == 这种窗口 ,会造成数据的丢失。

窗口计算时间属性

Event Time -事件时间 < Ingestion Time -摄入时间 < Processing Time -处理时间
在这里插入图片描述
Spark DStreaming 目前仅仅支持 processing Time -处理时间, 但是Spark的Structured Streaming 支持Event Time

窗口算子

Transformation Meaning
window(windowLength, slideInterval) Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval) Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength, slideInterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks]) A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
countByValueAndWindow(windowLength,slideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.

window(windowLength,slideInterval)

package com.hw.demo06

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}

/**
  * @aurhor:fql
  * @date 2019/10/4 11:48 
  * @type: 进行窗口的计算操作
  */
object WindowCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local[5]")
    conf.setAppName("windowCount")

    val scc = new StreamingContext(conf,Milliseconds(100))
    scc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)

      .flatMap(_.split("\\s+"))
      .map((_,1))
      .window(Seconds(4),Seconds(2))  //窗口长度 >滑动间隔
      .reduceByKey(_+_)
      .print()


    scc.start()
    scc.awaitTermination()
  }
}

以上window后可以更的算子:countreducereduceByKeycountByValue为了方便起见Spark提供合成算子例如

window+count 等价于 countByWindow**(windowLength, slideInterval)、window+reduceByKey 等价 reduceByKeyAndWindow

reduceByKeyAndWindow

package com.hw.demo06

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}

/**
  * @aurhor:fql
  * @date 2019/10/4 11:48 
  * @type: 进行窗口的计算操作
  */
object WindowCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
    conf.setMaster("local[5]")
    conf.setAppName("windowCount")

    val scc = new StreamingContext(conf,Milliseconds(100))
    scc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)

      .flatMap(_.split("\\s+"))
      .map((_,1))
      .reduceByKeyAndWindow((v1:Int,v2:Int)=>v1+v2,Seconds(4),Seconds(3))
      .print()
    scc.start()
    scc.awaitTermination()
  }
}

如果窗口重合过半,在计算窗口值的时候,可以使用下面方式计算结果

val sparkConf=new SparkConf().setAppName("WondowWordCount").setMaster("local[6]")
val ssc = new StreamingContext(sparkConf,Milliseconds(100))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoints")
ssc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)
.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKeyAndWindow(//当窗口重叠 超过50% ,使用以下计算效率较高
    (v1:Int,v2:Int)=>v1+v2,//上一个窗口结果+新进来的元素
    (v1:Int,v2:Int)=>v1-v2,//减去移出元素
    Seconds(4),
    Seconds(3),
    filterFunc = (t)=> t._2 > 0
)
.print()

ssc.start()
ssc.awaitTermination()

DStreams输出

Output Operation Meaning
print() Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
foreachRDD(func) The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
val sparkConf=new SparkConf().setAppName("WondowWordCount").setMaster("local[6]")
val ssc = new StreamingContext(sparkConf,Milliseconds(100))
ssc.sparkContext.setLogLevel("FATAL")
ssc.socketTextStream("CentOS",9999,StorageLevel.MEMORY_AND_DISK)
.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKeyAndWindow(
    (v1:Int,v2:Int)=>v1+v2,//上一个窗口结果+新进来的元素
    Seconds(60),
    Seconds(1)
)
.filter(t=> t._2 > 10)
.foreachRDD(rdd=>{
    rdd.foreachPartition(vs=>{
        vs.foreach(v=>KafkaSink.send2Kafka(v._1,v._2.toString))
    })
})

ssc.start()
ssc.awaitTermination()
object KafkaSink {

  private def createKafkaProducer(): KafkaProducer[String, String] = {
    val props = new Properties()
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer])
    props.put(ProducerConfig.BATCH_SIZE_CONFIG,"10")
    props.put(ProducerConfig.LINGER_MS_CONFIG,"1000")
    new KafkaProducer[String,String](props)
  }
  val kafkaProducer:KafkaProducer[String,String]=createKafkaProducer()

  def send2Kafka(k:String,v:String): Unit ={
    val message = new ProducerRecord[String,String]("topic01",k,v)
    kafkaProducer.send(message)
  }
  Runtime.getRuntime.addShutdownHook(new Thread(){
    override def run(): Unit = {
      kafkaProducer.flush()
      kafkaProducer.close()
    }
  })
}

注:对于Spark而言,默认只有当窗口的时间结束之后才会将窗口的计算结果最终输出,通常将该种输出方式成为钳制输出形式。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章