SparkStreaming之窗口函數

WindowOperations（窗口操作）

Spark還提供了窗口的計算，它允許你使用一個滑動窗口應用在數據變換中。下圖說明了該滑動窗口。

如圖所示，每個時間窗口在一個個DStream中劃過，每個DSteam中的RDD進入Window中進行合併，操作時生成爲

窗口化DSteam的RDD。在上圖中，該操作被應用在過去的3個時間單位的數據，和劃過了2個時間單位。這說明任

何窗口操作都需要指定2個參數。

window length（窗口長度）：窗口的持續時間（上圖爲3個時間單位）
sliding interval （滑動間隔）- 窗口操作的時間間隔（上圖爲2個時間單位）。

上面的2個參數的大小，必須是接受產生一個DStream時間的倍數

讓我們用一個例子來說明窗口操作。比如說，你想用以前的WordCount的例子，來計算最近30s的數據的中的單詞

數，10S接受爲一個DStream。爲此，我們要用reduceByKey操作來計算最近30s數據中每一個DSteam中關於

（word，1）的pair操作。它可以用reduceByKeyAndWindow操作來實現。一些常見的窗口操作如下。所有這些操作

都需要兩個參數--- window length（窗口長度）和sliding interval（滑動間隔）。

-------------------------實驗數據----------------------------------------------------------------------

spark
Streaming
better
than
storm
you
need
it
yes
do
it

（每秒在其中隨機抽取一個，作爲Socket端的輸入），socket端的數據模擬和實驗函數等程序見附錄百度雲鏈接

-----------------------------------------------window操作-------------------------------------------------------------------------

//輸入:窗口長度（隱：輸入的滑動窗口長度爲形成Dstream的時間）
//輸出：返回一個DStream,這個DStream包含這個滑動窗口下的全部元素
def window(windowDuration: Duration): DStream[T] = window(windowDuration, this.slideDuration)

//輸入:窗口長度和滑動窗口長度
//輸出：返回一個DStream,這個DStream包含這個滑動窗口下的全部元素
def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
  new WindowedDStream(this, windowDuration, slideDuration)
}

import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object  windowOnStreaming {
  def main(args: Array[String]) {
    /**
      * this is test  of Streaming operations-----window
      */
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

    val conf = new SparkConf().setAppName("the Window operation of SparK Streaming").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc,Seconds(2))


    //set the Checkpoint directory
    ssc.checkpoint("/Res")

    //get the socket Streaming data
    val socketStreaming = ssc.socketTextStream("master",9999)

    val data = socketStreaming.map(x =>(x,1))
    //def window(windowDuration: Duration): DStream[T]
    val getedData1 = data.window(Seconds(6))
    println("windowDuration only : ")
    getedData1.print()
    //same as
    // def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
    //val getedData2 = data.window(Seconds(9),Seconds(3))
    //println("Duration and SlideDuration : ")
    //getedData2.print()

    ssc.start()
    ssc.awaitTermination()
  }

}

--------------------reduceByKeyAndWindow操作--------------------------------

/**通過對每個滑動過來的窗口應用一個reduceByKey的操作，返回一個DSream，有點像
* `DStream.reduceByKey(),但是隻是這個函數只是應用在滑動過來的窗口，hash分區是採用spark集羣
* 默認的分區樹
 * @param reduceFunc 從左到右的reduce 函數
 * @param windowDuration 窗口時間
* 滑動窗口默認是1個batch  interval
* 分區數是是RDD默認（depend on spark集羣core）
*/
def reduceByKeyAndWindow(
    reduceFunc: (V, V) => V,
    windowDuration: Duration
  ): DStream[(K, V)] = ssc.withScope {
  reduceByKeyAndWindow(reduceFunc, windowDuration, self.slideDuration, defaultPartitioner())
}

/**通過對每個滑動過來的窗口應用一個reduceByKey的操作，返回一個DSream，有點像
* `DStream.reduceByKey(),但是隻是這個函數只是應用在滑動過來的窗口，hash分區是採用spark集羣
* 默認的分區樹
 * @param reduceFunc 從左到右的reduce 函數
 * @param windowDuration 窗口時間
* @param slideDuration  滑動時間
*/
def reduceByKeyAndWindow(
    reduceFunc: (V, V) => V,
    windowDuration: Duration,
    slideDuration: Duration
  ): DStream[(K, V)] = ssc.withScope {
  reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
}


/**通過對每個滑動過來的窗口應用一個reduceByKey的操作，返回一個DSream，有點像
* `DStream.reduceByKey(),但是隻是這個函數只是應用在滑動過來的窗口，hash分區是採用spark集羣
* 默認的分區樹
 * @param reduceFunc 從左到右的reduce 函數
 * @param windowDuration 窗口時間
* @param slideDuration  滑動時間

 * @param numPartitions  每個RDD的分區數.
 */
def reduceByKeyAndWindow(
    reduceFunc: (V, V) => V,
    windowDuration: Duration,
    slideDuration: Duration,
    numPartitions: Int
  ): DStream[(K, V)] = ssc.withScope {
  reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration,
    defaultPartitioner(numPartitions))
}

/**
 /**通過對每個滑動過來的窗口應用一個reduceByKey的操作，返回一個DSream，有點像
* `DStream.reduceByKey(),但是隻是這個函數只是應用在滑動過來的窗口，hash分區是採用spark集羣
* 默認的分區樹
 * @param reduceFunc 從左到右的reduce 函數
 * @param windowDuration 窗口時間
* @param slideDuration  滑動時間

 * @param numPartitions  每個RDD的分區數.
 * @param partitioner    設置每個partition的分區數
*/
def reduceByKeyAndWindow(
    reduceFunc: (V, V) => V,
    windowDuration: Duration,
    slideDuration: Duration,
    partitioner: Partitioner
  ): DStream[(K, V)] = ssc.withScope {
  self.reduceByKey(reduceFunc, partitioner)
      .window(windowDuration, slideDuration)
      .reduceByKey(reduceFunc, partitioner)
}

/**
 *通過對每個滑動過來的窗口應用一個reduceByKey的操作.同時對old RDDs進行了invReduceFunc操作
* hash分區是採用spark集羣，默認的分區樹
 * @param reduceFunc從左到右的reduce 函數
 * @param invReduceFunc inverse reduce function; such that for all y, invertible x:
 *                      `invReduceFunc(reduceFunc(x, y), x) = y`
 * @param windowDuration窗口時間
 * @param slideDuration  滑動時間
 * @param filterFunc     來賽選一定條件的 key-value 對的
 */
def reduceByKeyAndWindow(
    reduceFunc: (V, V) => V,
    invReduceFunc: (V, V) => V,
    windowDuration: Duration,
    slideDuration: Duration = self.slideDuration,
    numPartitions: Int = ssc.sc.defaultParallelism,
    filterFunc: ((K, V)) => Boolean = null
  ): DStream[(K, V)] = ssc.withScope {
  reduceByKeyAndWindow(
    reduceFunc, invReduceFunc, windowDuration,
    slideDuration, defaultPartitioner(numPartitions), filterFunc
  )
}

/**
*通過對每個滑動過來的窗口應用一個reduceByKey的操作.同時對old RDDs進行了invReduceFunc操作
* hash分區是採用spark集羣，默認的分區樹
 * @param reduceFunc從左到右的reduce 函數
 * @param invReduceFunc inverse reduce function; such that for all y, invertible x:
 *                      `invReduceFunc(reduceFunc(x, y), x) = y`
 * @param windowDuration窗口時間
 * @param slideDuration  滑動時間
 * @param partitioner    每個RDD的分區數.
 * @param filterFunc     來賽選一定條件的 key-value 對的
 */
def reduceByKeyAndWindow(
    reduceFunc: (V, V) => V,
    invReduceFunc: (V, V) => V,
    windowDuration: Duration,
    slideDuration: Duration,
    partitioner: Partitioner,
    filterFunc: ((K, V)) => Boolean
  ): DStream[(K, V)] = ssc.withScope {

  val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
  val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)
  val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None
  new ReducedWindowedDStream[K, V](
    self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,
    windowDuration, slideDuration, partitioner
  )
}

import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}


object reduceByWindowOnStreaming {

  def main(args: Array[String]) {
    /**
      * this is test  of Streaming operations-----reduceByKeyAndWindow
      */
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

    val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc,Seconds(2))

    //set the Checkpoint directory
    ssc.checkpoint("/Res")

    //get the socket Streaming data
    val socketStreaming = ssc.socketTextStream("master",9999)

    val data = socketStreaming.map(x =>(x,1))
    //def reduceByKeyAndWindow(reduceFunc: (V, V) => V, windowDuration: Duration  ): DStream[(K, V)]
    //val getedData1 = data.reduceByKeyAndWindow(_+_,Seconds(6))

    val getedData2 = data.reduceByKeyAndWindow(_+_,
      (a,b) => a+b*0
      ,Seconds(6),Seconds(2))

    val getedData1 = data.reduceByKeyAndWindow(_+_,_-_,Seconds(9),Seconds(6))

    println("reduceByKeyAndWindow : ")
    getedData1.print()

    ssc.start()
    ssc.awaitTermination()


  }
}

這裏出現了invReduceFunc函數這個函數有點特別，一不注意就會出錯，現在通過分析源碼中的

ReducedWindowedDStream這個類內部來進行說明：

------------------reduceByWindow操作---------------------------

/輸入：reduceFunc、窗口長度、滑動長度
//輸出：（a,b）爲從幾個從左到右一次取得兩個元素
//（，a,b）進入reduceFunc,
def reduceByWindow(
    reduceFunc: (T, T) => T,
    windowDuration: Duration,
    slideDuration: Duration
  ): DStream[T] = ssc.withScope {
  this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)
}
/**
*輸入reduceFunc，invReduceFunc，窗口長度、滑動長度
 */
def reduceByWindow(
    reduceFunc: (T, T) => T,
    invReduceFunc: (T, T) => T,
    windowDuration: Duration,
    slideDuration: Duration
  ): DStream[T] = ssc.withScope {
    this.map((1, _))
        .reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)
        .map(_._2)
}

import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by root on 6/23/16.
  */
object reduceByWindow {
  def main(args: Array[String]) {
    /**
      * this is test  of Streaming operations-----reduceByWindow
      */
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

    val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc,Seconds(2))
    //set the Checkpoint directory
    ssc.checkpoint("/Res")

    //get the socket Streaming data
    val socketStreaming = ssc.socketTextStream("master",9999)

    //val data = socketStreaming.reduceByWindow(_+_,Seconds(6),Seconds(2))
    val data = socketStreaming.reduceByWindow(_+_,_+_,Seconds(6),Seconds(2))


    println("reduceByWindow: count the number of elements")
    data.print()


    ssc.start()
    ssc.awaitTermination()

  }
}

-----------------------------------------------countByWindow操作---------------------------------

/**
* 輸入 窗口長度和滑動長度，返回窗口內的元素數量
 * @param windowDuration 窗口長度
* @param slideDuration  滑動長度
*/
def countByWindow(
    windowDuration: Duration,
    slideDuration: Duration): DStream[Long] = ssc.withScope {
  this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
//窗口下的DStream進行map操作，把每個元素變爲1之後進行reduceByWindow操作
 }

import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by root on 6/23/16.
  */
object countByWindow {
  def main(args: Array[String]) {

    /**
      * this is test  of Streaming operations-----countByWindow
      */
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

    val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc,Seconds(2))
    //set the Checkpoint directory
    ssc.checkpoint("/Res")

    //get the socket Streaming data
    val socketStreaming = ssc.socketTextStream("master",9999)

    val data = socketStreaming.countByWindow(Seconds(6),Seconds(2))


    println("countByWindow: count the number of elements")
    data.print()


    ssc.start()
    ssc.awaitTermination()


  }
}

-------------------------------- countByValueAndWindow-------------

/**
*輸入 窗口長度、滑動時間、RDD分區數（默認分區是等於並行度）
 * @param windowDuration width of the window; must be a multiple of this DStream's
 *                       batching interval
 * @param slideDuration  sliding interval of the window (i.e., the interval after which
 *                       the new DStream will generate RDDs); must be a multiple of this
 *                       DStream's batching interval
 * @param numPartitions  number of partitions of each RDD in the new DStream.
 */
def countByValueAndWindow(
    windowDuration: Duration,
    slideDuration: Duration,
    numPartitions: Int = ssc.sc.defaultParallelism)
    (implicit ord: Ordering[T] = null)
    : DStream[(T, Long)] = ssc.withScope {
  this.map((_, 1L)).reduceByKeyAndWindow(
    (x: Long, y: Long) => x + y,
    (x: Long, y: Long) => x - y,
    windowDuration,
    slideDuration,
    numPartitions,
    (x: (T, Long)) => x._2 != 0L
  )
}

import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by root on 6/23/16.
  */
object countByValueAndWindow {
  def main(args: Array[String]) {
    /**
      * this is test  of Streaming operations-----countByValueAndWindow
      */
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
    Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)

    val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc,Seconds(2))
    //set the Checkpoint directory
    ssc.checkpoint("/Res")

    //get the socket Streaming data
    val socketStreaming = ssc.socketTextStream("master",9999)

    val data = socketStreaming.countByValueAndWindow(Seconds(6),Seconds(2))


    println("countByWindow: count the number of elements")
    data.print()


    ssc.start()
    ssc.awaitTermination()
  }

}