WindowOperations(窗口操作)
Spark還提供了窗口的計算,它允許你使用一個滑動窗口應用在數據變換中。下圖說明了該滑動窗口。
如圖所示,每個時間窗口在一個個DStream中劃過,每個DSteam中的RDD進入Window中進行合併,操作時生成爲
窗口化DSteam的RDD。在上圖中,該操作被應用在過去的3個時間單位的數據,和劃過了2個時間單位。這說明任
何窗口操作都需要指定2個參數。
- window length(窗口長度):窗口的持續時間(上圖爲3個時間單位)
- sliding interval (滑動間隔)- 窗口操作的時間間隔(上圖爲2個時間單位)。
上面的2個參數的大小,必須是接受產生一個DStream時間的倍數
讓我們用一個例子來說明窗口操作。比如說,你想用以前的WordCount的例子,來計算最近30s的數據的中的單詞
數,10S接受爲一個DStream。爲此,我們要用reduceByKey操作來計算最近30s數據中每一個DSteam中關於
(word,1)的pair操作。它可以用reduceByKeyAndWindow操作來實現。一些常見的窗口操作如下。所有這些操作
都需要兩個參數--- window length(窗口長度)和sliding interval(滑動間隔)。
-------------------------實驗數據----------------------------------------------------------------------
spark
Streaming
better
than
storm
you
need
it
yes
do
it
(每秒在其中隨機抽取一個,作爲Socket端的輸入),socket端的數據模擬和實驗函數等程序見附錄百度雲鏈接
-----------------------------------------------window操作-------------------------------------------------------------------------//輸入:窗口長度(隱:輸入的滑動窗口長度爲形成Dstream的時間)
//輸出:返回一個DStream,這個DStream包含這個滑動窗口下的全部元素
def window(windowDuration: Duration): DStream[T] = window(windowDuration, this.slideDuration)
//輸入:窗口長度和滑動窗口長度
//輸出:返回一個DStream,這個DStream包含這個滑動窗口下的全部元素
def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
new WindowedDStream(this, windowDuration, slideDuration)
}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object windowOnStreaming {
def main(args: Array[String]) {
/**
* this is test of Streaming operations-----window
*/
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("the Window operation of SparK Streaming").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
//set the Checkpoint directory
ssc.checkpoint("/Res")
//get the socket Streaming data
val socketStreaming = ssc.socketTextStream("master",9999)
val data = socketStreaming.map(x =>(x,1))
//def window(windowDuration: Duration): DStream[T]
val getedData1 = data.window(Seconds(6))
println("windowDuration only : ")
getedData1.print()
//same as
// def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
//val getedData2 = data.window(Seconds(9),Seconds(3))
//println("Duration and SlideDuration : ")
//getedData2.print()
ssc.start()
ssc.awaitTermination()
}
}
--------------------reduceByKeyAndWindow操作--------------------------------
/**通過對每個滑動過來的窗口應用一個reduceByKey的操作,返回一個DSream,有點像
* `DStream.reduceByKey(),但是隻是這個函數只是應用在滑動過來的窗口,hash分區是採用spark集羣
* 默認的分區樹
* @param reduceFunc 從左到右的reduce 函數
* @param windowDuration 窗口時間
* 滑動窗口默認是1個batch interval
* 分區數是是RDD默認(depend on spark集羣core)
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, self.slideDuration, defaultPartitioner())
}
/**通過對每個滑動過來的窗口應用一個reduceByKey的操作,返回一個DSream,有點像
* `DStream.reduceByKey(),但是隻是這個函數只是應用在滑動過來的窗口,hash分區是採用spark集羣
* 默認的分區樹
* @param reduceFunc 從左到右的reduce 函數
* @param windowDuration 窗口時間
* @param slideDuration 滑動時間
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
}
/**通過對每個滑動過來的窗口應用一個reduceByKey的操作,返回一個DSream,有點像
* `DStream.reduceByKey(),但是隻是這個函數只是應用在滑動過來的窗口,hash分區是採用spark集羣
* 默認的分區樹
* @param reduceFunc 從左到右的reduce 函數
* @param windowDuration 窗口時間
* @param slideDuration 滑動時間
* @param numPartitions 每個RDD的分區數.
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
numPartitions: Int
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration,
defaultPartitioner(numPartitions))
}
/**
/**通過對每個滑動過來的窗口應用一個reduceByKey的操作,返回一個DSream,有點像
* `DStream.reduceByKey(),但是隻是這個函數只是應用在滑動過來的窗口,hash分區是採用spark集羣
* 默認的分區樹
* @param reduceFunc 從左到右的reduce 函數
* @param windowDuration 窗口時間
* @param slideDuration 滑動時間
* @param numPartitions 每個RDD的分區數.
* @param partitioner 設置每個partition的分區數
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner
): DStream[(K, V)] = ssc.withScope {
self.reduceByKey(reduceFunc, partitioner)
.window(windowDuration, slideDuration)
.reduceByKey(reduceFunc, partitioner)
}
/**
*通過對每個滑動過來的窗口應用一個reduceByKey的操作.同時對old RDDs進行了invReduceFunc操作
* hash分區是採用spark集羣,默認的分區樹
* @param reduceFunc從左到右的reduce 函數
* @param invReduceFunc inverse reduce function; such that for all y, invertible x:
* `invReduceFunc(reduceFunc(x, y), x) = y`
* @param windowDuration窗口時間
* @param slideDuration 滑動時間
* @param filterFunc 來賽選一定條件的 key-value 對的
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration = self.slideDuration,
numPartitions: Int = ssc.sc.defaultParallelism,
filterFunc: ((K, V)) => Boolean = null
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(
reduceFunc, invReduceFunc, windowDuration,
slideDuration, defaultPartitioner(numPartitions), filterFunc
)
}
/**
*通過對每個滑動過來的窗口應用一個reduceByKey的操作.同時對old RDDs進行了invReduceFunc操作
* hash分區是採用spark集羣,默認的分區樹
* @param reduceFunc從左到右的reduce 函數
* @param invReduceFunc inverse reduce function; such that for all y, invertible x:
* `invReduceFunc(reduceFunc(x, y), x) = y`
* @param windowDuration窗口時間
* @param slideDuration 滑動時間
* @param partitioner 每個RDD的分區數.
* @param filterFunc 來賽選一定條件的 key-value 對的
*/
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
invReduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner,
filterFunc: ((K, V)) => Boolean
): DStream[(K, V)] = ssc.withScope {
val cleanedReduceFunc = ssc.sc.clean(reduceFunc)
val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)
val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None
new ReducedWindowedDStream[K, V](
self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,
windowDuration, slideDuration, partitioner
)
}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object reduceByWindowOnStreaming {
def main(args: Array[String]) {
/**
* this is test of Streaming operations-----reduceByKeyAndWindow
*/
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
//set the Checkpoint directory
ssc.checkpoint("/Res")
//get the socket Streaming data
val socketStreaming = ssc.socketTextStream("master",9999)
val data = socketStreaming.map(x =>(x,1))
//def reduceByKeyAndWindow(reduceFunc: (V, V) => V, windowDuration: Duration ): DStream[(K, V)]
//val getedData1 = data.reduceByKeyAndWindow(_+_,Seconds(6))
val getedData2 = data.reduceByKeyAndWindow(_+_,
(a,b) => a+b*0
,Seconds(6),Seconds(2))
val getedData1 = data.reduceByKeyAndWindow(_+_,_-_,Seconds(9),Seconds(6))
println("reduceByKeyAndWindow : ")
getedData1.print()
ssc.start()
ssc.awaitTermination()
}
}
這裏出現了invReduceFunc函數這個函數有點特別,一不注意就會出錯,現在通過分析源碼中的
ReducedWindowedDStream這個類內部來進行說明:
------------------reduceByWindow操作---------------------------
/輸入:reduceFunc、窗口長度、滑動長度
//輸出:(a,b)爲從幾個從左到右一次取得兩個元素
//(,a,b)進入reduceFunc,
def reduceByWindow(
reduceFunc: (T, T) => T,
windowDuration: Duration,
slideDuration: Duration
): DStream[T] = ssc.withScope {
this.reduce(reduceFunc).window(windowDuration, slideDuration).reduce(reduceFunc)
}
/**
*輸入reduceFunc,invReduceFunc,窗口長度、滑動長度
*/
def reduceByWindow(
reduceFunc: (T, T) => T,
invReduceFunc: (T, T) => T,
windowDuration: Duration,
slideDuration: Duration
): DStream[T] = ssc.withScope {
this.map((1, _))
.reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, 1)
.map(_._2)
}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by root on 6/23/16.
*/
object reduceByWindow {
def main(args: Array[String]) {
/**
* this is test of Streaming operations-----reduceByWindow
*/
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
//set the Checkpoint directory
ssc.checkpoint("/Res")
//get the socket Streaming data
val socketStreaming = ssc.socketTextStream("master",9999)
//val data = socketStreaming.reduceByWindow(_+_,Seconds(6),Seconds(2))
val data = socketStreaming.reduceByWindow(_+_,_+_,Seconds(6),Seconds(2))
println("reduceByWindow: count the number of elements")
data.print()
ssc.start()
ssc.awaitTermination()
}
}
-----------------------------------------------countByWindow操作---------------------------------
/**
* 輸入 窗口長度和滑動長度,返回窗口內的元素數量
* @param windowDuration 窗口長度
* @param slideDuration 滑動長度
*/
def countByWindow(
windowDuration: Duration,
slideDuration: Duration): DStream[Long] = ssc.withScope {
this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
//窗口下的DStream進行map操作,把每個元素變爲1之後進行reduceByWindow操作
}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by root on 6/23/16.
*/
object countByWindow {
def main(args: Array[String]) {
/**
* this is test of Streaming operations-----countByWindow
*/
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
//set the Checkpoint directory
ssc.checkpoint("/Res")
//get the socket Streaming data
val socketStreaming = ssc.socketTextStream("master",9999)
val data = socketStreaming.countByWindow(Seconds(6),Seconds(2))
println("countByWindow: count the number of elements")
data.print()
ssc.start()
ssc.awaitTermination()
}
}
-------------------------------- countByValueAndWindow-------------
/**
*輸入 窗口長度、滑動時間、RDD分區數(默認分區是等於並行度)
* @param windowDuration width of the window; must be a multiple of this DStream's
* batching interval
* @param slideDuration sliding interval of the window (i.e., the interval after which
* the new DStream will generate RDDs); must be a multiple of this
* DStream's batching interval
* @param numPartitions number of partitions of each RDD in the new DStream.
*/
def countByValueAndWindow(
windowDuration: Duration,
slideDuration: Duration,
numPartitions: Int = ssc.sc.defaultParallelism)
(implicit ord: Ordering[T] = null)
: DStream[(T, Long)] = ssc.withScope {
this.map((_, 1L)).reduceByKeyAndWindow(
(x: Long, y: Long) => x + y,
(x: Long, y: Long) => x - y,
windowDuration,
slideDuration,
numPartitions,
(x: (T, Long)) => x._2 != 0L
)
}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by root on 6/23/16.
*/
object countByValueAndWindow {
def main(args: Array[String]) {
/**
* this is test of Streaming operations-----countByValueAndWindow
*/
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.Server").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("the reduceByWindow operation of SparK Streaming").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc,Seconds(2))
//set the Checkpoint directory
ssc.checkpoint("/Res")
//get the socket Streaming data
val socketStreaming = ssc.socketTextStream("master",9999)
val data = socketStreaming.countByValueAndWindow(Seconds(6),Seconds(2))
println("countByWindow: count the number of elements")
data.print()
ssc.start()
ssc.awaitTermination()
}
}
附錄
鏈接:http://pan.baidu.com/s/1slkqwBb 密碼:d92r