- Part 3 轉換操作

如前所述,DStream其實是內部一系列的不同時間點的RDD構成,因此大部分RDD的轉換操作,DStream都支持。其中一些操作在下面會詳細解釋。

Transformation Meaning
map(func) Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items.
filter(func) Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions) Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream) Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count() Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func) Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
countByValue() When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks]) When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config propertyspark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks]) When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks]) When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func) Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func) Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

UpdateStateByKey 操作

updateStateByKey用於有狀態的批處理,在每次批處理時updateStateByKey根據當前批處理產生的該key的value來更新之前保存的全局value,例如要記錄從程序開始運行到現在每個word的總體count數,這是我們需要記錄每個word跨batch的狀態,即需要用每次批處理統計的word count來加上總體的count數。實現有狀態的批處理需要兩步 -
1. 定義一個全局狀態,可以任意類型;
2. 定義一個函數,用來定義如何根據當前批處理統計的值來更新全局狀態;

在每次批處理的時候,Spark Streaming都會去更新所有key的全局狀態,無論當前批處理是否有不同的值產生;如果新的狀態爲None,則當前key會被從全局狀態表中刪除;

剛纔說到的總體word count需要定義的函數樣例如下,其中,我們將新值和總體值進行相加。
def updateFunction(newValues, runningCount):
    if runningCount is None:
       runningCount = 0
    return sum(newValues, runningCount)  # add the new values with the previous running count to get the new count
此函數的調用如下,其中pairs是當前批次每個word對用次數的key-value pair RDD.
runningCounts = pairs.updateStateByKey(updateFunction)
需要注意的是, updateStateByKey操作需要指定checkpoint 目錄用於存儲全局狀態,後面會介紹。


Transform 操作

高度抽象的方法,可以定義任意RDD to RDD的DStream的操作,所以當封裝的轉換操作不能滿足需求需要自己定製時,則用此方法;例如現成的轉換操作裏不包括DStream去join一個dataset,一個應用場景是將數據數據流的DStream去join一個垃圾信息的數據集,然後進行過濾,示例代碼如下
spamInfoRDD = sc.pickleFile(...) # RDD containing spam information

# join data stream with spam information to do data cleaning
cleanedDStream = wordCounts.transform(lambda rdd: rdd.join(spamInfoRDD).filter(...))
值得注意的是,如上的函數transform會在每次批處理時都比被調用,這使得我們有機會每次批處理時函數的實現方式以及RDD的分區數等是變化的。


Window 操作

窗口計算操作。對源DStream上放一個滑動窗口,對窗口內的多個RDD(定義爲窗口的大小)進行聚合,聚合爲目標DStream的一個RDD;下一次迭代窗口大小不變向右平移,平移的步長定義爲窗口滑動的間距。例如下圖是一個窗口大小爲3,步長爲2的window操作。
Spark Streaming
Window操作的一個應用場景,仍然以word count爲例,例如想實現每10秒出一個數據 - 計算過去30秒的word count統計結果,在這個例子中,如果原始DStream是每秒的word count統計結果,則窗口操作的窗口大小是30,窗口滑動的間距是10.

代碼實現上,假設pair是(word, 1)RDD的DStream
# Reduce last 30 seconds of data, every 10 seconds
windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, 30, 10)

另一種更高效的計算方式是增量方式,即不是每次都重新計算,而是在上一個窗口計算的基礎上,加入新滑入窗口的部分,去除滑出窗口的部分,所以這裏輸入兩個函數分別對應對滑入和滑出部分的操作。這種方法雖然高效,但能否使用這種方法取決於問題本身能否使用增量計算的方式,例如統計窗口內數據的標準差,至少我想不到辦法通過增量的方式基於上一個窗口的值計算當前窗口的數據離散程度。

# Reduce last 30 seconds of data, every 10 seconds
windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)


附上此函數的signature,附上API文檔。
reduceByKeyAndWindow(funcinvFunc,windowLengthslideInterval, [numTasks])

A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.

更多的window函數在這裏,常用的DStream操作都有支持window操作的版本,所有的操作內都需要指定窗口大小(windowLength)和滑動間距(slideInterval)作爲輸入參數。
Transformation Meaning
window(windowLengthslideInterval) Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength,slideInterval) Return a sliding window count of elements in the stream.
reduceByWindow(funcwindowLength,slideInterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func,windowLengthslideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
reduceByKeyAndWindow(funcinvFunc,windowLengthslideInterval, [numTasks])

A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.

countByValueAndWindow(windowLength,slideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.



Join 操作

Coming soon...





發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章