流處理的限速/反壓機制

本文主要是指spark+kafka，不包括flink。

摘要

1.spark streaming有限速（max rate），有反壓(back pressure)。
2.structured streaming沒有反壓，只有限速。

1.爲什麼要限速和反壓

一個spark集羣，資源總是有限。如果一個處理週期接收過多的數據，造成周期內數據處理不完，就會造成executor OOM等問題。相反地，如果一個處理週期接收的數據過少，則會造成資源的浪費，以及kafka消息的堆積。

所以合理的限速和反壓顯得非常重要。

如果集羣資源不夠大，流應用程序不能以接收數據的速度處理數據，接收端可以通過設置最大速率限制以記錄/秒爲單位來限制速率。在Spark 1.5中，我們引入了一個稱爲背壓的功能，它消除了設置速率限制的需要，因爲Spark Streaming會自動計算出速率限制，並在處理條件改變時動態調整它們。這個反壓力可以通過設置配置參數spark.streaming.backpressure.enabled爲true來啓用。

2.spark streaming

2.1限速

spark streaing連接kafka分爲兩種方式。

2.1.1.recive方式

此方式可以通過設置spark.streaming.receiver.maxRate參數

Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate.
可通過此參數控制receiver每秒接收的數據量。此參數值須大於0，如果小於0將視爲不限制。

2.1.2.direct方式

此方式可以通spark.streaming.kafka.maxRatePerPartition參數來控制速率。

Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the new Kafka direct stream API.
1.direct方式應通過spark.streaming.kafka.maxRatePerPartition方式來控制速率。2。此方式意爲每秒每個分區接收的數據量。

2.2反壓

從spark1.5開始引入了反壓的概念。它通過設置spark.streaming.backpressure.enabled參數=true/false來實現。

Enables or disables Spark Streaming's internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below).
1.backpressure可以動態的決定當前批次處理的數據量。由系統根據當前批次的延遲和處理時間決定。
2.backpressure應和maxRate/maxRatePerPartition 配合使用。後兩個參數將決定backpressure下每個批次處理數據量的上限。

2.3反壓的啓動問題

前面講到反壓spark.streaming.backpressure.enabled參數應與spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition一起使用，但也不是必須。如果沒有設置後兩個參數，在kafka有大量消息積壓的時候，首次啓動backpressure並不會生效，那麼還是會將全部消息拉取到集羣中，如果資源不夠，將再次撐爆資源。

那麼在這種情況下，還可以設置以下參數：

spark.streaming.backpressure.initialRate
開始反應時的初次速率。
爲什麼會有這個參數的存在，是因爲反壓的工作原理，是根據上一次批次處理數據的時間以及延遲來決定將要向上遊拉取數據的量。在初次啓動時，並沒有上一次的數據可供參考，這時反壓並不生效。

所以啓動反壓spark.streaming.backpressure.enabled時，必須與spark.streaming.backpressure.initialRate或者(spark.streaming.receiver.maxRate /spark.streaming.kafka.maxRatePerPartition)綁定使用。

3.structured streaming

3.1 structured streaming目前爲止沒有反壓機制，要控制當前批次的速率，只有通過參數maxOffsetsPerTrigger.

Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.
每個trigger處理數據的最大值，將均勻從各個kafka 分區拉取數據。

4.比較

限制速率與反壓孰優孰劣呢

1.限制速率需要去事先估算資源和數據量，不一定準確，如果誤差較大，需重啓任務。
2.設置一個死的速率，在低峯時，如凌晨，會造成資源浪費（這個時候正處在離線任務處理高峯期）；在高峯時，固然會有削峯的效果，但是也造成了數據的延遲，在一些實時性比較高的情況下，不是一個好的選擇。
3.反壓應該是一個比較好的選擇。特別是搭配上資源動態分配時，在凌晨能釋放不少資源。

5.structured streaming 爲什麼不/不能使用反壓？

既然反壓機制明顯優於限制速率，而structured streaming又優於spark streaming的情況下，structured streaming反而沒有設置背壓機制呢？

個人觀點，是因爲structured streaming接收kafka的方式是類似於receiver的方式，而非direct。所以只能限制整體數量，而無法限制分區的數量，更無法做反壓。

There are not receiver-based sources in Structured Streaming, so that's totally not necessary. From another point of view, Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.

參考資料
https://spark.apache.org/docs/1.6.3/configuration.html#spark-streaming
https://spark.apache.org/docs/2.3.0/structured-streaming-kafka-integration.html
https://stackoverflow.com/questions/44871621/how-spark-structured-streaming-handles-backpressure

流處理的限速/反壓機制

1.爲什麼要限速和反壓

2.spark streaming

2.1限速

2.1.1.recive方式

2.1.2.direct方式

2.2反壓

2.3反壓的啓動問題

3.structured streaming

4.比較

5.structured streaming 爲什麼不/不能使用反壓？

EXCEL中下拉菜單中添加新選項或者刪除選項

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

Java中止線程的方式

[轉帖]Oracle Exadata 學習筆記之核心特性Part1

《最新出爐》系列入門篇-Python+Playwright自動化測試-43-分頁測試

HTTP協議相關文檔

爲什麼list.sort()比Stream().sorted()更快？

小細節，大問題。分享一次代碼優化的過程

web系統字典統一中文翻譯問題

我們真的需要鏈式查詢嗎？

in用不用索引，啥時候能用啥時候不能用，一文說清

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結