【spark】都有哪些級別的容錯或者失敗重試？

哎，我又來寫文章了！

最近在看spark源碼（照着這本書看的《Spark內核設計的藝術架構設計與實現》），想整理一些東西（一些以前面試被問到的在我腦中沒有體系的知識點吧）

一、任務運行中主要的一些重試機制

1、Application級別的容錯

spark.yarn.maxAppAttempts

如果沒有手動配置這個參數，那就會使用集羣的默認值yarn.resourcemanager.am.max-attempts，默認是2，這是hadoop的yarn-site.xml裏面配置的，當然spark.yarn.maxAppAttempts要小於yarn.resourcemanager.am.max-attempts值，才生效

在YarnRMClient類中：

2、executor級別的容錯

spark.yarn.max.executor.failures

當executor掛了一定個數之後，整個任務就會掛掉

講真的，這個我找了好久，但是就是找不到executor死掉之後，是否會重新啓動幾個executor，只找到相關的代碼死後如何清理相關信息，然後把死掉的task全部扔到其他executor上執行

不過我好像還沒遇到過因爲executor掛太多導致整個任務失敗的

在ApplicationMaster類中：

3、stage級別的容錯

spark.stage.maxConsecutiveAttempts

Number of consecutive stage attempts allowed before a stage is aborted.

2.2.0

一個stage失敗了，會重試，通過如上的參數設置

4、task級別的容錯

spark.task.maxFailures

Number of failures of any particular task before giving up on the job. The total number of failures spread across different tasks will not cause the job to fail; a particular task has to fail this number of attempts. Should be greater than or equal to 1. Number of allowed retries = this value - 1.

0.8.0

task級別的重試，同一個task失敗4次纔會被影響，不同的task失敗不相互影響，在大作業（處理的數據量比較大）的情況下，建議可以設置爲8次

二、額外增加程序魯棒性的機制

5、shuffle的io級別的容錯

spark.shuffle.io.maxRetries

(Netty only) Fetches that fail due to IO-related exceptions are automatically retried if this is set to a non-zero value. This retry logic helps stabilize large shuffles in the face of long GC pauses or transient network connectivity issues.

1.2.0

shuffle拉取數據的時候，有可能連接的那臺機器正在gc，響應不了，所以拉取shuffle-io有重試，還可以設置重試的間隔等等，這裏就不列出來了

6、rpc級別的重試

spark.rpc.numRetries	3	Number of times to retry before an RPC task gives up. An RPC task will run at most times of this number.	1.4.0
spark.rpc.retry.wait	3s	Duration for an RPC ask operation to wait before retrying.	1.4.0
spark.rpc.askTimeout	spark.network.timeout	Duration for an RPC ask operation to wait before timing out.	1.4.0
spark.rpc.lookupTimeout	120s	Duration for an RPC remote endpoint lookup operation to wait before timing out.	1.4.0

spark中，相互通信，基本上都是rpc發送，舉個例子，一個task處理完了，通過spark內置的RPC框架往endpoint發送處理完了的消息，RPC的服務端，拿到這個消息做一個後續的處理，這之間的通信也需要有重試等機制，如果處理的數據量比較大，應該適當增加上述參數的時間

7、推測執行

其實我一直不太愛用推測執行，原因是這樣的，當某個task執行很慢的時候，排除機器的問題，那基本上是數據傾斜，既然是數據傾斜，那我再啓動另一個task來跑，同樣是很慢，沒什麼太大區別，所以推測執行的適用場景，應該是：

1、部分機器性能不行，相同數據量的task分發到這些機器上運行會比其他機器慢很多

2、由於數據本地化策略，把大部分任務扔到了相同的幾臺機器上運行，其他機器圍觀

3、部分task由於一些奇怪的原因卡住了

類似以上的情況，可以嘗試開啓推測執行來解決問題，但是推測執行勢必會影響spark任務的運行速度

spark.speculation	false	If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.	0.6.0
spark.speculation.interval	100ms	How often Spark will check for tasks to speculate.	0.6.0
spark.speculation.multiplier	1.5	How many times slower a task is than the median to be considered for speculation.	0.6.0
spark.speculation.quantile	0.75	Fraction of tasks which must be complete before speculation is enabled for a particular stage.	0.6.0
spark.speculation.task.duration.threshold	None	Task duration after which scheduler would try to speculative run the task. If provided, tasks would be speculatively run if current stage contains less tasks than or equal to the number of slots on a single executor and the task is taking longer time than the threshold. This config helps speculate stage with very few tasks. Regular speculation configs may also apply if the executor slots are large enough. E.g. tasks might be re-launched if there are enough successful runs even though the threshold hasn't been reached. The number of slots is computed based on the conf values of spark.executor.cores and spark.task.cpus minimum 1. Default unit is bytes, unless otherwise specified.	3.0.0

8、黑名單機制

黑名單機制的配置參數越來越多，證明spark是有在這上面花功夫的。本質上來講，其實就是大數據集羣的部分機器可能因爲某些原因（比如壞盤，cpu/io爆滿），導致分發到該機器上的task並不能完成，但是由於數據本地化策略，有可能task失敗重試的時候又分發到這臺機器上，這就很不合理。。。。所以可以設置，當一個task在某臺機器上失敗多少次，該臺機器就會被加入黑名單

spark.blacklist.enabled

false

If set to "true", prevent Spark from scheduling tasks on executors that have been blacklisted due to too many task failures. The blacklisting algorithm can be further controlled by the other "spark.blacklist" configuration options.

2.1.0

其他相關參數請看spark官網，因爲不同版本配置不一樣！

http://spark.apache.org/docs/latest/configuration.html

我沒有截全，還是不少的

9、cache機制

cache機制算是代碼和數據層面的容錯機制

cache機制會保留數據在內存或者磁盤上，並且保留RDD的上下游依賴關係

他的大致的原理是這樣的，數據流入spark，會根據用戶寫的代碼構建DAG圖，會生成不同的RDD，並且他們之間有上下游的依賴關係。如圖：

假設，Stage1中，我用紅色數字標出來的4階段的某個task出現數據丟失的異常，那麼就會導致Stage2的5階段拉取不到數據，因此就會重跑Stage1中的4階段的有異常的task，那麼如果沒有cache機制，數據就得從Stage0的1階段的位置（也就是磁盤上的原始數據）再把數據重新讀取出來，然後經過2和3的map才能流入Stage1，在鏈路很長的情況下，就顯得效率很差，所以spark提供了cache（把某個階段的數據緩存）機制，可以把3中的數據緩存起來，這樣就算4中數據丟失，也只需要從3中重新獲得數據即可，不需要從源頭來獲取數據！

10、checkpoint機制

checkpoint有點像快照，像spark和flink這種提供流式的數據處理框架，基本上都會提供這種機制。

checkpoint機制，像是cache的一個遞進的解決方案，當我們cache數據的時候，數據可能會存在內存中或者某部分機器的磁盤上，如果這些機器掛掉了，那cache失效，結果還是要從最原始的地方拉取數據！

checkpoint機制可以幫助你，把數據的結果寫入分佈式存儲系統裏，比如HDFS，通過HDFS來保證數據的不丟失，這樣就算spark任務掛掉了，下次啓動spark任務的時候還能夠重新讀到上一次checkpoint的數據，這個機制，一般用於流式的數據處理，記錄偏移量等操作

Flink的checkpoint尤其做的十分巧妙！

哎，這篇文章其實從端午之前就開始寫了

我不知道我列舉的是否夠全面，我的第一點和第二點的分類是否夠準確（我自己憑感覺分的），如果有遺漏的，或者有不同意見的小夥伴，歡迎批評指正！

菜雞一隻，祝自己端午節快樂，祝自己生日快樂（今年端午和陽曆生日居然是同一天呱！）

【spark】都有哪些級別的容錯或者失敗重試？

一、任務運行中主要的一些重試機制

1、Application級別的容錯

2、executor級別的容錯

3、stage級別的容錯

4、task級別的容錯

二、額外增加程序魯棒性的機制

5、shuffle的io級別的容錯

6、rpc級別的重試

7、推測執行

8、黑名單機制

9、cache機制

10、checkpoint機制

我不知道我列舉的是否夠全面，我的第一點和第二點的分類是否夠準確（我自己憑感覺分的），如果有遺漏的，或者有不同意見的小夥伴，歡迎批評指正！

Python多線程編程深度探索：從入門到實戰

mongodb處理json數據很好

35K*14 薪，入職了！這公司只要不裁員，我能一直呆下去！

【kafka】爲什麼快(why‘s kafka so fast)？

【spark】使用kryo序列化和壓縮，減少數據緩存和傳輸的大小

【spark】關於spark的shuffle模式的一些見解

【spark】on yarn的模式下，如何上傳files並在程序中讀取到？

【spark】都有哪些級別的容錯或者失敗重試？

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結