【spark】都有哪些级别的容错或者失败重试？

哎，我又来写文章了！

最近在看spark源码（照着这本书看的《Spark内核设计的艺术架构设计与实现》），想整理一些东西（一些以前面试被问到的在我脑中没有体系的知识点吧）

一、任务运行中主要的一些重试机制

1、Application级别的容错

spark.yarn.maxAppAttempts

如果没有手动配置这个参数，那就会使用集群的默认值yarn.resourcemanager.am.max-attempts，默认是2，这是hadoop的yarn-site.xml里面配置的，当然spark.yarn.maxAppAttempts要小于yarn.resourcemanager.am.max-attempts值，才生效

在YarnRMClient类中：

2、executor级别的容错

spark.yarn.max.executor.failures

当executor挂了一定个数之后，整个任务就会挂掉

讲真的，这个我找了好久，但是就是找不到executor死掉之后，是否会重新启动几个executor，只找到相关的代码死后如何清理相关信息，然后把死掉的task全部扔到其他executor上执行

不过我好像还没遇到过因为executor挂太多导致整个任务失败的

在ApplicationMaster类中：

3、stage级别的容错

spark.stage.maxConsecutiveAttempts

Number of consecutive stage attempts allowed before a stage is aborted.

2.2.0

一个stage失败了，会重试，通过如上的参数设置

4、task级别的容错

spark.task.maxFailures

Number of failures of any particular task before giving up on the job. The total number of failures spread across different tasks will not cause the job to fail; a particular task has to fail this number of attempts. Should be greater than or equal to 1. Number of allowed retries = this value - 1.

0.8.0

task级别的重试，同一个task失败4次才会被影响，不同的task失败不相互影响，在大作业（处理的数据量比较大）的情况下，建议可以设置为8次

二、额外增加程序鲁棒性的机制

5、shuffle的io级别的容错

spark.shuffle.io.maxRetries

(Netty only) Fetches that fail due to IO-related exceptions are automatically retried if this is set to a non-zero value. This retry logic helps stabilize large shuffles in the face of long GC pauses or transient network connectivity issues.

1.2.0

shuffle拉取数据的时候，有可能连接的那台机器正在gc，响应不了，所以拉取shuffle-io有重试，还可以设置重试的间隔等等，这里就不列出来了

6、rpc级别的重试

spark.rpc.numRetries	3	Number of times to retry before an RPC task gives up. An RPC task will run at most times of this number.	1.4.0
spark.rpc.retry.wait	3s	Duration for an RPC ask operation to wait before retrying.	1.4.0
spark.rpc.askTimeout	spark.network.timeout	Duration for an RPC ask operation to wait before timing out.	1.4.0
spark.rpc.lookupTimeout	120s	Duration for an RPC remote endpoint lookup operation to wait before timing out.	1.4.0

spark中，相互通信，基本上都是rpc发送，举个例子，一个task处理完了，通过spark内置的RPC框架往endpoint发送处理完了的消息，RPC的服务端，拿到这个消息做一个后续的处理，这之间的通信也需要有重试等机制，如果处理的数据量比较大，应该适当增加上述参数的时间

7、推测执行

其实我一直不太爱用推测执行，原因是这样的，当某个task执行很慢的时候，排除机器的问题，那基本上是数据倾斜，既然是数据倾斜，那我再启动另一个task来跑，同样是很慢，没什么太大区别，所以推测执行的适用场景，应该是：

1、部分机器性能不行，相同数据量的task分发到这些机器上运行会比其他机器慢很多

2、由于数据本地化策略，把大部分任务扔到了相同的几台机器上运行，其他机器围观

3、部分task由于一些奇怪的原因卡住了

类似以上的情况，可以尝试开启推测执行来解决问题，但是推测执行势必会影响spark任务的运行速度

spark.speculation	false	If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.	0.6.0
spark.speculation.interval	100ms	How often Spark will check for tasks to speculate.	0.6.0
spark.speculation.multiplier	1.5	How many times slower a task is than the median to be considered for speculation.	0.6.0
spark.speculation.quantile	0.75	Fraction of tasks which must be complete before speculation is enabled for a particular stage.	0.6.0
spark.speculation.task.duration.threshold	None	Task duration after which scheduler would try to speculative run the task. If provided, tasks would be speculatively run if current stage contains less tasks than or equal to the number of slots on a single executor and the task is taking longer time than the threshold. This config helps speculate stage with very few tasks. Regular speculation configs may also apply if the executor slots are large enough. E.g. tasks might be re-launched if there are enough successful runs even though the threshold hasn't been reached. The number of slots is computed based on the conf values of spark.executor.cores and spark.task.cpus minimum 1. Default unit is bytes, unless otherwise specified.	3.0.0

8、黑名单机制

黑名单机制的配置参数越来越多，证明spark是有在这上面花功夫的。本质上来讲，其实就是大数据集群的部分机器可能因为某些原因（比如坏盘，cpu/io爆满），导致分发到该机器上的task并不能完成，但是由于数据本地化策略，有可能task失败重试的时候又分发到这台机器上，这就很不合理。。。。所以可以设置，当一个task在某台机器上失败多少次，该台机器就会被加入黑名单

spark.blacklist.enabled

false

If set to "true", prevent Spark from scheduling tasks on executors that have been blacklisted due to too many task failures. The blacklisting algorithm can be further controlled by the other "spark.blacklist" configuration options.

2.1.0

其他相关参数请看spark官网，因为不同版本配置不一样！

http://spark.apache.org/docs/latest/configuration.html

我没有截全，还是不少的

9、cache机制

cache机制算是代码和数据层面的容错机制

cache机制会保留数据在内存或者磁盘上，并且保留RDD的上下游依赖关系

他的大致的原理是这样的，数据流入spark，会根据用户写的代码构建DAG图，会生成不同的RDD，并且他们之间有上下游的依赖关系。如图：

假设，Stage1中，我用红色数字标出来的4阶段的某个task出现数据丢失的异常，那么就会导致Stage2的5阶段拉取不到数据，因此就会重跑Stage1中的4阶段的有异常的task，那么如果没有cache机制，数据就得从Stage0的1阶段的位置（也就是磁盘上的原始数据）再把数据重新读取出来，然后经过2和3的map才能流入Stage1，在链路很长的情况下，就显得效率很差，所以spark提供了cache（把某个阶段的数据缓存）机制，可以把3中的数据缓存起来，这样就算4中数据丢失，也只需要从3中重新获得数据即可，不需要从源头来获取数据！

10、checkpoint机制

checkpoint有点像快照，像spark和flink这种提供流式的数据处理框架，基本上都会提供这种机制。

checkpoint机制，像是cache的一个递进的解决方案，当我们cache数据的时候，数据可能会存在内存中或者某部分机器的磁盘上，如果这些机器挂掉了，那cache失效，结果还是要从最原始的地方拉取数据！

checkpoint机制可以帮助你，把数据的结果写入分布式存储系统里，比如HDFS，通过HDFS来保证数据的不丢失，这样就算spark任务挂掉了，下次启动spark任务的时候还能够重新读到上一次checkpoint的数据，这个机制，一般用于流式的数据处理，记录偏移量等操作

Flink的checkpoint尤其做的十分巧妙！

哎，这篇文章其实从端午之前就开始写了

我不知道我列举的是否够全面，我的第一点和第二点的分类是否够准确（我自己凭感觉分的），如果有遗漏的，或者有不同意见的小伙伴，欢迎批评指正！

菜鸡一只，祝自己端午节快乐，祝自己生日快乐（今年端午和阳历生日居然是同一天呱！）

【spark】都有哪些级别的容错或者失败重试？

一、任务运行中主要的一些重试机制

1、Application级别的容错

2、executor级别的容错

3、stage级别的容错

4、task级别的容错

二、额外增加程序鲁棒性的机制

5、shuffle的io级别的容错

6、rpc级别的重试

7、推测执行

8、黑名单机制

9、cache机制

10、checkpoint机制

我不知道我列举的是否够全面，我的第一点和第二点的分类是否够准确（我自己凭感觉分的），如果有遗漏的，或者有不同意见的小伙伴，欢迎批评指正！

linux安装cuda和cudnn

Mellanox网卡开启SR-IOV

模拟手机设备：使用 Playwright 实现移动端自动化测试

全面系统的AI学习路径，帮助普通人也能玩转AI

HTML 00 Tutorial

从零开始：使用 Playwright 脚本录制实现自动化测试

uni-app实现上拉加载

vue3编译优化之“静态提升”

又是一个月-20240513

flask 如何保证返回json有序

【kafka】爲什麼快(why‘s kafka so fast)？

【spark】使用kryo序列化和壓縮，減少數據緩存和傳輸的大小

【spark】關於spark的shuffle模式的一些見解

【spark】on yarn的模式下，如何上傳files並在程序中讀取到？

【spark】都有哪些級別的容錯或者失敗重試？

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結