环境:
Spark版本: 2.4.3
Kubernetes版本:v1.16.2
问题:
提交spark-submit example.jar 以cluster方式到k8s集群,driver-pod报错如下:
19/11/06 07:06:54 INFO ExecutorPodsAllocator: Going to request 5 executors from Kubernetes.
19/11/06 07:06:54 WARN WatchConnectionManager: Exec Failure: HTTP 403, Status: 403 -
java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/11/06 07:06:54 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
19/11/06 07:06:54 ERROR SparkContext: Error initializing SparkContext.
io.fabric8.kubernetes.client.KubernetesClientException:
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:201)
at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:185)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
原因:
查了下,发现是EKS安全补丁导致Apache Spark作业失败并出现权限错误。
EKS security patches cause Apache Spark jobs to fail with permissions error
Spark社区patch:
https://github.com/apache/spark/pull/25641
https://github.com/apache/spark/pull/25640
解决:
方法1. 该版本已在spark-2.4.4-release及之后版本修复,测试环境的话,直接替换修复后的spark版本或cherry-pick相关commit即可解决;
方法2. 问题的根本原因是spark依赖的jar包问题,因此可将spark/jars下的三个jar包,替换为4.4.0 及更高版本即可。
kubernetes-client-4.4.2.jar
kubernetes-model-4.4.2.jar
kubernetes-model-common-4.4.2.jar
jar包可通过maven仓库获取,如:
wget https://repo1.maven.org/maven2/io/fabric8/kubernetes-model/4.4.2/kubernetes-model-4.4.2.jar
补充:
1. 通过替换jar包方式,重新build并push镜像后,重新spark-submit提交任务,发现仍报相同错误;
2. 原因应该是本地镜像没更新,仍然用的是旧的镜像;
3. spark-submit 命令中添加: --conf spark.kubernetes.container.image.pullPolicy=Always,使用修改后新的image,问题解决。
至此,spark on kubernetes 官方demo完整提交命令如下:
spark-submit \
--master k8s://https://172.16.192.128:6443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=merrily01/repo:spark-2.4.3-image-merrily01 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image.pullPolicy=Always \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.3.jar