rancher故障——Cluster health check failed: Failed to communicate with API server

原創

2021-12-25 21:30

問題：

Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://10.91.231.62:6443/api/v1/namespaces/kube-system?timeout=45s": write tcp 172.17.0.2:443->172.17.0.1:48978: i/o timeout

間歇性出現，過幾分鐘自動恢復

問題排查過程：

根據rancher的架構圖，可知：用戶—>Rancher 容器（UI管理界面<—>cattle-cluster-agent和cattle-node-agent）—>K8s API

Rancher 納管的集羣部署有兩種不同的 Agent：
cattle-cluster-agent（主）：用於連接集羣的Rancher 部署的 Kubernetes 集羣的 Kubernetes API。
cattle-node-agent（備）：用於和Rancher 部署的 Kubernetes 集羣中的節點進行交互。

當cattle-cluster-agent不可用時，cattle-node-agent 將作爲備選方案連接到Rancher 部署的 Kubernetes API。

1、間歇性不健康，即連接超時，懷疑是kube-api和etcd之間的問題：

Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://10.91.231.62:6443/api/v1/namespaces/kube-system?timeout=45s": write tcp

172.17.0.2:443->172.17.0.1:59912: i/o timeout

#curl -Ik https://172.17.0.2:443/api/v1/namespaces/kube-system 【直接訪問kube-api是正常的】
HTTP/1.1 401 Unauthorized

#etcd【未見異常，排查kube-api和etcd的問題】
watch chan error: etcdserver: mvcc: required revision has been compacted

#kube-api，docker logs -t --since=2021-11-17T18:10:00 --until=2021-11-17T18:58:00 kube-apiserver &> /tmp/bbb【未見異常】

parsed scheme: "passthrough"
ccResolverWrapper: sending update to cc: {[{https://10.91.231.62:2379 <nil> 0 <nil>}] <nil> <nil>}

watch chan error: etcdserver: mvcc: required revision has been compacted

ccResolverWrapper: sending update to cc: {[{https://10.91.231.62:2379 <nil> 0 <nil>}] <nil> <nil>}

Error on LIST cronjobs

#kube-controller-manager【未見異常】
utils.go:424] couldn't find ipfamilies for headless service: test-manager/bizcenter-report likely because controller manager is likely connected to an old apiserver that does not support ip families yet. The service endpoint slice will use dual stack families until api-server default it correctly

#rancher【都是rancher自己的超時】
error in remotedialer server [400]: read tcp 172.17.0.2:443->172.17.0.1:38644: i/o timeout

Error on LIST replicationcontrollers: Get "https://10.91.231.62:6443/api/v1/namespaces/security-scan/replicationcontrollers?timeout=45s": net/http: request canceled while waiting for connection

mvcc: finished scheduled compaction at

2、排查rancher自身的組件：

#system下的cattle-cluster-agent
time="2021-11-23T08:33:08Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"

#kubectl get pods -o wide --all-namespaces【查到0.7的在定時檢查rancher端口】

cattle-system cattle-cluster-agent-75995f7c5f-8pwtx 1/1 Running 0 144d 10.42.0.7 a-tsy-app02-test <none>

##system下的cattle-cluster-agent，23號，15：30左右報錯，導致rancher連不上kube-api

time="2021-11-23T07:32:28Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:59388->10.91.231.62:9090: i/o timeout"
time="2021-11-23T07:32:28Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:59388->10.91.231.62:9090: i/o timeout"
time="2021-11-23T07:32:28Z" level=error msg="Remotedialer proxy error" error="read tcp 10.42.0.7:59388->10.91.231.62:9090: i/o timeout"
time="2021-11-23T07:32:38Z" level=info msg="Connecting to wss://10.91.231.62:9090/v3/connect with token wjdm2rgmttdmqx9ltjldhqhzldlrc"
time="2021-11-23T07:32:38Z" level=info msg="Connecting to proxy" url="wss://10.91.231.62:9090/v3/connect"

##system下的cattle-cluster-agent，23號，16：30左右報錯，導致rancher連不上kube-api
time="2021-11-23T08:33:08Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"
time="2021-11-23T08:33:08Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"
time="2021-11-23T08:33:08Z" level=error msg="Remotedialer proxy error" error="read tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"
time="2021-11-23T08:33:18Z" level=info msg="Connecting to wss://10.91.231.62:9090/v3/connect with token wjdm2rgmttdmqx9ltjldhqhzldlrc725rb9rzh"
time="2021-11-23T08:33:18Z" level=info msg="Connecting to proxy" url="wss://10.91.231.62:9090/v3/connect"

【確定是rancher自身組件的問題，應該是cattle-cluster-agent定期訪問rancher端口，確保之間通訊正常，有問題的話，顯示上面的保持影響用戶通過rancher界面和API使用】

解決辦法：

1、重啓cattle-cluster-agent

2、重啓rancher容器

3、根據網上，有可能是rancher版本的問題，只能是有條件進行測試，https://forums.cnrancher.com/q_1248.html

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

rancher故障——Cluster health check failed: Failed to communicate with API server

問題：

間歇性出現，過幾分鐘自動恢復

問題排查過程：

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

springboot啓動配置文件加載過程

實現大文件上傳和斷點續傳實踐經驗總結

學會JavaScript手寫代碼祕籍14道常用api

一文了解基於 ITIL 的運維管理體系框架

一圖帶你解鎖數字化運維的建設思路

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

rancher故障——Cluster health check failed: Failed to communicate with API server

問題：

間歇性出現 ，過幾分鐘自動恢復

問題排查過程：

間歇性出現，過幾分鐘自動恢復