rancher故障——Cluster health check failed: Failed to communicate with API server

問題:

Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://10.91.231.62:6443/api/v1/namespaces/kube-system?timeout=45s": write tcp 172.17.0.2:443->172.17.0.1:48978: i/o timeout

間歇性出現 ,過幾分鐘自動恢復

 

問題排查過程:

 

根據rancher的架構圖,可知:用戶—>Rancher 容器(UI管理界面<—>cattle-cluster-agent和cattle-node-agent)—>K8s API

Rancher 納管的集羣部署有兩種不同的 Agent:
cattle-cluster-agent(主):用於連接集羣的Rancher 部署的 Kubernetes 集羣的 Kubernetes API。
cattle-node-agent(備):用於和Rancher 部署的 Kubernetes 集羣中的節點進行交互。

當cattle-cluster-agent不可用時,cattle-node-agent 將作爲備選方案連接到Rancher 部署的 Kubernetes API。

1、間歇性不健康,即連接超時,懷疑是kube-api和etcd之間的問題:

Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://10.91.231.62:6443/api/v1/namespaces/kube-system?timeout=45s": write tcp

172.17.0.2:443->172.17.0.1:59912: i/o timeout

 

#curl -Ik https://172.17.0.2:443/api/v1/namespaces/kube-system  【直接訪問kube-api是正常的】
HTTP/1.1 401 Unauthorized

 

#etcd【未見異常,排查kube-api和etcd的問題】
watch chan error: etcdserver: mvcc: required revision has been compacted

 

#kube-api,docker logs -t --since=2021-11-17T18:10:00 --until=2021-11-17T18:58:00  kube-apiserver &> /tmp/bbb【未見異常】

parsed scheme: "passthrough"
ccResolverWrapper: sending update to cc: {[{https://10.91.231.62:2379  <nil> 0 <nil>}] <nil> <nil>}

watch chan error: etcdserver: mvcc: required revision has been compacted

ccResolverWrapper: sending update to cc: {[{https://10.91.231.62:2379  <nil> 0 <nil>}] <nil> <nil>}

Error on LIST cronjobs


#kube-controller-manager【未見異常】
utils.go:424] couldn't find ipfamilies for headless service: test-manager/bizcenter-report likely because controller manager is likely connected to an old apiserver that does not support ip families yet. The service endpoint slice will use dual stack families until api-server default it correctly

 

#rancher【都是rancher自己的超時】
error in remotedialer server [400]: read tcp 172.17.0.2:443->172.17.0.1:38644: i/o timeout

Error on LIST replicationcontrollers: Get "https://10.91.231.62:6443/api/v1/namespaces/security-scan/replicationcontrollers?timeout=45s": net/http: request canceled while waiting for connection 

mvcc: finished scheduled compaction at

 

2、排查rancher自身的組件:

#system下的cattle-cluster-agent
time="2021-11-23T08:33:08Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"

#kubectl get pods -o wide --all-namespaces【查到0.7的在定時檢查rancher端口】

cattle-system  cattle-cluster-agent-75995f7c5f-8pwtx  1/1  Running     0          144d    10.42.0.7      a-tsy-app02-test   <none> 


##system下的cattle-cluster-agent,23號,15:30左右報錯,導致rancher連不上kube-api

time="2021-11-23T07:32:28Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:59388->10.91.231.62:9090: i/o timeout"
time="2021-11-23T07:32:28Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:59388->10.91.231.62:9090: i/o timeout"
time="2021-11-23T07:32:28Z" level=error msg="Remotedialer proxy error" error="read tcp 10.42.0.7:59388->10.91.231.62:9090: i/o timeout"
time="2021-11-23T07:32:38Z" level=info msg="Connecting to wss://10.91.231.62:9090/v3/connect with token wjdm2rgmttdmqx9ltjldhqhzldlrc"
time="2021-11-23T07:32:38Z" level=info msg="Connecting to proxy" url="wss://10.91.231.62:9090/v3/connect"


##system下的cattle-cluster-agent,23號,16:30左右報錯,導致rancher連不上kube-api
time="2021-11-23T08:33:08Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"
time="2021-11-23T08:33:08Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"
time="2021-11-23T08:33:08Z" level=error msg="Remotedialer proxy error" error="read tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"
time="2021-11-23T08:33:18Z" level=info msg="Connecting to wss://10.91.231.62:9090/v3/connect with token wjdm2rgmttdmqx9ltjldhqhzldlrc725rb9rzh"
time="2021-11-23T08:33:18Z" level=info msg="Connecting to proxy" url="wss://10.91.231.62:9090/v3/connect"

 

【確定是rancher自身組件的問題,應該是cattle-cluster-agent定期訪問rancher端口,確保之間通訊正常,有問題的話,顯示上面的保持影響用戶通過rancher界面和API使用】

 

解決辦法:

1、重啓cattle-cluster-agent

2、重啓rancher容器

3、根據網上,有可能是rancher版本的問題,只能是有條件進行測試https://forums.cnrancher.com/q_1248.html

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章