問題:
Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://10.91.231.62:6443/api/v1/namespaces/kube-system?timeout=45s": write tcp 172.17.0.2:443->172.17.0.1:48978: i/o timeout
間歇性出現 ,過幾分鐘自動恢復
問題排查過程:
根據rancher的架構圖,可知:用戶—>Rancher 容器(UI管理界面<—>cattle-cluster-agent和cattle-node-agent)—>K8s API
Rancher 納管的集羣部署有兩種不同的 Agent:
cattle-cluster-agent(主):用於連接集羣的Rancher 部署的 Kubernetes 集羣的 Kubernetes API。
cattle-node-agent(備):用於和Rancher 部署的 Kubernetes 集羣中的節點進行交互。
當cattle-cluster-agent不可用時,cattle-node-agent 將作爲備選方案連接到Rancher 部署的 Kubernetes API。
1、間歇性不健康,即連接超時,懷疑是kube-api和etcd之間的問題:
Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://10.91.231.62:6443/api/v1/namespaces/kube-system?timeout=45s": write tcp
172.17.0.2:443->172.17.0.1:59912: i/o timeout
#curl -Ik https://172.17.0.2:443/api/v1/namespaces/kube-system 【直接訪問kube-api是正常的】
HTTP/1.1 401 Unauthorized
#etcd【未見異常,排查kube-api和etcd的問題】
watch chan error: etcdserver: mvcc: required revision has been compacted
#kube-api,docker logs -t --since=2021-11-17T18:10:00 --until=2021-11-17T18:58:00 kube-apiserver &> /tmp/bbb【未見異常】
parsed scheme: "passthrough"
ccResolverWrapper: sending update to cc: {[{https://10.91.231.62:2379 <nil> 0 <nil>}] <nil> <nil>}
watch chan error: etcdserver: mvcc: required revision has been compacted
ccResolverWrapper: sending update to cc: {[{https://10.91.231.62:2379 <nil> 0 <nil>}] <nil> <nil>}
Error on LIST cronjobs
#kube-controller-manager【未見異常】
utils.go:424] couldn't find ipfamilies for headless service: test-manager/bizcenter-report likely because controller manager is likely connected to an old apiserver that does not support ip families yet. The service endpoint slice will use dual stack families until api-server default it correctly
#rancher【都是rancher自己的超時】
error in remotedialer server [400]: read tcp 172.17.0.2:443->172.17.0.1:38644: i/o timeout
Error on LIST replicationcontrollers: Get "https://10.91.231.62:6443/api/v1/namespaces/security-scan/replicationcontrollers?timeout=45s": net/http: request canceled while waiting for connection
mvcc: finished scheduled compaction at
2、排查rancher自身的組件:
#system下的cattle-cluster-agent
time="2021-11-23T08:33:08Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"
#kubectl get pods -o wide --all-namespaces【查到0.7的在定時檢查rancher端口】
cattle-system cattle-cluster-agent-75995f7c5f-8pwtx 1/1 Running 0 144d 10.42.0.7 a-tsy-app02-test <none>
##system下的cattle-cluster-agent,23號,15:30左右報錯,導致rancher連不上kube-api
time="2021-11-23T07:32:28Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:59388->10.91.231.62:9090: i/o timeout"
time="2021-11-23T07:32:28Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:59388->10.91.231.62:9090: i/o timeout"
time="2021-11-23T07:32:28Z" level=error msg="Remotedialer proxy error" error="read tcp 10.42.0.7:59388->10.91.231.62:9090: i/o timeout"
time="2021-11-23T07:32:38Z" level=info msg="Connecting to wss://10.91.231.62:9090/v3/connect with token wjdm2rgmttdmqx9ltjldhqhzldlrc"
time="2021-11-23T07:32:38Z" level=info msg="Connecting to proxy" url="wss://10.91.231.62:9090/v3/connect"
##system下的cattle-cluster-agent,23號,16:30左右報錯,導致rancher連不上kube-api
time="2021-11-23T08:33:08Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"
time="2021-11-23T08:33:08Z" level=error msg="Error writing ping" error="write tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"
time="2021-11-23T08:33:08Z" level=error msg="Remotedialer proxy error" error="read tcp 10.42.0.7:41622->10.91.231.62:9090: i/o timeout"
time="2021-11-23T08:33:18Z" level=info msg="Connecting to wss://10.91.231.62:9090/v3/connect with token wjdm2rgmttdmqx9ltjldhqhzldlrc725rb9rzh"
time="2021-11-23T08:33:18Z" level=info msg="Connecting to proxy" url="wss://10.91.231.62:9090/v3/connect"
【確定是rancher自身組件的問題,應該是cattle-cluster-agent定期訪問rancher端口,確保之間通訊正常,有問題的話,顯示上面的保持影響用戶通過rancher界面和API使用】
解決辦法:
1、重啓cattle-cluster-agent
2、重啓rancher容器
3、根據網上,有可能是rancher版本的問題,只能是有條件進行測試,https://forums.cnrancher.com/q_1248.html