1、問題思考
前面花了點時間嘗試通過數據恢復的方式修復故障,一直沒有成功。想了下,應該是陷入了一種思維定勢,認爲只要有數據損壞,就要通過恢復數據的方式修復。其實etcd本身是一款非常優秀的分佈式kv存儲集羣系統,基於raft協議來保證數據庫一致性。集羣中有節點數據損壞,可以通過同步方式恢復數據。
想到這個點,修復起來就比較簡單了
2、修復數據
2.1、停數據損壞節點的kubelet
[root@k8s-m2 wal]# systemctl stop kubelet
[root@k8s-m2 wal]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: inactive (dead) since Wed 2020-04-15 21:31:31 EDT; 2s ago
Docs: http://kubernetes.io/docs/
Process: 17209 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=0/SUCCESS)
Process: 17195 ExecStartPre=/usr/bin/kubelet-pre-start.sh (code=exited, status=0/SUCCESS)
Main PID: 17209 (code=exited, status=0/SUCCESS)
2.2、停數據損壞節點的etcd容器
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
eec644681fe4 0cae8d5cc64c "kube-apiserver --ad…" 6 minutes ago Up 6 minutes k8s_kube-apiserver_kube-apiserver-k8s-m2_kube-system_a6969daa2e4e9a047c11e645ac639c8f_6543
5f75788ca082 303ce5db0e90 "etcd --advertise-cl…" 7 minutes ago Up 7 minutes k8s_etcd_etcd-k8s-m2_kube-system_8fc15ff127d417c1e3b2180d50ce85e3_1
# docker stop 5f75788ca082 88b1dcb2e14f
5f75788ca082
88b1dcb2e14f
2.3、刪除數據損壞節點的etcd數據
這一步直接刪除故障數據即可,加入集羣后會自動從其他節點同步過來
# rm -f /var/lib/etcd/member/wal/*
# rm -f /var/lib/etcd/member/snap/*
2.4、在第一個節點刪除添加故障節點
獲取member列表
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member list
1e2fb9983e528532, started, k8s-m2, https://172.0.2.146:2380, https://172.0.2.146:2379, false
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false
刪除節點k8s-m2
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member remove 1e2fb9983e528532
Member 1e2fb9983e528532 removed from cluster 450f66a1edd8aab3
添加故障節點k8s-m2
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member add k8s-m2 --peer-urls="https://172.0.2.146:2380"
Member 630ebadbb6f56ec1 added to cluster 450f66a1edd8aab3
ETCD_NAME="k8s-m2"
ETCD_INITIAL_CLUSTER="k8s-m2=https://172.0.2.146:2380,k8s-m3=https://172.0.2.234:2380,k8s-m1=https://172.0.2.139:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://172.0.2.146:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
剛添加的節點處於unstarted狀態
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, unstarted, , https://172.0.2.146:2380, , false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false
2.5、啓動故障節點的kubelet進程
[root@k8s-m2 home]# systemctl start kubelet
2.6、查看etcd集羣和k8s集羣狀態
etcd集羣的k8s-m2節點已經處於started狀態
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, started, k8s-m2, https://172.0.2.146:2380, https://172.0.2.146:2379, false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false
k8s集羣的k8s-m2節點也已經處於ready狀態
# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-m1 Ready master 56d v1.17.0 172.0.2.139 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.6
k8s-m2 Ready master 56d v1.17.0 172.0.2.146 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.6
k8s-m3 Ready master 56d v1.17.0 172.0.2.234 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.6
2.7、數據同步成功
[root@k8s-m2 ~]# ls /var/lib/etcd/member/wal/ -l
total 312512
-rw-------. 1 root root 64000272 Apr 16 14:23 0000000000000000-0000000000000000.wal
-rw-------. 1 root root 64000432 Apr 17 02:00 0000000000000001-0000000000b0c506.wal
-rw-------. 1 root root 64000440 Apr 17 13:36 0000000000000002-0000000000b3001d.wal
-rw-------. 1 root root 64000000 Apr 17 22:08 0000000000000003-0000000000b53b3a.wal
-rw-------. 1 root root 64000000 Apr 17 13:36 1.tmp
3、總結
其實這個步驟就和部署etcd集羣時,加入member節點思路差不多。
進一步思考一下,既然可以通過同步方式修復,那麼,etcd集羣爲什麼不實現自動修復呢?