etcdserver: read wal error (walpb: crc mismatch) and cannot be repaired(3):主備同步修復

1、問題思考

前面花了點時間嘗試通過數據恢復的方式修復故障,一直沒有成功。想了下,應該是陷入了一種思維定勢,認爲只要有數據損壞,就要通過恢復數據的方式修復。其實etcd本身是一款非常優秀的分佈式kv存儲集羣系統,基於raft協議來保證數據庫一致性。集羣中有節點數據損壞,可以通過同步方式恢復數據。
想到這個點,修復起來就比較簡單了

2、修復數據

2.1、停數據損壞節點的kubelet

[root@k8s-m2 wal]# systemctl stop kubelet
[root@k8s-m2 wal]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: inactive (dead) since Wed 2020-04-15 21:31:31 EDT; 2s ago
     Docs: http://kubernetes.io/docs/
  Process: 17209 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=0/SUCCESS)
  Process: 17195 ExecStartPre=/usr/bin/kubelet-pre-start.sh (code=exited, status=0/SUCCESS)
 Main PID: 17209 (code=exited, status=0/SUCCESS)

2.2、停數據損壞節點的etcd容器

# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
eec644681fe4        0cae8d5cc64c           "kube-apiserver --ad…"   6 minutes ago       Up 6 minutes                            k8s_kube-apiserver_kube-apiserver-k8s-m2_kube-system_a6969daa2e4e9a047c11e645ac639c8f_6543
5f75788ca082        303ce5db0e90           "etcd --advertise-cl…"   7 minutes ago       Up 7 minutes                            k8s_etcd_etcd-k8s-m2_kube-system_8fc15ff127d417c1e3b2180d50ce85e3_1
# docker stop 5f75788ca082 88b1dcb2e14f
5f75788ca082
88b1dcb2e14f

2.3、刪除數據損壞節點的etcd數據

這一步直接刪除故障數據即可,加入集羣后會自動從其他節點同步過來

# rm -f /var/lib/etcd/member/wal/*
# rm -f /var/lib/etcd/member/snap/*

2.4、在第一個節點刪除添加故障節點

獲取member列表

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member list
1e2fb9983e528532, started, k8s-m2, https://172.0.2.146:2380, https://172.0.2.146:2379, false
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false

刪除節點k8s-m2

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member remove 1e2fb9983e528532
Member 1e2fb9983e528532 removed from cluster 450f66a1edd8aab3

添加故障節點k8s-m2

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member add k8s-m2 --peer-urls="https://172.0.2.146:2380"
Member 630ebadbb6f56ec1 added to cluster 450f66a1edd8aab3
ETCD_NAME="k8s-m2"
ETCD_INITIAL_CLUSTER="k8s-m2=https://172.0.2.146:2380,k8s-m3=https://172.0.2.234:2380,k8s-m1=https://172.0.2.139:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://172.0.2.146:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

剛添加的節點處於unstarted狀態

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, unstarted, , https://172.0.2.146:2380, , false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false

2.5、啓動故障節點的kubelet進程

[root@k8s-m2 home]# systemctl start kubelet

2.6、查看etcd集羣和k8s集羣狀態

etcd集羣的k8s-m2節點已經處於started狀態

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, started, k8s-m2, https://172.0.2.146:2380, https://172.0.2.146:2379, false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false

k8s集羣的k8s-m2節點也已經處於ready狀態

# kubectl get nodes -o wide
NAME               STATUS     ROLES    AGE   VERSION                                INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION           CONTAINER-RUNTIME
k8s-m1             Ready      master   56d   v1.17.0                                172.0.2.139   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64   docker://19.3.6
k8s-m2             Ready      master   56d   v1.17.0                                172.0.2.146   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64   docker://19.3.6
k8s-m3             Ready      master   56d   v1.17.0                                172.0.2.234   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64   docker://19.3.6

2.7、數據同步成功

[root@k8s-m2 ~]# ls /var/lib/etcd/member/wal/ -l
total 312512
-rw-------. 1 root root 64000272 Apr 16 14:23 0000000000000000-0000000000000000.wal
-rw-------. 1 root root 64000432 Apr 17 02:00 0000000000000001-0000000000b0c506.wal
-rw-------. 1 root root 64000440 Apr 17 13:36 0000000000000002-0000000000b3001d.wal
-rw-------. 1 root root 64000000 Apr 17 22:08 0000000000000003-0000000000b53b3a.wal
-rw-------. 1 root root 64000000 Apr 17 13:36 1.tmp

3、總結

其實這個步驟就和部署etcd集羣時,加入member節點思路差不多。
進一步思考一下,既然可以通過同步方式修復,那麼,etcd集羣爲什麼不實現自動修復呢?

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章