etcdserver: read wal error (walpb: crc mismatch) and cannot be repaired(2):嘗試通過snap修復

1、停止etcd pod

本步驟操作都在故障節點k8s-m2上執行

1.1、停止kubelet

[root@k8s-m2 wal]# systemctl stop kubelet
[root@k8s-m2 wal]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: inactive (dead) since Wed 2020-04-15 21:31:31 EDT; 2s ago
     Docs: http://kubernetes.io/docs/
  Process: 17209 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=0/SUCCESS)
  Process: 17195 ExecStartPre=/usr/bin/kubelet-pre-start.sh (code=exited, status=0/SUCCESS)
 Main PID: 17209 (code=exited, status=0/SUCCESS)

1.2、停止etcd容器

由於etcd一致處於重啓狀態,所以只要停止了kubelet後,etcd容器就不會再啓動
如果發現etcd容器還在,可以手動停止

# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
eec644681fe4        0cae8d5cc64c           "kube-apiserver --ad…"   6 minutes ago       Up 6 minutes                            k8s_kube-apiserver_kube-apiserver-k8s-m2_kube-system_a6969daa2e4e9a047c11e645ac639c8f_6543
5f75788ca082        303ce5db0e90           "etcd --advertise-cl…"   7 minutes ago       Up 7 minutes                            k8s_etcd_etcd-k8s-m2_kube-system_8fc15ff127d417c1e3b2180d50ce85e3_1
# docker stop 5f75788ca082 88b1dcb2e14f
5f75788ca082
88b1dcb2e14f

2、通過snap修復數據

本步驟操作都在故障節點k8s-m2上執行

2.1、備份數據

# mkdir backup
# mv /var/lib/etcd/member backup/

2.2、使用snap恢復數據

# rm -rf /var/lib/etcd
# etcdctl snapshot restore backup/member/snap/db --data-dir=/var/lib/etcd --skip-hash-check=true
{"level":"info","ts":1587002300.8804266,"caller":"snapshot/v3_snapshot.go:287","msg":"restoring snapshot","path":"backup/member/snap/db","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"}
{"level":"info","ts":1587002301.0180507,"caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":8003675}
{"level":"info","ts":1587002301.028811,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1587002301.046414,"caller":"snapshot/v3_snapshot.go:300","msg":"restored snapshot","path":"backup/member/snap/db","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"}
# ls /var/lib/etcd/member/snap/
0000000000000001-0000000000000001.snap  db
# ls /var/lib/etcd/member/wal/
0000000000000000-0000000000000000.wal

3、etcd集羣刪除添加節點

本步驟操作都在非故障節點k8s-m1上執行

3.1、刪除k8s-m2節點

獲取member列表

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member list
1e2fb9983e528532, started, k8s-m2, https://172.0.2.146:2380, https://172.0.2.146:2379, false
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false

刪除節點k8s-m2

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member remove 1e2fb9983e528532
Member 1e2fb9983e528532 removed from cluster 450f66a1edd8aab3

3.3、添加k8s-m2節點

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member add k8s-m2 --peer-urls="https://172.0.2.146:2380"
Member 630ebadbb6f56ec1 added to cluster 450f66a1edd8aab3
ETCD_NAME="k8s-m2"
ETCD_INITIAL_CLUSTER="k8s-m2=https://172.0.2.146:2380,k8s-m3=https://172.0.2.234:2380,k8s-m1=https://172.0.2.139:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://172.0.2.146:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

剛添加的節點處於unstarted狀態

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, unstarted, , https://172.0.2.146:2380, , false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false

4、啓動故障節點etcd pod

4.1、啓動kubelet進程

[root@k8s-m2 home]# systemctl start kubelet

4.2、查看etcd集羣狀態

可以看到k8s-m2節點依然是unstarted狀態

[root@k8s-m1 wal]# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.234:2379 --insecure-skip-tls-verify  member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, unstarted, , https://172.0.2.146:2380, , false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false

在k8s-m2節點上查看,發現etcd 在k8s-m2節點上單獨創建了一個集羣

[root@k8s-m1 wal]# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.146:2379 --insecure-skip-tls-verify  member list
8e9e05c52164694d, started, k8s-m2, http://localhost:2380, https://172.0.2.146:2379, false

4.3、查看etcd 容器日誌

k8s-m2節點上的容器列表

# docker ps -a
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                            PORTS               NAMES
5a3c10335a0d        303ce5db0e90           "etcd --advertise-cl…"   9 seconds ago       Up 9 seconds                                          k8s_etcd_etcd-k8s-m2_kube-system_c45c8fe716669e896c01df9357b80855_1
2bd2e6148d5c        78c190f736b1           "kube-scheduler --au…"   35 seconds ago      Up 35 seconds                                         k8s_kube-scheduler_kube-scheduler-k8s-m2_kube-system_ff67867321338ffd885039e188f6b424_57

k8s-m2節點上的etcd容器日誌

[root@k8s-m2 home]# docker logs 5a3c10335a0d
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-04-16 02:21:21.879836 I | etcdmain: etcd Version: 3.4.3
2020-04-16 02:21:21.879920 I | etcdmain: Git SHA: 3cf2f69b5
2020-04-16 02:21:21.879926 I | etcdmain: Go Version: go1.12.12
2020-04-16 02:21:21.879931 I | etcdmain: Go OS/Arch: linux/amd64
2020-04-16 02:21:21.879937 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2020-04-16 02:21:21.880006 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-04-16 02:21:21.880065 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2020-04-16 02:21:21.881130 I | embed: name = k8s-m2
2020-04-16 02:21:21.881141 I | embed: data dir = /var/lib/etcd
2020-04-16 02:21:21.881147 I | embed: member dir = /var/lib/etcd/member
2020-04-16 02:21:21.881152 I | embed: heartbeat = 100ms
2020-04-16 02:21:21.881157 I | embed: election = 1000ms
2020-04-16 02:21:21.881162 I | embed: snapshot count = 10000
2020-04-16 02:21:21.881174 I | embed: advertise client URLs = https://172.0.2.146:2379
2020-04-16 02:21:21.881180 I | embed: initial advertise peer URLs = https://172.0.2.146:2380
2020-04-16 02:21:21.881192 I | embed: initial cluster =
2020-04-16 02:21:21.886462 I | etcdserver: recovered store from snapshot at index 1
2020-04-16 02:21:21.897999 I | mvcc: restore compact to 8005911
2020-04-16 02:21:21.915289 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 2028
raft2020/04/16 02:21:21 INFO: 8e9e05c52164694d switched to configuration voters=(10276657743932975437)
raft2020/04/16 02:21:21 INFO: 8e9e05c52164694d became follower at term 3
raft2020/04/16 02:21:21 INFO: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 3, commit: 2028, applied: 1, lastindex: 2028, lastterm: 3]
2020-04-16 02:21:21.915719 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
2020-04-16 02:21:21.924745 I | mvcc: restore compact to 8005911
2020-04-16 02:21:21.928380 W | auth: simple token is not cryptographically signed
2020-04-16 02:21:21.939722 I | etcdserver: starting server... [version: 3.4.3, cluster version: to_be_decided]
2020-04-16 02:21:21.939852 I | etcdserver: 8e9e05c52164694d as single-node; fast-forwarding 9 ticks (election ticks 10)
2020-04-16 02:21:21.940415 N | etcdserver/membership: set the initial cluster version to 3.4
2020-04-16 02:21:21.940536 I | etcdserver/api: enabled capabilities for version 3.4
2020-04-16 02:21:21.943059 I | embed: ClientTLS: cert = /etc/kubernetes/pki/etcd/server.crt, key = /etc/kubernetes/pki/etcd/server.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2020-04-16 02:21:21.943177 I | embed: listening for peers on 172.0.2.146:2380
2020-04-16 02:21:21.943272 I | embed: listening for metrics on http://127.0.0.1:2381
2020-04-16 02:21:21.956914 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:21.981283 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:21.982367 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:21.991827 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:21.992782 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:22.004156 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:22.045179 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:22.082271 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)

日誌中,有集羣id不一致的打印

rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)

4.4、日誌分析

往上看日誌,想要的集羣id cdf818194e3a8c32 是從store中讀取的

etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store

而從2380端口讀取到的集羣id爲450f66a1edd8aab3,出現不一致情況,導致加入集羣失敗

5、結論

從上述分析可以看出,從snap恢復的數據有問題,導致無法加入正常集羣。嘗試失敗。

注:又嘗試了將k8s-m1節點上的etcd snap數據拷貝到k8s-m2節點,嘗試恢復,問題一樣

該問題最後通過主備同步解決
etcdserver: read wal error (walpb: crc mismatch) and cannot be repaired(3):主備同步修復

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章