排錯解決:etcd節點掉線後(code=exited, status=1/FAILURE),如何加入

一:問題出現環境

有一個etcd節點因爲磁盤問題當掉

在node1節點查看健康狀態

[root@node01 ~]# /k8s/etcd/bin/etcdctl --ca-file=/k8s/etcd/ssl/ca.pem --cert-file=/k8s/etcd/ssl/server.pem --key-file=/k8s/etcd/ssl/server-key.pem --endpoints="https://192.168.247.149:2379,https://192.168.247.143:2379,https://192.168.247.144:2379" cluster-health
member 8f4e6ce663f0d49a is healthy: got healthy result from https://192.168.247.143:2379
member b6230d9c6f20feeb is healthy: got healthy result from https://192.168.247.144:2379
failed to check the health of member d618618928dffeba on https://192.168.247.149:2379: Get https://192.168.247.149:2379/health: dial tcp 192.168.247.149:2379: i/o timeout
member d618618928dffeba is unreachable: [https://192.168.247.149:2379] are all unreachable
cluster is degraded

切換到192.168.247.149節點
將etcd的相關配置文件、命令腳本、證書、啓動腳本複製過去

[root@node01 ~]# scp -r /k8s [email protected]:/k8s
The authenticity of host '192.168.247.149 (192.168.247.149)' can't be established.
ECDSA key fingerprint is SHA256:QeJNZeAOre44X0uR34SeAzOr80+OZ173556h07FrT0k.
ECDSA key fingerprint is MD5:e2:4c:4c:bc:ed:a2:e0:03:2c:71:c7:4f:2c:da:32:a8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.247.149' (ECDSA) to the list of known hosts.
[email protected]'s password: 
etcd                                                                                  100%  523   270.3KB/s   00:00    
etcd                                                                                  100%   18MB  77.8MB/s   00:00    
etcdctl                                                                               100%   15MB 118.9MB/s   00:00    
ca-key.pem                                                                            100% 1679     1.5MB/s   00:00    
ca.pem                                                                                100% 1265   361.4KB/s   00:00    
server-key.pem                                                                        100% 1675   936.5KB/s   00:00    
server.pem                                                                            100% 1338     1.2MB/s   00:00    
[root@node01 ~]# scp /usr/lib/systemd/system/etcd.service [email protected]:/usr/lib/systemd/system/
[email protected]'s password: 
etcd.service                                                                          100%  923   3

然後修改成本地參數
開啓服務發現失敗

[root@master1 k8s]# systemctl status kube-apiserver.service 
● kube-apiserver.service - Kubernetes API Server
   Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Thu 2020-04-30 08:22:35 CST; 21s ago
[root@master1 etcd]# journalctl -xe
Apr 30 09:31:32 master1 etcd[51631]: member d618618928dffeba has already been bootstrapped
Apr 30 09:31:32 master1 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILUR
Apr 30 09:31:32 master1 systemd[1]: Failed to start Etcd Server.
-- Subject: Unit etcd.service has failed

二:問題:member d618618928dffeba has already been bootstrapped

大概意思:
其中一個成員是通過discovery service引導的。必須刪除以前的數據目錄來清理成員信息。否則成員將忽略新配置,使用舊配置。這就是爲什麼你看到了不匹配。
看到了這裏,問題所在也就很明確了,啓動失敗的原因在於data-dir (/var/lib/etcd/default.etcd)中記錄的信息與 etcd啓動的選項所標識的信息不太匹配造成的。
這裏用的解決辦法時把配置參數中–initial-cluster-state改爲existing

#!/bin/bash
# example: ./etcd.sh etcd01 192.168.247.149 etcd02=https://192.168.247.143:2380,etcd03=https://192.168.247.144:2380

ETCD_NAME=$1
ETCD_IP=$2
ETCD_CLUSTER=$3

WORK_DIR=/k8s/etcd

cat <<EOF >$WORK_DIR/cfg/etcd
#[Member]
ETCD_NAME="${ETCD_NAME}"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="https://${ETCD_IP}:2380"
ETCD_LISTEN_CLIENT_URLS="https://${ETCD_IP}:2379"

#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://${ETCD_IP}:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://${ETCD_IP}:2379"
ETCD_INITIAL_CLUSTER="etcd01=https://${ETCD_IP}:2380,${ETCD_CLUSTER}"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="existing"		#此處修改
EOF

cat <<EOF >/usr/lib/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=${WORK_DIR}/cfg/etcd
ExecStart=${WORK_DIR}/bin/etcd \
--name=\${ETCD_NAME} \
--data-dir=\${ETCD_DATA_DIR} \
--listen-peer-urls=\${ETCD_LISTEN_PEER_URLS} \
--listen-client-urls=\${ETCD_LISTEN_CLIENT_URLS},http://127.0.0.1:2379 \
--advertise-client-urls=\${ETCD_ADVERTISE_CLIENT_URLS} \
--initial-advertise-peer-urls=\${ETCD_INITIAL_ADVERTISE_PEER_URLS} \
--initial-cluster=\${ETCD_INITIAL_CLUSTER} \
--initial-cluster-token=\${ETCD_INITIAL_CLUSTER_TOKEN} \
--initial-cluster-state=existing \				#此處修改
--cert-file=${WORK_DIR}/ssl/server.pem \
--key-file=${WORK_DIR}/ssl/server-key.pem \
--peer-cert-file=${WORK_DIR}/ssl/server.pem \
--peer-key-file=${WORK_DIR}/ssl/server-key.pem \
--trusted-ca-file=${WORK_DIR}/ssl/ca.pem \
--peer-trusted-ca-file=${WORK_DIR}/ssl/ca.pem
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable etcd
systemctl restart etcd

然後重新執行腳本

[root@master1 etcd]# bash etcd.sh etcd01 192.168.247.149 etcd02=https://192.168.247.143:2380,etcd03=https://192.168.247.144:2380

成功

從網絡上還找到兩個方法
第二種方式刪除所有etcd節點的 data-dir 文件(不刪也行),重啓各個節點的etcd服務,這個時候,每個節點的data-dir的數據都會被更新,就不會有以上故障了。

第三種方式是複製其他節點的data-dir中的內容,以此爲基礎上以 --force-new-cluster 的形式強行拉起一個,然後以添加新成員的方式恢復這個集羣。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章