一:問題出現環境
有一個etcd節點因爲磁盤問題當掉
在node1節點查看健康狀態
[root@node01 ~]# /k8s/etcd/bin/etcdctl --ca-file=/k8s/etcd/ssl/ca.pem --cert-file=/k8s/etcd/ssl/server.pem --key-file=/k8s/etcd/ssl/server-key.pem --endpoints="https://192.168.247.149:2379,https://192.168.247.143:2379,https://192.168.247.144:2379" cluster-health
member 8f4e6ce663f0d49a is healthy: got healthy result from https://192.168.247.143:2379
member b6230d9c6f20feeb is healthy: got healthy result from https://192.168.247.144:2379
failed to check the health of member d618618928dffeba on https://192.168.247.149:2379: Get https://192.168.247.149:2379/health: dial tcp 192.168.247.149:2379: i/o timeout
member d618618928dffeba is unreachable: [https://192.168.247.149:2379] are all unreachable
cluster is degraded
切換到192.168.247.149節點
將etcd的相關配置文件、命令腳本、證書、啓動腳本複製過去
[root@node01 ~]# scp -r /k8s [email protected]:/k8s
The authenticity of host '192.168.247.149 (192.168.247.149)' can't be established.
ECDSA key fingerprint is SHA256:QeJNZeAOre44X0uR34SeAzOr80+OZ173556h07FrT0k.
ECDSA key fingerprint is MD5:e2:4c:4c:bc:ed:a2:e0:03:2c:71:c7:4f:2c:da:32:a8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.247.149' (ECDSA) to the list of known hosts.
[email protected]'s password:
etcd 100% 523 270.3KB/s 00:00
etcd 100% 18MB 77.8MB/s 00:00
etcdctl 100% 15MB 118.9MB/s 00:00
ca-key.pem 100% 1679 1.5MB/s 00:00
ca.pem 100% 1265 361.4KB/s 00:00
server-key.pem 100% 1675 936.5KB/s 00:00
server.pem 100% 1338 1.2MB/s 00:00
[root@node01 ~]# scp /usr/lib/systemd/system/etcd.service [email protected]:/usr/lib/systemd/system/
[email protected]'s password:
etcd.service 100% 923 3
然後修改成本地參數
開啓服務發現失敗
[root@master1 k8s]# systemctl status kube-apiserver.service
● kube-apiserver.service - Kubernetes API Server
Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Thu 2020-04-30 08:22:35 CST; 21s ago
[root@master1 etcd]# journalctl -xe
Apr 30 09:31:32 master1 etcd[51631]: member d618618928dffeba has already been bootstrapped
Apr 30 09:31:32 master1 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILUR
Apr 30 09:31:32 master1 systemd[1]: Failed to start Etcd Server.
-- Subject: Unit etcd.service has failed
二:問題:member d618618928dffeba has already been bootstrapped
大概意思:
其中一個成員是通過discovery service引導的。必須刪除以前的數據目錄來清理成員信息。否則成員將忽略新配置,使用舊配置。這就是爲什麼你看到了不匹配。
看到了這裏,問題所在也就很明確了,啓動失敗的原因在於data-dir (/var/lib/etcd/default.etcd)中記錄的信息與 etcd啓動的選項所標識的信息不太匹配造成的。
這裏用的解決辦法時把配置參數中–initial-cluster-state改爲existing
#!/bin/bash
# example: ./etcd.sh etcd01 192.168.247.149 etcd02=https://192.168.247.143:2380,etcd03=https://192.168.247.144:2380
ETCD_NAME=$1
ETCD_IP=$2
ETCD_CLUSTER=$3
WORK_DIR=/k8s/etcd
cat <<EOF >$WORK_DIR/cfg/etcd
#[Member]
ETCD_NAME="${ETCD_NAME}"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="https://${ETCD_IP}:2380"
ETCD_LISTEN_CLIENT_URLS="https://${ETCD_IP}:2379"
#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://${ETCD_IP}:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://${ETCD_IP}:2379"
ETCD_INITIAL_CLUSTER="etcd01=https://${ETCD_IP}:2380,${ETCD_CLUSTER}"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="existing" #此處修改
EOF
cat <<EOF >/usr/lib/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
EnvironmentFile=${WORK_DIR}/cfg/etcd
ExecStart=${WORK_DIR}/bin/etcd \
--name=\${ETCD_NAME} \
--data-dir=\${ETCD_DATA_DIR} \
--listen-peer-urls=\${ETCD_LISTEN_PEER_URLS} \
--listen-client-urls=\${ETCD_LISTEN_CLIENT_URLS},http://127.0.0.1:2379 \
--advertise-client-urls=\${ETCD_ADVERTISE_CLIENT_URLS} \
--initial-advertise-peer-urls=\${ETCD_INITIAL_ADVERTISE_PEER_URLS} \
--initial-cluster=\${ETCD_INITIAL_CLUSTER} \
--initial-cluster-token=\${ETCD_INITIAL_CLUSTER_TOKEN} \
--initial-cluster-state=existing \ #此處修改
--cert-file=${WORK_DIR}/ssl/server.pem \
--key-file=${WORK_DIR}/ssl/server-key.pem \
--peer-cert-file=${WORK_DIR}/ssl/server.pem \
--peer-key-file=${WORK_DIR}/ssl/server-key.pem \
--trusted-ca-file=${WORK_DIR}/ssl/ca.pem \
--peer-trusted-ca-file=${WORK_DIR}/ssl/ca.pem
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable etcd
systemctl restart etcd
然後重新執行腳本
[root@master1 etcd]# bash etcd.sh etcd01 192.168.247.149 etcd02=https://192.168.247.143:2380,etcd03=https://192.168.247.144:2380
成功
從網絡上還找到兩個方法
第二種方式刪除所有etcd節點的 data-dir 文件(不刪也行),重啓各個節點的etcd服務,這個時候,每個節點的data-dir的數據都會被更新,就不會有以上故障了。
第三種方式是複製其他節點的data-dir中的內容,以此爲基礎上以 --force-new-cluster 的形式強行拉起一個,然後以添加新成員的方式恢復這個集羣。