Etcd和kubernetes master的災備與恢復
背景說明
問題:假設某臺帶有etcd的k8s master節點完全故障,徹底無法恢復
方案:新啓動一臺主機,配置爲故障master主機相同的ip和主機名,並嘗試原地恢復,頂替原故障master節點
Etcd恢復
參考官方文檔:
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/recovery.md#restoring-a-cluster
注意:從官方文檔可得知,備份快照時會改變cluster id(改變cluster id不影響客戶端,客戶端是使用ip來連接訪問的),因此,需要etcd cluster的每個成員都參與備份與恢復。
檢查
~# export ENDPOINTS="https://10.90.1.238:2379,https://10.90.1.239:2379,https://10.90.1.240:2379"
~# etcdctl --endpoints=$ENDPOINTS --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --cacert=/etc/etcd/ssl/ca.pem member list
316dbd21b17d1b4f, started, NODE238, https://10.90.1.238:2380, https://10.90.1.238:2379, false
8a206cf1ed53b6f4, started, NODE240, https://10.90.1.240:2380, https://10.90.1.240:2379, false
e444265665d3bd32, started, NODE239, https://10.90.1.239:2380, https://10.90.1.239:2379, false
~# etcdctl endpoint health --endpoints=$ENDPOINTS --cacert=/etc/etcd/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem
https://10.90.1.239:2379 is healthy: successfully committed proposal: took = 16.882627ms
https://10.90.1.240:2379 is healthy: successfully committed proposal: took = 17.046735ms
https://10.90.1.238:2379 is healthy: successfully committed proposal: took = 19.395533ms
~ # kubectl get nodes
NAME STATUS ROLES AGE VERSION
ubuntu238 Ready master 123d v1.15.3
ubuntu239 Ready master 123d v1.15.3
ubuntu240 Ready master 123d v1.15.3
確認正常無誤
開始備份
關停客戶端
# 3臺分別執行,二進制安裝的關停相應組件服務
service kubelet stop
service docker stop
備份快照
在每一臺etcd節點上分別執行:
# 修改爲對應節點的ip
export EP="https://10.90.1.238:2379"
export ETCDCTL_API=3
mkdir -p /var/lib/etcd_bak
etcdctl --endpoints=$EP --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --cacert=/etc/etcd/ssl/ca.pem snapshot save /var/lib/etcd_bak/`hostname`-etcd_`date +%Y%m%d%H%M`.db
systemctl stop etcd
mv /var/lib/etcd /var/lib/etcd_old
拷貝文件:
scp -r /etc/etcd/ssl/* NEW_HOST
scp -r /etc/systemd/system/etcd.service NEW_HOST
scp /var/lib/etcd_bak/ubuntu238-etcd_202001011000.db NEW_HOST
將備份出的快照文件、ssl認證文件、systemd啓動腳本,拷貝到新主機上,同時,在新主機上安裝好docker、kubelet、kubectl、etcd服務,安裝步驟這裏省略,由於這幾個組件都是go語言的二進制程序,因此可以直接將原主機的執行文件scp到新主機上相同路徑中,不必花費精力去找相應版本的rpm包。
恢復備份
在新加入的一臺替換主機,和保持不變的另外2臺主機上執行:
必須每臺成員主機全部執行,因爲etcd協商的cluster id等一些元數據發生了改變
# 這3個環境變量相應修改
export NODE_NAME=NODE238
export NODE_IP=10.90.1.238
export SNAPSHOT=ubuntu238-etcd_202001011000.db
export ETCDCTL_API=3
export ETCD_NODES="NODE238"=https://10.90.1.238:2380,"NODE239"=https://10.90.1.239:2380,"NODE240"=https://10.90.1.240:2380
export ENDPOINTS="https://10.90.1.238:2379,https://10.90.1.239:2379,https://10.90.1.240:2379"
cd /var/lib/etcd_bak/
etcdctl snapshot restore $SNAPSHOT \
--name=$NODE_NAME \
--cert=/etc/etcd/ssl/etcd.pem \
--key=/etc/etcd/ssl/etcd-key.pem \
--cacert=/etc/etcd/ssl/etcd-ca.pem \
--initial-advertise-peer-urls=https://$NODE_IP:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=$ETCD_NODES \
--data-dir=/var/lib/etcd/
完成以上步驟後,3臺etcd主機上一起啓動etcd並檢查狀態:
~ # service etcd start
~ # etcdctl endpoint health --endpoints=$ENDPOINTS --cacert=/etc/etcd/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem
https://10.90.1.240:2379 is healthy: successfully committed proposal: took = 16.112222ms
https://10.90.1.238:2379 is healthy: successfully committed proposal: took = 16.126138ms
https://10.90.1.239:2379 is healthy: successfully committed proposal: took = 17.345037ms
etcd恢復成功!
K8s master恢復
如果是原主機不動,則無需操作,如果是替換爲新主機,則需要將如下文件拷貝至新主機上(適用於kubeadm安裝的k8s,二進制需要對應修改):
- /etc/kubernetes/目錄下的所有文件(證書,manifest文件)
- 用戶(root)主目錄下 /root/.kube/config文件(kubectl連接認證)
- /var/lib/kubelet/目錄下所有文件(plugins、容器連接認證)
- /etc/systemd/system/kubelet.service.d/10-kubeadm.conf(kubelet systemd 啓動文件)
複製完成後,啓動服務:
service docker start
service kubelet start
檢查:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ubuntu238 Ready master 123d v1.15.3
ubuntu239 Ready master 123d v1.15.3
ubuntu240 Ready master 123d v1.15.3
不出意外的話,一分鐘左右master就恢復了。
總結
master恢復還是很簡單的,沒什麼好多說的,etcd的恢復比較麻煩一些,需要全部成員重新initial一次,一定要按步驟來,仔細一些不要錯。
提示:操作需要中斷客戶端寫入,即停止k8s服務,風險高,生產慎用!