Etcd 和 kubernetes master的災備與恢復

Etcd和kubernetes master的災備與恢復

背景說明

問題:假設某臺帶有etcd的k8s master節點完全故障,徹底無法恢復

方案:新啓動一臺主機,配置爲故障master主機相同的ip和主機名,並嘗試原地恢復,頂替原故障master節點

Etcd恢復

參考官方文檔:

https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/recovery.md#restoring-a-cluster

注意:從官方文檔可得知,備份快照時會改變cluster id(改變cluster id不影響客戶端,客戶端是使用ip來連接訪問的),因此,需要etcd cluster的每個成員都參與備份與恢復。

檢查

~# export ENDPOINTS="https://10.90.1.238:2379,https://10.90.1.239:2379,https://10.90.1.240:2379"

~# etcdctl --endpoints=$ENDPOINTS --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --cacert=/etc/etcd/ssl/ca.pem  member list
316dbd21b17d1b4f, started, NODE238, https://10.90.1.238:2380, https://10.90.1.238:2379, false
8a206cf1ed53b6f4, started, NODE240, https://10.90.1.240:2380, https://10.90.1.240:2379, false
e444265665d3bd32, started, NODE239, https://10.90.1.239:2380, https://10.90.1.239:2379, false

~# etcdctl endpoint health --endpoints=$ENDPOINTS --cacert=/etc/etcd/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem
https://10.90.1.239:2379 is healthy: successfully committed proposal: took = 16.882627ms
https://10.90.1.240:2379 is healthy: successfully committed proposal: took = 17.046735ms
https://10.90.1.238:2379 is healthy: successfully committed proposal: took = 19.395533ms

~ # kubectl get nodes
NAME        STATUS   ROLES    AGE    VERSION
ubuntu238   Ready    master   123d   v1.15.3
ubuntu239   Ready    master   123d   v1.15.3
ubuntu240   Ready    master   123d   v1.15.3

確認正常無誤

開始備份

關停客戶端

# 3臺分別執行,二進制安裝的關停相應組件服務
service kubelet stop
service docker stop

備份快照

在每一臺etcd節點上分別執行:

# 修改爲對應節點的ip
export EP="https://10.90.1.238:2379"

export ETCDCTL_API=3

mkdir -p /var/lib/etcd_bak

etcdctl --endpoints=$EP --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --cacert=/etc/etcd/ssl/ca.pem   snapshot save  /var/lib/etcd_bak/`hostname`-etcd_`date +%Y%m%d%H%M`.db

systemctl stop etcd

mv /var/lib/etcd /var/lib/etcd_old

拷貝文件:

scp -r /etc/etcd/ssl/* NEW_HOST
scp -r /etc/systemd/system/etcd.service  NEW_HOST
scp  /var/lib/etcd_bak/ubuntu238-etcd_202001011000.db  NEW_HOST

將備份出的快照文件、ssl認證文件、systemd啓動腳本,拷貝到新主機上,同時,在新主機上安裝好docker、kubelet、kubectl、etcd服務,安裝步驟這裏省略,由於這幾個組件都是go語言的二進制程序,因此可以直接將原主機的執行文件scp到新主機上相同路徑中,不必花費精力去找相應版本的rpm包。

恢復備份

在新加入的一臺替換主機,和保持不變的另外2臺主機上執行:

必須每臺成員主機全部執行,因爲etcd協商的cluster id等一些元數據發生了改變

# 這3個環境變量相應修改
export NODE_NAME=NODE238
export NODE_IP=10.90.1.238
export SNAPSHOT=ubuntu238-etcd_202001011000.db

export ETCDCTL_API=3
export ETCD_NODES="NODE238"=https://10.90.1.238:2380,"NODE239"=https://10.90.1.239:2380,"NODE240"=https://10.90.1.240:2380
export ENDPOINTS="https://10.90.1.238:2379,https://10.90.1.239:2379,https://10.90.1.240:2379"

cd /var/lib/etcd_bak/

etcdctl snapshot restore  $SNAPSHOT \
--name=$NODE_NAME \
--cert=/etc/etcd/ssl/etcd.pem \
--key=/etc/etcd/ssl/etcd-key.pem \
--cacert=/etc/etcd/ssl/etcd-ca.pem \
--initial-advertise-peer-urls=https://$NODE_IP:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=$ETCD_NODES \
--data-dir=/var/lib/etcd/

完成以上步驟後,3臺etcd主機上一起啓動etcd並檢查狀態:

~ # service etcd start
~ # etcdctl endpoint health --endpoints=$ENDPOINTS --cacert=/etc/etcd/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem
https://10.90.1.240:2379 is healthy: successfully committed proposal: took = 16.112222ms
https://10.90.1.238:2379 is healthy: successfully committed proposal: took = 16.126138ms
https://10.90.1.239:2379 is healthy: successfully committed proposal: took = 17.345037ms

etcd恢復成功!

K8s master恢復

如果是原主機不動,則無需操作,如果是替換爲新主機,則需要將如下文件拷貝至新主機上(適用於kubeadm安裝的k8s,二進制需要對應修改):

  • /etc/kubernetes/目錄下的所有文件(證書,manifest文件)
  • 用戶(root)主目錄下 /root/.kube/config文件(kubectl連接認證)
  • /var/lib/kubelet/目錄下所有文件(plugins、容器連接認證)
  • /etc/systemd/system/kubelet.service.d/10-kubeadm.conf(kubelet systemd 啓動文件)

複製完成後,啓動服務:

service docker start
service kubelet start

檢查:

kubectl get nodes
NAME        STATUS   ROLES    AGE    VERSION
ubuntu238   Ready    master   123d   v1.15.3
ubuntu239   Ready    master   123d   v1.15.3
ubuntu240   Ready    master   123d   v1.15.3

不出意外的話,一分鐘左右master就恢復了。

總結

master恢復還是很簡單的,沒什麼好多說的,etcd的恢復比較麻煩一些,需要全部成員重新initial一次,一定要按步驟來,仔細一些不要錯。

提示:操作需要中斷客戶端寫入,即停止k8s服務,風險高,生產慎用!

發佈了77 篇原創文章 · 獲贊 23 · 訪問量 11萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章