kubeadm安裝的Kubernetes etcd備份恢復
[TOC]
1. 事件由來
2018年9月16日颱風過後,我的一套kuernetes測試系統,etcd啓動失敗,經過半天的搶救,仍然無果(3臺master都是如下錯誤)。無奈再花半天時間把環境重新弄了起來。即使是etcd集羣,備份也是必須的,因爲數據沒了,就都沒了。好在問題出現得早,要是正式生產出現這種情況,估計要捲鋪蓋走人了。因此,研究下kubernetes備份。
2018-09-17 00:11:55.781279 I | etcdmain: etcd Version: 3.2.18
2018-09-17 00:11:55.781457 I | etcdmain: Git SHA: eddf599c6
2018-09-17 00:11:55.781477 I | etcdmain: Go Version: go1.8.7
2018-09-17 00:11:55.781503 I | etcdmain: Go OS/Arch: linux/amd64
2018-09-17 00:11:55.781519 I | etcdmain: setting maximum number of CPUs to 32, total number of available CPUs is 32
2018-09-17 00:11:55.781634 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-09-17 00:11:55.781702 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = , trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true
2018-09-17 00:11:55.783073 I | embed: listening for peers on https://192.168.105.92:2380
2018-09-17 00:11:55.783182 I | embed: listening for client requests on 127.0.0.1:2379
2018-09-17 00:11:55.783281 I | embed: listening for client requests on 192.168.105.92:2379
2018-09-17 00:11:55.791474 I | etcdserver: recovered store from snapshot at index 16471696
2018-09-17 00:11:55.792633 I | mvcc: restore compact to 13683366
2018-09-17 00:11:55.849153 C | mvcc: store.keyindex: put with unexpected smaller revision [{13685569 0} / {13685569 0}]
panic: store.keyindex: put with unexpected smaller revision [{13685569 0} / {13685569 0}]
goroutine 89 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc42018c160, 0xfa564e, 0x3e, 0xc420062cb0, 0x2, 0x2)
/tmp/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*keyIndex).put(0xc4207fd7c0, 0xd0d341, 0x0)
/tmp/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/key_index.go:80 +0x3ec
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.restoreIntoIndex.func1(0xc42029e460, 0xc4202a0600, 0x14bef40, 0xc420285640)
/tmp/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:367 +0x3e3
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.restoreIntoIndex
/tmp/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:374 +0xa5
2. 環境說明
kubeadm安裝的kubernetes1.11
3. etcd集羣查看
# 列出成員
etcdctl --endpoints=https://192.168.105.92:2379,https://192.168.105.93:2379,https://192.168.105.94:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt member list
# 列出kubernetes數據
export ETCDCTL_API=3
etcdctl get / --prefix --keys-only --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
4. etcd數據備份
- 備份
/etc/kubernetes/
目錄下的所有文件(證書,manifest文件) /var/lib/kubelet/
目錄下所有文件(plugins容器連接認證)- etcd V3版api數據
將腳本添加到計劃任務,每日備份。
#!/usr/bin/env bash
##############################################################
# File Name: ut_backup_k8s.sh
# Version: V1.0
# Author: Chinge_Yang
# Blog: http://blog.csdn.net/ygqygq2
# Created Time : 2018-09-18 09:13:55
# Description:
##############################################################
#獲取腳本所存放目錄
cd `dirname $0`
bash_path=`pwd`
#腳本名
me=$(basename $0)
# delete dir and keep days
delete_dirs=("/data/backup/kubernetes:7")
backup_dir=/data/backup/kubernetes
files_dir=("/etc/kubernetes" "/var/lib/kubelet")
log_dir=$backup_dir/log
shell_log=$log_dir/${USER}_${me}.log
ssh_port="22"
ssh_parameters="-o StrictHostKeyChecking=no -o ConnectTimeout=60"
ssh_command="ssh ${ssh_parameters} -p ${ssh_port}"
scp_command="scp ${ssh_parameters} -P ${ssh_port}"
DATE=$(date +%F)
BACK_SERVER="127.0.0.1" # 遠程備份服務器IP
BACK_SERVER_BASE_DIR="/data/backup"
BACK_SERVER_DIR="$BACK_SERVER_BASE_DIR/kubernetes/${HOSTNAME}" # 遠程備份服務器目錄
BACK_SERVER_LOG_DIR="$BACK_SERVER_BASE_DIR/kubernetes/logs"
#定義保存日誌函數
function save_log () {
echo -e "`date +%F\ %T` $*" >> $shell_log
}
save_log "start backup mysql"
[ ! -d $log_dir ] && mkdir -p $log_dir
#定義輸出顏色函數
function red_echo () {
#用法: red_echo "內容"
local what=$*
echo -e "\e[1;31m ${what} \e[0m"
}
function green_echo () {
#用法: green_echo "內容"
local what=$*
echo -e "\e[1;32m ${what} \e[0m"
}
function yellow_echo () {
#用法: yellow_echo "內容"
local what=$*
echo -e "\e[1;33m ${what} \e[0m"
}
function twinkle_echo () {
#用法: twinkle_echo $(red_echo "內容") ,此處例子爲紅色閃爍輸出
local twinkle='\e[05m'
local what="${twinkle} $*"
echo -e "${what}"
}
function return_echo () {
[ $? -eq 0 ] && green_echo "$* 成功" || red_echo "$* 失敗"
}
function return_error_exit () {
[ $? -eq 0 ] && REVAL="0"
local what=$*
if [ "$REVAL" = "0" ];then
[ ! -z "$what" ] && green_echo "$what 成功"
else
red_echo "$* 失敗,腳本退出"
exit 1
fi
}
#定義確認函數
function user_verify_function () {
while true;do
echo ""
read -p "是否確認?[Y/N]:" Y
case $Y in
[yY]|[yY][eE][sS])
echo -e "answer: \\033[20G [ \e[1;32m是\e[0m ] \033[0m"
break
;;
[nN]|[nN][oO])
echo -e "answer: \\033[20G [ \e[1;32m否\e[0m ] \033[0m"
exit 1
;;
*)
continue
;;
esac
done
}
#定義跳過函數
function user_pass_function () {
while true;do
echo ""
read -p "是否確認?[Y/N]:" Y
case $Y in
[yY]|[yY][eE][sS])
echo -e "answer: \\033[20G [ \e[1;32m是\e[0m ] \033[0m"
break
;;
[nN]|[nN][oO])
echo -e "answer: \\033[20G [ \e[1;32m否\e[0m ] \033[0m"
return 1
;;
*)
continue
;;
esac
done
}
function backup () {
for f_d in ${files_dir[@]}; do
f_name=$(basename ${f_d})
d_name=$(dirname $f_d)
cd $d_name
tar -cjf ${f_name}.tar.bz $f_name
if [ $? -eq 0 ]; then
file_size=$(du ${f_name}.tar.bz|awk '{print $1}')
save_log "$file_size ${f_name}.tar.bz"
save_log "finish tar ${f_name}.tar.bz"
else
file_size=0
save_log "failed tar ${f_name}.tar.bz"
fi
rsync -avzP ${f_name}.tar.bz $backup_dir/$(date +%F)-${f_name}.tar.bz
rm -f ${f_name}.tar.bz
done
export ETCDCTL_API=3
etcdctl --cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
snapshot save $backup_dir/$(date +%F)-k8s-snapshot.db
cd $backup_dir
tar -cjf $(date +%F)-k8s-snapshot.tar.bz $(date +%F)-k8s-snapshot.db
if [ $? -eq 0 ]; then
file_size=$(du $(date +%F)-k8s-snapshot.tar.bz|awk '{print $1}')
save_log "$file_size ${f_name}.tar.bz"
save_log "finish tar ${f_name}.tar.bz"
else
file_size=0
save_log "failed tar ${f_name}.tar.bz"
fi
rm -f $(date +%F)-k8s-snapshot.db
}
function rsync_backup_files () {
# 傳輸日誌文件
#傳輸到遠程服務器備份, 需要配置免密ssh認證
$ssh_command root@${BACK_SERVER} "mkdir -p ${BACK_SERVER_DIR}/${DATE}/"
rsync -avz --bwlimit=5000 -e "${ssh_command}" $backup_dir/*.bz \
root@${BACK_SERVER}:${BACK_SERVER_DIR}/${DATE}/
[ $? -eq 0 ] && save_log "success rsync" || \
save_log "failed rsync"
}
function delete_old_files () {
for delete_dir_keep_days in ${delete_dirs[@]}; do
delete_dir=$(echo $delete_dir_keep_days|awk -F':' '{print $1}')
keep_days=$(echo $delete_dir_keep_days|awk -F':' '{print $2}')
[ -n "$delete_dir" ] && cd ${delete_dir}
[ $? -eq 0 ] && find -L ${delete_dir} -mindepth 1 -mtime +$keep_days -exec rm -rf {} \;
done
}
backup
delete_old_files
#rsync_backup_files
save_log "finish $0\n"
exit 0
5. etcd數據恢復
注意
數據恢復操作,會停止全部應用狀態和訪問!!!
首先需要分別停掉三臺Master機器的kube-apiserver,確保kube-apiserver已經停止了。
mv /etc/kubernetes/manifests /etc/kubernetes/manifests.bak
docker ps|grep k8s_ # 查看etcd、api是否up,等待全部停止
mv /var/lib/etcd /var/lib/etcd.bak
etcd集羣用同一份snapshot恢復。
# 準備恢復文件
cd /tmp
tar -jxvf /data/backup/kubernetes/2018-09-18-k8s-snapshot.tar.bz
rsync -avz 2018-09-18-k8s-snapshot.db 192.168.105.93:/tmp/
rsync -avz 2018-09-18-k8s-snapshot.db 192.168.105.94:/tmp/
在lab1上執行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
--endpoints=192.168.105.92:2379 \
--name=lab1 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.105.92:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
--data-dir=/var/lib/etcd
在lab2上執行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
--endpoints=192.168.105.93:2379 \
--name=lab2 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.105.93:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
--data-dir=/var/lib/etcd
在lab3上執行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
--endpoints=192.168.105.94:2379 \
--name=lab3 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.105.94:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
--data-dir=/var/lib/etcd
全部恢復完成後,三臺Master機器恢復manifests。
mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests
最後確認:
# 再次查看key
[root@lab1 kubernetes]# etcdctl get / --prefix --keys-only --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
registry/apiextensions.k8s.io/customresourcedefinitions/apprepositories.kubeapps.com
/registry/apiregistration.k8s.io/apiservices/v1.
/registry/apiregistration.k8s.io/apiservices/v1.apps
/registry/apiregistration.k8s.io/apiservices/v1.authentication.k8s.io
........此處省略..........
[root@lab1 kubernetes]# kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-777d78ff6f-m5chm 1/1 Running 1 18h
coredns-777d78ff6f-xm7q8 1/1 Running 1 18h
dashboard-kubernetes-dashboard-7cfc6c7bf5-hr96q 1/1 Running 0 13h
dashboard-kubernetes-dashboard-7cfc6c7bf5-x9p7j 1/1 Running 0 13h
etcd-lab1 1/1 Running 0 18h
etcd-lab2 1/1 Running 0 1m
etcd-lab3 1/1 Running 0 18h
kube-apiserver-lab1 1/1 Running 0 18h
kube-apiserver-lab2 1/1 Running 0 1m
kube-apiserver-lab3 1/1 Running 0 18h
kube-controller-manager-lab1 1/1 Running 0 18h
kube-controller-manager-lab2 1/1 Running 0 1m
kube-controller-manager-lab3 1/1 Running 0 18h
kube-flannel-ds-7w6rl 1/1 Running 2 18h
kube-flannel-ds-b9pkf 1/1 Running 2 18h
kube-flannel-ds-fck8t 1/1 Running 1 18h
kube-flannel-ds-kklxs 1/1 Running 1 18h
kube-flannel-ds-lxxx9 1/1 Running 2 18h
kube-flannel-ds-q7lpg 1/1 Running 1 18h
kube-flannel-ds-tlqqn 1/1 Running 1 18h
kube-proxy-85j7g 1/1 Running 1 18h
kube-proxy-gdvkk 1/1 Running 1 18h
kube-proxy-jw5gh 1/1 Running 1 18h
kube-proxy-pgfxf 1/1 Running 1 18h
kube-proxy-qx62g 1/1 Running 1 18h
kube-proxy-rlbdb 1/1 Running 1 18h
kube-proxy-whhcv 1/1 Running 1 18h
kube-scheduler-lab1 1/1 Running 0 18h
kube-scheduler-lab2 1/1 Running 0 1m
kube-scheduler-lab3 1/1 Running 0 18h
kubernetes-dashboard-754f4d5f69-7npk5 1/1 Running 0 13h
kubernetes-dashboard-754f4d5f69-whtg9 1/1 Running 0 13h
tiller-deploy-98f7f7564-59hcs 1/1 Running 0 13h
進相應的安裝程序確認,數據全部正常。
6. 小結
不管是二進制還是kubeadm安裝的Kubernetes,其備份主要是通過etcd的備份完成的。而恢復時,主要考慮的是整個順序:停止kube-apiserver,停止etcd,恢復數據,啓動etcd,啓動kube-apiserver。