rancher,etcd集羣排錯

rancher1.6版本、k8s1.10、etcd-3節點始終重啓,查看日誌爲healthcheck不通,kubectl apply更新資源(pod、deployment)後狀態處於pending狀態與describe實際運行狀態不符,排查總結rancher的etcd集羣部署模式

rancher中使用docker的方式運行etcd集羣,默認在新加入和3個node節點上容器化運行etcd服務

etcd鏡像:rancher/etcd:v2.3.7-17

etcd鏡像打包history:https://hub.docker.com/layers/rancher/etcd/v2.3.7-17/images/sha256-0a893573bac68f79f5d4acd37fd20d319326c02623dd9142c4659f792035d778

由鏡像存儲層可以得知etcd啓動腳本爲/opt/rancher/run.sh(參數:node),腳本中定義了probe_https,switch_node_to_https,etcdctl_quorum,etcdctl_one,healthcheck_proxy,create_backup,rolling_backup,cleanup,standalone_node,restart_node,runtime_node,recover_node,disaster_node,node

首先獲取etcd證書壓縮包,由於參數爲node,執行node主函數,否則執行standalone_node、

rancher中etcd的部署方式如下圖:

rancher使用busybox鏡像啓動容器管理存儲資源,作爲etcd-data,可以通過docker volume ls查看對應的volume

etcd運行容器中掛載路徑/pdata爲數據存儲路徑,/data-backup爲備份數據存儲路徑,根據啓動腳本看到在etcd leader節點啓動時如果/pdata中已有數據或上次backup觸發的動作未執行結束則會觸發備份動作

備份/pdata/data.current/member中的內容,結構爲:

數據文件夾結構爲(snap中爲數據文件,wal爲日誌文件):

etcd集羣的共識方式爲raft共識選舉leader,通過日誌複製、兩階段的方式提交進行數據同步,詳細內容通過另一篇文章進行描述

經過日誌排查,發現etcd-3容器始終進入runtime_node()函數,等待小於自己id的etcd容器狀態變更爲health,etcd2由於未知原因一直處於healthy false的狀態,在未確定原因的情況下強制啓動了etcd3節點造成了etcd3節點中的cluster id與etcd1集羣的cluster不一致,etcd2和etcd3節點之間集羣信息不匹配。

將etcd2節點的backup文件夾重命名後同時重啓etcd2和etcd3節點,避免了etcd3節點和備份數據對etcd2節點的干擾,etcd2節點健康狀態恢復正常;etcd3節點由於未知原因脫離原etcd集羣,自己獨立爲一個集羣將自己選舉爲leader

分別在etcd-2和etcd-3節點查看狀態(etcd-3已脫離集羣):

etcdctl --ca-file=/etc/etcd/ssl/ca.pem --cert-file=/etc/etcd/ssl/cert.pem --key-file=/etc/etcd/ssl/key.pem  cluster-health

將etcd-3所在主機的etcd-data的volume和掛載在主機路徑的backups目錄完全清理乾淨後重新通過rancher增加hosts,etcd集羣三個節點恢復正常

可以通過

etcdctl --ca-file=/etc/etcd/ssl/ca.pem --cert-file=/etc/etcd/ssl/cert.pem --key-file=/etc/etcd/ssl/key.pem  set key value

再其他節點

etcdctl --ca-file=/etc/etcd/ssl/ca.pem --cert-file=/etc/etcd/ssl/cert.pem --key-file=/etc/etcd/ssl/key.pem  get key

的方式驗證集羣正常

 

報錯信息:

curl -s -k --cacert /etc/etcd/ssl/ca.pem --cert /etc/etcd/ssl/cert.pem --key /etc/etcd/ssl/key.pem https://kubernetes-etcd-2:2379/health

2019/10/17 下午3:19:47++ grep -q '{"health": "true"}'

2019/10/17 下午3:19:47++ '[' 1 -eq 0 ']'

2019/10/17 下午3:19:47++ '[' true == true ']'

2019/10/17 下午3:19:47++ echo seconds= 35

2019/10/17 下午3:19:47++ '[' 35 -gt 60 ']'

2019/10/17 下午3:19:47++ sleep 7.4

======================

raft: newRaft 199d8ab79584f11 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2019-10-17 08:29:57.878738 I | raft: 199d8ab79584f11 became follower at term 1
2019-10-17 08:29:57.902166 I | etcdserver: starting server... [version: 2.3.7, cluster version: to_be_decided]
2019-10-17 08:29:57.904500 N | etcdserver: added local member 199d8ab79584f11 [https://kubernetes-etcd-3:2380] to cluster aee0b3f3e4b91e10
2019-10-17 08:29:57.932362 E | rafthttp: request cluster ID mismatch (got 5e14199e77db2090 want aee0b3f3e4b91e10)
2019-10-17 08:29:57.932463 E | rafthttp: request cluster ID mismatch (got 5e14199e77db2090 want aee0b3f3e4b91e10)
2019-10-17 08:29:58.042553 E | rafthttp: request cluster ID mismatch (got 5e14199e77db2090 want aee0b3f3e4b91e10)

=============

2019/10/17 下午4:34:322019-10-17 08:34:32.944134 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[7cc03574e2a32160]=aee0b3f3e4b91e10, local=5e14199e77db2090)
2019/10/17 下午4:34:33++ '[' '' == '' ']'
2019/10/17 下午4:34:33+++ etcdctl --endpoints=https://127.0.0.1:2379 member list
2019/10/17 下午4:34:33+++ grep 10-42-31-6
2019/10/17 下午4:34:33+++ cut -d : -f 1
2019/10/17 下午4:34:332019-10-17 08:34:33.054718 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[7cc03574e2a32160]=aee0b3f3e4b91e10, local=5e14199e77db2090)
2019/10/17 下午4:34:332019-10-17 08:34:33.057136 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[7cc03574e2a32160]=aee0b3f3e4b91e10, local=5e14199e77db2090)
2019/10/17 下午4:34:33Failed to get leader:  client: etcd cluster is unavailable or misconfigured
2019/10/17 下午4:34:33++ member_id=
2019/10/17 下午4:34:33++ sleep 1

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章