本文集中記錄k8s集羣部署過程的問題。由於各人環境不同,限於經驗,本文僅供參考。
注:本文會不定時更新。
源、key問題
使用國內中科大源:
cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial main
EOF
更新:
apt-get update
但出錯:
Ign:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages
Get:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages [31.3 kB]
Err:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages
Hash Sum mismatch
Fetched 38.9 kB in 1s (20.2 kB/s)
Reading package lists... Done
E: Failed to fetch http://mirrors.ustc.edu.cn/kubernetes/apt/dists/kubernetes-xenial/main/binary-amd64/Packages.gz Hash Sum mismatch
E: Some index files failed to download. They have been ignored, or old ones used instead.
原因及解決:
添加key:
gpg --keyserver keyserver.ubuntu.com --recv-keys 6A030B21BA07F4FB
gpg --export --armor 6A030B21BA07F4FB | sudo apt-key add -
結果:失敗。
使用k8s官方提供的國外源:
cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
更新apt-get update
,會卡住,失敗。
使用阿里雲源:
cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main
EOF
添加key:
cat https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
如果不成功,先通過一些方法下載:https://packages.cloud.google.com/apt/doc/apt-key.gpg, 保存到當前目錄。
再執行:
cat apt-key.gpg | sudo apt-key add -
再執行更新apt-get update
,成功。
如果不添加 key,更新阿里雲源出錯:
W: GPG error: https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 6A030B21BA07F4FB
W: The repository 'https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial InRelease' is not signed.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.
查詢k8s相應配置包:
W1214 08:46:14.303158 8461 version.go:101] could not fetch a Kubernetes version from the internet: unable to get URL "https://dl.k8s.io/release/stable-1.txt": Get https://dl.k8s.io/release/stable-1.txt: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
W1214 08:46:14.303772 8461 version.go:102] falling back to the local client version: v1.17.0
W1214 08:46:14.304223 8461 validation.go:28] Cannot validate kube-proxy config - no validator is available
W1214 08:46:14.304609 8461 validation.go:28] Cannot validate kubelet config - no validator is available
原因及解決:
外網無法訪問。不用理會,因爲會根據執行的版本使用默認的版本。
腳本執行
pullk8s.sh: 3: pullk8s.sh: Syntax error: "(" unexpected
原因及解決:
腳本開頭需爲#!/bin/bash
,如非,則用 bash pullk8s.sh
執行。
初始化環境 kubeadm init
提示:
[ERROR Swap]: running with swap on is not supported. Please disable swap
原因及解決:
不支持 swap 需要禁止。
提示:
[ERROR Port-10250]: Port 10250 is in use
需要停止 kubelet 的運行:systemctl stop kubelet
。
提示WARNING IsDockerSystemdCheck
。
[init] Using Kubernetes version: v1.17.0
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR NumCPU]: the number of available CPUs 1 is less than the required 2
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
)
原因及解決:
docker使用cgroupfs,與k8s不一致。先查看:
# docker info | grep -i cgroup
Cgroup Driver: cgroupfs // !!! 此處爲cgroupfs
WARNING: No swap limit support
需要修改,先停止docker:
systemctl stop docker
更改 /etc/docker/daemon.json,添加:
"exec-opts": ["native.cgroupdriver=systemd"]
重啓docker:
systemctl start docker
查看 cgroup:
# docker info | grep -i cgroup
Cgroup Driver: systemd
已改。
(!!!!!!
注:
修改kubeadm配置文件:
vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
在 Environment 後再新加:
Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"
另一個指定的pod源的:
Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause-amd64:3.1"
重啓:
systemctl daemon-reload
systemctl restart kubelet
此方法實踐不成功
!!!!!!)
提示ERROR NumCPU
:
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR NumCPU]: the number of available CPUs 1 is less than the required 2
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...
原因解決:CPU要雙核以上,改虛擬機cpu爲2個核心或以上即可。
運行時
查看狀態:
kubectl get pods -n kube-system
出錯:
The connection to the server localhost:8080 was refused - did you specify the right host or port?
原因及解決:
沒有執行:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
執行後,成功。
The connection to the server 192.168.0.102:6443 was refused - did you specify the right host or port?
# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6955765f44-j7lvd 0/1 CrashLoopBackOff 14 51m
coredns-6955765f44-kmhfc 0/1 CrashLoopBackOff 14 51m
etcd-ubuntu 1/1 Running 0 52m
kube-apiserver-ubuntu 1/1 Running 0 52m
kube-controller-manager-ubuntu 1/1 Running 0 52m
kube-proxy-qlhfs 1/1 Running 0 51m
kube-scheduler-ubuntu 1/1 Running 0 52m
也可用 kubectl get pod --all-namespaces 查看所有命名空間。
如果不設置網絡,則coredns爲Pending狀態。
部署flannel:
kubectl apply -f kube-flannel.yml
提示:
error: unable to recognize "kube-flannel-aliyun-0.11.0.yml": no matches for kind "DaemonSet" in version "extensions/v1beta1"
換calico,也是一樣的。應該是命名的問題。
解決:找master對應的文件。
https://github.com/coreos/flannel/blob/master/Documentation/kube-flannel.yml
kube-flannel-aliyun.yml的mster和其它tag使用了"extensions/v1beta1"。kube-flannel.yml的Tag使用了,但master又恢復了。
未部署flannel前,
[FATAL] plugin/loop: Loop (127.0.0.1:60825 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 7805087528265218508.4857814245207702505."
部署flannel後,
1、coredns出現 CrashLoopBackOff,kube-flannel出現 Init:ImagePullBackOff,
# kubectl logs kube-flannel-ds-amd64-n55rf -n kube-system
Error from server (BadRequest): container "kube-flannel" in pod "kube-flannel-ds-amd64-n55rf" is waiting to start: PodInitializing
使用 kubectl describe pod 查看:
# kubectl describe pod kube-flannel-ds-amd64-n55rf -n kube-system
...
Normal Scheduled 13m default-scheduler Successfully assigned kube-system/kube-flannel-ds-amd64-n55rf to ubuntu
Normal Pulling 4m21s (x4 over 13m) kubelet, ubuntu Pulling image "quay.io/coreos/flannel:v0.11.0-amd64"
Warning Failed 3m6s (x4 over 10m) kubelet, ubuntu Failed to pull image "quay.io/coreos/flannel:v0.11.0-amd64": rpc error: code = Unknown desc = context canceled
Warning Failed 3m6s (x4 over 10m) kubelet, ubuntu Error: ErrImagePull
Normal BackOff 2m38s (x7 over 10m) kubelet, ubuntu Back-off pulling image "quay.io/coreos/flannel:v0.11.0-amd64"
Warning Failed 2m27s (x8 over 10m) kubelet, ubuntu Error: ImagePullBackOff
原因:flannel:v0.11.0-amd64 無法下載,通過其它方式下載。注意,名稱 quay.io/coreos/flannel:v0.11.0-amd64 一定要對。
下載flannel後,有2種情況:
2、
coredns 變成 ContainerCreating 狀態:
# kubectl logs coredns-6955765f44-4csvn -n kube-system
Error from server (BadRequest): container "coredns" in pod "coredns-6955765f44-r96qk" is waiting to start: ContainerCreating
3、
coredns 變成 CrashLoopBackOff 狀態:
# kubectl logs coredns-6955765f44-4csvn -n kube-system
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.5
linux/amd64, go1.13.4, c2fd1b2
[FATAL] plugin/loop: Loop (127.0.0.1:41252 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 1746539958269975925.3391392736060997773."
查看詳細信息:
# kubectl describe pod coredns-6955765f44-4csvn -n kube-system
Name: coredns-6955765f44-r96qk
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: ubuntu/192.168.0.102
Start Time: Sun, 15 Dec 2019 22:45:15 +0800
Labels: k8s-app=kube-dns
pod-template-hash=6955765f44
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/coredns-6955765f44
Containers:
coredns:
Container ID:
Image: k8s.gcr.io/coredns:1.6.5
Image ID:
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-qq7qf (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-qq7qf:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-qq7qf
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 7m21s (x3 over 8m32s) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Normal Scheduled 6m55s default-scheduler Successfully assigned kube-system/coredns-6955765f44-r96qk to ubuntu
Warning FailedCreatePodSandBox 6m52s kubelet, ubuntu Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9a2d45536097d22cc6b10f338b47f1789869f45f4b12f8a202aa898295dc80a4" network for pod "coredns-6955765f44-r96qk": networkPlugin cni failed to set up pod "coredns-6955765f44-r96qk_kube-system" network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.0.1/24
安裝flannel後,刪除出問題的pod:
kubectl delete pod coredns-6955765f44-4csvn -n kube-system
會自動重啓一個新的pod,但問題依然。查看 ifconfig,發現有 cni0 。
網上解決方法:
#在master節點之外的節點進行操作
kubeadm reset
systemctl stop kubelet
systemctl stop docker
rm -rf /var/lib/cni/
rm -rf /var/lib/kubelet/*
rm -rf /etc/cni/
ifconfig cni0 down
ifconfig flannel.1 down
ifconfig docker0 down
ip link delete cni0
ip link delete flannel.1
##重啓kubelet
systemctl restart kubelet
##重啓docker
systemctl restart docker
嘗試,失敗!
又一次部署的提示:
Warning FailedScheduling 77s (x5 over 5m53s) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Normal Scheduled 76s default-scheduler Successfully assigned kube-system/coredns-9d85f5447-4jwf2 to ubuntu
Warning FailedCreatePodSandBox 73s kubelet, ubuntu Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5c109baa51b8d97e75c6b35edf108ca4f2f56680b629140c8b477b9a8a03d97c" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Warning FailedCreatePodSandBox 71s kubelet, ubuntu Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3f8c5b704fb1dc4584a2903b2ecff329e717e5c2558c9f761501fab909d32133" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Normal SandboxChanged 70s (x2 over 72s) kubelet, ubuntu Pod sandbox changed, it will be killed and re-created.
Normal Pulled 29s (x4 over 69s) kubelet, ubuntu Container image "registry.aliyuncs.com/google_containers/coredns:1.6.5" already present on machine
Normal Created 29s (x4 over 69s) kubelet, ubuntu Created container coredns
Normal Started 29s (x4 over 69s) kubelet, ubuntu Started container coredns
Warning BackOff 10s (x9 over 67s) kubelet, ubuntu Back-off restarting failed container
原因及解決:
網上說初始化時要添加 --pod-network-cidr=10.244.0.0/16,但已添加。注:稍等片刻即產生此文件。
Warning FailedScheduling 56m (x5 over 60m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Normal Scheduled 56m default-scheduler Successfully assigned kube-system/coredns-9d85f5447-4jwf2 to ubuntu
Warning FailedCreatePodSandBox 56m kubelet, ubuntu Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5c109baa51b8d97e75c6b35edf108ca4f2f56680b629140c8b477b9a8a03d97c" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Warning FailedCreatePodSandBox 55m kubelet, ubuntu Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3f8c5b704fb1dc4584a2903b2ecff329e717e5c2558c9f761501fab909d32133" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
Normal SandboxChanged 55m (x2 over 55m) kubelet, ubuntu Pod sandbox changed, it will be killed and re-created.
Normal Pulled 55m (x4 over 55m) kubelet, ubuntu Container image "registry.aliyuncs.com/google_containers/coredns:1.6.5" already present on machine
Normal Created 55m (x4 over 55m) kubelet, ubuntu Created container coredns
Normal Started 55m (x4 over 55m) kubelet, ubuntu Started container coredns
Warning BackOff 59s (x270 over 55m) kubelet, ubuntu Back-off restarting failed container
log信息:
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.5
linux/amd64, go1.13.4, c2fd1b2
[FATAL] plugin/loop: Loop (127.0.0.1:48100 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 639535139534040434.6569166625322327450."
原因及解決:
ConfigMap裏使用了 /etc/resolv.conf,裏面的DNS爲127.0.1.1,此導致問題。
執行:
kubectl edit cm coredns -n kube-system
刪除 loop 字段,保存,退出(vim編輯器)。
刪除出問題的所有的 coredns:
kubectl delete pod coredns-9d85f5447-4jwf2 -n kube-system
coredns ConfigMap內容如下:
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
metadata:
creationTimestamp: "2019-12-21T09:50:31Z"
name: coredns
namespace: kube-system
resourceVersion: "171"
selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
uid: 62485b55-3de6-4dee-b24a-8440052bdb66
注:理論上修改 /etc/resolv.conf 爲8.8.8.8 應該能解決,但該文件手動修改重啓後恢復爲127網段,無效。刪除 loop 字段可解決問題。
加入集羣失敗
[preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.17" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.
原因及解決:
猜測可能是主機名與 master 一致導致的,但沒有實證。
網上收集的
WARNING FileExisting-socat
socat是一個網絡工具, k8s 使用它來進行 pod 的數據交互,出現這個問題直接安裝socat即可:
apt-get install socat
工作節點加入失敗
在子節點執行kubeadm join命令後返回超時錯誤,如下:
root@worker2:~# kubeadm join 192.168.56.11:6443 --token wbryr0.am1n476fgjsno6wa --discovery-token-ca-cert-hash sha256:7640582747efefe7c2d537655e428faa6275dbaff631de37822eb8fd4c054807
[preflight] Running pre-flight checks
error execution phase preflight: couldn't validate the identity of the API Server: abort connecting to API servers after timeout of 5m0s
在master節點上執行 kubeadm token create --print-join-command 重新生成加入命令,並使用輸出的新命令在工作節點上重新執行即可。
master節點的token 24小時過期後,可以通過命令產生新的token:
kubeadm token list
創建永不過期的token
kubeadm token create --ttl 0
master節點上運行命令,可查詢discovery-token-ca-cert-hash值:
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
重新加入節點
kubeadm join 192.168.124.195:6443 --token 8xwg8u.lkj382k9ox58qkw9 \
--discovery-token-ca-cert-hash sha256:86291bed442dd1dcd6c26f2213208e10cab0f87763f44e0edf01fa670cd9e8b