我的k8s隨筆:Kubernetes部署-問題篇

本文集中記錄k8s集羣部署過程的問題。由於各人環境不同,限於經驗,本文僅供參考。
注:本文會不定時更新。

源、key問題

使用國內中科大源:

cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial main
EOF

更新:

apt-get update

但出錯:

Ign:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages
Get:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages [31.3 kB]
Err:7 http://mirrors.ustc.edu.cn/kubernetes/apt kubernetes-xenial/main amd64 Packages
  Hash Sum mismatch
Fetched 38.9 kB in 1s (20.2 kB/s)                            
Reading package lists... Done
E: Failed to fetch http://mirrors.ustc.edu.cn/kubernetes/apt/dists/kubernetes-xenial/main/binary-amd64/Packages.gz  Hash Sum mismatch
E: Some index files failed to download. They have been ignored, or old ones used instead.

原因及解決:
添加key:

gpg --keyserver keyserver.ubuntu.com --recv-keys 6A030B21BA07F4FB
gpg --export --armor 6A030B21BA07F4FB | sudo apt-key add -

結果:失敗

使用k8s官方提供的國外源:

cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF

更新apt-get update,會卡住,失敗

使用阿里雲源:

cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main
EOF

添加key:

cat https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

如果不成功,先通過一些方法下載:https://packages.cloud.google.com/apt/doc/apt-key.gpg, 保存到當前目錄。
再執行:

cat apt-key.gpg | sudo apt-key add -

再執行更新apt-get update成功

如果不添加 key,更新阿里雲源出錯:

W: GPG error: https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 6A030B21BA07F4FB
W: The repository 'https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial InRelease' is not signed.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.

查詢k8s相應配置包:

W1214 08:46:14.303158    8461 version.go:101] could not fetch a Kubernetes version from the internet: unable to get URL "https://dl.k8s.io/release/stable-1.txt": Get https://dl.k8s.io/release/stable-1.txt: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
W1214 08:46:14.303772    8461 version.go:102] falling back to the local client version: v1.17.0
W1214 08:46:14.304223    8461 validation.go:28] Cannot validate kube-proxy config - no validator is available
W1214 08:46:14.304609    8461 validation.go:28] Cannot validate kubelet config - no validator is available

原因及解決:
外網無法訪問。不用理會,因爲會根據執行的版本使用默認的版本。

腳本執行

pullk8s.sh: 3: pullk8s.sh: Syntax error: "(" unexpected

原因及解決:
腳本開頭需爲#!/bin/bash,如非,則用 bash pullk8s.sh執行。

初始化環境 kubeadm init

提示:

[ERROR Swap]: running with swap on is not supported. Please disable swap

原因及解決:
不支持 swap 需要禁止。

提示:

 [ERROR Port-10250]: Port 10250 is in use

需要停止 kubelet 的運行:systemctl stop kubelet

提示WARNING IsDockerSystemdCheck

[init] Using Kubernetes version: v1.17.0
[preflight] Running pre-flight checks
        [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR NumCPU]: the number of available CPUs 1 is less than the required 2
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
)

原因及解決:
docker使用cgroupfs,與k8s不一致。先查看:

# docker info | grep -i cgroup
Cgroup Driver: cgroupfs   // !!! 此處爲cgroupfs
WARNING: No swap limit support

需要修改,先停止docker:

systemctl stop docker

更改 /etc/docker/daemon.json,添加:

"exec-opts": ["native.cgroupdriver=systemd"]

重啓docker:

systemctl start docker

查看 cgroup:

# docker info | grep -i cgroup
Cgroup Driver: systemd

已改。
(!!!!!!
注:
修改kubeadm配置文件:

vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

在 Environment 後再新加:

Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"

另一個指定的pod源的:

Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause-amd64:3.1"

重啓:

systemctl daemon-reload
systemctl restart kubelet

此方法實踐不成功
!!!!!!)

提示ERROR NumCPU

error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR NumCPU]: the number of available CPUs 1 is less than the required 2
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...

原因解決:CPU要雙核以上,改虛擬機cpu爲2個核心或以上即可。

運行時

查看狀態:

kubectl get pods -n kube-system

出錯:

The connection to the server localhost:8080 was refused - did you specify the right host or port?

原因及解決:
沒有執行:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

執行後,成功。

The connection to the server 192.168.0.102:6443 was refused - did you specify the right host or port?
# kubectl get pods -n kube-system
NAME                             READY   STATUS             RESTARTS   AGE
coredns-6955765f44-j7lvd         0/1     CrashLoopBackOff   14         51m
coredns-6955765f44-kmhfc         0/1     CrashLoopBackOff   14         51m
etcd-ubuntu                      1/1     Running            0          52m
kube-apiserver-ubuntu            1/1     Running            0          52m
kube-controller-manager-ubuntu   1/1     Running            0          52m
kube-proxy-qlhfs                 1/1     Running            0          51m
kube-scheduler-ubuntu            1/1     Running            0          52m

也可用 kubectl get pod --all-namespaces 查看所有命名空間。

如果不設置網絡,則coredns爲Pending狀態。
部署flannel:

kubectl apply -f kube-flannel.yml 

提示:

error: unable to recognize "kube-flannel-aliyun-0.11.0.yml": no matches for kind "DaemonSet" in version "extensions/v1beta1"

換calico,也是一樣的。應該是命名的問題。
解決:找master對應的文件。
https://github.com/coreos/flannel/blob/master/Documentation/kube-flannel.yml

kube-flannel-aliyun.yml的mster和其它tag使用了"extensions/v1beta1"。kube-flannel.yml的Tag使用了,但master又恢復了。

未部署flannel前,


[FATAL] plugin/loop: Loop (127.0.0.1:60825 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 7805087528265218508.4857814245207702505."

部署flannel後,

1、coredns出現 CrashLoopBackOff,kube-flannel出現 Init:ImagePullBackOff,

# kubectl logs kube-flannel-ds-amd64-n55rf -n kube-system
Error from server (BadRequest): container "kube-flannel" in pod "kube-flannel-ds-amd64-n55rf" is waiting to start: PodInitializing

使用 kubectl describe pod 查看:

# kubectl describe pod kube-flannel-ds-amd64-n55rf -n kube-system
...
  Normal   Scheduled  13m                  default-scheduler  Successfully assigned kube-system/kube-flannel-ds-amd64-n55rf to ubuntu
  Normal   Pulling    4m21s (x4 over 13m)  kubelet, ubuntu    Pulling image "quay.io/coreos/flannel:v0.11.0-amd64"
  Warning  Failed     3m6s (x4 over 10m)   kubelet, ubuntu    Failed to pull image "quay.io/coreos/flannel:v0.11.0-amd64": rpc error: code = Unknown desc = context canceled
  Warning  Failed     3m6s (x4 over 10m)   kubelet, ubuntu    Error: ErrImagePull
  Normal   BackOff    2m38s (x7 over 10m)  kubelet, ubuntu    Back-off pulling image "quay.io/coreos/flannel:v0.11.0-amd64"
  Warning  Failed     2m27s (x8 over 10m)  kubelet, ubuntu    Error: ImagePullBackOff

原因:flannel:v0.11.0-amd64 無法下載,通過其它方式下載。注意,名稱 quay.io/coreos/flannel:v0.11.0-amd64 一定要對。

下載flannel後,有2種情況:
2、
coredns 變成 ContainerCreating 狀態:

# kubectl logs coredns-6955765f44-4csvn -n kube-system
Error from server (BadRequest): container "coredns" in pod "coredns-6955765f44-r96qk" is waiting to start: ContainerCreating

3、
coredns 變成 CrashLoopBackOff 狀態:

# kubectl logs coredns-6955765f44-4csvn -n kube-system
.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.5
linux/amd64, go1.13.4, c2fd1b2
[FATAL] plugin/loop: Loop (127.0.0.1:41252 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 1746539958269975925.3391392736060997773."

查看詳細信息:

# kubectl describe pod coredns-6955765f44-4csvn -n kube-system 
Name:                 coredns-6955765f44-r96qk
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 ubuntu/192.168.0.102
Start Time:           Sun, 15 Dec 2019 22:45:15 +0800
Labels:               k8s-app=kube-dns
                      pod-template-hash=6955765f44
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/coredns-6955765f44
Containers:
  coredns:
    Container ID:  
    Image:         k8s.gcr.io/coredns:1.6.5
    Image ID:      
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-qq7qf (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-qq7qf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-qq7qf
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                     From               Message
  ----     ------                  ----                    ----               -------
  Warning  FailedScheduling        7m21s (x3 over 8m32s)   default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled               6m55s                   default-scheduler  Successfully assigned kube-system/coredns-6955765f44-r96qk to ubuntu
  Warning  FailedCreatePodSandBox  6m52s                   kubelet, ubuntu    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9a2d45536097d22cc6b10f338b47f1789869f45f4b12f8a202aa898295dc80a4" network for pod "coredns-6955765f44-r96qk": networkPlugin cni failed to set up pod "coredns-6955765f44-r96qk_kube-system" network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.0.1/24

安裝flannel後,刪除出問題的pod:

kubectl delete pod coredns-6955765f44-4csvn -n kube-system

會自動重啓一個新的pod,但問題依然。查看 ifconfig,發現有 cni0 。
網上解決方法:

#在master節點之外的節點進行操作
kubeadm reset
systemctl stop kubelet
systemctl stop docker
rm -rf /var/lib/cni/
rm -rf /var/lib/kubelet/*
rm -rf /etc/cni/
ifconfig cni0 down
ifconfig flannel.1 down
ifconfig docker0 down
ip link delete cni0
ip link delete flannel.1
##重啓kubelet
systemctl restart kubelet
##重啓docker
systemctl restart docker

嘗試,失敗!

又一次部署的提示:

  Warning  FailedScheduling        77s (x5 over 5m53s)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled               76s                  default-scheduler  Successfully assigned kube-system/coredns-9d85f5447-4jwf2 to ubuntu
  Warning  FailedCreatePodSandBox  73s                  kubelet, ubuntu    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5c109baa51b8d97e75c6b35edf108ca4f2f56680b629140c8b477b9a8a03d97c" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  71s                  kubelet, ubuntu    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3f8c5b704fb1dc4584a2903b2ecff329e717e5c2558c9f761501fab909d32133" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
  Normal   SandboxChanged          70s (x2 over 72s)    kubelet, ubuntu    Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled                  29s (x4 over 69s)    kubelet, ubuntu    Container image "registry.aliyuncs.com/google_containers/coredns:1.6.5" already present on machine
  Normal   Created                 29s (x4 over 69s)    kubelet, ubuntu    Created container coredns
  Normal   Started                 29s (x4 over 69s)    kubelet, ubuntu    Started container coredns
  Warning  BackOff                 10s (x9 over 67s)    kubelet, ubuntu    Back-off restarting failed container

原因及解決:
網上說初始化時要添加 --pod-network-cidr=10.244.0.0/16,但已添加。注:稍等片刻即產生此文件。

  Warning  FailedScheduling        56m (x5 over 60m)    default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled               56m                  default-scheduler  Successfully assigned kube-system/coredns-9d85f5447-4jwf2 to ubuntu
  Warning  FailedCreatePodSandBox  56m                  kubelet, ubuntu    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5c109baa51b8d97e75c6b35edf108ca4f2f56680b629140c8b477b9a8a03d97c" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  55m                  kubelet, ubuntu    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3f8c5b704fb1dc4584a2903b2ecff329e717e5c2558c9f761501fab909d32133" network for pod "coredns-9d85f5447-4jwf2": networkPlugin cni failed to set up pod "coredns-9d85f5447-4jwf2_kube-system" network: open /run/flannel/subnet.env: no such file or directory
  Normal   SandboxChanged          55m (x2 over 55m)    kubelet, ubuntu    Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled                  55m (x4 over 55m)    kubelet, ubuntu    Container image "registry.aliyuncs.com/google_containers/coredns:1.6.5" already present on machine
  Normal   Created                 55m (x4 over 55m)    kubelet, ubuntu    Created container coredns
  Normal   Started                 55m (x4 over 55m)    kubelet, ubuntu    Started container coredns
  Warning  BackOff                 59s (x270 over 55m)  kubelet, ubuntu    Back-off restarting failed container

log信息:

.:53
[INFO] plugin/reload: Running configuration MD5 = 4e235fcc3696966e76816bcd9034ebc7
CoreDNS-1.6.5
linux/amd64, go1.13.4, c2fd1b2
[FATAL] plugin/loop: Loop (127.0.0.1:48100 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 639535139534040434.6569166625322327450."

原因及解決:
ConfigMap裏使用了 /etc/resolv.conf,裏面的DNS爲127.0.1.1,此導致問題。
執行:

kubectl edit cm coredns -n kube-system

刪除 loop 字段,保存,退出(vim編輯器)。
刪除出問題的所有的 coredns:

kubectl delete pod coredns-9d85f5447-4jwf2 -n kube-system

coredns ConfigMap內容如下:

apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2019-12-21T09:50:31Z"
  name: coredns
  namespace: kube-system
  resourceVersion: "171"
  selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
  uid: 62485b55-3de6-4dee-b24a-8440052bdb66

注:理論上修改 /etc/resolv.conf 爲8.8.8.8 應該能解決,但該文件手動修改重啓後恢復爲127網段,無效。刪除 loop 字段可解決問題。

加入集羣失敗

[preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.17" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.

原因及解決:
猜測可能是主機名與 master 一致導致的,但沒有實證。

網上收集的

WARNING FileExisting-socat
socat是一個網絡工具, k8s 使用它來進行 pod 的數據交互,出現這個問題直接安裝socat即可:

apt-get install socat

工作節點加入失敗
在子節點執行kubeadm join命令後返回超時錯誤,如下:

root@worker2:~# kubeadm join 192.168.56.11:6443 --token wbryr0.am1n476fgjsno6wa --discovery-token-ca-cert-hash sha256:7640582747efefe7c2d537655e428faa6275dbaff631de37822eb8fd4c054807
[preflight] Running pre-flight checks
error execution phase preflight: couldn't validate the identity of the API Server: abort connecting to API servers after timeout of 5m0s

在master節點上執行 kubeadm token create --print-join-command 重新生成加入命令,並使用輸出的新命令在工作節點上重新執行即可。

master節點的token 24小時過期後,可以通過命令產生新的token:

kubeadm token list

創建永不過期的token

kubeadm token create --ttl 0

master節點上運行命令,可查詢discovery-token-ca-cert-hash值:

openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'

重新加入節點

kubeadm join 192.168.124.195:6443 --token 8xwg8u.lkj382k9ox58qkw9 \
--discovery-token-ca-cert-hash sha256:86291bed442dd1dcd6c26f2213208e10cab0f87763f44e0edf01fa670cd9e8b
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章