Kubernetes 集羣中 Ingress 故障的根因診斷

作者:scwang18,主要負責技術架構,在容器雲方向頗有研究。

前言

KubeSphere 是青雲開源的基於 Kubernetes 的雲原生分佈式操作系統,提供了比較炫酷的 Kubernetes 集羣管理界面,我們團隊用 KubeSphere 來作爲開發平臺。

本文記錄了一次 KubeSphere 環境下的網絡故障的解決過程。

現象

開發同學反饋自己搭建的 Harbor 倉庫總是出問題,偶爾會報 net/http: TLS handshake timeout , 通過 curl 的方式訪問 harbor.xxxx.cn ,也會隨機頻繁掛起。但是 ping 的反饋一切正常。

原因分析

接到錯誤報障後,經過了多輪分析,才最終定位到原因,應該是安裝 KubeSphere 時,使用了最新版的 Kubernetes 1.23.1 。

雖然使用 ./kk version --show-supported-k8s 可以看到 KubeSphere 3.2.1 可以支持 Kubernetes 1.23.1 ,但實際上只是試驗性支持,有坑的。

分析過程如下:

  1. 出現 Harbor registry 訪問問題,下意識以爲是 Harbor 部署有問題,但是在檢查 Harbor core 的日誌的時候,沒有看到異常時有相應錯誤信息,甚至 info 級別的日誌信息都沒有。

  2. 又把目標放在 Harbor portal, 查看訪問日誌,一樣沒有發現異常信息。

  3. 根據訪問鏈,繼續追查 kubesphere-router-kubesphere-system , 即 KubeSphere 版的 nginx ingress controller ,同樣沒有發現異常日誌。

  4. 嘗試在集羣內其他 Pod 裏訪問 Harbor 的集羣內 Service 地址,發現不會出現訪問超時問題。初步判斷是 KubeSphere 自帶的 Ingress 的問題。

  5. 把 kubeSphere 自帶的 Ingress Controller 關閉,安裝 Kubernetes 官方推薦的 ingress-nginx-controller 版本, 故障依舊,而且 Ingress 日誌裏也沒有發現異常信息。

  6. 綜合上面的分析,問題應該出現在客戶端到 Ingress Controller 之間,我的 Ingress Controller 是通過 NodePort 方式暴露到集羣外面。因此,測試其他通過 NodePort 暴露到集羣外的 service,發現是一樣的故障,至此,可以完全排除 Harbor 部署問題了,基本確定是客戶端到 Ingress Controller 的問題。

  7. 外部客戶端通過 NodePort 訪問 Ingress Controller 時,會通過 kube-proxy 組件,分析 kube-proxy 的日誌,發現告警信息

can’t set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1

這個告警信息是因爲我的 centos 7.6 的內核版本過低, 當前是 3.10.0-1160.21.1.el7.x86_64 ,與 Kubernetes 新版的 ipvs 存在兼容性問題。

可以通過升級操作系統的 kernel 版本可以解決。

  1. 升級完 kernel 後,Calico 啓動不了,報以下錯誤信息
ipset v7.1: kernel and userspace incompatible: settype hash:ip,port with revision 6 not supported by userspace.

原因是安裝 KubeSphere 時默認安裝的 Calico 版本是 v3.20.0 , 這個版本不支持最新版的 Linux Kernel ,升級後的內核版本是 5.18.1-1.el7.elrepo.x86_64,calico 需要升級到 v3.23.0 以上版本。

  1. 升級完 Calico 版本後,Calico 繼續報錯
user "system:serviceaccount:kube-system:calico-node" cannot list resource "caliconodestatuses" in api group "crd.projectcalico.org"

還有另外一個錯誤信息,都是因爲 clusterrole 的資源權限不足,可以通過修改 clusterrole 來解決問題。

  1. 至此,該莫名其妙的網絡問題解決了。

解決過程

根據上面的分析,主要解決方案如下:

升級操作系統內核

  1. 使用阿里雲的 yum 源
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum clean all && yum -y update
  1. 啓用 elrepo 倉庫
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
  1. 安裝最新版本內核
yum --enablerepo=elrepo-kernel install kernel-ml
  1. 查看系統上的所有可用內核
awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg
  1. 設置新的內核爲 grub2 的默認版本

查看第4步返回的系統可用內核列表,不出意外第1個應該是最新安裝的內核。

grub2-set-default 0
  1. 生成 grub 配置文件並重啓
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot now
  1. 驗證
uname -r

升級 Calico

Kubernetes 上的 Calico 一般是使用 Daemonset 方式部署,我的集羣裏,Calico 的 Daemonset 名字是 calico-node。

直接輸出爲 yaml 文件,修改文件裏的所有 image 版本號爲最新版本 v3.23.1 。重新創建 Daemonset。

  1. 輸出 yaml
kubectl -n kube-system get ds  calico-node -o yaml>calico-node.yaml
  1. calico-node.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: calico-node
  name: calico-node
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: calico-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: calico-node
    spec:
      containers:
      - env:
        - name: DATASTORE_TYPE
          value: kubernetes
        - name: WAIT_FOR_DATASTORE
          value: "true"
        - name: NODENAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        - name: CLUSTER_TYPE
          value: k8s,bgp
        - name: NODEIP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: IP_AUTODETECTION_METHOD
          value: can-reach=$(NODEIP)
        - name: IP
          value: autodetect
        - name: CALICO_IPV4POOL_IPIP
          value: Always
        - name: CALICO_IPV4POOL_VXLAN
          value: Never
        - name: FELIX_IPINIPMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_VXLANMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: FELIX_WIREGUARDMTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: CALICO_IPV4POOL_CIDR
          value: 10.233.64.0/18
        - name: CALICO_IPV4POOL_BLOCK_SIZE
          value: "24"
        - name: CALICO_DISABLE_FILE_LOGGING
          value: "true"
        - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
          value: ACCEPT
        - name: FELIX_IPV6SUPPORT
          value: "false"
        - name: FELIX_HEALTHENABLED
          value: "true"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/node:v3.23.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-live
            - -bird-live
          failureThreshold: 6
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: calico-node
        readinessProbe:
          exec:
            command:
            - /bin/calico-node
            - -felix-ready
            - -bird-ready
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          requests:
            cpu: 250m
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
        - mountPath: /run/xtables.lock
          name: xtables-lock
        - mountPath: /var/run/calico
          name: var-run-calico
        - mountPath: /var/lib/calico
          name: var-lib-calico
        - mountPath: /var/run/nodeagent
          name: policysync
        - mountPath: /sys/fs/
          mountPropagation: Bidirectional
          name: sysfs
        - mountPath: /var/log/calico/cni
          name: cni-log-dir
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - command:
        - /opt/cni/bin/calico-ipam
        - -upgrade
        env:
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CALICO_NETWORKING_BACKEND
          valueFrom:
            configMapKeyRef:
              key: calico_backend
              name: calico-config
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: upgrade-ipam
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/cni/networks
          name: host-local-net-dir
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
      - command:
        - /opt/cni/bin/install
        env:
        - name: CNI_CONF_NAME
          value: 10-calico.conflist
        - name: CNI_NETWORK_CONFIG
          valueFrom:
            configMapKeyRef:
              key: cni_network_config
              name: calico-config
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CNI_MTU
          valueFrom:
            configMapKeyRef:
              key: veth_mtu
              name: calico-config
        - name: SLEEP
          value: "false"
        envFrom:
        - configMapRef:
            name: kubernetes-services-endpoint
            optional: true
        image: calico/cni:v3.23.1
        imagePullPolicy: IfNotPresent
        name: install-cni
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
      - image: calico/pod2daemon-flexvol:v3.23.1
        imagePullPolicy: IfNotPresent
        name: flexvol-driver
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/driver
          name: flexvol-driver-host
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: calico-node
      serviceAccountName: calico-node
      terminationGracePeriodSeconds: 0
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - hostPath:
          path: /lib/modules
          type: ""
        name: lib-modules
      - hostPath:
          path: /var/run/calico
          type: ""
        name: var-run-calico
      - hostPath:
          path: /var/lib/calico
          type: ""
        name: var-lib-calico
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
      - hostPath:
          path: /sys/fs/
          type: DirectoryOrCreate
        name: sysfs
      - hostPath:
          path: /opt/cni/bin
          type: ""
        name: cni-bin-dir
      - hostPath:
          path: /etc/cni/net.d
          type: ""
        name: cni-net-dir
      - hostPath:
          path: /var/log/calico/cni
          type: ""
        name: cni-log-dir
      - hostPath:
          path: /var/lib/cni/networks
          type: ""
        name: host-local-net-dir
      - hostPath:
          path: /var/run/nodeagent
          type: DirectoryOrCreate
        name: policysync
      - hostPath:
          path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
          type: DirectoryOrCreate
        name: flexvol-driver-host
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

ClusterRole

還需要修改 ClusterRole ,否則 Calico 會一直報權限錯。

  1. 輸出 yaml
kubectl get clusterrole calico-node -o yaml >calico-node-clusterrole.yaml
  1. calico-node-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: calico-node
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  verbs:
  - watch
  - list
  - get
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
  - update
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  - serviceaccounts
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - globalfelixconfigs
  - felixconfigurations
  - bgppeers
  - globalbgpconfigs
  - bgpconfigurations
  - ippools
  - ipamblocks
  - globalnetworkpolicies
  - globalnetworksets
  - networkpolicies
  - networksets
  - clusterinformations
  - hostendpoints
  - blockaffinities
  - caliconodestatuses
  - ipreservations
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ippools
  - felixconfigurations
  - clusterinformations
  verbs:
  - create
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - bgpconfigurations
  - bgppeers
  verbs:
  - create
  - update
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  - ipamblocks
  - ipamhandles
  verbs:
  - get
  - list
  - create
  - update
  - delete
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ipamconfigs
  verbs:
  - get
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  verbs:
  - watch
- apiGroups:
  - apps
  resources:
  - daemonsets
  verbs:
  - get

總結

這次奇怪的網絡故障,最終原因還是因爲 KubeSphere 的版本與 Kubernetes 的版本不匹配。所以工作環境要穩字爲先,不要冒進使用最新的版本。否則會耽擱很多時間來解決莫名其妙的問題。

本文由博客一文多發平臺 OpenWrite 發佈!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章