kube-proxy 工作模式分析

我們知道kube-proxy支持 iptables 和 ipvs 兩種模式, 在kubernetes v1.8 中引入了 ipvs 模式,在 v1.9 中處於 beta 階段,在 v1.11 中已經正式可用了。iptables 模式在 v1.1 中就添加支持了,從 v1.2 版本開始 iptables 就是 kube-proxy 默認的操作模式,ipvs 和 iptables 都是基於netfilter的,那麼 ipvs 模式和 iptables 模式之間有哪些差異呢?

  • ipvs 爲大型集羣提供了更好的可擴展性和性能
  • ipvs 支持比 iptables 更復雜的複製均衡算法(最小負載、最少連接、加權等等)
  • ipvs 支持服務器健康檢查和連接重試等功能
  • 可以動態修改ipset集合。即使iptables的規則正在使用這個集合。

ipvs 依賴 iptables

由於ipvs 無法提供包過濾、SNAT、masquared(僞裝)等功能。因此在某些場景(如Nodeport的實現)下還是要與iptables搭配使用,ipvs 將使用ipset來存儲需要DROP或masquared的流量的源或目標地址,以確保 iptables 規則的數量是恆定的。假設要禁止上萬個IP訪問我們的服務器,則用iptables的話,就需要一條一條地添加規則,會在iptables中生成大量的規則;但是使用ipset的話,只需要將相關的IP地址(網段)加入到ipset集合中即可,這樣只需要設置少量的iptables規則即可實現目標。

kube-proxy使用ipvs模式

在每臺機器上安裝依賴包:

[root@k8s-m1 ~]# yum install ipvsadm ipset sysstat conntrack libseccomp -y

所有機器選擇需要開機加載的內核模塊,以下是 ipvs 模式需要加載的模塊並設置開機自動加載

[root@k8s-m1 ~]# :> /etc/modules-load.d/ipvs.conf
module=(
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack
br_netfilter
  )
for kernel_module in ${module[@]};do
    /sbin/modinfo -F filename $kernel_module |& grep -qv ERROR && echo $kernel_module >> /etc/modules-load.d/ipvs.conf || :
done
systemctl enable --now systemd-modules-load.service

上面如果systemctl enable命令報錯可以systemctl status -l systemd-modules-load.service看看哪個內核模塊加載不了,在/etc/modules-load.d/ipvs.conf裏註釋掉它再enable試試

所有機器需要設定/etc/sysctl.d/k8s.conf的系統參數。

[root@k8s-m1 ~]# cat <<EOF > /etc/sysctl.d/k8s.conf
# https://github.com/moby/moby/issues/31208 
# ipvsadm -l --timout
# 修復ipvs模式下長連接timeout問題 小於900即可
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 10
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
net.ipv4.neigh.default.gc_stale_time = 120
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.lo.arp_announce = 2
net.ipv4.conf.all.arp_announce = 2
net.ipv4.ip_forward = 1
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_synack_retries = 2
# 要求iptables不對bridge的數據進行處理
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-arptables = 1
net.netfilter.nf_conntrack_max = 2310720
fs.inotify.max_user_watches=89100
fs.may_detach_mounts = 1
fs.file-max = 52706963
fs.nr_open = 52706963
vm.swappiness = 0
vm.overcommit_memory=1
vm.panic_on_oom=0
EOF

[root@k8s-m1 ~]# sysctl --system

修改Kube-proxy配置文件將mode設置爲ipvs

hostnameOverride: k8s-m1
iptables:
    masqueradeAll: true
    masqueradeBit: 14
    minSyncPeriod: 0s
    syncPeriod: 30s
ipvs:
    excludeCIDRs: null
    minSyncPeriod: 0s
    scheduler: ""
    syncPeriod: 30s
kind: KubeProxyConfiguration
metricsBindAddress: 192.168.0.200:10249
mode: "ipvs"
nodePortAddresses: null
oomScoreAdj: -999
portRange: ""
resourceContainer: /kube-proxy
udpIdleTimeout: 250ms

創建 ClusterIP 類型服務時,IPVS proxier 將執行以下三項操作:

  • 確保節點中存在虛擬接口,默認爲 kube-ipvs0
  • 將Service IP 地址綁定到虛擬接口
  • 分別爲每個Service IP 地址創建 IPVS virtual servers

這是一個例子:

[root@k8s-m1 ~]# kubectl describe svc tomcat-service
Name:              tomcat-service
Namespace:         default
Labels:            <none>
Annotations:       kubectl.kubernetes.io/last-applied-configuration:
                     {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"tomcat-service","namespace":"default"},"spec":{"ports":[{"port":8...
Selector:          app=tomcat
Type:              ClusterIP
IP:                10.106.88.77
Port:              <unset>  8080/TCP
TargetPort:        8080/TCP
Endpoints:         10.244.0.48:8080
Session Affinity:  None
Events:            <none>
[root@k8s-m1 ~]# ip -4 a
8: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    inet 10.96.0.10/32 brd 10.96.0.10 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.101.68.42/32 brd 10.101.68.42 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.96.0.1/32 brd 10.96.0.1 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.107.7.203/32 brd 10.107.7.203 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.106.88.77/32 brd 10.106.88.77 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.98.230.124/32 brd 10.98.230.124 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.103.49.63/32 brd 10.103.49.63 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
[root@k8s-m1 ~]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.17.0.1:30024 rr
  -> 10.244.4.21:3000             Masq    1      0          0
TCP  192.168.0.200:30024 rr
  -> 10.244.4.21:3000             Masq    1      0          0
TCP  192.168.0.200:30040 rr
  -> 10.244.4.28:9090             Masq    1      0          0
TCP  10.96.0.1:443 rr
  -> 192.168.0.200:6443           Masq    1      0          0
  -> 192.168.0.201:6443           Masq    1      1          0
  -> 192.168.0.202:6443           Masq    1      0          0

刪除 Kubernetes Service將觸發刪除相應的 IPVS 虛擬服務器,IPVS 物理服務器及其綁定到虛擬接口的 IP 地址。

端口映射:

IPVS 中有三種代理模式:NAT(masq),IPIP 和 DR。 只有 NAT 模式支持端口映射。 Kube-proxy 利用 NAT 模式進行端口映射。 以下示例顯示 IPVS 服務端口8080到Pod端口80的映射。

TCP  10.107.7.203:8080 rr
  -> 10.244.4.14:80               Masq    1      0          0
  -> 10.244.4.15:80               Masq    1      0          0
  -> 10.244.4.16:80               Masq    1      0          0
  -> 10.244.4.20:80               Masq    1      0          0
  -> 10.244.4.22:80               Masq    1      0          0
  -> 10.244.4.23:80               Masq    1      0          0
  -> 10.244.4.24:80               Masq    1      0          0

會話關係:

IPVS 支持客戶端 IP 會話關聯(持久連接)。 當服務指定會話關係時,IPVS 代理將在 IPVS 虛擬服務器中設置超時值(默認爲180分鐘= 10800秒)。 例如:

[root@k8s-m1 ~]# kubectl describe svc nginx-service
Name:           nginx-service
...
IP:             10.102.128.4
Port:           http    3080/TCP
Session Affinity:   ClientIP

[root@k8s-m1 ~]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.102.128.4:3080 rr persistent 10800

ipvs proxier 將在以下5種情況下依賴於 iptables:

  • kube-proxy 設置 --masquerade-all = true
  • kube-proxy 設置 --cluster-cidr=<cidr>
  • Load Balancer 類型的 Service
  • NodePort 類型的 Service
  • 指定 externalIPs 的 Service

kube-proxy 配置參數 –masquerade-all=true

如果 kube-proxy 配置了--masquerade-all=true參數,則 ipvs 將僞裝所有訪問 Service 的 Cluster IP 的流量,此時的行爲和 iptables 是一致的,由 ipvs 添加的 iptables 規則如下:


[root@k8s-m1 ~]# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (2 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst

在 kube-proxy 啓動時指定集羣 CIDR

如果 kube-proxy 配置了--cluster-cidr=<cidr>參數,則 ipvs 會僞裝所有訪問 Service Cluster IP 的外部流量,其行爲和 iptables 相同,假設 kube-proxy 提供的集羣 CIDR 值爲:10.244.16.0/24,那麼 ipvs 添加的 iptables 規則應該如下所示:

[root@k8s-m1 ~]# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (3 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  -- !10.244.16.0/24       0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst

Load Balancer 類型的 Service

對於loadBalancer類型的服務,ipvs 將安裝匹配 KUBE-LOAD-BALANCER 的 ipset 的 iptables 規則。特別當服務的 LoadBalancerSourceRanges 被指定或指定 externalTrafficPolicy=local 的時候,ipvs 將創建 ipset 集合KUBE-LOAD-BALANCER-LOCAL/KUBE-LOAD-BALANCER-FW/KUBE-LOAD-BALANCER-SOURCE-CIDR,並添加相應的 iptables 規則,如下所示規則:


# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

Chain KUBE-FIREWALL (1 references)
target     prot opt source               destination
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOAD-BALANCER-SOURCE-CIDR dst,dst,src
KUBE-MARK-DROP  all  --  0.0.0.0/0            0.0.0.0/0

Chain KUBE-LOAD-BALANCER (1 references)
target     prot opt source               destination
KUBE-FIREWALL  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOAD-BALANCER-FW dst,dst
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOAD-BALANCER-LOCAL dst,dst
KUBE-MARK-MASQ  all  --  0.0.0.0/0            0.0.0.0/0

Chain KUBE-MARK-DROP (1 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x8000

Chain KUBE-MARK-MASQ (2 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-LOAD-BALANCER  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOAD-BALANCER dst,dst
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOAD-BALANCER dst,dst

NodePort 類型的 Service

對於 NodePort 類型的服務,ipvs 將添加匹配KUBE-NODE-PORT-TCP/KUBE-NODE-PORT-UDP的 ipset 的iptables 規則。當指定externalTrafficPolicy=local時,ipvs 將創建 ipset 集KUBE-NODE-PORT-LOCAL-TC/KUBE-NODE-PORT-LOCAL-UDP並安裝相應的 iptables 規則,如下所示:(假設服務使用 TCP 類型 nodePort)

# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (2 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-NODE-PORT (1 references)
target     prot opt source               destination
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-NODE-PORT-LOCAL-TCP dst
KUBE-MARK-MASQ  all  --  0.0.0.0/0            0.0.0.0/0

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-NODE-PORT  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-NODE-PORT-TCP dst

指定 externalIPs 的 Service

對於指定了externalIPs的 Service,ipvs 會安裝匹配KUBE-EXTERNAL-IP ipset 集的 iptables 規則,假設我們有指定了 externalIPs 的 Service,則 iptables 規則應該如下所示:

# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (2 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-EXTERNAL-IP dst,dst
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-EXTERNAL-IP dst,dst PHYSDEV match ! --physdev-is-in ADDRTYPE match src-type !LOCAL
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-EXTERNAL-IP dst,dst ADDRTYPE match dst-type LOCAL

IPVS 模式

入流量

入流量是指由集羣外部訪問 service 的流量。
Iptables 入流量的 chain 路徑是 PREROUTING@nat -> INPUT@nat。

ClusterIP

Iptables 入流量的 chain 路徑是 PREROUTING@nat -> INPUT@nat。
在 PREROUTING 階段,流量跳轉到 KUBE-SERVICES target chain:

[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
DOCKER     all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

KUBE-SERVICES chain 如下:

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  -- !10.244.0.0/16        0.0.0.0/0            /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst
KUBE-NODE-PORT  all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst

ClusterIP service 的訪問流量會交由 KUBE-MARK-MASQ處理,其匹配規則是匹配內核中名爲 KUBE-CLUSTER-IP 的 ipset(將源地址不是10.244.0.0/16的IP交由KUBE-MARK-MASQ)。

下一步就是爲這些包打上標記:

Chain KUBE-MARK-MASQ (3 references)
target     prot opt source               destination         
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

此時封包在 iptables 的路徑已經走完,並沒有後續的 DNAT 到後端 endpoint 的流程,這一步的工作交由 IPVS 來完成。(這步是僅僅給ipset list中的KUBE-CLUSTER-IP 添加了一個標籤0x4000,有此標記的數據包會在KUBE-POSTROUTING chain中統一做MASQUERADE)

檢查 ipvs 代理規則

用戶可以使用ipvsadm工具檢查 kube-proxy 是否維護正確的 ipvs 規則,比如,我們在集羣中有以下一些服務:

# kubectl get svc --all-namespaces
NAMESPACE     NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
default       kubernetes   ClusterIP   10.0.0.1     <none>        443/TCP         1d
kube-system   kube-dns     ClusterIP   10.0.0.10    <none>        53/UDP,53/TCP   1d

我們可以得到如下的一些 ipvs 代理規則:

 # ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.1:443 rr persistent 10800
  -> 192.168.0.1:6443             Masq    1      1          0
TCP  10.0.0.10:53 rr
  -> 172.17.0.2:53                Masq    1      0          0
UDP  10.0.0.10:53 rr
  -> 172.17.0.2:53                Masq    1      0          0

出流量

出流量是指由集羣內的 pod 訪問 service 的流量。
Iptables 出流量的 chain 路徑是 OUTPUT@nat -> POSTROUTING@nat。
OUTPUT chain 如下,與入流量情形一樣,也是所有流量跳轉到 KUBE-SERVICES chain:

[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
DOCKER     all  --  0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL

而後的動作與入流量一致,不論 ClusterIP service 還是 NodePort service,都是爲封包打上 0x4000 的標記。區別是至此入流量的 iptables 流程走完,而出流量還需要經過 nat 表的 POSTROUTING chain,其定義如下:

[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
MASQUERADE  all  --  172.17.0.0/16        0.0.0.0/0           
MASQUERADE  all  --  233.233.5.0/24       0.0.0.0/0           
RETURN     all  --  10.244.0.0/16        10.244.0.0/16       
MASQUERADE  all  --  10.244.0.0/16       !224.0.0.0/4         
RETURN     all  -- !10.244.0.0/16        10.244.16.0/24      
MASQUERADE  all  -- !10.244.0.0/16        10.244.0.0/16       

進一步跳轉到 KUBE-POSTROUTING chain:Chain KUBE-POSTROUTING (1 references)

[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination         
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose */ match-set KUBE-LOOP-BACK dst,dst,src

在這裏,會爲之前打上 0x4000 標記的出流量封包執行 MASQUERADE target,即類似於 SNAT 的一種操作,將其來源 IP 變更爲 ClusterIP 或 Node ip。

被打了標記的流量處理方式

[root@k8s-n-1920168091021 overlord]# iptables -L -n

Chain KUBE-FIREWALL (2 references)
target     prot opt source               destination         
DROP       all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000

Chain KUBE-FORWARD (1 references)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding rules */ mark match 0x4000/0x4000
ACCEPT     all  --  10.244.0.0/16        0.0.0.0/0            /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
ACCEPT     all  --  0.0.0.0/0            10.244.0.0/16        /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

ipset命令使用(iptables 中 match-set 匹配的就是就這裏的地址 )

[root@k8s-n-1920168091021 overlord]# ipset list
Name: KUBE-CLUSTER-IP
Type: hash:ip,port
Revision: 2
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 17584
References: 2
Members:
10.105.97.136,tcp:80
10.105.235.100,tcp:8452
10.98.53.160,tcp:80
10.97.204.141,tcp:80
10.108.115.91,tcp:80
10.98.118.117,tcp:80
10.96.0.1,tcp:443
10.101.26.124,tcp:443
10.98.88.140,tcp:8080
10.108.210.26,tcp:3306
10.96.0.10,tcp:9153
10.96.164.37,tcp:443
10.109.162.103,tcp:80
10.110.237.2,tcp:80
10.101.206.6,tcp:7030
10.111.154.57,tcp:8451
10.110.94.131,tcp:1111
10.98.146.210,tcp:7020
10.103.144.159,tcp:44134
10.96.0.10,tcp:53
10.98.88.140,tcp:8081
10.100.77.215,tcp:80
10.111.2.26,tcp:80
10.104.58.177,tcp:2181
10.97.58.7,tcp:80
10.111.11.67,tcp:8080
10.109.196.230,tcp:9090
10.98.39.12,tcp:5672
10.98.254.44,tcp:6379
10.96.0.10,udp:53
10.100.189.66,tcp:80
10.96.160.63,tcp:7010
10.97.217.217,tcp:3306

官方文檔:https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/README.md

網友們遇到的坑:

使用ab測試性能進行測試。結果ab跑了沒幾個請求,K8S的機器就報錯了。
kernel: nf_conntrack: table full, dropping packet

-

這也算經典的錯誤了,查了下nf_conntrack_max只有131072,肯定是不夠的,CentOS7.3默認應該是65536*4=262144。肯定是有地方改動這個值了,查了一圈沒找到,最後看了下Kube-proxy的日誌,結果還真是它改的!

[root@k8s-m-1 ~]# kubectl logs kube-proxy-q2s4h -n kube-system
W0110 09:32:36.679540      1 server_others.go:263] Flag proxy-mode="" unknown, assuming iptables proxy
I0110 09:32:36.681946      1 server_others.go:117] Using iptables Proxier.
I0110 09:32:36.699112      1 server_others.go:152] Tearing down inactive rules.
I0110 09:32:36.860064      1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0110 09:32:36.860138      1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0110 09:32:36.860192      1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0110 09:32:36.860230      1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0110 09:32:36.860480      1 config.go:102] Starting endpoints config controller

尋找罪魁禍首

翻看了一下源代碼,發現這是一個預設值,在kube-proxy的參數裏可以找到。

  • --conntrack-max-per-core int32 Maximum number of NAT connections to track per CPU core (0 to leave the limit as-is and ignore conntrack-min). (default 32768) 每個核默認32768個,總數就是32768*CPU核數
  • --conntrack-min int32 Minimum number of conntrack entries to allocate, regardless of conntrack-max-per-core (set conntrack-max-per-core=0 to leave the limit as-is). (default 131072) 最小值是131072個,CPU核數低於或者等於4,默認是131072

解決

  • 找到原因了,如何修改kube-proxy的參數呢?(kube-proxy參數)
    --conntrack-min=1048576
  • 增加如下值到 sysctl.conf中,kube-proxy 默認會調整到 131072(系統內核參數)
    net.netfilter.nf_conntrack_max=1048576
    net.nf_conntrack_max=1048576

IPVS引發的TCP超時問題定位

[root@k8s-n-192016801100151 overlord]# ipvsadm -Lnc       
IPVS connection entries
pro expire state       source             virtual            destination
TCP 14:46  ESTABLISHED 192.168.20.163:38150 192.168.110.151:30401 10.244.17.24:80
TCP 01:46  TIME_WAIT   192.168.20.163:37798 192.168.110.151:30401 10.244.17.24:80
TCP 00:01  TIME_WAIT   192.168.20.163:37150 192.168.110.151:30401 10.244.18.30:80
TCP 13:57  ESTABLISHED 192.168.20.163:37890 192.168.110.151:30401 10.244.18.30:80
TCP 14:59  ESTABLISHED 192.168.20.163:38218 192.168.110.151:30401 10.244.18.30:80
TCP 00:51  TIME_WAIT   192.168.20.163:37442 192.168.110.151:30401 10.244.18.30:80
TCP 00:46  TIME_WAIT   192.168.20.163:37424 192.168.110.151:30401 10.244.17.24:80
[root@k8s-n-192016801100151 overlord]# ipvsadm -l --timeout
Timeout (tcp tcpfin udp): 900 120 300

基本確定了問題, 看起來是 ipvs 維護 VIP的這條鏈接存在15min左右的超時閾值設定,這個值是否跟系統默認的tcp_keepalive_timeout 有協同影響? 那麼系統的默認tcp超時時間是多少呢?
ipvs維護鏈接有個超時時間,默認爲900s爲15分鐘;然後操作系統默認的tcp_keepalive_timeout 默認爲7200s,當一個空閒 tcp連接達到900s時,首先他被ipvs斷了,但是操作系統認爲該鏈接還沒有到保活超時,所以客戶端還會使用之前的連接去發送查詢請求,但是ipvs已經不維護該鏈接了,所以 Lost Connection。。所以只要減小系統的tcp_keepalive_timeout時間,比如到600,後發送一個心跳包,讓tcp保活, 這樣, ipvs的連接超時也會被重置計數爲15min。

新增如下參數

  • 表示當Keepalive起用的時候,TCP發送keepalive消息的頻繁度。預設值是2小時,這裏我改爲5分鐘。
    net.ipv4.tcp_keepalive_time = 600
  • 總共發送keepalive的次數
    net.ipv4.tcp_keepalive_probes = 10
  • 每次發送keepalive間隔單位S
    net.ipv4.tcp_keepalive_intvl = 30

-

當啓用與內核參數或守護程序端配置或客戶端配置相關的選項時,它將根據這些選項終止tcp會話。例如,當您將以上述內核參數選項視爲示例時,首先將在600秒後開始發送keepalive數據包,之後每隔30秒發送一次下一個數據包10次。當客戶端或服務器在這段時間內根本沒有應答時,tcp會話將被視爲已損壞,並將終止。爲什麼我們要設置爲600s呢, 其實只要比 ipvs的默認值900小即可!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章