kube-proxy 工作模式分析

我們知道kube-proxy支持 iptables 和 ipvs 兩種模式，在kubernetes v1.8 中引入了 ipvs 模式，在 v1.9 中處於 beta 階段，在 v1.11 中已經正式可用了。iptables 模式在 v1.1 中就添加支持了，從 v1.2 版本開始 iptables 就是 kube-proxy 默認的操作模式，ipvs 和 iptables 都是基於netfilter的，那麼 ipvs 模式和 iptables 模式之間有哪些差異呢？

ipvs 爲大型集羣提供了更好的可擴展性和性能

ipvs 支持比 iptables 更復雜的複製均衡算法（最小負載、最少連接、加權等等）

ipvs 支持服務器健康檢查和連接重試等功能

可以動態修改ipset集合。即使iptables的規則正在使用這個集合。

ipvs 依賴 iptables

由於ipvs 無法提供包過濾、SNAT、masquared(僞裝)等功能。因此在某些場景（如Nodeport的實現）下還是要與iptables搭配使用，ipvs 將使用ipset來存儲需要DROP或masquared的流量的源或目標地址，以確保 iptables 規則的數量是恆定的。假設要禁止上萬個IP訪問我們的服務器，則用iptables的話，就需要一條一條地添加規則，會在iptables中生成大量的規則；但是使用ipset的話，只需要將相關的IP地址（網段）加入到ipset集合中即可，這樣只需要設置少量的iptables規則即可實現目標。

kube-proxy使用ipvs模式

在每臺機器上安裝依賴包：

[root@k8s-m1 ~]# yum install ipvsadm ipset sysstat conntrack libseccomp -y

所有機器選擇需要開機加載的內核模塊,以下是 ipvs 模式需要加載的模塊並設置開機自動加載

[root@k8s-m1 ~]# :> /etc/modules-load.d/ipvs.conf
module=(
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack
br_netfilter
  )
for kernel_module in ${module[@]};do
    /sbin/modinfo -F filename $kernel_module |& grep -qv ERROR && echo $kernel_module >> /etc/modules-load.d/ipvs.conf || :
done
systemctl enable --now systemd-modules-load.service

上面如果systemctl enable命令報錯可以systemctl status -l systemd-modules-load.service看看哪個內核模塊加載不了,在/etc/modules-load.d/ipvs.conf裏註釋掉它再enable試試

所有機器需要設定/etc/sysctl.d/k8s.conf的系統參數。

[root@k8s-m1 ~]# cat <<EOF > /etc/sysctl.d/k8s.conf
# https://github.com/moby/moby/issues/31208 
# ipvsadm -l --timout
# 修復ipvs模式下長連接timeout問題 小於900即可
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 10
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
net.ipv4.neigh.default.gc_stale_time = 120
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.lo.arp_announce = 2
net.ipv4.conf.all.arp_announce = 2
net.ipv4.ip_forward = 1
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_synack_retries = 2
# 要求iptables不對bridge的數據進行處理
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-arptables = 1
net.netfilter.nf_conntrack_max = 2310720
fs.inotify.max_user_watches=89100
fs.may_detach_mounts = 1
fs.file-max = 52706963
fs.nr_open = 52706963
vm.swappiness = 0
vm.overcommit_memory=1
vm.panic_on_oom=0
EOF

[root@k8s-m1 ~]# sysctl --system

修改Kube-proxy配置文件將mode設置爲ipvs

hostnameOverride: k8s-m1
iptables:
    masqueradeAll: true
    masqueradeBit: 14
    minSyncPeriod: 0s
    syncPeriod: 30s
ipvs:
    excludeCIDRs: null
    minSyncPeriod: 0s
    scheduler: ""
    syncPeriod: 30s
kind: KubeProxyConfiguration
metricsBindAddress: 192.168.0.200:10249
mode: "ipvs"
nodePortAddresses: null
oomScoreAdj: -999
portRange: ""
resourceContainer: /kube-proxy
udpIdleTimeout: 250ms

創建 ClusterIP 類型服務時，IPVS proxier 將執行以下三項操作：

確保節點中存在虛擬接口，默認爲 kube-ipvs0

將Service IP 地址綁定到虛擬接口

分別爲每個Service IP 地址創建 IPVS virtual servers

這是一個例子:

[root@k8s-m1 ~]# kubectl describe svc tomcat-service
Name:              tomcat-service
Namespace:         default
Labels:            <none>
Annotations:       kubectl.kubernetes.io/last-applied-configuration:
                     {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"tomcat-service","namespace":"default"},"spec":{"ports":[{"port":8...
Selector:          app=tomcat
Type:              ClusterIP
IP:                10.106.88.77
Port:              <unset>  8080/TCP
TargetPort:        8080/TCP
Endpoints:         10.244.0.48:8080
Session Affinity:  None
Events:            <none>

[root@k8s-m1 ~]# ip -4 a
8: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    inet 10.96.0.10/32 brd 10.96.0.10 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.101.68.42/32 brd 10.101.68.42 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.96.0.1/32 brd 10.96.0.1 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.107.7.203/32 brd 10.107.7.203 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.106.88.77/32 brd 10.106.88.77 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.98.230.124/32 brd 10.98.230.124 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.103.49.63/32 brd 10.103.49.63 scope global kube-ipvs0
       valid_lft forever preferred_lft forever

[root@k8s-m1 ~]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.17.0.1:30024 rr
  -> 10.244.4.21:3000             Masq    1      0          0
TCP  192.168.0.200:30024 rr
  -> 10.244.4.21:3000             Masq    1      0          0
TCP  192.168.0.200:30040 rr
  -> 10.244.4.28:9090             Masq    1      0          0
TCP  10.96.0.1:443 rr
  -> 192.168.0.200:6443           Masq    1      0          0
  -> 192.168.0.201:6443           Masq    1      1          0
  -> 192.168.0.202:6443           Masq    1      0          0

刪除 Kubernetes Service將觸發刪除相應的 IPVS 虛擬服務器，IPVS 物理服務器及其綁定到虛擬接口的 IP 地址。

端口映射：

IPVS 中有三種代理模式：NAT（masq），IPIP 和 DR。只有 NAT 模式支持端口映射。 Kube-proxy 利用 NAT 模式進行端口映射。以下示例顯示 IPVS 服務端口8080到Pod端口80的映射。

TCP  10.107.7.203:8080 rr
  -> 10.244.4.14:80               Masq    1      0          0
  -> 10.244.4.15:80               Masq    1      0          0
  -> 10.244.4.16:80               Masq    1      0          0
  -> 10.244.4.20:80               Masq    1      0          0
  -> 10.244.4.22:80               Masq    1      0          0
  -> 10.244.4.23:80               Masq    1      0          0
  -> 10.244.4.24:80               Masq    1      0          0

會話關係：

IPVS 支持客戶端 IP 會話關聯（持久連接）。當服務指定會話關係時，IPVS 代理將在 IPVS 虛擬服務器中設置超時值（默認爲180分鐘= 10800秒）。例如：

[root@k8s-m1 ~]# kubectl describe svc nginx-service
Name:           nginx-service
...
IP:             10.102.128.4
Port:           http    3080/TCP
Session Affinity:   ClientIP

[root@k8s-m1 ~]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.102.128.4:3080 rr persistent 10800

ipvs proxier 將在以下5種情況下依賴於 iptables：

kube-proxy 設置 --masquerade-all = true

kube-proxy 設置 --cluster-cidr=<cidr>

Load Balancer 類型的 Service

NodePort 類型的 Service

指定 externalIPs 的 Service

kube-proxy 配置參數 –masquerade-all=true

如果 kube-proxy 配置了--masquerade-all=true參數，則 ipvs 將僞裝所有訪問 Service 的 Cluster IP 的流量，此時的行爲和 iptables 是一致的，由 ipvs 添加的 iptables 規則如下：


[root@k8s-m1 ~]# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (2 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst

在 kube-proxy 啓動時指定集羣 CIDR

如果 kube-proxy 配置了--cluster-cidr=<cidr>參數，則 ipvs 會僞裝所有訪問 Service Cluster IP 的外部流量，其行爲和 iptables 相同，假設 kube-proxy 提供的集羣 CIDR 值爲：10.244.16.0/24，那麼 ipvs 添加的 iptables 規則應該如下所示：

[root@k8s-m1 ~]# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (3 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  -- !10.244.16.0/24       0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst

Load Balancer 類型的 Service

對於loadBalancer類型的服務，ipvs 將安裝匹配 KUBE-LOAD-BALANCER 的 ipset 的 iptables 規則。特別當服務的 LoadBalancerSourceRanges 被指定或指定 externalTrafficPolicy=local 的時候，ipvs 將創建 ipset 集合KUBE-LOAD-BALANCER-LOCAL/KUBE-LOAD-BALANCER-FW/KUBE-LOAD-BALANCER-SOURCE-CIDR，並添加相應的 iptables 規則，如下所示規則：


# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

Chain KUBE-FIREWALL (1 references)
target     prot opt source               destination
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOAD-BALANCER-SOURCE-CIDR dst,dst,src
KUBE-MARK-DROP  all  --  0.0.0.0/0            0.0.0.0/0

Chain KUBE-LOAD-BALANCER (1 references)
target     prot opt source               destination
KUBE-FIREWALL  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOAD-BALANCER-FW dst,dst
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOAD-BALANCER-LOCAL dst,dst
KUBE-MARK-MASQ  all  --  0.0.0.0/0            0.0.0.0/0

Chain KUBE-MARK-DROP (1 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x8000

Chain KUBE-MARK-MASQ (2 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-LOAD-BALANCER  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOAD-BALANCER dst,dst
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOAD-BALANCER dst,dst

NodePort 類型的 Service

對於 NodePort 類型的服務，ipvs 將添加匹配KUBE-NODE-PORT-TCP/KUBE-NODE-PORT-UDP的 ipset 的iptables 規則。當指定externalTrafficPolicy=local時，ipvs 將創建 ipset 集KUBE-NODE-PORT-LOCAL-TC/KUBE-NODE-PORT-LOCAL-UDP並安裝相應的 iptables 規則，如下所示：(假設服務使用 TCP 類型 nodePort)

# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (2 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-NODE-PORT (1 references)
target     prot opt source               destination
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-NODE-PORT-LOCAL-TCP dst
KUBE-MARK-MASQ  all  --  0.0.0.0/0            0.0.0.0/0

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-NODE-PORT  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-NODE-PORT-TCP dst

指定 externalIPs 的 Service

對於指定了externalIPs的 Service，ipvs 會安裝匹配KUBE-EXTERNAL-IP ipset 集的 iptables 規則，假設我們有指定了 externalIPs 的 Service，則 iptables 規則應該如下所示：

# iptables -t nat -nL

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

Chain KUBE-MARK-MASQ (2 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-LOOP-BACK dst,dst,src

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-EXTERNAL-IP dst,dst
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-EXTERNAL-IP dst,dst PHYSDEV match ! --physdev-is-in ADDRTYPE match src-type !LOCAL
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-EXTERNAL-IP dst,dst ADDRTYPE match dst-type LOCAL

IPVS 模式

入流量

入流量是指由集羣外部訪問 service 的流量。
Iptables 入流量的 chain 路徑是 PREROUTING@nat -> INPUT@nat。

ClusterIP

Iptables 入流量的 chain 路徑是 PREROUTING@nat -> INPUT@nat。
在 PREROUTING 階段，流量跳轉到 KUBE-SERVICES target chain:

[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
DOCKER     all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

KUBE-SERVICES chain 如下：

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  -- !10.244.0.0/16        0.0.0.0/0            /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst
KUBE-NODE-PORT  all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst

ClusterIP service 的訪問流量會交由 KUBE-MARK-MASQ處理，其匹配規則是匹配內核中名爲 KUBE-CLUSTER-IP 的 ipset（將源地址不是10.244.0.0/16的IP交由KUBE-MARK-MASQ）。

下一步就是爲這些包打上標記：

Chain KUBE-MARK-MASQ (3 references)
target     prot opt source               destination         
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

此時封包在 iptables 的路徑已經走完，並沒有後續的 DNAT 到後端 endpoint 的流程，這一步的工作交由 IPVS 來完成。（這步是僅僅給ipset list中的KUBE-CLUSTER-IP 添加了一個標籤0x4000，有此標記的數據包會在KUBE-POSTROUTING chain中統一做MASQUERADE）

檢查 ipvs 代理規則

用戶可以使用ipvsadm工具檢查 kube-proxy 是否維護正確的 ipvs 規則，比如，我們在集羣中有以下一些服務：

# kubectl get svc --all-namespaces
NAMESPACE     NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
default       kubernetes   ClusterIP   10.0.0.1     <none>        443/TCP         1d
kube-system   kube-dns     ClusterIP   10.0.0.10    <none>        53/UDP,53/TCP   1d

我們可以得到如下的一些 ipvs 代理規則：

 # ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.1:443 rr persistent 10800
  -> 192.168.0.1:6443             Masq    1      1          0
TCP  10.0.0.10:53 rr
  -> 172.17.0.2:53                Masq    1      0          0
UDP  10.0.0.10:53 rr
  -> 172.17.0.2:53                Masq    1      0          0

出流量

出流量是指由集羣內的 pod 訪問 service 的流量。
Iptables 出流量的 chain 路徑是 OUTPUT@nat -> POSTROUTING@nat。
OUTPUT chain 如下，與入流量情形一樣，也是所有流量跳轉到 KUBE-SERVICES chain：

[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
DOCKER     all  --  0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL

而後的動作與入流量一致，不論 ClusterIP service 還是 NodePort service，都是爲封包打上 0x4000 的標記。區別是至此入流量的 iptables 流程走完，而出流量還需要經過 nat 表的 POSTROUTING chain，其定義如下：

[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
MASQUERADE  all  --  172.17.0.0/16        0.0.0.0/0           
MASQUERADE  all  --  233.233.5.0/24       0.0.0.0/0           
RETURN     all  --  10.244.0.0/16        10.244.0.0/16       
MASQUERADE  all  --  10.244.0.0/16       !224.0.0.0/4         
RETURN     all  -- !10.244.0.0/16        10.244.16.0/24      
MASQUERADE  all  -- !10.244.0.0/16        10.244.0.0/16

進一步跳轉到 KUBE-POSTROUTING chain：Chain KUBE-POSTROUTING (1 references)

[root@k8s-n-1920168091021 overlord]# iptables -L -n -t nat
Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination         
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose */ match-set KUBE-LOOP-BACK dst,dst,src

在這裏，會爲之前打上 0x4000 標記的出流量封包執行 MASQUERADE target，即類似於 SNAT 的一種操作，將其來源 IP 變更爲 ClusterIP 或 Node ip。

被打了標記的流量處理方式

[root@k8s-n-1920168091021 overlord]# iptables -L -n

Chain KUBE-FIREWALL (2 references)
target     prot opt source               destination         
DROP       all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000

Chain KUBE-FORWARD (1 references)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding rules */ mark match 0x4000/0x4000
ACCEPT     all  --  10.244.0.0/16        0.0.0.0/0            /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
ACCEPT     all  --  0.0.0.0/0            10.244.0.0/16        /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

ipset命令使用（iptables 中 match-set 匹配的就是就這裏的地址）

[root@k8s-n-1920168091021 overlord]# ipset list
Name: KUBE-CLUSTER-IP
Type: hash:ip,port
Revision: 2
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 17584
References: 2
Members:
10.105.97.136,tcp:80
10.105.235.100,tcp:8452
10.98.53.160,tcp:80
10.97.204.141,tcp:80
10.108.115.91,tcp:80
10.98.118.117,tcp:80
10.96.0.1,tcp:443
10.101.26.124,tcp:443
10.98.88.140,tcp:8080
10.108.210.26,tcp:3306
10.96.0.10,tcp:9153
10.96.164.37,tcp:443
10.109.162.103,tcp:80
10.110.237.2,tcp:80
10.101.206.6,tcp:7030
10.111.154.57,tcp:8451
10.110.94.131,tcp:1111
10.98.146.210,tcp:7020
10.103.144.159,tcp:44134
10.96.0.10,tcp:53
10.98.88.140,tcp:8081
10.100.77.215,tcp:80
10.111.2.26,tcp:80
10.104.58.177,tcp:2181
10.97.58.7,tcp:80
10.111.11.67,tcp:8080
10.109.196.230,tcp:9090
10.98.39.12,tcp:5672
10.98.254.44,tcp:6379
10.96.0.10,udp:53
10.100.189.66,tcp:80
10.96.160.63,tcp:7010
10.97.217.217,tcp:3306

官方文檔：https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/README.md

網友們遇到的坑：

使用ab測試性能進行測試。結果ab跑了沒幾個請求，K8S的機器就報錯了。
kernel: nf_conntrack: table full, dropping packet

這也算經典的錯誤了，查了下nf_conntrack_max只有131072，肯定是不夠的，CentOS7.3默認應該是65536*4=262144。肯定是有地方改動這個值了，查了一圈沒找到，最後看了下Kube-proxy的日誌，結果還真是它改的！

[root@k8s-m-1 ~]# kubectl logs kube-proxy-q2s4h -n kube-system
W0110 09:32:36.679540      1 server_others.go:263] Flag proxy-mode="" unknown, assuming iptables proxy
I0110 09:32:36.681946      1 server_others.go:117] Using iptables Proxier.
I0110 09:32:36.699112      1 server_others.go:152] Tearing down inactive rules.
I0110 09:32:36.860064      1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0110 09:32:36.860138      1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0110 09:32:36.860192      1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0110 09:32:36.860230      1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0110 09:32:36.860480      1 config.go:102] Starting endpoints config controller

尋找罪魁禍首

翻看了一下源代碼，發現這是一個預設值，在kube-proxy的參數裏可以找到。

--conntrack-max-per-core int32 Maximum number of NAT connections to track per CPU core (0 to leave the limit as-is and ignore conntrack-min). (default 32768) 每個核默認32768個，總數就是32768*CPU核數

--conntrack-min int32 Minimum number of conntrack entries to allocate, regardless of conntrack-max-per-core (set conntrack-max-per-core=0 to leave the limit as-is). (default 131072) 最小值是131072個，CPU核數低於或者等於4，默認是131072

解決

找到原因了，如何修改kube-proxy的參數呢？（kube-proxy參數）
--conntrack-min=1048576

增加如下值到 sysctl.conf中,kube-proxy 默認會調整到 131072（系統內核參數）
net.netfilter.nf_conntrack_max=1048576
net.nf_conntrack_max=1048576

IPVS引發的TCP超時問題定位

[root@k8s-n-192016801100151 overlord]# ipvsadm -Lnc       
IPVS connection entries
pro expire state       source             virtual            destination
TCP 14:46  ESTABLISHED 192.168.20.163:38150 192.168.110.151:30401 10.244.17.24:80
TCP 01:46  TIME_WAIT   192.168.20.163:37798 192.168.110.151:30401 10.244.17.24:80
TCP 00:01  TIME_WAIT   192.168.20.163:37150 192.168.110.151:30401 10.244.18.30:80
TCP 13:57  ESTABLISHED 192.168.20.163:37890 192.168.110.151:30401 10.244.18.30:80
TCP 14:59  ESTABLISHED 192.168.20.163:38218 192.168.110.151:30401 10.244.18.30:80
TCP 00:51  TIME_WAIT   192.168.20.163:37442 192.168.110.151:30401 10.244.18.30:80
TCP 00:46  TIME_WAIT   192.168.20.163:37424 192.168.110.151:30401 10.244.17.24:80

[root@k8s-n-192016801100151 overlord]# ipvsadm -l --timeout
Timeout (tcp tcpfin udp): 900 120 300

基本確定了問題，看起來是 ipvs 維護 VIP的這條鏈接存在15min左右的超時閾值設定，這個值是否跟系統默認的tcp_keepalive_timeout 有協同影響？那麼系統的默認tcp超時時間是多少呢？
ipvs維護鏈接有個超時時間，默認爲900s爲15分鐘；然後操作系統默認的tcp_keepalive_timeout 默認爲7200s，當一個空閒 tcp連接達到900s時，首先他被ipvs斷了，但是操作系統認爲該鏈接還沒有到保活超時，所以客戶端還會使用之前的連接去發送查詢請求，但是ipvs已經不維護該鏈接了，所以 Lost Connection。。所以只要減小系統的tcp_keepalive_timeout時間，比如到600，後發送一個心跳包，讓tcp保活，這樣， ipvs的連接超時也會被重置計數爲15min。

新增如下參數

表示當Keepalive起用的時候，TCP發送keepalive消息的頻繁度。預設值是2小時，這裏我改爲5分鐘。
net.ipv4.tcp_keepalive_time = 600

總共發送keepalive的次數
net.ipv4.tcp_keepalive_probes = 10

每次發送keepalive間隔單位S
net.ipv4.tcp_keepalive_intvl = 30

當啓用與內核參數或守護程序端配置或客戶端配置相關的選項時，它將根據這些選項終止tcp會話。例如，當您將以上述內核參數選項視爲示例時，首先將在600秒後開始發送keepalive數據包，之後每隔30秒發送一次下一個數據包10次。當客戶端或服務器在這段時間內根本沒有應答時，tcp會話將被視爲已損壞，並將終止。爲什麼我們要設置爲600s呢，其實只要比 ipvs的默認值900小即可！

kube-proxy 工作模式分析

ipvs 依賴 iptables

kube-proxy使用ipvs模式

在每臺機器上安裝依賴包：

所有機器選擇需要開機加載的內核模塊,以下是 ipvs 模式需要加載的模塊並設置開機自動加載

所有機器需要設定/etc/sysctl.d/k8s.conf的系統參數。

修改Kube-proxy配置文件將mode設置爲ipvs

創建 ClusterIP 類型服務時，IPVS proxier 將執行以下三項操作：

這是一個例子:

端口映射：

會話關係：

ipvs proxier 將在以下5種情況下依賴於 iptables：

kube-proxy 配置參數 –masquerade-all=true

在 kube-proxy 啓動時指定集羣 CIDR

Load Balancer 類型的 Service

NodePort 類型的 Service

指定 externalIPs 的 Service

IPVS 模式

入流量

ClusterIP

KUBE-SERVICES chain 如下：

下一步就是爲這些包打上標記：

檢查 ipvs 代理規則

我們可以得到如下的一些 ipvs 代理規則：

出流量

被打了標記的流量處理方式

ipset命令使用（iptables 中 match-set 匹配的就是就這裏的地址 ）

網友們遇到的坑：

尋找罪魁禍首

解決

IPVS引發的TCP超時問題定位

新增如下參數

ipset命令使用（iptables 中 match-set 匹配的就是就這裏的地址）