問題
發現:rancher中的service name,ping不通
問題排查過程:
service name無法ping通,第一反應應該是dns有問題。
1、查看兩臺coredns的日誌,顯示的是一樣的輸出:
2、容器解析兩臺dns的地址,238上的容器無法使用156上dns的ip去解析,疑似k8s的網絡問題
3、對比兩臺的網絡:
①、10.97.248.156:有flannel.1,且路由有對方的信息(10.42.1.0 10.42.1.0 255.255.255.0 UG 0 0 0 flannel.1)
[root@APP01-RC ~]# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.97.248.156 netmask 255.255.255.0 broadcast 10.97.248.255
②、10.97.248.238:有flannel.1,且路由沒有對方的信息
嘗試添加路由:不行
[root@TC-TS-WL-APP02-RC ~]# route add -net 10.42.0.0 netmask 255.255.255.0 gw 10.42.0.0 dev flannel.1
SIOCADDRT: Network is unreachable
兩臺/etc/cni/net.d/下的內容一致
/opt/cni/bin也一樣
③、兩臺機,均不能ping通本機的flannel.1網卡地址:
[root@TC-TS-WL-APP02-RC net.d]# ping 10.42.0.0
Do you want to ping broadcast? Then -b. If not, check your local firewall rules.
[root@TC-TS-WL-APP02-RC net.d]# ping 10.42.0.0 -b
WARNING: pinging broadcast address
PING 10.42.0.0 (10.42.0.0) 56(84) bytes of data.
^C
--- 10.42.0.0 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms
④、在生產正常的rancher上ping網卡是正常的:ping本機和對方的flannel.1是正常的
[root@-APP02-PROD ~]# ping 10.42.1.0
PING 10.42.1.0 (10.42.1.0) 56(84) bytes of data.
64 bytes from 10.42.1.0: icmp_seq=1 ttl=64 time=0.292 ms
^C
--- 10.42.1.0 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
⑤、正常的也是無法連接8472/UDP
[root@APP01-RC net.d]# netstat -anpltu|grep 8472
udp 0 0 0.0.0.0:8472 0.0.0.0:* -
[root@-APP01-RC net.d]# nc -w 2 -v 10.97.248.238 8472
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection refused.
⑥、重新部署,有兩個容器(cattle-node-agent、flannel_canal)的system空間,一臺就可以ping自己的網絡,但是路由沒有flannel.1
[root@APP02-RC ~]# ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.42.1.0 netmask 255.255.255.255 broadcast 10.42.1.0
[root@APP02-RC ~]# ping 10.42.1.0
PING 10.42.1.0 (10.42.1.0) 56(84) bytes of data.
64 bytes from 10.42.1.0: icmp_seq=1 ttl=64 time=0.048 ms
⑦、看容器flannel_canal的日誌:
只有10.97.248.238的flannel_canal日誌異常:
failed to add vxlanRoute (10.42.0.0/24 -> 10.42.0.0): invalid argument
日誌最後一行有報錯,添加路由失敗。查看所有網絡設備和網段,重點關注 10.42.0.0/24
⑧、解決過程:
ip a
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,PROMISC,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:e9:ec:ea:9f brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet 10.42.0.1/16 brd 10.42.255.255 scope global docker0
觀察到第三項 docker0 設備 佔用了 10.42.0.1/16 網段,與最前面 FLANNEL_NETWORK 聲明的網段衝突,導致路由添加失敗,Overlay Network 無法做轉發
兩臺,均刪除 docker0 設備下的 10.42.0.1/16 網段,問題解決:
ip addr del 10.42.0.1 dev docker0
[root@APP02-RC ~]# ip addr del 10.42.0.1 dev docker0
Warning: Executing wildcard deletion to stay compatible with old scripts.
Explicitly specify the prefix length (10.42.0.1/32) to avoid this warning.
This special behaviour is likely to disappear in further releases,
fix your scripts!
⑨、重新部署,有兩個容器(flannel_canal)的system空間
⑩、測試網絡:兩臺物理機都能ping通容器的ip
[root@APP02-RC ~]# ping 10.42.1.6
PING 10.42.1.6 (10.42.1.6) 56(84) bytes of data.
64 bytes from 10.42.1.6: icmp_seq=1 ttl=64 time=0.060 ms
64 bytes from 10.42.1.6: icmp_seq=2 ttl=64 time=0.043 ms
[root@APP01-RC net.d]# ping 10.42.1.6
PING 10.42.1.6 (10.42.1.6) 56(84) bytes of data.
64 bytes from 10.42.1.6: icmp_seq=1 ttl=63 time=0.377 ms
64 bytes from 10.42.1.6: icmp_seq=2 ttl=63 time=0.290 ms
解決辦法:由於屬於底層問題,重新刪除和加入節點,也無法解決,刪除隱藏在docker0下 的網段(ip addr del 10.42.0.1 dev docker0)即可,Rancher 1.X 版本會在 docker0 設備下面添加 10.42 網段做 ipsec 轉發。因未知原因未清理乾淨,與 flannel 網絡的默認配置網段發生衝突。