問題現象:
K8S中創建的容器有時可以ping通域名,有時不可以
基礎環境:
K8S通過kubeaze自動搭建,域名解析使用的是coredns,coredns啓動了兩個實例,分佈到兩個不同的Worker節點。
調研步驟:
1. 手動啓動一個centos7的容器
2. ping www.baidu.com, 不通
3. 將容器中的/etc/resolve.conf中加入K8S集羣外的dns
nameserver 10.128.142.149 #這一條是集羣外的dns地址
nameserver 172.20.0.2 #這一條是集羣內的coredns 服務的地址
4. ping www.baidu.com ,可以通
5. 進入到容器中安裝dig命令
#yum install bind-utils
#dig www.baidu.com ,可以看到返回我地址的是我集羣外的DNS服務
[root@centos-bdf5cff5b-7jzt9 /]# dig www.baidu.com
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.baidu.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2694
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.baidu.com. IN A
;; ANSWER SECTION:
www.baidu.com. 1102 IN CNAME www.a.shifen.com.
www.a.shifen.com. 207 IN A 112.80.248.76
www.a.shifen.com. 207 IN A 112.80.248.75
;; Query time: 0 msec
;; SERVER: 10.128.142.149#53(10.128.142.149)
;; WHEN: Thu Dec 26 02:16:04 UTC 2019
;; MSG SIZE rcvd: 104
6.將集羣外的DNS從/etc/resolve.conf中去掉,再進行dig操作,爲了避免緩存,我使用了另外一個域名,可以看到集羣內部的dns給我返回了正確的地址,我也可以ping通qq這個域名
[root@centos-bdf5cff5b-7jzt9 /]# dig www.qq.com
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.qq.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59772
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.qq.com. IN A
;; ANSWER SECTION:
www.qq.com. 30 IN CNAME public.sparta.mig.tencent-cloud.net.
public.sparta.mig.tencent-cloud.net. 30 IN A 157.255.192.44
public.sparta.mig.tencent-cloud.net. 30 IN A 61.241.44.148
;; Query time: 1 msec
;; SERVER: 172.20.0.2#53(172.20.0.2)
;; WHEN: Thu Dec 26 02:16:42 UTC 2019
;; MSG SIZE rcvd: 200
[root@centos-bdf5cff5b-7jzt9 /]# ping www.qq.com
PING public.sparta.mig.tencent-cloud.net (157.255.192.44) 56(84) bytes of data.
64 bytes from 157.255.192.44 (157.255.192.44): icmp_seq=1 ttl=49 time=33.0 ms
7. 再次dig 另外一個域名www.163.com,這次就報錯了。
[root@centos-bdf5cff5b-7jzt9 /]# dig www.163.com
;; reply from unexpected source: 172.200.3.93#53, expected 172.20.0.2#53
;; reply from unexpected source: 172.200.3.93#53, expected 172.20.0.2#53
;; reply from unexpected source: 172.200.3.93#53, expected 172.20.0.2#53
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.163.com
;; global options: +cmd
;; connection timed out; no servers could be reached
8. 把這個錯誤在網上搜索了一下,看到stackoverflow中也有人問這個問題,解決方法如下:
Ubuntu: 在worker節點中 查看 br_netfilter這個模塊是不是啓用了,如果沒有啓用運行modprobe br_netfilter
CentOS: 看看 /proc/sys/net/bridge/bridge-nf-call-iptables 的值是不是爲1,如果不是: echo '1'> /proc/sys/net/bridge/bridge-nf-call-iptables
9. 我的問題就是coredns運行的節點上有一個節點的/proc/sys/net/bridge/bridge-nf-call-iptables 不爲1,修改後再次dig 就可以成功了。 問題也解決了
[root@centos-bdf5cff5b-7jzt9 /]# dig www.163.com
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-9.P2.el7 <<>> www.163.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28746
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.163.com. IN A
;; ANSWER SECTION:
www.163.com. 30 IN CNAME www.163.com.163jiasu.com.
www.163.com.163jiasu.com. 30 IN CNAME www.163.com.bsgslb.cn.
www.163.com.bsgslb.cn. 30 IN CNAME z163ipv6.v.bsgslb.cn.
z163ipv6.v.bsgslb.cn. 30 IN A 58.16.59.134
z163ipv6.v.bsgslb.cn. 30 IN A 58.16.59.131
z163ipv6.v.bsgslb.cn. 30 IN A 58.16.59.137
;; Query time: 1 msec
;; SERVER: 172.20.0.2#53(172.20.0.2)
;; WHEN: Thu Dec 26 02:22:34 UTC 2019
;; MSG SIZE rcvd: 311
參考:
https://github.com/kubernetes/kubernetes/issues/21613#issuecomment-343190401