Qunar容器平臺網絡之道:Calico

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:趙寧(去哪兒OPS)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2011年加入去哪兒網,從事私有云基礎架構及ceph存儲運維工作,有豐富的運維經驗,現負責容器及存儲組的工作。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一. 簡介","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Calico 是一個開源的 CNI 項目,爲容器化應用提供的網絡解決方案,下面來爲大家簡單介紹一下我們是如何使用 Calico 爲容器化提供網絡功能的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二. Calico架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單說一下 Calico 架構,Calico 是一個基於三層的數據中心網絡方案,可作爲 CNI 插件爲運行於 Kubernetes 中的容器提供基於 TCP/IP 三層的網絡通信方案,也可與 OpenStack 這種 IaaS 雲架構集成,利用 BGP,IPIP 等協議爲工作負載提供網絡聯通功能,能夠提供高效可控的 VM、容器、物理機之間的通信。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9a/9a0f4dc3662583e5c031e32c69545f52.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖1:核心組件","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Calico的核心組件包括:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"✧ Felix,Calico Agent,運行在每個容器宿主節點上,主要負責配置路由、ACL等信息來確保容器的聯通狀態;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"✧ Etcd ,分佈式的 Key/Value 存儲,負責網絡元數據一致性,確保 Calico 網絡狀態的準確性;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"✧ BGP Client(Bird) ,主要把 Felix 寫入 Kernel 的路由信息分發到 Calico 網絡,保證容器間的通信有效性;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"✧ BGP Route Reflector (簡稱:RR ),路由反射器,默認 Calico 工作在 node-mesh 模式,所有節點互相連接, node-mesh 模式在小規模部署時工作是沒有問題的,當大規模部署時,連接數會非常大,消耗過多資源,利用 BGP RR ,可以避免這種情況的發生,通過一個或者多個 BGP RR 來完成集中式的路由分發,減少對網絡資源的消耗以及提高 Calico 工作效率、穩定性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Calico 在每一個容器宿主節點上利用 Linux kernel 實現了一個高效的 vRouter 來負責數據轉發 而每個 vRouter 通過 BGP 協議負責把自己上運行的 workload 的路由信息向整個 Calico 網絡內傳播;","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e2/e23cb3ff2228279984aed7d709cf6b41.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖2:數據路徑","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三. Calico在Qunar的使用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Qunar 的網絡環境中,主機域名是一個較強的依賴因素,比如 Nginx/OR 就需要可以 IP 直達的 upstream 成員,而此需求容器本身或者 Kubernetes 並無法直接滿足,所以在 2017 年開始測試使用 Kubernetes 作爲容器編排工具時, Pod 可被直接訪問就是一個必須的測試要素,經過一段時間的測試使用,從 Flannel , Cilium , Calico 等方案中,我們選擇了 Calico 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Calico 已經在我們的 ESAAS 專用集羣中爲 4000+ 的 Pod 提供網絡功能,目前線上業務集羣也選擇使用 Calico 方案,選擇 Calico 方案的主要因素:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"✧ 純三層方案,因爲沒有 Overlay 網絡,方案簡單可控,並且沒有解包封包,節省 CPU 計算資源的同時,提高了整個網絡的性能,並且三層方案不會因爲容器數量變化帶來 ARP 廣播風暴,也不用擔心因爲容器的頻繁啓動停止所帶來的網絡擾動,保證了網絡的穩定性;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"✧ Pod ip,Sevice ip 均可路由直達,沒有類似 NAT 的中間環節,所有數據流量通過 IP 包的方式完成互聯;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"✧ 適合大規模部署的方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總結以上就是 Calico 網絡方案簡單,高效,穩定,適合用於大規模的生產環境。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Qunar 容器雲平臺使用 Kubernetes 編排容器, Calico 作爲 DaemonSet 部署在集羣的每個宿主節點上,爲了適合大規模應用,我們使用了 Calico RR 模式,將機架交換機做爲路由反射器與宿主節點建立起 BGP 鄰居聯繫,所有宿主節點都配置爲同一個 AS Number,每個宿主節點都會向機架交換機宣告到本地容器IP的路由,再由交換機向同一 AS 域內的其它節點宣告。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e4/e4468bc758bf4235d03c2b851a1d1818.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖3:Calico 的 BGP Peer 列表","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過圖3 可以看到,宿主 Node1 的 BGP Peer 是兩個機架交換機的 IP 地址,配置兩個 Peer 也是出於冗餘的考慮,當其中一個 Peer 連接失敗,路由信息還可以通過另一個 Peer 進行宣告,不影響線上業務。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/23/237a79853265d5f4ccd11584649744a0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖4:整體架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖4所示,所有 Kubernetes Node 通過兩塊 10G 網卡做 Bonding 同時連接兩臺機架交換機,和機架交換機建立 IBGP 連接,機架交換機和核心交換機通過 EBGP 連接, Kubernetes 集羣使用統一的 AS Number 。下面是 Calico 的 BgpPeer 配置:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ec/ec26d2d9c90cb0e1b9e7346686a49ed7.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖5:BGP Peer 配置","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Calico 分配 IP 地址的原則爲,將整個 IPPool 分爲多個地址塊( Block ),每個 Node 獲得一個 Block ,當有 Pod 調度到某個 Node 上時,Node 優先使用 Block 內的地址。通過 BGP 學習到到其它宿主的路由條目,並且自己也會向外宣告本地的路由條目,如此集羣內的宿主就可以相互學習到彼此的路由條目,並且由 Felix 添加到本地的路由表中,這樣所有宿主就都知道集羣內其它宿主擁有的 Block 了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1. Calico IAPM","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Calico 分配 IP 地址的原則爲,將整個 IPPool 分爲多個地址塊( Block ),每個 Node 獲得一個 Block ,當有 Pod 調度到某個 Node 上時,Node 優先使用 Block 內的地址。通過 BGP 學習到到其它宿主的路由條目,並且自己也會向外宣告本地的路由條目,如此集羣內的宿主就可以相互學習到彼此的路由條目,並且由 Felix 添加到本地的路由表中,這樣所有宿主就都知道集羣內其它宿主擁有的 Block 了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"2. 同一 Kubernetes 集羣內 Pod 間的通信","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8e/8e99bb11d2eb12a1383aa0a280894b9d.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖6:同一集羣內 Pod 流量","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同一集羣內的 Pod 如何相互通信呢,如上圖所示,觀察綠色線段, Container1 請求的訪問 Container4 ,請求會先經過 veth pair calixxxx 的虛擬網卡進入到宿主網絡棧,並在宿主 Node1 的路由表中會查找到 Container4 IP 所屬網段的路由:  10.10.1.0/26 via 10.10.5.10 dev bond0 proto bird ,這條路由就是由 Node2 在 Calico 分配了 IP Block 之後宣告出來的,意思是去往目標地址在 10.66.1.0/26 內的下一跳地址是 10.10.5.10 ,也就是 Node2 的 IP 地址,當數據包到達宿主 Node2 上時,會查詢匹配到目標 IP 爲 10.10.1.13 的路由:10.10.1.3 dev cali11239f98883 scope link ,這條路由是 Ceontainer4 被調度到 Node2 上後由 Calico 添加的,最終數據包通過  veth pair cali11239f98883 進入到 Container4 中,這個數據通路反過來也是一樣,所以兩個 Pod 就可以建立連接相互通信了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"3. Kubernetes 集羣內 Pod 與外部網絡的通信","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"瞭解了統一集羣內的 Pod 如何相互通信,容器與集羣外部地址通信就好理解了。由於機架交換機與核心交換機建立的 EBGP 連接,會將集羣的路由向核心交換機宣告,所有到容器網絡訪問,核心交換機都會將數據包轉發給對應機架交換機。當某一個 Pod 要去訪問外部的地址時(比如:位於集羣外部網絡的 gitlab ),我們通過 traceroute 的輸出看到:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fe/fe305b40dfc221ab525902ccc3282898.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖7:訪問集羣外地址路徑","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一跳:是容器所在宿主 IP","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二跳:是宿主 bond0 IP所在 VLAN 的網關","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三跳和第四跳:是機架交換機和核心交換機的互聯地址","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第五跳:是核心交換機地址","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第六跳和第七跳:也是網絡設備的地址,已經進入到 Qunar 的基礎網絡","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第八跳:最後一跳,到達目標地址","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/89/8959a469c8a3674a2dc6f7cc1e9204bb.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖8:訪問集羣外流量","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四. 問題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Calico 使用過中,也發現了一些問題,比如下面這個問題, Calico IPAM 分配 IP 的時候,是按照如下邏輯進行的:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、如果節點已有綁定 IP Block ,則從此 IP Block 中分配 IP ;2、如果第一步中無可用 IP 分配或者宿主沒有已綁定的 IP Block ,則會從 IP Pool 中查找一個未綁定的 IP Block 給宿主,再執行 IP 分配策略;3、如果第二步失敗,則會從所有 IP Block 中查找一個未使用的 IP ;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這就會有個問題,在所有 IP Block 都分配給宿主後,當一個新宿主加入集羣,那這個宿主上啓動的容器就會分到一個已分配 IP Block 中的可用 IP(IP Borrowing) ,問題是, Calico 在利用 BIRD 進行 BGP 路由廣播時,針對每個已綁定的 IP Block 會設置 blackhole 路由,而這個 IP 由於黑洞路由的存在是不能與之通信的。在 Calico 3.14 之前的版本中我們對這個問題的解決方案是在集羣規劃時,同時關聯考慮 IP 和集羣大小,同時添加 IP Block 使用監控,在 IP Block 即將耗盡時能及時添加新的 IP Pool ,3.14及之後的版本,可以配置打開 strict IP affinity 來關閉 IP Borrowing 。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五. 總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"至此我們瞭解了容器是如何互通以及如何與集羣外部地址通信。Calico 還有很多更深入的功能及用法,比如適應超大規模的 Typha 模式,可以爲更大規模的容器化平臺提供健壯高效的網絡功能。後期我們也會根據實際使用場景,調整配置,配合業務的需求和發展。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章