愛奇藝容器實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近日,愛奇藝技術產品團隊舉辦了“i技術會”線下技術沙龍,本次技術會的主題是“雲原生落地探索與實踐”,邀請快手、百度和字節跳動的技術專家,與愛奇藝技術產品團隊共同分享與探討雲原生落地的實踐經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中,來自愛奇藝的技術專家趙慰爲大家帶來了愛奇藝容器實踐的分享,講述了愛奇藝的容器應用場景和在容器網絡、容器運行時方面的實踐經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本場分享的主要內容包括愛奇藝近年來在容器方面的實踐經驗和遇到過的問題,以及我們在選型和探索過程中的一些心路歷程。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"愛奇藝容器應用場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/59\/59911a2f49bb0eb4ed44ea30b6ddf03e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"愛奇藝內部超過一半的應用實例正在以容器形式運行,這些容器基本都運行在物理機集羣上面。最初愛奇藝採用Mesos框架、基於Mesos的開源服務調度框架Marathon和自研批處理框架Sisyphus,並基於Marathon研發了QAE(iQIYI App Engine)。當時,大量公司以Mesos\/Marathon爲基礎做着類似的事情。前不久,有消息表明Mesos即將從Apache光榮退休,而Twitter貢獻的著名Mesos調度框架Aurora項目也早在去年二月份就已進入Apache退役名單。對於Mesos的告別,我們感到非常遺憾,同時也加快了向Kubernetes轉型的腳步。愛奇藝在Kubernetes方面起步較晚,之後除提供原生K8S服務之外,我們還會進一步在K8S基礎上提供不同抽象級別的應用引擎,包括Serverless、FaaS、workflow等,希望以後能夠有機會和大家分享我們後續的工作。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"容器網絡實踐"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"愛奇藝容器網絡應用的發展如圖所示:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/92\/92cb0afc01c4f78ede6f0e8286e1336d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在Mesos框架用的最多的是Docker原生的local Bridge + NAT,於2014年投入使用,現在還在大量運行;中途嘗試過Calico,然而當時Docker和K8S等組織有關容器網絡標準的爭論導致我們無法全力投入某種技術,加之管理難度的限制,最終導致Calico方案並沒有在愛奇藝大規模應用;後來發展K8S時,由於之前NAT策略在應用上不太友好,所以我們定了一個基礎方向,將容器網絡跟內網打通,並選了一些具體實現方法,比如VXLAN、Calico\/CNI。後來Cilium出現,並在一些公司得到應用,大家比較感興趣;既然我們起步已經晚了,倒不如直接採用一些激進的新方案奮起直追。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可能有些同學對CNM(Container Network Model,容器網絡模型)和CNI(Container Network Interface,容器網絡接口)的容器標準之爭不太瞭解,下面做一個簡單的解釋:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Docker將自己的網絡方面剝離出一個獨立項目libnetwork並提出了CNM,定義了網絡、接入點等概念以及創建網絡、加入網絡等操作,允許第三方插件按照標準與Docker對接;CoreOS則提出了更爲簡單的CNI,只定義了往網絡裏面加減容器的接口標準。當時對於兩款接口的使用,兩家公司各執一詞。對於K8S方來說,他們覺得CNM的接口和Docker的結合過於緊密,所定的用戶操作標準也過於複雜,CNI相比CNM來說更加安全、簡單、松耦合;Docker方對此的官方回覆是,K8S等社區的意見和建議不符合Docker的整體設計。最終K8S選擇了CNI作爲網絡方案。現在CNI在行業應用中佔據主導地位,但對於插件來說,無論對哪個接口,所做的工作內容都是差不多的,例如CNM要求網絡配置要通過Docker的libKV存儲,而CNI沒有這方面的要求,但對於網絡插件來說,它的存儲總是要有的,只不過由插件自己管理或由K8S統一管理對於K8S用戶來說更加方便。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"(1)Bridge + NAT "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"回到Mesos環境中。由於當時Mesos還沒有很好地支持CNI,而愛奇藝在使用CNM中遇到了很多管理和運維上的困難,最簡單可靠的Bridge + NAT方案成爲了唯一選擇。當時我們想的也比較簡單、理想化,認爲不管走NAT還是不走NAT,在服務註冊發現把這些信息掩飾掉就好。但在長期的實踐中,這層NAT還是給我們的運維工作帶來了無比多的煩惱。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們遇到的常見問題包括RPC暴露服務地址、排障時IP PORT查應用不能正常使用、Nginx Keepalive失效等。另外還有一些偶發問題比如網卡無法釋放、IP衝突等等,但整體來說還是比較可靠。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Nginx Keepalive失效是這幾個問題裏面比較棘手的一個。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"問題"},{"type":"text","text":":多個 RS,通過 Nginx 代理請求,QPS 極低,偶發 502,有一定規律復現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解決"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)抓包:——Nginx:502 時,直接收到 RST;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"——RS 容器內:中間發送過 FIN,502 時沒有包。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)解決:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推測幾個可能性:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"——iptables 有 bug——然而相關文章表明,這種情況只會小概率出現並不會穩定出現,不太符合這個 bug 的現象,而且該服務使用短鏈接訪問 FIN 正常,故排除此原因;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"——bridge 網絡問題:小概率出現,同樣不符合;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"——iptables NAT 規則問題:繼續抓包,從RS容器發出的FIN包入手,主機上被沒有處理,進而發現時主機NAT表裏已經沒有了Nginx到RS的連接轉換規則,聯繫到低頻請求的情況,最終判斷爲是RS空閒超時後主動斷開keepalive連接,而Nginx並不知情仍嘗試使用舊連接導致訪問失敗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個問題沒有完美的解決方案,緩解思路有幾個:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)將內核參數net.netfilter.nf_conntrack_tcp_timeout_established調大,使NAT規則容忍的空閒時間超過RS容忍的空閒連接時間;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)Nginx或其他客戶端使用Keepalive時使用TCP心跳等機制維持連接;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)全部使用短連接請求。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"(2)Bridge\/CNI + VXLAN"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用K8S之後,一開始愛奇藝嘗試了Bridge\/CNI + VXLAN。對於二層網絡,K8S官方至今沒有給出最佳實踐方案。我們只瞭解到類似GCE等公有云通過這種方式使用,但整個工作應該和Docker早期一個叫pipework的工具差不多,還是比較簡單好用的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在應用過程中也遇到過一些問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"問題"},{"type":"text","text":":該模式下,Pod內訪問Service IP的請求如果被轉發到同節點的實例,則收不到響應。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解決"},{"type":"text","text":":這個問題是也和iptables有關,如果內核參數net.bridge.bridge-nf-call-iptables 值爲1,則數據包會經過主機iptables處理並遺憾丟失。將參數開機置爲0後,仍發現各節點的參數值爲1;經過一系列排查,最終發現是Docker啓動時會加載除bridge外還加載br_netfilter模塊,同時修改參數爲1。於是將br_netfilter設置爲開機加載,加之內核參數開機置0,問題即被解決。另一個解決思路是,放棄Docker,更換爲Containerd。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"(3)Cilium\/CNI + BGP"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Cilium這部分是比較冒險的一個嘗試,它的工作原理簡單而言,就是把一些工作通過eBPF機制實現到內核裏。這種程序本身不會比普通程序執行更快,但它大大縮短了執行路徑。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8a\/8a5b965d757ab08576b9cb4e71d693ea.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖片來自網絡)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與Bridge\/CNI + VXLAN相比,Cilium\/CNI + BGP涉及到整個基礎網絡環境的改動。其中IPAM和BGP與網絡規劃密切相關。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"IPAM的思路一般有完全分佈式的CIDR per host或集中式的CIDR per IDC、global CIDR幾種。各種選擇都有利弊,比如CIDR per host的路由表雖然簡單,但IP漂移侷限性較大,並且會因爲碎片化浪費大量IP;而CIDR per IDC、global CIDR方案IP漂移侷限性小,也能節省IP,但會導致路由表變得即爲龐大難以維護。經協調,我們最終決定按照TOR做規劃,儘可能在路由複雜度、IP資源與靈活性之間做出權衡。具體的網絡規劃涉及保密信息,此處不便展開。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BGP配置涉及到交換機和主機較大的改動。宿主機延續了之前的雙物理網卡HA設計,通過分別於兩個交換機建立BGP連接,並建立等價路由,既能提升網絡帶寬又能在一定程度保障網絡高可用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/29\/29349979852233b29503a593cec5ca5a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/22\/228648a7fc25cdbe671d169bbe871eb8.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"(4)Bridge\/CNI + Cilium\/CNI混合部署"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"問題"},{"type":"text","text":":——Cilium + BGP 改造週期長,不能及時滿足交付時間;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ——Cilium 技術本身潛在風險,需要準備快速恢復的方案;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"        ——存量 Bridge 集羣平滑遷移到 Cilium。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解決"},{"type":"text","text":":——Bridge、Cilium 在內網網段統一規劃,跨節點走交換機;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"      ——爲不同網絡節點打標籤,通過 Daemon Set 部署 CNI Agent。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9b\/9b014d262963aacd4f7072dd1ef6a8db.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖,綠色線是二層網絡,紅色三層數據通路。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"容器運行時"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在容器運行時這方面,愛奇藝也做過很多的嘗試。使用最早也是最多的當然是Docker;有一段時間曾嘗試過Mesos Unified Container;最後在K8S環境裏選擇了Containerd + RunC\/Kata。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Docker Daemon早期的工作模式和穩定性爲很多使用者所詬病;後來我們爲更好地保持集羣和容器狀態一致性,嘗試使用Mesos推出的Unified Container,然而遇到的問題並不比Docker少太多。由於當時鏡像、容器的存儲可靠性和使用效率不盡人意,以及缺少排障工具等等原因,這段嘗試最終告一段落,當然現在還有一些特殊的應用場景仍在沿用這個環境。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這篇文章要分享的方案主要還是"},{"type":"text","marks":[{"type":"strong"}],"text":"Containerd + RunC\/Kata"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"使用容器時經常要被以下幾個問題困擾"},{"type":"text","text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,容器的隔離性並不充分,容器之間會有一定程度的相互影響,很多情況下,一個容器出問題會導致整個主機的運行癱瘓。比如我們之前遇到過一些問題,一個容器中的進程數飆升導致整個宿主機load升高,並伴隨着高頻的線程切換,這種情況cgroup的簡單限制無法幫助其他容器正常運行;另外還遇到過一些低級錯誤,例如請求完忘記關掉連接,導致很快耗盡FD進而整機故障;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次,容器內檢測到的資源通常爲宿主機的資源,比如有些JAVA程序通過檢測CPU、內存來自動適配自己的線程數等運行狀態。當然Java等也都在逐漸往適配容器的方向發展,但效果還不是太理想;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後是安全需求,比如各種資源的可見性、可訪問性控制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決以上問題的一個常見思路是將容器放在虛擬機內運行。實際應用中要突破兩個點,一是虛擬機要儘可能輕量、啓動快,二是與方便使用、與Kubernetes等方便集成。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"容器運行時的實踐"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"(1)Kata"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/11\/115e34dd2e85d801ef82ab39c07539f3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖片來自網絡)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"愛奇藝接觸英特爾Clear Containers比較早,但一直沒有正式應用過,只是做了一些簡單的試驗,類似還包括HYPER;兩個項目最後合併成了Kata Containers,將原來的虛機進行最大程度的精簡但不影響使用。Kata同樣遵守OCI標準,能夠與外部控制器交互,比如Containerd或者自身CLI。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4b\/4bc7fe96be0d953c62de779d57c32ff9.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖注:測試Kata與RunC的性能"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們對Kata進行了一系列基準測試。根據測試結果,可以看出兩者的啓動時間相差不多,Kata一秒多一點,RunC是半秒,在實際應用中對毫秒級啓動的需求也沒有那麼多,是完全可以接受的差距。而CPU測試是使用了經典的素數計算,計算2,000,000個素數,8線程,運行60秒,看能運轉多少輪;內存測試的大小爲100G。整體來看Kata比RunC稍微慢了一點點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得注意的是,實際應用場景中表現出來的性能通常與基準測試有一定差距。比如我們深度學習的圖片推理場景進行測試,Kata相比RunC有5%到10%的損耗,勉強在可接受範圍之內,但是如果要規模應用,還需要進一步優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過,Kata本身因爲套了一層虛機,實際應用還有一些限制,列舉如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·需要 CPU 開啓 vmx 虛擬化支持"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·不支持 host 網絡模式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·不支持加入其它容器 network namespace"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·不支持 docker checkpoint、restore 功能"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·不完全支持 events,例如不支持 OOM notification"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·Update command 不支持 block IO weight"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·不支持 docker run 參數 --shm-size 設定共享內存"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"·不支持 docker run 參數 –sysctl等"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實際應用時,除了不支持host網絡模式偶爾對我們有一些影響之外,其他限制基本沒有影響。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"(2)gVisor"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/db\/db7a2310d1de4a09a3a9f4bc12f6d583.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖片來自網絡)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"gVisor是輕量虛擬化的另一個選項,由谷歌推出,使用Golang開發了一個“高仿”內核。在極致性能的同時,也帶來了一些兼容性問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這裏放上它的官方文檔和兼容性描述連接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"官方文檔:https:\/\/gvisor.dev\/docs\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兼容性:https:\/\/gvisor.dev\/docs\/user_guide\/compatibility\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"愛奇藝沒有選擇gVisor的原因是很多目前生產環境中的必需的工具它暫時無法完美支持,比如IP、SSHD、netstat等命令。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"(3)Containerd + RunC\/Kata"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Containerd與RunC的關係:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0f\/0f4584fbb8eb43a67efcd90b33f1a7c3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖片來自網絡)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單解釋一下,OCI是Docker領銜推出的開放容器標準,定義了鏡像、容器運行時、鏡像倉庫規範。只要Kata和RunC都按照OCI去實現接口,就基本可以直接進行替換。替換過程如果用Docker會繁瑣一些,但是用Containerd會非常簡單。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f7\/f77028597d1a98362092fa19dd095953.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"(圖片來自網絡)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前一段時間,Kubernetes自v1.20之後將Docker標記爲Deprecated狀態的聲明一度引起恐慌;但從目前的情況來看,使用Containerd替換Docker + Shim已經是一個很簡便的操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了以上基礎,在Kubernetes中使用Kata就變得很簡單。首先要有一個K8S的Runtime Class,在Containerd上加上Runtime Class的相應配置:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3e\/3e4a571d242272686c50e0335504df45.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後在Pod Spec中指定Kata:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f1\/f1f4d092ab5323fdbb1a30e4121e4c99.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"應用場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"互聯網行業普遍面臨的一個問題是服務器資源利用率比較低。即使在午高峯、晚高峯等時間段,整體利用率也不盡如人意。截圖是某個集羣24h內的CPU利用率監控統計。如圖所示,紅色框中的突起部分,是我們稍微做了一些工作去解決這個問題,比如晚上去運行一些任務;由於是隨機選取一天,看起來比日常平均優化效果略差一些。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/66\/66d2943801ca3949a16f2fa2f694b969.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖所示,目前方案以Mesos爲主,同時管理KVM和Docker宿主機。虛機的部分比較簡單粗暴,在每臺宿主機上使用空閒資源(按照實際利用率)創建合適規格的虛擬機,在夜間啓動、凌晨關閉;Docker部分則利用Mesos靈活的資源超賣能力,爲超賣資源分配單獨的Role。兩種資源統一由Mesos管理,並交由任務調度系統使用,爲避免影響到常規負載,調度時間同樣控制在半夜1點到6點。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/42\/42554875eea08d18b9118bc82050d1e2.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在K8S + Kata\/RunC的環境中,事情會變得簡單一些。這得益於K8S提供的比較可靠的縱向、橫向伸縮能力,和Kata提供的較強的性能隔離特性。虛機由於創建時對資源調度的控制並不精確,白天使用可能會對常規虛機性能造成較大影響,因此仍然保持僅在夜間運行。容器方面RunC的部分就是正常的調度,而Kata用來運行一些轉碼等離線任務;整體資源由K8S統一控制,避免多個資源和任務管理框架在服務器控制上產生衝突。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d9\/d9a6f88393ea60e80cd262afcf873b30.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"愛奇藝在K8S在線離線混部方面的工作纔剛剛起步,希望在過去Mesos混部的基礎上做出精細化的運作,並進一步提升服務器利用率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:愛奇藝技術產品團隊(ID:iQIYI-TP)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/d0V80Rg3buuowhRgAy-YJQ","title":"xxx","type":null},"content":[{"type":"text","text":"愛奇藝容器實踐"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章