百度大規模Service Mesh落地實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀","attrs":{}},{"type":"text","text":":百度過去基於rpc框架的服務治理存在各種框架能力層次不齊、業務自身服務治理效率低、全局可觀測性不足等諸多問題。本文介紹了百度內部落地service mesh的實踐過程,以基礎穩定性能力治理和流量調度治理能力爲業務落地點,詳細闡述了內部落地的service mesh整體技術方案以及一系列關鍵技術,如性能的極致優化、擴展的高級策略、周邊服務治理系統等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"全文6835字,預計閱讀時間13分鐘。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"一、背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度大部分產品線已完成微服務的改造, 數萬個微服務對架構服務治理能力提出了更高的要求。傳統的服務治理一般通過rpc框架去解決,多年以來百度內部也衍生出多種語言的rpc框架,比如c++、go、php等等框架,基礎服務治理能力和rpc框架耦合,rpc框架能力參差不齊,給公司整體服務治理能力和效率提升帶來較多的痛點及挑戰:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1.高級架構能力無法多語言、多框架複用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如某產品線近2年發生數次雪崩case,底層依賴的php、golang等框架需要重複建設來定製動態熔斷、動態超時等高級能力,而這些能力在其他rpc框架已支持;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如常用架構降級、止損能力各個產品線重複建設,接口方案差異大,從運維層面,運維同學期望基礎的架構止損能力在不同產品線之間能夠通用化,接口標準化,降低運維成本;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2.架構容錯能力治理週期長,基礎能力覆蓋度低","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着混沌工程全面落地,對架構能力有了更高要求。多數模塊對單點異常,慢節點等異常缺乏基礎容忍能力,推動每個模塊獨立修復,成本高,上線週期長。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如某產品線治理改造花了2個季度完成;推薦某類召回服務經常出現超時、重試配置等不合理的問題,集中管理調整成本比較高。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3.可觀測性不足,是否有通用機制提升產品線可觀測性?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如某推薦業務缺少整體模塊調用關係鏈和流量視圖,線上故障靠人肉經驗定位、新機房搭建週期長,效率低。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"二、service mesh解決什麼問題?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲徹底解決當前業務服務治理的痛點和問題,我們引入了service mesh,基本思路解耦治理能力和框架,治理能力下沉到sidecar。內部聯合多個部門通過合作共建方式建設通用的service mesh架構, 提供通用的基礎穩定性能力和統一的流量控制接口。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們期望service mesh在廠內業務落地解決什麼問題?總結爲兩點:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1、基礎穩定性能力的關鍵組件","attrs":{}},{"type":"text","text":" – 爲微服務提供通用的基礎故障容錯能力、基礎故障檢測能力、統一的干預和控制接口;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2、流量治理的核心繫統","attrs":{}},{"type":"text","text":" – 實現各產品線整體的連接託管、全局流量的可觀測、精細調度能力;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/76/76a9254d0a0c3e5deea75920b4491764.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"附service mesh定義:Linkerd CEO William Morgan於2016年9月29日公開提出,service mesh是用於處理服務間通信的基礎設施層,用於在雲原生應用複雜的服務拓撲中實現可靠的請求傳遞。在實踐中,service mesh通常是一組與應用一起部署,但對應用透明的輕量級網絡代理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"三、技術挑戰","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"我們在落地service mesh實際過程中,面臨以下幾大挑戰","attrs":{}},{"type":"text","text":":","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 低侵入","attrs":{}},{"type":"text","text":":百度大大小小有上百個產品線,模塊數量級達到萬級別,實例數達到百萬級別,如何讓業務在不改代碼前提下無縫遷移,低侵入接入是我們在設計方案考慮第一要素;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 高性能","attrs":{}},{"type":"text","text":":百度核心產品線在線服務對延遲要求極高,比如推薦、搜索等核心產品線,延遲上漲幾毫秒會直接影響用戶的體驗和公司收入,從業務角度不能接受接入mesh後帶來的性能退化。因此我們在落地過程中,投入很大精力來優化mesh的延遲,降低接入mesh後的性能損耗;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 異構系統融合","attrs":{}},{"type":"text","text":":首先我們需要解決廠內多語言框架互通問題,其次需要統一接口和協議,打通廠內多個服務治理系統,如服務發現、流量調度、故障止損等系統;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· mesh可靠性","attrs":{}},{"type":"text","text":":在線業務對可靠性要求極高,要求我們在落地過程中,充分考慮自身穩定性,避免出重大case。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"總結:我們的需求是實現一套低侵入、高性能、完備的治理能力","attrs":{}},{"type":"text","text":",能夠解決業務實際問題service mesh架構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"四、整體架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 技術選型","attrs":{}},{"type":"text","text":":我們底層以開源istio+envoy組件爲基礎,基於廠內實際業務場景,適配廠內組件。選擇基於開源定製的主要原因是兼容社區,跟開源保持標準協議,吸收社區的高級feature同時能夠反哺到社區。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"我們內部落地的mesh整體架構如下 ,包括以下核心組件:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/83/83a7dc9fe7bb0d92bddf03bf1d62b7f0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· Mesh控制中心:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 接入中心","attrs":{}},{"type":"text","text":":sidecar的注入,管理sidecar版本,統一上線入口;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 配置中心","attrs":{}},{"type":"text","text":":穩定性治理和流量治理入口,託管連接、路由配置、通信等策略;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 運維中心","attrs":{}},{"type":"text","text":":mesh的日常運維,如干預去劫持操作;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 控制面板","attrs":{}},{"type":"text","text":":istio-pilot組件,負責路由管理、通信策略等功能;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 數據面板","attrs":{}},{"type":"text","text":":envoy組件,負責流量轉發、負載均衡等功能;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 依賴組件","attrs":{}},{"type":"text","text":":融合廠內服務發現組件naming service、內部各種rpc框架適配、監控系統、底層paas支持;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 周邊治理生態","attrs":{}},{"type":"text","text":":基於mesh統一治理接口衍生出的服務治理生態,如智能調參系統、 故障自動定位&止損系統、故障治癒、混沌工程(基於mesh的精細化故障注入)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們從","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"接入方式、性能優化、穩定性治理、流量治理、周邊系統協同、穩定性保障等","attrs":{}},{"type":"text","text":"關鍵技術來解析:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.1 接入方式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"社區採用的iptables流量劫持方案, iptables規則過多會導致性能問題,尤其在廠內數萬個實例轉發下受限iptables線性匹配規則,轉發延遲非常大,不能滿足在線低延遲的場景。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"我們的解決思路","attrs":{}},{"type":"text","text":":基於本地lookbackip地址方案,envoy打通內部服務發現組件,劫持服務發現請求,通過回傳lookback地址透明劫持業務流量。同時本地naming agent定期探活envoy,一旦envoy出現異常,自動回退到直連模式,避免envoy故障導致流量丟失。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f2/f20da48b9b5d2c55c6d81ce3f5c0a1fb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時針對一些不走流量劫持的業務,我們設計了proxyless方案,即通過rpc框架適配istio標準的xds,接入pilot服務治理的通路,託管服務治理策略和參數分發生效。無論業務流量是否被劫持,都通過mesh標準化的干預入口實現服務治理的統一管控和治理。目前proxyless方案已在內部c++等rpc框架完成適配,在搜索、推薦等業務線落地。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4a/4a6efafacc0eaefa6444d01840b81ade.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"總結","attrs":{}},{"type":"text","text":":我們通過基於服務發現流量劫持和proxyless兩種透明遷移的的接入方案,實現業務模塊無需修改代碼即可接入mesh的低侵入方式,降低業務接入mesh的成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.2 性能極致優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在落地過程發現社區版本envoy延遲、資源消耗較大,在一些大扇出複雜場景下,流量劫持帶來的延遲上漲接近5ms,cpu消耗佔比20%以上,無法滿足廠內在線業務高吞吐、低延遲場景。我們分析evnoy底層模型,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"本質原因是envoy 是一個單進程多線程的libevent線程模型,一個event-loop只能使用一個核,一個回調卡住就會卡住整個線程,容易產生高延時,導致吞吐長尾控制能力比較差。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/53/5344c34326758c4b303e2773c132b086.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們基於envoy擴展接口擴展envoy的網絡模型&線程模型,引入brpc底層高性能的bthread協程模型 。在我們內部簡稱高性能brpc-envoy版本。同時我們打通pilot,實現原始libevent和brpc-thread在線切換,用戶可以非常方便自助選擇開啓高性能模型。","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"備註:brpc 百度內部c++ 高性能rpc開源框架,內部數幾十個產品線再使用,實例數有數百萬規模,已開源。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"測試下來結果,相比開源社區版本和MOSN(螞蟻自研已開源)等業界框架, CPU降低60%+,平均延遲降低70%+,長尾延遲平均降低75%+,性能大幅領先業界,徹底解決社區版envoy無法滿足大規模工業高性能場景的問題,爲大規模落地mesh掃清障礙。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/52/52830f5ca4e573e7a6e15a57bea352b6.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時我們正在調研ebpf、dpdk等新技術,進一步降低延遲和資源消耗。目前測試下來ebpf相比本地lookbackip轉發性能有20%的提升,dpdk相比內核協議棧有30%的性能優化空間(在綁核條件下)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.3 穩定性治理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"內部在線&離線服務大規模混部,線上混部環境複雜,對模塊的架構穩定性能力要求比較高。我們基於mesh提供通用的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"故障容錯能力、故障檢測能力、統一的干預和降級能力","attrs":{}},{"type":"text","text":"來整體提升產品線穩定性能力的baseline:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.3.1 局部故障容錯能力:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了提升架構對日常機器故障的容錯能力,我們基於envoy擴展了高級穩定性容錯策略,比如增加動態重試熔斷策略,通過滑動窗口計算分位值耗時,動態控制重試比例,通過重試撈回請求同時也避免大量重試引發雪崩的風險。另外我們引入反饋式的高級負載均衡策略,根據下游返回定製的錯誤碼,降權&屏蔽故障實例,通過熔斷保護機制控制權值,避免正常實例被打掛。在我們內部核心產品線上線後,大幅提升模塊在局部故障下的容錯能力,架構韌性能力大大提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f5/f560cf51eded7f7d942e9447d04ee4eb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(參考下圖,某在線核心模塊接入mesh後,可用性從之前2個9提升到4個9)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e8abeb8b90c729656bd04d88c68baa75.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對雪崩治理場景(我們統計廠內核心產品線雪崩歷史case,90%以上case都是雪崩治理能力缺失,比如重試風暴、超時倒掛、降級能力缺失導致),我們基於mesh定製熔斷能力的高級重試能力來抑制重試風暴,提供動態超時機制來預防超時倒掛。在覈心產品線的大範圍鋪開後,覆蓋近2年內雪崩90%+故障場景, 2020年雪崩case對比2019年雪崩類case損失環比下降了44%","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.3.2 局部故障檢測能力:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"過去故障檢測依賴機器粒度的基礎指標,粒度比較粗,針對容器故障實例缺乏精細指標檢測,無法及時探測到故障實例,通常需要數小時纔會檢測到故障實例。我們打通了上層故障自愈系統,基於envoy擴展故障檢測策略,提供通用、快速直接的故障發現檢測能力,外部故障自愈系統通過prometheus接口採集故障實例,經過匯聚分析,觸發paas遷移故障實例。對於已接入mesh的業務線,幾乎零成本代價下即可具備局部異常的快速發現&定位能力,故障實例的檢測時效性從原來數小時優化到分鐘級。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ef/ef50c545b892ee60319e1fefbde9213a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4.3.3 統一的干預和降級能力:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於一些大規模故障,單靠架構自身容錯解決不了,需要依賴穩定性預案去止損,比如典型的下游弱依賴摘除預案。過去依賴不同產品線和模塊自身去建設降級能力,不同模塊接口方案差異大,隨着系統不斷迭代,降級能力可能出現退化,運維成本和挑戰比較大。我們結合mesh實現通用降級和干預能力,如支持多協議場景下流量丟棄能力,實現統一的流量降級策略;通過統一的超時和重試干預能力,實現秒級的干預時效性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過落地mesh爲多產品線提供統一的干預和控制接口,爲穩定性預案提供一致的操作接口,大大提升了服務治理效率,產品線服務治理迭代週期從過去季度級縮短到月級。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如20年某業務線接入mesh兩週完成4個方向20+模塊架構治理改造,而原來往往需要一個季度週期才能完成改造。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1c/1cd02e0ce290b64f071763412c66491e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.4 流量治理能力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 流量可觀測性:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"過去構建產品線模塊上下游調用鏈和基礎黃金指標一直缺乏通用的解決方案,大多數都是基於rpc框架或者業務框架定製,模塊調用鏈和黃金指標覆蓋率低。比如某重要產品線端到端涉及到2000多個模塊,調用鏈關係十分複雜,具體流量的來源不夠透明,嚴重影響運維效率。如機房搭建不知道上下游的連接關係,靠人肉梳理誤差大,某產品線一次搭建週期將近2個月時間。另外故障定位、容量管理等由於全局的可觀測性不足,往往只能依賴經驗定位,效率十分低下。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"我們整體思路以mesh爲中心,結合周邊rpc框架,構建全局servicegraph調用鏈。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 一方面","attrs":{}},{"type":"text","text":"通過istio內部crd抽象表達出模塊鏈路關係和鏈路屬性,在istio上層自建mesh配置中心,屏蔽底層crd細節。以配置中心作爲連接託管的唯一入口,託管模塊全鏈路的調用關係,新機房建設基於servicegraph快速構建出新機房的拓撲,很大程度提升機房搭建效率,縮短週期。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 另一方面","attrs":{}},{"type":"text","text":"同時結合brpc和mesh,制定標準的黃金指標格式,建設統一的黃金指標數據倉庫,支持上游的服務治理建設,比如容量管理分析、故障定位、性能分析、故障注入等。比如我們正在落地的故障自感知、止損系統基於servicegraph可自動化、快速、準確實現線上故障的感知、止損。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/72/72ca09e62a672ee244b44890a401f58d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 流量精細調度:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"廠內大部分產品線基於入口整體切流,一直缺乏對模塊鏈路內部流量精細調度控制能力。我們結合mesh的流量調度能力,打通廠內服務發現組件,整合一系列切流平臺,統一流量調度入口到mesh控制中心。結合前面servicegraph提供的全局調用鏈,實現模塊精細連接關係的流量調度能力;另外我們基於mesh實現模塊實例粒度精細流量調度和流量複製能力,典型應用於模塊的精細流量評估、線下壓測、導流場景下。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6c/6c99b6265fa8e33d65999b159ea3d575.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.5 周邊生態協同","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於mesh提供統一的控制接口,衍生出周邊服務治理系統,典型場景如治理參數自動調參、故障自動止損、故障自愈等系統。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 自動調參系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服務治理參數依賴用戶手工配置參數(超時比例、權重比例等),完全依賴人肉經驗,頻繁出現配置不合理影響治理能力效果,同時線上環境差異比較大,靜態配置無法適應線上複雜環境變化。我們設計出一套動態調參系統,核心思路基於mesh的治理統一接口和結合線上指標實時反饋,實時調整治理參數。比如根據下游CPU利用率,動態調參訪問下游重試分位值比例;根據下游機器負載差異化,動態調參訪問下游權重。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在廠內核心產品線落地後,通過自動調參完全代替人肉調參,實現服務治理參數自適應調整。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/86/86de2ce089582b948e45a44d712888b1.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 故障自動感知止損系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統線上故障憑人工經驗定位,產品線深度定製預案能力,強依賴有經驗的工程師,新人上手成本高;並且預案止損操作散落在文檔中,可維護性差,隨着業務迭代可能失效或者逐步退化,不可持續。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們基於mesh通用的干預能力和統一控制接口,研發一套故障預案自動止損系統,結合前面提到的service graph提供全局調用鏈和黃金指標,實現常見故障的自動感知、預案自動止損,降低故障止損的mttr時間。同時打通混沌工程,定期端到端注入故障觸發預案演練,避免預案能力退化。這套系統目前典型應用在強弱依賴降級、精細化流量調度等預案場景,預計到年底,接入mesh的產品線大部分線上故障都能自動化處理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ca/ca2bc4c29124599d9d19f3226c784d2d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"· 統一協議,協同周邊系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於mesh配置中心提供標準的流量控制和服務治理接口(如流量降級接口),協同周邊系統生態,如自動調參、故障感知止損、故障自愈、流量調度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於開源xds協議,統一數據面協議,對接周邊rpc框架,實現不同rpc框架能夠統一控制。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/94/946eb308fb2876d854f8120acc2f9dd1.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.6 自身穩定性保障","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"廠內業務比如搜索、推薦等關鍵業務對穩定性要求極高,在線遷移mesh好比”高速公路上換車輪“,必須保證對業務無損。因此穩定性建設是我們在落地mesh過程重點關注點之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"首先我們通過多級兜底機制保障流量轉發的可靠性","attrs":{}},{"type":"text","text":"。針對局部故障,如個別envoy實例配置、進程等異常,envoy自身具備fallback機制,異常可以自動回退直連模式,無需人工介入。但一些大規模故障,比如envoy出現大範圍故障,靠envoy自身機制已經無法保證(可能出現劫持、非劫持模式來回波動),我們通過外部干預平臺一鍵下發轉發黑名單,強制干預envoy切到直連模式,全產品線止損時效性控制在5分鐘以內;極端情況下,如envoy大範圍hang死,可能導致對外干預接口失效,我們準備了兜底預案,聯動paas批量強制殺掉envoy進程,回退到直連模式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"其次在服務治理配置發佈方面,我們核心思路控制故障隔離域","attrs":{}},{"type":"text","text":",比如打通mesh配置中心,灰度控制配置發佈的百分比;同時構建mesh接入一站式平臺,梯度逐步發佈,控制envoy升級對業務的影響面。我們引入monitor模塊定期做端到端巡檢,如配置一致性、envoy節點服務異常、版本一致性等校驗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"最後我們定期通過混沌工程主動注入故障","attrs":{}},{"type":"text","text":",比如模擬envoy異常、pilot異常、配置中心異常等,進行極限異常case演練,避免自身穩定性架構能力退化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"五、總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從19年年底開始立項,不到2年的時候,在內部數十個產品線已完成落地,其中一些核心產品線主幹模塊已覆蓋到80%以上,天級託管流量超過千億。新接入模塊幾乎零成本接入,即可具備基礎穩定性治理和流量調度能力。我們結合周邊生態系統,構建一站式mesh接入平臺,爲各業務線提供低侵入、低成本、標準化的服務治理解決方案,系統性解決各個產品線的基礎可用性問題,大幅降低治理迭代成本&週期,促進體系整體穩定性能力的提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"招聘信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你對微服務感興趣,請你聯繫我,我們當面聊聊未來的N種可能性。無論你是後端,前端 ,大數據還是算法,這裏有若干職位在等你,歡迎投遞簡歷,關注同名公衆號百度Geek說,輸入內推即可,我們期待你的加入!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"推薦閱讀","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247494238&idx=1&sn=efaf3b393a8d14d3f1e6e7bbd2bce3cd&chksm=c03eda22f7495334c4038e6c932dbfbe617ff4dd21c14ff79c222cc575df710211d49d043810&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|百度同學教你怎樣成爲覆盤高手","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"|","attrs":{}},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247494205&idx=1&sn=8af04b9aab29a6a1759de837be2b6251&chksm=c03eda41f7495357cb5d9a9c75e27caf93283b192456de55acb1e008c81774d53e5a14f01a82&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"聯邦計算在百度觀星盤的實踐","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"|","attrs":{}},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247493554&idx=1&sn=9eaa6cb738547c38980c23798fd66e29&chksm=c03ed7cef7495ed8422338b880235d04c0ca2ccfd4abb96ca0bd9c9ef47a535729865f1f0cdb&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"百度愛番番與Servicemesh不得不說的故事","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"|","attrs":{}},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247493116&idx=1&sn=90925b509f4d8bfedc7066f2317e3d9c&chksm=c03ed580f7495c9621068194b799dd7fcc9ebff535a6fa04aacf593eae549c8d500b06df57d1&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"一種基於實時分位數計算的系統及方法","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"---------- END ----------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度Geek說","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度官方技術公衆號上線啦!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術乾貨 · 行業資訊 · 線上沙龍 · 行業大會","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"招聘信息 · 內推信息 · 技術書籍 · 百度周邊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎各位同學關注","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章