如何優雅地重試

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在微服務架構中,一個大系統被拆分成多個小服務,小服務之間大量 RPC 調用,經常可能因爲網絡抖動等原因導致 RPC 調用失敗,這時候使用重試機制可以提高請求的最終成功率,減少故障影響,讓系統運行更穩定。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d5\/25\/d52fd4e7bf0ddc57ac3e5f30fafed125.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"重試的風險"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"重試能夠提高服務穩定性,但是一般情況下大家都不會輕易去重試,或者說不敢重試,主要是因爲重試有放大故障的風險。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,重試會加大直接下游的負載。如下圖,假設 A 服務調用 B 服務,重試次數設置爲 r(包括首次請求),當 B 高負載時很可能調用不成功,這時 A 調用失敗重試 B ,B 服務的被調用量快速增大,最壞情況下可能放大到 r 倍,不僅不能請求成功,還可能導致 B 的負載繼續升高,甚至直接打掛。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/dc\/35\/dcf3bb31f510d2759752199172b86735.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更可怕的是,重試還會存在鏈路放大的效應,結合下圖說明一下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/c8\/d4\/c80b1e5abd470aca1375041683e151d4.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設現在場景是 Backend A 調用 Backend B,Backend B 調用 DB Frontend,均設置重試次數爲 3 。如果 Backend B 調用 DB Frontend,請求 3 次都失敗了,這時 Backend B 會給 Backend A 返回失敗。但是 Backend A 也有重試的邏輯,Backend A 重試 Backend B 三次,每一次 Backend B 都會請求 DB Frontend 3 次,這樣算起來,DB Frontend 就會被請求了 9 次,實際是指數級擴大。假設正常訪問量是 n,鏈路一共有 m 層,每層重試次數爲 r,則最後一層受到的訪問量最大,爲 n * r ^ (m - 1) 。這種指數放大的效應很可怕,可能導致鏈路上多層都被打掛,整個系統雪崩。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"重試的使用成本"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外使用重試的成本也比較高。之前在字節跳動的內部框架和服務治理平臺中都沒有支持重試,在一些很需要重試的業務場景下(比如調用一些第三方業務經常失敗),業務方可能用簡單 for 循環來實現,基本不會考慮重試的放大效應,這樣很不安全,公司內部出現過多次因爲重試而導致的事故,且出事故的時候還需要修改代碼上線才能關閉重試,導致事故恢復也不迅速。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外也有一些業務使用開源的重試組件,這些組件通常會考慮對直接下游的保護,但不會考慮鏈路級別的重試放大,另外需要業務方修改 RPC 調用代碼才能使用,對業務代碼入侵較多,而且也是靜態配置,需要修改配置時都必須重新上線。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於以上的背景,爲了讓業務方能夠靈活安全的使用重試,我們字節跳動直播中臺團隊設計和實現了一個重試治理組件,具有以下優點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"能夠在鏈路級別防重試風暴。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"保證易用性,業務接入成本小。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"具有靈活性,能夠動態調整配置。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面介紹具體的實現方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"重試治理"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"動態配置"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何讓業務方簡單接入是首先要解決的問題。如果還是普通組件庫的方式,依舊免不了要大量入侵用戶代碼,且很難動態調整。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"字節跳動的 Golang 開發框架支持中間件 (Milddleware) 模式,可以註冊多個自定義 Middleware 並依次遞歸調用,通常是用於完成打印日誌、上報監控等非業務邏輯,能夠有效將業務和非業務代碼功能進行解耦。因此我們決定使用 Middleware 的方式來實現重試功能,定義一個 Middleware 並在內部實現對 RPC 的重複調用,把重試的配置信息用字節跳動的分佈式配置存儲中心存儲,這樣 Middleware 中能夠讀取配置中心的配置並進行重試,對用戶來說不需要修改調用 RPC 的代碼,而只需要在服務中引入一個全局的 Middleware 即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下面的整體架構圖所示,我們提供配置的網頁和後臺,用戶能夠在專門進行服務治理的頁面上很方便的對 RPC 進行配置修改並自動生效,內部的實現邏輯對用戶透明,對業務代碼無入侵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/48\/32\/4827f9c6bb2d6fecb0e07d60843d9a32.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"配置的維度按照字節跳動的 RPC 調用特點,選定 [調用方服務,調用方集羣,被調用服務, 被調用方法] 爲一個元組,按照元組來進行配置。Middleware 中封裝了讀取配置的方法,在 RPC 調用的時候會自動讀取並生效。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種 Middleware 的方式能夠讓業務方很容易接入,相對於之前普通組件庫的方式要方便很多,並且一次接入以後就具有動態配置的能力,可能很方便地調整或者關閉重試配置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"退避策略"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"確定了接入方式以後就可以開始實現重試組件的具體功能,一個重試組件所包含的基本功能中,除了重試次數和總延時這樣的基礎配置外,還需要有退避策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於一些暫時性的錯誤,如網絡抖動等,可能立即重試還是會失敗,通常等待一小會兒再重試的話成功率會較高,並且也可能打散上游重試的時間,較少因爲同時都重試而導致的下游瞬間流量高峯。決定等待多久之後再重試的方法叫做退避策略,我們實現了常見的退避策略,如:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"線性退避:每次等待固定時間後重試。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨機退避:在一定範圍內隨機等待一個時間後重試。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"指數退避:連續重試時,每次等待時間都是前一次的倍數。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"防止 retry storm"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何安全重試,防止 retry storm 是我們面臨的最大的難題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"限制單點重試"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先要在單點進行限制,一個服務不能不受限制的重試下游,很容易造成下游被打掛。除了限制用戶設定的重試次數上限外,更重要的是限制重試請求的成功率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實現的方案很簡單,基於斷路器的思想,限制 請求失敗\/請求成功 的比率,給重試增加熔斷功能。我們採用了常見的滑動窗口的方法來實現,如下圖,內存中爲每一類 RPC 調用維護一個滑動窗口,比如窗口分 10 個 bucket ,每個 bucket 裏面記錄了 1s 內 RPC 的請求結果數據(成功、失敗)。新的一秒到來時,生成新的 bucket ,並淘汰最早的一個 bucket ,只維持 10s 的數據。在新請求這個 RPC 失敗時,根據前 10s 內的 失敗\/成功 是否超過閾值來判斷是否可以重試。默認閾值是 0.1 ,即下游最多承受 1.1 倍的 QPS ,用戶可以根據需要自行調整熔斷開關和閾值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/46\/84\/467084d7754a7ce5a03d25acb0aa9084.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"限制鏈路重試"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面說過在多級鏈路中如果每層都配置重試可能導致調用量指數級擴大,雖然有了重試熔斷之後,重試不再是指數增長(每一單節點重試擴大限制了 1.1 倍),但還是會隨着鏈路的級數增長而擴大調用次數,因此還是需要從鏈路層面來考慮重試的安全性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鏈路層面的防重試風暴的核心是限制每層都發生重試,理想情況下只有最下一層發生重試。Google SRE 中指出了 Google 內部使用特殊錯誤碼的方式來實現:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"統一約定一個特殊的 status code ,它表示:調用失敗,但別重試。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"任何一級重試失敗後,生成該 status code 並返回給上層。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上層收到該 status code 後停止對這個下游的重試,並將錯誤碼再傳給自己的上層。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種方式理想情況下只有最下一層發生重試,它的上游收到錯誤碼後都不會重試,鏈路整體放大倍數也就是 r 倍(單層的重試次數)。但是這種策略依賴於業務方傳遞錯誤碼,對業務代碼有一定入侵,而且通常業務方的代碼差異很大,調用 RPC 的方式和場景也各不相同,需要業務方配合進行大量改造,很可能因爲漏改等原因導致沒有把從下游拿到的錯誤碼傳遞給上游。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"好在字節跳動內部用的 RPC 協議中有擴展字段,我們在 Middleware 中做了很多嘗試,封裝了錯誤碼處理和傳遞的邏輯,在 RPC 的 Response 擴展字段中傳遞錯誤碼標識 nomore_retry ,它告訴上游不要再重試了。Middleware 完成錯誤碼的生成、識別、傳遞等整個生命週期的管理,不需要業務方修改本身的 RPC 邏輯,錯誤碼的方案對業務來說是透明的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/a3\/10\/a348d0b1cdbb949ab9e2b332e167ed10.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在鏈路中,推進每層都接入重試組件,這樣每一層都可以通過識別這個標誌位來停止重試,並逐層往上傳遞,上層也都停止重試,做到鏈路層面的防護,達到“只有最靠近錯誤發生的那一層才重試”的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"超時處理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在測試錯誤碼上傳的方案時,我們發現超時的情況可能導致傳遞錯誤碼的方案失效。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於 A -> B -> C 的場景,假設 B -> C 超時,B 重試請求 C ,這時候很可能 A -> B 也超時了,所以 A 沒有拿到 B 返回的錯誤碼,而是也會重試 B , 這個時候雖然 B 重試 C 且生成了重試失敗的錯誤碼,但是卻不能再傳遞給 A 。這種情況下,A 還是會重試 B ,如果鏈路中每一層都超時,那麼還是會出現鏈路指數擴大的效應。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此爲了處理這種情況,除了下游傳遞重試錯誤標誌以外,我們還實現了“對重試請求不重試”的方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於重試的請求,我們在 Request 中打上一個特殊的 retry flag ,在上面 A -> B -> C 的鏈路,當 B 收到 A 的請求時會先讀取這個 flag 判斷這個請求是不是重試請求,如果是,那它調用 C 即使失敗也不會重試;否則調用 C 失敗後會重試 C 。同時 B 也會把這個 retry flag 下傳,它發出的請求也會有這個標誌,它的下游也不會再對這個請求重試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/a6\/ed\/a6289160f637943122f62b1ae68d59ed.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣即使 A 因爲超時而拿不到 B 的返回,對 B 發出重試請求後,B 能感知到並且不會對 C 重試,這樣 A 最多請求 r 次,B 最多請求 r + r - 1,如果後面還有更下層次的話,C 最多請求 r + r + r - 2 次, 第 i 層最多請求 i * r - (i-1) 次,最壞情況下是倍數增長,不是指數增長了。加上實際還有重試熔斷的限制,增長的幅度要小很多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過重試熔斷來限制單點的放大倍數,通過重試錯誤標誌鏈路回傳的方式來保證只有最下層發生重試,又通過重試請求 flag 鏈路下傳的方式來保證對重試請求不重試,多種控制策略結合,可以有效地較少重試放大效應。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"超時場景優化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分佈式系統中,RPC 請求的結果有三種狀態:成功、失敗、超時,其中最難處理的就是超時的情況。但是超時往往又是最經常發生的那一個,我們統計了字節跳動直播業務線上一些重要服務的 RPC 錯誤分佈,發現佔比最高的就是超時錯誤,怕什麼偏來什麼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在超時重試的場景中,雖然給重試請求添加 retry flag 能防止指數擴大,但是卻不能提高請求成功率。如下圖,假如 A 和 B 的超時時間都是 1000ms ,當 C 負載很高導致 B 訪問 C 超時,這時 B 會重試 C ,但是時間已經超過了 1000ms ,時間 A 這裏也超時了並且斷開了和 B 的連接,所以 B 這次重試 C 不管是否成功都是無用功,從 A 的視角看,本次請求已經失敗了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/9b\/c6\/9bcedfb5f6a4e946f84964d58bb819c6.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種情況的本質原因是因爲鏈路上的超時時間設置得不合理,上游和下游的超時時間設置的一樣,甚至上游的超時時間比下游還要短。在實際情況中業務一般都沒有專門配置過 RPC 的超時時間,所以可能上下游都是默認的超時,時長是一樣的。爲了應對這種情況,我們需要有一個機制來優化超時情況下的穩定性,並減少無用的重試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖,正常重試的場景是等拿到 Resp1 (或者拿到超時結果) 後再發起第二次請求,整體耗時是 t1 + t2 。我們分析下,service A 在發出去 Req1 之後可能等待很長的時間,比如 1s ,但是這個請求的 pct99 或者 pct999 可能通常只有 100ms 以內,如果超過了 100ms ,有很大概率是這次訪問最終會超時,能不能不要傻等,而是提前重試呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/4f\/0b\/4f4cdf4a1454f34c34932650c2a42e0b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於這種思想,我們引入並實現了 Backup Requests 的方案。如下圖,我們預先設定一個閾值 t3(比超時時間小,通常建議是 RPC 請求延時的 pct99 ),當 Req1 發出去後超過 t3 時間都沒有返回,那我們直接發起重試請求 Req2 ,這樣相當於同時有兩個請求運行。然後等待請求返回,只要 Resp1 或者 Resp2 任意一個返回成功的結果,就可以立即結束這次請求,這樣整體的耗時就是 t4 ,它表示從第一個請求發出到第一個成功結果返回之間的時間,相比於等待超時後再發出請求,這種機制能大大減少整體延時。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/81\/a8\/81decb36bdd880cb9dyy429c1c36aca8.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際上 Backup Requests 是一種用訪問量來換成功率 (或者說低延時) 的思想,當然我們會控制它的訪問量增大比率,在發起重試之前,會爲第一次的請求記錄一次失敗,並檢查當前失敗率是否超過了熔斷閾值,這樣整體的訪問比率還是會在控制之內。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"結合 DDL"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Backup Requests 的思路能在縮短整體請求延時的同時減少一部分的無效請求,但不是所有業務場景下都適合配置 Backup Requests ,因此我們又結合了 DDL 來控制無效重試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DDL 是“ Deadline Request 調用鏈超時”的簡稱,我們知道 TCP\/IP 協議中的 TTL 用於判斷數據包在網絡中的時間是否太長而應被丟棄,DDL 與之類似,它是一種全鏈路式的調用超時,可以用來判斷當前的 RPC 請求是否還需要繼續下去。如下圖,字節跳動的基礎團隊已經實現了 DDL 功能,在 RPC 請求調用鏈中會帶上超時時間,並且每經過一層就減去該層處理的時間,如果剩下的時間已經小於等於 0 ,則可以不需要再請求下游,直接返回失敗即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/24\/94\/24c01edf1bc25b601d18e65b5516aa94.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DDL 的方式能有效減少對下游的無效調用,我們在重試治理中也結合了 DDL 的數據,在每一次發起重試前都會判斷 DDL 的剩餘值是否還大於 0 ,如果已經不滿足條件了,那也就沒必要對下游重試,這樣能做到最大限度的減少無用的重試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"實際的鏈路放大效應"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之前說的鏈路指數放大是理想情況下的分析,實際的情況要複雜很多,因爲有很多影響因素:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
策略說明
重試熔斷請求失敗 \/ 成功 > 0.1 時停止重試
鏈路上傳錯誤標誌下層重試失敗後上傳錯誤標誌,上層不再重試
鏈路下傳重試標誌重試請求特殊標記,下層對重試請求不會重試
DDL當剩餘時間不夠時不再發起重試請求
框架熔斷微服務框架本身熔斷、過載保護等機制也會影響重試效果"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"各種因素綜合下來,最終實際方法情況不是一個簡單的計算公式能說明,我們構造了多層調用鏈路,在線上實際測試和記錄了在不同錯誤類型、不同錯誤率的情況下使用重試治理組件的效果,發現接入重試治理組件後能夠在鏈路層面有效的控制重試放大倍數,大幅減少重試導致系統雪崩的概率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上所述,基於服務治理的思想我們開發了重試治理的功能,支持動態配置,接入方式基本無需入侵業務代碼,並使用多種策略結合的方式在鏈路層面控制重試放大效應,兼顧易用性、靈活性、安全性,在字節跳動內部已經有包括直播在內的很多服務接入使用並上線驗證,對提高服務本身穩定性有良好的效果。目前方案已經被驗證並在字節跳動直播等業務推廣,後續將爲更多的字節跳動業務服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文章轉載自:字節跳動技術團隊(ID:toutiaotechblog)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/6IkTnUbBlHjM3GM_bT35tA","title":"xxx","type":null},"content":[{"type":"text","text":"如何優雅地重試"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章