高併發場景創建JedisPool有哪些注意事項?

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"一、背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個平靜的下午,報警、Moji羣裏接連傳來並行MOA默認集羣 /service/parallel 出現異常的提示信息,服務維護人員查看日誌發現是發生了並行任務線程池被打滿的問題。線程池滿會導致新請求直接被拒絕,大量業務請求報錯,極速版附近的人、基因、聊天室等多個業務進入降級狀態... 而導致這一系列嚴重影響的問題原因,是大家最熟悉不過的 new JedisPool() 這一行代碼。","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Jedis是Java訪問Redis最常用的客戶端組件","attrs":{}}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"二、問題分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從慢請求日誌我們發現,單一請求阻塞線程的時間最長可達到10分鐘以上。簡單的new JedisPool()爲何會長時間阻塞線程?通過搭建測試服務、復現問題、抓取jstack分析後,我們發現JedisPool中向JMX註冊對象的邏輯,在特定的場景會出現嚴重的鎖競爭與阻塞問題。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"包依賴說明","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"並行MOA工程 ->MOA(MOARedisClient) ->MCF(RedisDao) ->Jedis(JedisPool) ->commons-pool(BaseGenericObjectPool) ->JDK(JMX)","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"問題出現在並行MOA通過MOARedisClient訪問下游服務新啓動實例的過程中,此時需要通過new JedisPool()創建與下游實例的連接池。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"new JedisPool()中與JMX的交互","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JedisPool使用commons-pool來管理連接對象,commons-pool創建對象池時會向JMX註冊,以便於在運行時通過JMX接口獲取對象池相關的監控數據。但向JMX註冊的過程,包含以下邏輯","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"commons-pool向JMX註冊BaseGenericObjectPool對象,JMX要求每個對象有不同的名字","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"commons-pool生成不同名字的方式是:基於一個默認相同的名字,末尾添加一個自增ID","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"每次new JedisPool()時ID從1開始嘗試,發現名字重複後ID自增+1後再次重試,直至發現一個未被佔用的ID、重試成功爲止","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"嘗試某個名字是否被佔用,會共用一把全局的鎖,同一時刻只能有一個JedisPool對象對某一個名字ID驗證是否重複","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"commons-pool中遍歷ID嘗試註冊objName的代碼","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/23/23d0bf187c6f24dd8477b53ca425bd8c.jpeg","alt":"","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JMX中註冊對象的代碼,會獲取一把全局的鎖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c7/c7f542f33552b652f0d0df0e7c6c0f7b.jpeg","alt":"","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"問題產生的條件","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"當前進程中已創建了大量的JedisPool,有大量的自增ID已被佔用(如1~1w已被佔用)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"此時創建下一個JedisPool,需要遍歷1w次已有ID才能以1w + 1這個ID註冊對象,這1w次嘗試每次都需要加鎖","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"當有多個線程同時創建JedisPool時,每個線程都需要遍歷所有ID,並且遍歷過程中每次加鎖都會導致其他線程無法重試、只能等待","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"假設1個線程遍歷1w次需要1秒,100個線程各遍歷1w次、共計100w次嘗試需要串行執行,並且100個線程是交替獲得鎖、交替重試,最終導致100個線程都需要100秒才能重試完","attrs":{}}]}]}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"三、問題排查過程","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"問題產生","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"16:14 /service/phpmoa/v1_microvideo_index 執行常規的發佈操作16:16 /service/parallel 並行任務線程池被打滿、開始通過擴容和隔離實例解決16:26 服務逐步恢復","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"並行MOA使用了MSC線程池組件,從活躍線程數監控可以看到每個並行MOA實例線程池被打滿到恢復的時間","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ca/ca77040a87a3fab956c1d6b1e80c60e6.jpeg","alt":"","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"被阻塞的線程是能夠自動恢復的,並且恢復的時間並不統一。從日誌中我們首先找到了阻塞線程的慢請求","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"execution finished after parallel timeout: /service/phpmoa/v1_microvideo_index,isRiskFeeds, startTime = 2020-11-30 16:26:02,428, server = 10.116.88.15:20000, routeTime = 2020-11-30 16:26:02,428, blacklistTime = 2020-11-30 16:26:02,428, executeTime = 2020-11-30 16:37:21,657, timeCost = 679229","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"剛好是調用 /service/phpmoa/v1_microvideo_index 服務,但記錄的執行時間最長可達到10分鐘以上。慢日誌中包含各個階段的耗時,因此耗時的邏輯可以鎖定在 blacklistTime 和 executeTime 之間,這裏只對應一行通過MOA框架MOARedisClient調用下游服務的代碼","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"初步分析","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在MOARedisClient.exeuteByServer()內部,僅有2個邏輯可能出現較長的耗時,一個是RedisFactory.getRedisDao(),這裏會與下游實例創建連接。另一個是doInvoke()真正發起請求,由於後者的耗時會提交到Hubble,並且未發現達到分鐘級的耗時,因此問題的原因更可能出現在創建RedisDao的邏輯中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"排查瓶頸","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於RedisFactory.getRedisDao()各個階段的耗時缺少監控,並且服務出現異常期間沒有及時通過jstack打印堆棧信息,問題排查到這一步僅靠分析很難繼續進行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"問題復現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們查找了 /service/phpmoa/v1_microvideo_index 的發佈記錄,發現這個服務每次發佈的時候,/service/parallel 都會有短暫的errorCount波動,因此推斷該問題是能夠通過重啓 /service/phpmoa/v1_microvideo_index 來複現的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"搭建測試服務","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"重啓線上服務有可能再次導致服務異常、影響線上業務,所以我們先嚐試在線上環境複製上下游項目、發佈成不同的ServiceUri,並增加一個測試接口,通過壓測平臺製造流量,搭建起和線上調用鏈路基本一致的測試環境。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"增加監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了在MOA和MCF的代碼中增加各階段耗時的日誌外,對於並行MOA出現線程池滿拒絕請求、以及出現10秒以上慢請求的場景,均增加了自動打印jstack的機制。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"獲得排查依據","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在適當調整模擬流量的壓力後,重啓測試的 /service/phpmoa/v1_microvideo_index 服務後,問題提復現了。這一次我們拿到了詳細的耗時信息,以及線程池滿後的jstack堆棧信息,才進一步分析到問題的根本原因。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"四、問題驗證","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"測試服務驗證","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"問題復現後的jstack堆棧,611個線程停留在等待鎖的步驟","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"將JMX關閉後,對比其他未關閉的實例沒有再復現該問題","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e9/e908a80e90ed386c334eba1b8490563d.jpeg","alt":"","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"與問題現象匹配","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"並行MOA的特徵","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"調用的下游服務極多、下游實例數極多,需要創建大量的JedisPool","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"下游重啓過程中並行MOA需要創建大量新的JedisPool,並且並行創建的線程數很多(最多800個)","attrs":{}}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"問題發生過程","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"下游服務發佈後出問題(microvideo_index)、下游實例數多的服務發佈問題嚴重(230個)、發佈速度快的服務問題嚴重(2分鐘)、多個服務同時發佈的時候問題嚴重(microvideo_index和user_location在同一時間段做發佈)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"各個並行MOA實例能夠自動恢復,但恢復的時間點差異較大(具體耗時取決於已有ID數量、並行創建JedisPool的線程數據量,各實例的情況可能不一致)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"異常期間並行MOA服務的CPU使用率大幅升高(在頻繁獲取鎖)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"相同時刻其他並行MOA的集羣未出問題(因爲請求量低、並行創建JedisPool的線程少)","attrs":{}}]}]}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"五、解決方案","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"問題影響範圍","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務上使用JedisPool的場景,多通過MCF的RedisDao封裝。RedisDao主要用於兩個場景","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"MomoStore","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過MomoStore訪問Redis數據源、訪問OneStore底層使用RedisDao。由於MomoStore對於新實例的連接建立是在接收事件通知後單線程執行的,受併發創建JedisPool的影響較少。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"MOARedisClient","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於與下游新實例創建連接的動作是在業務請求中完成的,所以使用MOARedisClient的場景受併發創建JedisPool影響的可能性較大。當服務與並行MOA具備類似的特徵:下游服務多、實例多,並行執行的請求多,在下游服務發佈時也容易出現相同的問題。使用MOARedisClient在某些場景下的執行時間超出設定的timeout時間,也與該問題有關。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"修復方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最簡單有效的解決方案是關閉JedisPool的JMX配置,可以在MCF的代碼中統一修改、通過升級MCF版本修復。對於已接入Mesh的服務,由於MOARedisClient實際與下游通信的地址是127.0.0.1,所需建立的連接池很少,所以不會受該問題影響。後續我們會掃描所有使用MOARedisClient、但尚未接入Mesh的服務,推動升級MCF版本消除這一隱患。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"其他改進項","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在MSC線程池中加入線程池滿自動打印jstack的機制。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章