解Bug之路-串包Bug

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"解Bug之路-串包Bug"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"筆者很熱衷於解決Bug,同時比較擅長(網絡/協議)部分,所以經常被喚去解決一些網絡IO方面的Bug。現在就挑一個案例出來,寫出分析思路,以饗讀者,希望讀者在以後的工作中能夠少踩點坑。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"串包Bug現場"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前置故障Redis超時"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於某個系統大量的hget、hset操作將Redis拖垮,通過監控發現Redis的CPU和IO有大量的尖刺,CPU示意圖下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4f/4fd22e8304c18ea1c8df87cfe8ec260d.png","alt":"輸入圖片說明","title":"在這裏輸入圖片標題","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"CPU達到了100%,導致很多Redis請求處理不及時,其它業務系統都頻繁爆出readTimeOut。此時,緊急將這個做大量hget、hset的系統kill,過了一段時間,Redis的CPU恢復平穩。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一波未平,一波又起"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"就在我們以爲事件平息的時候,線上爆出登錄後的用戶名稱不正確。同時錯誤日誌裏面也有大量的Redis返回不正確的報錯。尤爲奇葩的是,系統獲取一個已經存在的key,例如get User123456Name,返回的竟然是redis的成功返回OK。示意圖如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"Jedis.sendCommand:get User123456Name\nJedis.return:OK\n or\nJedis.sendCommand:get User123456Name\nJedis.return:user789\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們發現此情況時,聯繫op將Redis集羣的所有Key緊急delete,當時監控示意圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d2/d28fc449fccb43ebc3ba5ceb7a589f7b.png","alt":"輸入圖片說明","title":"在這裏輸入圖片標題","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當重啓後,我們再去線上觀察的時候,發現錯誤依然存在,神奇的是,這種錯誤發生的頻率會隨着時間的增加而遞減。到最後刷個10分鐘頁面纔會出現這種錯,示意圖如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f7/f7cc61cf84e43a4447966c9a77470d16.png","alt":"輸入圖片說明","title":"在這裏輸入圖片標題","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然如此,那隻能祭出重啓大法,把出錯的業務系統全部重啓了一遍。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"重啓之後,線上恢復正常,一切Okay。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Bug覆盤"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此次Bug是由Redis本身Server負載太高超時引起的。Bug的現象是通過Jedis去取對應的Key值,得不到預期的結果,簡而言之包亂了,串包了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"縮小Bug範圍"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先:Redis是全球久經考驗的系統,這樣的串包不應該是Redis的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二:Redis刷新了key後Bug依然存在,而業務系統重啓了之後Okay。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三:筆者在錯誤日誌中發現一個現象,A系統只可能打印出屬於A系統的json串結構(redis存的是json)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很明顯,是業務系統的問題,如果是Redis本身的問題,那麼在很大概率上A系統會接收到B系統的json串結構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"業務系統問題定位"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務系統用的是Jedis,這同樣也是一個久經考驗的庫,出現此問題的可能性不大。那麼問題肯定是出在運用Jedis的姿勢上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"於是筆者找到了下面一段代碼:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"public Object invoke(Object proxy,Method method,Object[] args) throws Throwable{\n JedisClient jedisClient = jedisPool.getResource(); \n try{\n return method.invoke(jedisClient,args); \n } catch(Exception e){\n logger.error(\"invoke redis error\",e); \n throw e; \n }finally {\n if(jedisClient != null){\n // 問題處在下面這句\n jedisPool.returnResource(jedisClient);\n }\n }\n}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當時我就覺得很奇怪,筆者自己寫的,閱讀過的連接池的代碼都沒有將拋異常的連接放回池裏。就以Druid爲例,如果是網絡IO等fatal級別的異常,直接拋棄連接。這裏把jedisClient連接返回去感覺就是出問題的關鍵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Bug推理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"筆者意識到,之所以串包可能是由於jedisClient裏面可能有殘餘的數據,導致讀取的時候讀取到此數據,從而造成串包的現象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"串包原因"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"正常情況下的redis交互"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先上Jedis源碼"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"public String get(final String key) {\n checkIsInMulti();\n client.sendCommand(Protocol.Command.GET, key);\n return client.getBulkReply();\n}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Jedis本身用的是Bio,上述源碼的過程示意圖如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a5/a51c500eb4f4a4b0b7da5c41ee7b9de5.png","alt":"輸入圖片說明","title":"在這裏輸入圖片標題","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"出錯的業務系統的redis交互"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/46/46874b44664ee621316ece381217a950.png","alt":"輸入圖片說明","title":"在這裏輸入圖片標題","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於Redis本身在高負載狀態,導致沒能及時相應command請求,從而導致readTimeOut異常。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"複用這個出錯鏈接導致出錯"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Redis響應了上一個command後,把數據傳到了對應command的socket,進而被inputream給buffer起來。而這個command由於超時失敗了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/32/32438b067c5d1bf511404c902963cd9c.png","alt":"輸入圖片說明","title":"在這裏輸入圖片標題","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣,inputStream裏面就有個上個命令留下來的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下一次業務操作在此拿到這個連接的時候,就會出現下面的情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3e/3e5f174f75e8ec6c3902a3f1b0813075.png","alt":"輸入圖片說明","title":"在這裏輸入圖片標題","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再下面的命令get user789Key會拿到get user456Key的結果,依次類推,則出現串包的現象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"串包過程圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0f/0f8e2caa8cea8cabf66a0fabd4561fd7.png","alt":"輸入圖片說明","title":"在這裏輸入圖片標題","style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中相同顏色的矩形對應的數據是一致的。但是由於Redis超時,導致數據串了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"爲什麼get操作返回OK"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖很明顯的解釋了爲什麼一個get操作會返回OK的現象。因爲其上一個操作是set操作,它返回的OK被get操作讀取到,於是就有了這種現象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"爲什麼會隨着時間的收斂而頻率降低"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲在調用Redis出錯後,業務系統有一層攔截器會攔截到業務層的出錯,同時給這個JedisClient的錯誤個數+1,當錯誤個數>3的時候,會將其從池中踢掉。這樣這種串了的連接會越來越少,導致Bug原來越難以出現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"在每次調用之前清理下inputstream可行否"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不行,因爲Redis可能在你清理inputstream後,你下次讀取前把數據給傳回來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"怎麼避免這種現象?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拋出這種IO異常的連接直接給扔掉,不要放到池子裏面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"怎麼從協議層面避免這種現象"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對每次發送的命令都加一個隨機的packetId,然後結果返回回來的時候將這個packetId帶回來。在客戶端每次接收到數據的時候,獲取包中的packetId和之前發出的packetId相比較,如下代碼所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"if(oldPacketId != packetIdFromData){\n throw new RuntimeException(\"串包\");\n}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"至少在筆者遇到的場景中,出現IO異常的連接都必須被拋掉廢棄,因爲你永遠不知道在你複用的那一刻,socket或者inputstream的buffer中到底有沒有上一次命令遺留的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然如果刻意的去構造協議,能夠通過packetId之類的手段把收發狀態重新調整爲一致也是可以的,這無疑增加了很高的複雜度。所以廢棄連接重建是最簡單有效的方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"公衆號"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關注筆者公衆號,獲取更多幹貨文章:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/04/046ecbf6ada994d1dade8690048697a9.jpeg","alt":"","title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章