一、背景
業務方反饋,日誌中發現大量如下錯誤:
[2019-09-25T18:29:19,853][WARN ][RocketmqClient ] get Topic [search-core_model_v2-topic] RouteInfoFromNameServer is not exist value
[2019-09-25T18:29:19,853][WARN ][RocketmqClient ] updateTopicRouteInfoFromNameServer Exception
org.apache.rocketmq.client.exception.MQClientException: CODE: 17 DESC: No topic route info in name server for the topic: search-core_model_v2-topic
See http://rocketmq.apache.org/docs/faq/ for further details.
at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1227) ~[rocketmq-client-4.2.0.jar:4.2.0]
at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1197) ~[rocketmq-client-4.2.0.jar:4.2.0]
at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:605) [rocketmq-client-4.2.0.jar:4.2.0]
at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:492) [rocketmq-client-4.2.0.jar:4.2.0]
at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:361) [rocketmq-client-4.2.0.jar:4.2.0]
at org.apache.rocketmq.client.impl.factory.MQClientInstance$3.run(MQClientInstance.java:278) [rocketmq-client-4.2.0.jar:4.2.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_111]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_111]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_111]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]
而最近應用沒有重啓或更新過。
二、調查
-
錯誤的含義很明確,NameServer上不存在這個topic,而NameServer後端的代碼很簡單,簡化如下:
TopicRouteData topicRouteData = map.get(requestHeader.getTopic()); if (topicRouteData == null) { response.setCode(ResponseCode.TOPIC_NOT_EXIST); response.setRemark("No topic route info in name server for the topic: " + requestHeader.getTopic() + FAQUrl.suggestTodo(FAQUrl.APPLY_TOPIC_URL)); }
即,確實從NameServer上查不到這個topic。
-
立馬本地測試,從NameServer上卻可以獲取到。
NameServer上的topic路由是broker上報上去的,檢查broker發現topic的配置都正常。
-
檢查異常發生的日期,發現出問題那天進行了兩項變更:
- 兩個集羣的slave爲了滿足跨機房高可用的需求,進行了交換。
- 將A集羣的NameServer兩個機器下線,並分給了B集羣,部署了兩個新的NameServer實例。
所以,極有可能是NameServer這個變更導致了問題的發生。
-
調查日誌
-
跟進業務日誌,發現如下端倪:
[2019-09-17T16:27:19,850][INFO ][RocketmqRemoting ] new name server is chosen. OLD: A:9876 , NEW: B:9876. namesrvIndex = 790 [2019-09-17T16:27:19,851][INFO ][RocketmqRemoting ] createChannel: begin to connect remote host[B:9876] asynchronously [2019-09-17T16:27:19,851][INFO ][RocketmqRemoting ] NETTY CLIENT PIPELINE: CONNECT UNKNOWN => B:9876 [2019-09-17T16:27:19,852][INFO ][RocketmqRemoting ] createChannel: connect remote host[B:9876] success, DefaultChannelPromise@291c6e50(success)
日誌顯示,在16:27:19秒,客戶端連接的NameServer由A切換到了B(而這兩臺恰好是當時切過去的兩臺)。
客戶端在跟NameServer遠程交互時,發生如下情況會進行NameServer的切換:
- 網絡異常
- 超時
-
緊接着,發現如下日誌:
[2019-09-17T16:27:29,846][INFO ][RocketmqClient ] name server address changed, old=A,B,C new=C,D,E [2019-09-17T16:27:29,846][INFO ][RocketmqRemoting ] name server address updated. NEW : [C,D,E] , OLD: [A,B,C]
此日誌是客戶端每2分鐘輪詢一次,查看NameServer地址是否發生變更,變更了的話,就會更新本地緩存的列表。
-
-
可能的問題:
- 業務端持有的NameServer列表是:A,B, …,並且目前使用的NameServer是A。
- 從集羣下線B,挪到其他集羣,啓動
- 從集羣下線A,挪到其他集羣,啓動
- 由於下線A,業務端與NameServer交互(心跳等)發生異常,則自動選擇下一個,即:B,也就是4.1中的日誌(注意:此時選擇的節點已經屬於其他集羣了)。
- 客戶端定時輪詢,發現NameServer地址發生變更,更新本地緩存的列表,也就是4.2中的日誌。
問題來了,客戶端持有了其他集羣的NameServer節點,而且此節點是正常啓動的節點,通信正常,但是卻沒有topic信息。
重點是,rocketmq客戶端更新本地緩存的列表時,並不檢查正在使用的NameServer是否在其中。類似如下代碼:
private Channel getAndCreateNameserverChannel() throws InterruptedException { String addr = this.namesrvAddrChoosed.get(); if (addr != null) { ChannelWrapper cw = this.channelTables.get(addr); if (cw != null && cw.isOK()) { return cw.getChannel(); } } final List<String> addrList = this.namesrvAddrList.get(); ... ... }
-
什麼情況會發生上面的情況?
客戶端的列表順序恰好跟更新的順序一致,且客戶端發生NameServer選擇恰好在NameServer更新時間內,且NameServer更新時間小於2分鐘。(默認客戶端列表是隨機打亂的)
-
發生上述情況帶來的影響?
由於topic路由找不到,那麼生產和消費都將不可用。
三、解決方案
這個嚴格來說應該是一個bug,解決方案如下:
- 獲取鏈接時,只需要判斷是否在NameServer列表即可。
- 或者定時更新NameServer列表,發現發生了變更時,判斷一下當前使用的是否在其中即可。
1 升級rocketmq客戶端,解決這個問題。
2 在各個業務升級到最新版本之前,更新NameServer要保障間隔至少大於3分鐘。
四、測試步驟
1 首先需要先復現出此種情況:
- 部署一個rocketmq集羣,其NameServer集羣具有兩個節點,這裏稱爲A,B
- 啓動客戶端消費進程,並打印其NameServer緩存的順序,假設其緩存順序A,B,目前連接着A
- 觀察NameServer更新時機,默認每2分鐘更新一次,在其間隔內執行如下步驟:
- 將B從NameServer域名列表中移除
- 關閉B,並移至其他集羣並啓動
- 關閉A,觸發NameServer選擇機制
此時,問題重現了,而且客戶端一直在拋出如下異常(B即遷移走的節點):
11:12:00.999 [MQClientFactoryScheduledThread] INFO RocketmqRemoting - new name server is chosen. OLD: A , NEW: B. namesrvIndex = 627
11:12:01.000 [MQClientFactoryScheduledThread] INFO RocketmqRemoting - createChannel: begin to connect remote host[B] asynchronously
11:12:01.001 [NettyClientWorkerThread_4] INFO RocketmqRemoting - NETTY CLIENT PIPELINE: CONNECT UNKNOWN => B
11:12:01.009 [MQClientFactoryScheduledThread] INFO RocketmqRemoting - createChannel: connect remote host[B] success, DefaultChannelPromise@5a9281bb(success)
11:12:01.022 [MQClientFactoryScheduledThread] INFO RocketmqClient - updateTopicRouteInfoFromNameServer:basic-apitest-topic
11:12:01.041 [MQClientFactoryScheduledThread] WARN RocketmqClient - get Topic [basic-apitest-topic] RouteInfoFromNameServer is not exist value
11:12:01.053 [MQClientFactoryScheduledThread] WARN RocketmqClient - updateTopicRouteInfoFromNameServer Exception
org.apache.rocketmq.client.exception.MQClientException: CODE: 17 DESC: No topic route info in name server for the topic: basic-apitest-topic
See http://rocketmq.apache.org/docs/faq/ for further details.
at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1233) ~[classes/:na]
at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1203) ~[classes/:na]
at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:606) [classes/:na]
at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:493) [classes/:na]
at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:362) [classes/:na]
at org.apache.rocketmq.client.impl.factory.MQClientInstance$3.run(MQClientInstance.java:278) [classes/:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_80]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_80]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_80]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_80]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_80]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_80]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
2 使用修復後的客戶端,重複執行上述步驟進行測試:
11:31:13.676 [MQClientFactoryScheduledThread] INFO RocketmqClient - fetchNameServerAddr:A
11:31:13.676 [MQClientFactoryScheduledThread] INFO RocketmqClient - name server address changed, old=B;A, new=A
11:31:13.676 [MQClientFactoryScheduledThread] INFO RocketmqRemoting - name server address updated. NEW : [A] , OLD: [A, B]
11:31:33.668 [MQClientFactoryScheduledThread] INFO RocketmqClient - updateTopicRouteInfoFromNameServer:TBW102
11:31:33.668 [MQClientFactoryScheduledThread] INFO RocketmqRemoting - name server address is invalid: B
11:31:33.668 [MQClientFactoryScheduledThread] INFO RocketmqRemoting - new name server is chosen. OLD: null , NEW: A. namesrvIndex = 614
11:31:33.669 [MQClientFactoryScheduledThread] INFO RocketmqRemoting - createChannel: begin to connect remote host[A] asynchronously
日誌:name server address is invalid: B 表示,能夠及時剔除還在連着的,但是不在NameServer列表中的地址。