NameServer更新bug

一、背景

業務方反饋,日誌中發現大量如下錯誤:

[2019-09-25T18:29:19,853][WARN ][RocketmqClient           ] get Topic [search-core_model_v2-topic] RouteInfoFromNameServer is not exist value
[2019-09-25T18:29:19,853][WARN ][RocketmqClient           ] updateTopicRouteInfoFromNameServer Exception
org.apache.rocketmq.client.exception.MQClientException: CODE: 17  DESC: No topic route info in name server for the topic: search-core_model_v2-topic
See http://rocketmq.apache.org/docs/faq/ for further details.
    at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1227) ~[rocketmq-client-4.2.0.jar:4.2.0]
    at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1197) ~[rocketmq-client-4.2.0.jar:4.2.0]
    at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:605) [rocketmq-client-4.2.0.jar:4.2.0]
    at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:492) [rocketmq-client-4.2.0.jar:4.2.0]
    at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:361) [rocketmq-client-4.2.0.jar:4.2.0]
    at org.apache.rocketmq.client.impl.factory.MQClientInstance$3.run(MQClientInstance.java:278) [rocketmq-client-4.2.0.jar:4.2.0]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_111]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_111]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_111]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_111]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

而最近應用沒有重啓或更新過。

二、調查

  1. 錯誤的含義很明確,NameServer上不存在這個topic,而NameServer後端的代碼很簡單,簡化如下:

    TopicRouteData topicRouteData = map.get(requestHeader.getTopic());
    if (topicRouteData == null) {
        response.setCode(ResponseCode.TOPIC_NOT_EXIST);
        response.setRemark("No topic route info in name server for the topic: " + requestHeader.getTopic()
            + FAQUrl.suggestTodo(FAQUrl.APPLY_TOPIC_URL));
    }
    

    即,確實從NameServer上查不到這個topic。

  2. 立馬本地測試,從NameServer上卻可以獲取到。

    NameServer上的topic路由是broker上報上去的,檢查broker發現topic的配置都正常。

  3. 檢查異常發生的日期,發現出問題那天進行了兩項變更:

    1. 兩個集羣的slave爲了滿足跨機房高可用的需求,進行了交換。
    2. 將A集羣的NameServer兩個機器下線,並分給了B集羣,部署了兩個新的NameServer實例。

    所以,極有可能是NameServer這個變更導致了問題的發生。

  4. 調查日誌

    1. 跟進業務日誌,發現如下端倪:

      [2019-09-17T16:27:19,850][INFO ][RocketmqRemoting         ] new name server is chosen. OLD: A:9876 , NEW: B:9876. namesrvIndex = 790
      [2019-09-17T16:27:19,851][INFO ][RocketmqRemoting         ] createChannel: begin to connect remote host[B:9876] asynchronously
      [2019-09-17T16:27:19,851][INFO ][RocketmqRemoting         ] NETTY CLIENT PIPELINE: CONNECT  UNKNOWN => B:9876
      [2019-09-17T16:27:19,852][INFO ][RocketmqRemoting         ] createChannel: connect remote host[B:9876] success, DefaultChannelPromise@291c6e50(success)
      

      日誌顯示,在16:27:19秒,客戶端連接的NameServer由A切換到了B(而這兩臺恰好是當時切過去的兩臺)。

      客戶端在跟NameServer遠程交互時,發生如下情況會進行NameServer的切換:

      1. 網絡異常
      2. 超時
    2. 緊接着,發現如下日誌:

      [2019-09-17T16:27:29,846][INFO ][RocketmqClient           ] name server address changed, old=A,B,C new=C,D,E
      [2019-09-17T16:27:29,846][INFO ][RocketmqRemoting         ] name server address updated. NEW : [C,D,E] , OLD: [A,B,C]
      

      此日誌是客戶端每2分鐘輪詢一次,查看NameServer地址是否發生變更,變更了的話,就會更新本地緩存的列表。

  5. 可能的問題:

    1. 業務端持有的NameServer列表是:A,B, …,並且目前使用的NameServer是A。
    2. 從集羣下線B,挪到其他集羣,啓動
    3. 從集羣下線A,挪到其他集羣,啓動
    4. 由於下線A,業務端與NameServer交互(心跳等)發生異常,則自動選擇下一個,即:B,也就是4.1中的日誌(注意:此時選擇的節點已經屬於其他集羣了)。
    5. 客戶端定時輪詢,發現NameServer地址發生變更,更新本地緩存的列表,也就是4.2中的日誌。

    問題來了,客戶端持有了其他集羣的NameServer節點,而且此節點是正常啓動的節點,通信正常,但是卻沒有topic信息。

    重點是,rocketmq客戶端更新本地緩存的列表時,並不檢查正在使用的NameServer是否在其中。類似如下代碼:

    private Channel getAndCreateNameserverChannel() throws InterruptedException {
        String addr = this.namesrvAddrChoosed.get();
        if (addr != null) {
            ChannelWrapper cw = this.channelTables.get(addr);
            if (cw != null && cw.isOK()) {
                return cw.getChannel();
            }
        }
        final List<String> addrList = this.namesrvAddrList.get();
        ... ...
    }
    
  6. 什麼情況會發生上面的情況?

    客戶端的列表順序恰好跟更新的順序一致,且客戶端發生NameServer選擇恰好在NameServer更新時間內,且NameServer更新時間小於2分鐘。(默認客戶端列表是隨機打亂的)

  7. 發生上述情況帶來的影響?

    由於topic路由找不到,那麼生產和消費都將不可用。

三、解決方案

這個嚴格來說應該是一個bug,解決方案如下:

  1. 獲取鏈接時,只需要判斷是否在NameServer列表即可。
  2. 或者定時更新NameServer列表,發現發生了變更時,判斷一下當前使用的是否在其中即可。

1 升級rocketmq客戶端,解決這個問題。

2 在各個業務升級到最新版本之前,更新NameServer要保障間隔至少大於3分鐘。

四、測試步驟

1 首先需要先復現出此種情況:

  1. 部署一個rocketmq集羣,其NameServer集羣具有兩個節點,這裏稱爲A,B
  2. 啓動客戶端消費進程,並打印其NameServer緩存的順序,假設其緩存順序A,B,目前連接着A
  3. 觀察NameServer更新時機,默認每2分鐘更新一次,在其間隔內執行如下步驟:
    1. 將B從NameServer域名列表中移除
    2. 關閉B,並移至其他集羣並啓動
  4. 關閉A,觸發NameServer選擇機制

此時,問題重現了,而且客戶端一直在拋出如下異常(B即遷移走的節點):

11:12:00.999 [MQClientFactoryScheduledThread] INFO  RocketmqRemoting - new name server is chosen. OLD: A , NEW: B. namesrvIndex = 627
11:12:01.000 [MQClientFactoryScheduledThread] INFO  RocketmqRemoting - createChannel: begin to connect remote host[B] asynchronously
11:12:01.001 [NettyClientWorkerThread_4] INFO  RocketmqRemoting - NETTY CLIENT PIPELINE: CONNECT  UNKNOWN => B
11:12:01.009 [MQClientFactoryScheduledThread] INFO  RocketmqRemoting - createChannel: connect remote host[B] success, DefaultChannelPromise@5a9281bb(success)
11:12:01.022 [MQClientFactoryScheduledThread] INFO  RocketmqClient - updateTopicRouteInfoFromNameServer:basic-apitest-topic
11:12:01.041 [MQClientFactoryScheduledThread] WARN  RocketmqClient - get Topic [basic-apitest-topic] RouteInfoFromNameServer is not exist value
11:12:01.053 [MQClientFactoryScheduledThread] WARN  RocketmqClient - updateTopicRouteInfoFromNameServer Exception
org.apache.rocketmq.client.exception.MQClientException: CODE: 17  DESC: No topic route info in name server for the topic: basic-apitest-topic
See http://rocketmq.apache.org/docs/faq/ for further details.
    at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1233) ~[classes/:na]
    at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1203) ~[classes/:na]
    at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:606) [classes/:na]
    at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:493) [classes/:na]
    at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:362) [classes/:na]
    at org.apache.rocketmq.client.impl.factory.MQClientInstance$3.run(MQClientInstance.java:278) [classes/:na]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_80]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_80]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_80]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_80]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_80]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_80]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]

2 使用修復後的客戶端,重複執行上述步驟進行測試:

11:31:13.676 [MQClientFactoryScheduledThread] INFO  RocketmqClient - fetchNameServerAddr:A
11:31:13.676 [MQClientFactoryScheduledThread] INFO  RocketmqClient - name server address changed, old=B;A, new=A
11:31:13.676 [MQClientFactoryScheduledThread] INFO  RocketmqRemoting - name server address updated. NEW : [A] , OLD: [A, B]
11:31:33.668 [MQClientFactoryScheduledThread] INFO  RocketmqClient - updateTopicRouteInfoFromNameServer:TBW102
11:31:33.668 [MQClientFactoryScheduledThread] INFO  RocketmqRemoting - name server address is invalid: B
11:31:33.668 [MQClientFactoryScheduledThread] INFO  RocketmqRemoting - new name server is chosen. OLD: null , NEW: A. namesrvIndex = 614
11:31:33.669 [MQClientFactoryScheduledThread] INFO  RocketmqRemoting - createChannel: begin to connect remote host[A] asynchronously

日誌:name server address is invalid: B 表示,能夠及時剔除還在連着的,但是不在NameServer列表中的地址。

發佈了62 篇原創文章 · 獲贊 23 · 訪問量 15萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章