使用librdkafka高級消費者接口退出時rd_kafka_destroy卡住問題

  • 近期在使用librdkafka消費者接口時遇到一個問題:當消息消費完成或用戶主動退出時經常卡住,gdb attach上去看了一下是調用rd_kafka_destroy時一直阻塞:
    • 剩餘3個線程,其中兩個在pthread_join,另一個當前堆棧頂層函數是poll
    • 查閱代碼,並打印了poll的參數,該調用是有超時時間的。也就是說不會一直阻塞,問題應該出在其他地方。
(gdb) set pagination off
(gdb) info thr
  Id   Target Id         Frame 
  3    Thread 0x7f1f506ba700 (LWP 24296) "rdk:main" 0x0000003074208705 in pthread_join (threadid=139772448966400, thread_return=0x7f1f506b5d58) at pthread_join.c:92
  2    Thread 0x7f1f4f2b8700 (LWP 24299) "rdk:broker0" 0x0000003073ee1bfd in poll () at ../sysdeps/unix/syscall-template.S:81
* 1    Thread 0x7f1f514ef780 (LWP 23787) "mpp_test" 0x0000003074208705 in pthread_join (threadid=139772469946112, thread_return=0x7fff0d9410a8) at pthread_join.c:92

(gdb) thr apply all bt

Thread 3 (Thread 0x7f1f506ba700 (LWP 24296)):
Python Exception <type 'exceptions.AttributeError'> 'module' object has no attribute 'Command': 
#0  0x0000003074208705 in pthread_join (threadid=139772448966400, thread_return=0x7f1f506b5d58) at pthread_join.c:92
#1  0x0000000000475f28 in thrd_join (thr=139772448966400, res=0x0) at tinycthread.c:749
#2  0x000000000040c199 in rd_kafka_destroy_internal (rk=0x1b6c280) at rdkafka.c:850
#3  0x000000000040e7ac in rd_kafka_thread_main (arg=0x1b6c280) at rdkafka.c:1284
#4  0x0000000000475db1 in _thrd_wrapper_function (aArg=0x1b6d350) at tinycthread.c:624
#5  0x0000003074207213 in start_thread (arg=0x7f1f506ba700) at pthread_create.c:309
#6  0x0000003073eeb65d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7f1f4f2b8700 (LWP 24299)):
Python Exception <type 'exceptions.AttributeError'> 'module' object has no attribute 'Command': 
#0  0x0000003073ee1bfd in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x000000000043d969 in rd_kafka_transport_poll (rktrans=0x7f1f40000ab0, tmout=1000) at rdkafka_transport.c:1538
#2  0x000000000043d24c in rd_kafka_transport_io_serve (rktrans=0x7f1f40000ab0, timeout_ms=1000) at rdkafka_transport.c:1397
#3  0x000000000042081d in rd_kafka_broker_serve (rkb=0x1b6de40, abs_timeout=435806751453) at rdkafka_broker.c:2293
#4  0x0000000000425e63 in rd_kafka_broker_consumer_serve (rkb=0x1b6de40) at rdkafka_broker.c:3199
#5  0x0000000000426334 in rd_kafka_broker_thread_main (arg=0x1b6de40) at rdkafka_broker.c:3311
#6  0x0000000000475db1 in _thrd_wrapper_function (aArg=0x1b4c370) at tinycthread.c:624
#7  0x0000003074207213 in start_thread (arg=0x7f1f4f2b8700) at pthread_create.c:309
#8  0x0000003073eeb65d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f1f514ef780 (LWP 23787)):
Python Exception <type 'exceptions.AttributeError'> 'module' object has no attribute 'Command': 
#0  0x0000003074208705 in pthread_join (threadid=139772469946112, thread_return=0x7fff0d9410a8) at pthread_join.c:92
#1  0x0000000000475f28 in thrd_join (thr=139772469946112, res=0x0) at tinycthread.c:749
#2  0x000000000040bc97 in rd_kafka_destroy_app (rk=0x1b6c280, blocking=1) at rdkafka.c:736
#3  0x000000000040bd18 in rd_kafka_destroy (rk=0x1b6c280) at rdkafka.c:749
#4  0x0000000000407718 in closeKafkaConsumerConnection (kafka_cxt=0x1b4c010) at test.c:618
#5  0x0000000000408e02 in main (argc=12, argv=0x7fff0d9412e8) at test.c:1259
  • 根據線程2的調用堆棧查閱代碼,看到rd_kafka_broker_consumer_serve 函數中有一個循環。於是在循環條件處設斷點,然後continue繼續執行,果然進入斷點:
static void rd_kafka_broker_consumer_serve (rd_kafka_broker_t *rkb) {
    rd_kafka_assert(rkb->rkb_rk, thrd_is_current(rkb->rkb_thread));
    rd_kafka_broker_lock(rkb);

    while (!rd_kafka_broker_terminating(rkb) &&
           rkb->rkb_state == RD_KAFKA_BROKER_STATE_UP) {
        rd_ts_t now;
                rd_ts_t min_backoff;

        rd_kafka_broker_unlock(rkb);
        ……

        rd_kafka_broker_serve(rkb,
                now + (rkb->rkb_blocking_max_ms * 1000));

        rd_kafka_broker_lock(rkb);
    }
    ……
}
  • 打印循環控制變量: 其中rd_kafka_broker_terminating(rkb)是一個宏,其展開來定義如下:(rd_refcnt_get(&(rkb)->rkb_refcnt) <= 1,也就是判斷(rkb)的引用計數是否<=1。個人理解<=1說明只有當前這個指針還在引用它,沒有其他指針指向這個內存,可以安全釋放。
Breakpoint 4, rd_kafka_broker_consumer_serve (rkb=0x1b6de40) at rdkafka_broker.c:3159
3159               rkb->rkb_state == RD_KAFKA_BROKER_STATE_UP) {
(gdb) p rkb->rkb_state
$3 = RD_KAFKA_BROKER_STATE_UP
(gdb) p rd_refcnt_get(&(rkb)->rkb_refcnt)
$4 = 16
  • 很顯然這裏是因爲(rkb)->rkb_refcnt這個引用計數不滿足<=1的條件。常識判斷應該是有什麼對象沒有釋放或者銷燬,以下是退出關閉環節的主要代碼:仔細檢查了幾遍,似乎並沒有什麼問題。
    err = rd_kafka_consumer_close(rk);
    if (err)
        log(WARNING, "Failed to close consumer: %s\n",  rd_kafka_err2str(err));
    else
        log(WARNING, "Consumer closed\n");

    rd_kafka_destroy(rk);
    rd_kafka_topic_partition_list_destroy(tpl);

    /* Let background threads clean up and terminate cleanly. */
    int run = 5;
    while (run-- > 0 && rd_kafka_wait_destroyed(1000) == -1)
        log(WARNING, "Waiting for librdkafka to decommission\n");
    if (run <= 0)
        rd_kafka_dump(stdout, rk);
  • 作爲對比,用librdkafka自帶的demo測試了幾次,發現沒有這個問題。二者主要區別在於demo收到消息後僅增加統計計數然後就銷燬。而自己的程序根據業務需要對消息做了解析,仔細檢查了發現有一個解析出錯的異常分支,當初測試時爲了省事直接continue了,沒有認真處理,也沒有調用 rd_kafka_message_destroy 銷燬消息對象。加上這一句後即一切正常。
  • 總結一下 解析kafka消息時,有個別消息不滿足業務規則進入異常分支,沒有調用rd_kafka_message_destroy,造成內存泄露並導致rd_kafka_broker_t對象的引用計數無法釋放,在退出時因不滿足條件而一直循環。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章