- 近期在使用librdkafka消費者接口時遇到一個問題:當消息消費完成或用戶主動退出時經常卡住,gdb attach上去看了一下是調用
rd_kafka_destroy
時一直阻塞:
- 剩餘3個線程,其中兩個在
pthread_join
,另一個當前堆棧頂層函數是poll
- 查閱代碼,並打印了poll的參數,該調用是有超時時間的。也就是說不會一直阻塞,問題應該出在其他地方。
(gdb) set pagination off
(gdb) info thr
Id Target Id Frame
3 Thread 0x7f1f506ba700 (LWP 24296) "rdk:main" 0x0000003074208705 in pthread_join (threadid=139772448966400, thread_return=0x7f1f506b5d58) at pthread_join.c:92
2 Thread 0x7f1f4f2b8700 (LWP 24299) "rdk:broker0" 0x0000003073ee1bfd in poll () at ../sysdeps/unix/syscall-template.S:81
* 1 Thread 0x7f1f514ef780 (LWP 23787) "mpp_test" 0x0000003074208705 in pthread_join (threadid=139772469946112, thread_return=0x7fff0d9410a8) at pthread_join.c:92
(gdb) thr apply all bt
Thread 3 (Thread 0x7f1f506ba700 (LWP 24296)):
Python Exception <type 'exceptions.AttributeError'> 'module' object has no attribute 'Command':
#0 0x0000003074208705 in pthread_join (threadid=139772448966400, thread_return=0x7f1f506b5d58) at pthread_join.c:92
#1 0x0000000000475f28 in thrd_join (thr=139772448966400, res=0x0) at tinycthread.c:749
#2 0x000000000040c199 in rd_kafka_destroy_internal (rk=0x1b6c280) at rdkafka.c:850
#3 0x000000000040e7ac in rd_kafka_thread_main (arg=0x1b6c280) at rdkafka.c:1284
#4 0x0000000000475db1 in _thrd_wrapper_function (aArg=0x1b6d350) at tinycthread.c:624
#5 0x0000003074207213 in start_thread (arg=0x7f1f506ba700) at pthread_create.c:309
#6 0x0000003073eeb65d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Thread 2 (Thread 0x7f1f4f2b8700 (LWP 24299)):
Python Exception <type 'exceptions.AttributeError'> 'module' object has no attribute 'Command':
#0 0x0000003073ee1bfd in poll () at ../sysdeps/unix/syscall-template.S:81
#1 0x000000000043d969 in rd_kafka_transport_poll (rktrans=0x7f1f40000ab0, tmout=1000) at rdkafka_transport.c:1538
#2 0x000000000043d24c in rd_kafka_transport_io_serve (rktrans=0x7f1f40000ab0, timeout_ms=1000) at rdkafka_transport.c:1397
#3 0x000000000042081d in rd_kafka_broker_serve (rkb=0x1b6de40, abs_timeout=435806751453) at rdkafka_broker.c:2293
#4 0x0000000000425e63 in rd_kafka_broker_consumer_serve (rkb=0x1b6de40) at rdkafka_broker.c:3199
#5 0x0000000000426334 in rd_kafka_broker_thread_main (arg=0x1b6de40) at rdkafka_broker.c:3311
#6 0x0000000000475db1 in _thrd_wrapper_function (aArg=0x1b4c370) at tinycthread.c:624
#7 0x0000003074207213 in start_thread (arg=0x7f1f4f2b8700) at pthread_create.c:309
#8 0x0000003073eeb65d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Thread 1 (Thread 0x7f1f514ef780 (LWP 23787)):
Python Exception <type 'exceptions.AttributeError'> 'module' object has no attribute 'Command':
#0 0x0000003074208705 in pthread_join (threadid=139772469946112, thread_return=0x7fff0d9410a8) at pthread_join.c:92
#1 0x0000000000475f28 in thrd_join (thr=139772469946112, res=0x0) at tinycthread.c:749
#2 0x000000000040bc97 in rd_kafka_destroy_app (rk=0x1b6c280, blocking=1) at rdkafka.c:736
#3 0x000000000040bd18 in rd_kafka_destroy (rk=0x1b6c280) at rdkafka.c:749
#4 0x0000000000407718 in closeKafkaConsumerConnection (kafka_cxt=0x1b4c010) at test.c:618
#5 0x0000000000408e02 in main (argc=12, argv=0x7fff0d9412e8) at test.c:1259
- 根據線程2的調用堆棧查閱代碼,看到
rd_kafka_broker_consumer_serve
函數中有一個循環。於是在循環條件處設斷點,然後continue繼續執行,果然進入斷點:
static void rd_kafka_broker_consumer_serve (rd_kafka_broker_t *rkb) {
rd_kafka_assert(rkb->rkb_rk, thrd_is_current(rkb->rkb_thread))
rd_kafka_broker_lock(rkb)
while (!rd_kafka_broker_terminating(rkb) &&
rkb->rkb_state == RD_KAFKA_BROKER_STATE_UP) {
rd_ts_t now
rd_ts_t min_backoff
rd_kafka_broker_unlock(rkb)
……
rd_kafka_broker_serve(rkb,
now + (rkb->rkb_blocking_max_ms * 1000))
rd_kafka_broker_lock(rkb)
}
……
}
- 打印循環控制變量: 其中
rd_kafka_broker_terminating(rkb)
是一個宏,其展開來定義如下:(rd_refcnt_get(&(rkb)->rkb_refcnt) <= 1
,也就是判斷(rkb)
的引用計數是否<=1。個人理解<=1說明只有當前這個指針還在引用它,沒有其他指針指向這個內存,可以安全釋放。
Breakpoint 4, rd_kafka_broker_consumer_serve (rkb=0x1b6de40) at rdkafka_broker.c:3159
3159 rkb->rkb_state == RD_KAFKA_BROKER_STATE_UP) {
(gdb) p rkb->rkb_state
$3 = RD_KAFKA_BROKER_STATE_UP
(gdb) p rd_refcnt_get(&(rkb)->rkb_refcnt)
$4 = 16
- 很顯然這裏是因爲
(rkb)->rkb_refcnt
這個引用計數不滿足<=1的條件。常識判斷應該是有什麼對象沒有釋放或者銷燬,以下是退出關閉環節的主要代碼:仔細檢查了幾遍,似乎並沒有什麼問題。
err = rd_kafka_consumer_close(rk);
if (err)
log(WARNING, "Failed to close consumer: %s\n", rd_kafka_err2str(err));
else
log(WARNING, "Consumer closed\n");
rd_kafka_destroy(rk);
rd_kafka_topic_partition_list_destroy(tpl);
/* Let background threads clean up and terminate cleanly. */
int run = 5;
while (run-- > 0 && rd_kafka_wait_destroyed(1000) == -1)
log(WARNING, "Waiting for librdkafka to decommission\n");
if (run <= 0)
rd_kafka_dump(stdout, rk);
- 作爲對比,用librdkafka自帶的demo測試了幾次,發現沒有這個問題。二者主要區別在於demo收到消息後僅增加統計計數然後就銷燬。而自己的程序根據業務需要對消息做了解析,仔細檢查了發現有一個解析出錯的異常分支,當初測試時爲了省事直接continue了,沒有認真處理,也沒有調用
rd_kafka_message_destroy
銷燬消息對象。加上這一句後即一切正常。
- 總結一下 解析kafka消息時,有個別消息不滿足業務規則進入異常分支,沒有調用
rd_kafka_message_destroy
,造成內存泄露並導致rd_kafka_broker_t
對象的引用計數無法釋放,在退出時因不滿足條件而一直循環。