使用Tair時遇到pthread_join段錯誤問題解決

原創

2018-08-24 10:39

最經使用程序訪問Tair時，程序經常Crash，通過跟蹤和分析發現原因如下
     在tair_client_impl::retrieve_server_addr中調用瞭如下函數：
            thread.start(this, reinterpret_cast<void *>(heart_type));
            response_thread.start(this, reinterpret_cast<void *>(response_type));
    當前線程創建出錯，但是沒有處理，但是在tair_client_impl::close函數中調用瞭如下函數：
             thread.join();
             response_thread.join();
    由於線程創建失敗，所以這裏產生了段錯誤。

具體分析和解決步驟如下：
（1） gdb調試core dump：
        通過core dump得到的stack如下：

#0 0x0000003a14c07fc3 in pthread_join () from/lib64/libpthread.so.0

#1 0x00000000004abe6f injoin(this=0x7f1df3ffe130) at /home/guojun8/lib/lib/include/tbsys/thread.h:51

#2 tair::tair_client_impl::close(this=0x7f1df3ffe130) at tair_client_api_impl.cpp:247

#3 0x00000000004b07a7 in tair::tair_client_impl::~tair_client_impl(this=0x7f1df3ffe130, __in_chrg=<value optimized out>) at tair_client_api_impl.cpp:83

#4 0x00000000004a58f0 in tair::new_tair_client(master_addr=<value optimized out>, slave_addr=<value optimized out>, group_name=<value optimized out>)

    at tair_client_api.cpp:584        

#5 0x00000000004a5b43 in tair::tair_client_api::startup(this=0x7f1dd4001170, master_addr=0x7f1dd40010d8"127.0.0.1:5198",

    slave_addr=0x7f1dd4001108 "127.0.0.1:5198", group_name=<value optimized out>) at tair_client_api.cpp:72

#6 0x0000000000447126 in imagestorage::Tair_Handler::Connect(this=0x7f1dd4000f90) at imagestorage/tair_handler.cc:10

#7 0x00000000004502cc in imagestorage::ImageHandler::FetchImage(this=0x1e8cb90, image_name=0x7f1dd4000908"h00731dcfb73d42acc95f5a54e6088df117",

    image_norm_name=0x1e8e9a0 "\270\347\350\001", image_buffer=0x7f1de19fc010"", image_size=0x7f1df3ffe894, schema="plaza", err_msg="")

    at imagestorage/image_handler.cc:213    
....

2. 通過gdb調試：

點擊(此處)摺疊或打開

(gdb) f 2
#2 0x00000000004b3c72 in tair::tair_client_impl::close(this=0x7fa2bf4f3120) at tair_client_api_impl.cpp:248
warning: Source file is more recent than executable.
248 response_thread.join();
(gdb) p response_thread
$1 ={tid= 140336135386880, pid= 0, runnable= 0x7fa2bf4f3128, args= 0x1} ===============》 pid = 0
(gdb)

查看源碼：

點擊(此處)摺疊或打開

static void *hook(void*arg){
CThread *thread = (CThread*) arg;
thread->pid= gettid(); =========> 如果線程啓動成功， pid不應該爲0，因此懷疑創建線程失敗；
if (thread->getRunnable()){
thread->getRunnable()->run(thread, thread->getArgs());
}
return (void*)NULL;
}

3. 添加日誌：

ret_thread = thread.start(this, reinterpret_cast<void*>(heart_type));
if(!ret_thread){
TBSYS_LOG(ERROR,"create thread failed.");
}
ret_thread = response_thread.start(this, reinterpret_cast<void*>(response_type));
if(!ret_thread){
TBSYS_LOG(ERROR,"create response_thread failed.");
}

重新運行後得到下面的日誌輸出，因此判斷創建線程出錯。

[2013-10-29 18:07:21.531977] WARN parse_invalidate_server (tair_client_api_impl.cpp:3449)[140336971073280] no invalid server info found.
[2013-10-29 18:07:21.532869]ERRORretrieve_server_addr(tair_client_api_impl.cpp:3434)[140336971073280] create response_thread failed.
[2013-10-29 18:07:21.532915] INFO transport.cpp:394[140336976336640] ADDIOC, SOCK: 24, 127.0.0.1:5198, RON: 1, WON: 1, IOCount:1, IOC:0x7fa270802ea0
[2013-10-29 18:07:21.532941] INFO transport.cpp:394[140337076029184] ADDIOC, SOCK: 25, 127.0.0.1:5198, RON: 1, WON: 1, IOCount:1, IOC:0x7fa2a0803c50

4. 得到pthread_create的失敗信息：

點擊(此處)摺疊或打開

int ret = pthread_create(&tid,NULL, CThread::hook, this);
if(ret != 0)
printf("pthread_create failed, ret = %s\n", strerror(ret));
assert(ret == 0);
return 0 == ret;

得到的日誌輸出結果爲：
     pthread_create failed, ret = Resource temporarily unavailable

5. 解決方法：
查看錯誤信息，得到：
       EAGAIN not enough system resources to create a process for the new
              thread.

       EAGAIN more than PTHREAD_THREADS_MAX threads are already active.

./asm/errno.h:14:#define        EAGAIN          11      /* Try again */

懷疑當前用戶的進程數超出：
    [sre@WDDS-DEV-016 ~]$ ulimit -u
    1024
修改/etc/security/limits.d/90-nproc.conf中的默認值到10240，具體參見（ulimit限制之nproc問題）
修改之後的值爲10240.
     [sre@WDDS-DEV-016 ~]$ ulimit -u
     10240
修改用戶進程限制後，問題解決。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用Tair時遇到pthread_join段錯誤問題解決

訪問淘寶Tair的基礎類型

Socket connect error 99(Cannot assign requested address)

C++ STL map中的Key使用自定義類型

std::nth_element bug引起的crash問題

單生產者-多消費者模型中遇到的問題

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結