最經使用程序訪問Tair時,程序經常Crash,通過跟蹤和分析發現原因如下
在tair_client_impl::retrieve_server_addr中調用瞭如下函數:
thread.start(this, reinterpret_cast<void *>(heart_type));
response_thread.start(this, reinterpret_cast<void *>(response_type));
當前線程創建出錯,但是沒有處理,但是在tair_client_impl::close函數中調用瞭如下函數:
thread.join();
response_thread.join();
由於線程創建失敗,所以這裏產生了段錯誤。
具體分析和解決步驟如下:
(1) gdb調試core dump:
通過core dump得到的stack如下:
#0 0x0000003a14c07fc3 in pthread_join () from/lib64/libpthread.so.0
#1 0x00000000004abe6f injoin(this=0x7f1df3ffe130) at /home/guojun8/lib/lib/include/tbsys/thread.h:51
#2 tair::tair_client_impl::close(this=0x7f1df3ffe130) at tair_client_api_impl.cpp:247
#3 0x00000000004b07a7 in tair::tair_client_impl::~tair_client_impl(this=0x7f1df3ffe130, __in_chrg=<value optimized out>) at tair_client_api_impl.cpp:83
#4 0x00000000004a58f0 in tair::new_tair_client(master_addr=<value optimized out>, slave_addr=<value optimized out>, group_name=<value optimized out>)
at tair_client_api.cpp:584
#5 0x00000000004a5b43 in tair::tair_client_api::startup(this=0x7f1dd4001170, master_addr=0x7f1dd40010d8"127.0.0.1:5198",
slave_addr=0x7f1dd4001108 "127.0.0.1:5198", group_name=<value optimized out>) at tair_client_api.cpp:72
#6 0x0000000000447126 in imagestorage::Tair_Handler::Connect(this=0x7f1dd4000f90) at imagestorage/tair_handler.cc:10
#7 0x00000000004502cc in imagestorage::ImageHandler::FetchImage(this=0x1e8cb90, image_name=0x7f1dd4000908"h00731dcfb73d42acc95f5a54e6088df117",
image_norm_name=0x1e8e9a0 "\270\347\350\001", image_buffer=0x7f1de19fc010"", image_size=0x7f1df3ffe894, schema="plaza", err_msg="")
at imagestorage/image_handler.cc:213
....
點擊(此處)摺疊或打開
- (gdb) f 2
- #2 0x00000000004b3c72 in tair::tair_client_impl::close(this=0x7fa2bf4f3120)
at tair_client_api_impl.cpp:248
- warning: Source file
is more recent than executable.
- 248 response_thread.join();
- (gdb) p response_thread
- $1 ={tid=
140336135386880, pid= 0,
runnable= 0x7fa2bf4f3128, args=
0x1} ===============》 pid = 0
- (gdb)
點擊(此處)摺疊或打開
- static void *hook(void*arg){
- CThread *thread
= (CThread*) arg;
- thread->pid=
gettid(); =========>
如果線程啓動成功, pid不應該爲0,因此懷疑創建線程失敗;
- if (thread->getRunnable()){
- thread->getRunnable()->run(thread,
thread->getArgs());
- }
- return (void*)NULL;
- }
- ret_thread = thread.start(this, reinterpret_cast<void*>(heart_type));
- if(!ret_thread){
- TBSYS_LOG(ERROR,"create thread failed.");
- }
- ret_thread = response_thread.start(this, reinterpret_cast<void*>(response_type));
- if(!ret_thread){
- TBSYS_LOG(ERROR,"create response_thread failed.");
- }
- [2013-10-29 18:07:21.531977]
WARN parse_invalidate_server (tair_client_api_impl.cpp:3449)[140336971073280]
no invalid server info found.
- [2013-10-29
18:07:21.532869]ERRORretrieve_server_addr(tair_client_api_impl.cpp:3434)[140336971073280]
create response_thread failed.
- [2013-10-29 18:07:21.532915]
INFO transport.cpp:394[140336976336640] ADDIOC, SOCK:
24, 127.0.0.1:5198, RON:
1, WON: 1, IOCount:1, IOC:0x7fa270802ea0
- [2013-10-29 18:07:21.532941] INFO transport.cpp:394[140337076029184] ADDIOC, SOCK: 25, 127.0.0.1:5198, RON: 1, WON: 1, IOCount:1, IOC:0x7fa2a0803c50
點擊(此處)摺疊或打開
- int ret
= pthread_create(&tid,NULL, CThread::hook,
this);
- if(ret
!= 0)
- printf("pthread_create failed, ret = %s\n", strerror(ret));
- assert(ret == 0);
- return 0 == ret;
pthread_create failed, ret = Resource temporarily unavailable
5. 解決方法:
查看錯誤信息,得到:
EAGAIN not enough system resources to create a process for the new
thread.
EAGAIN more than PTHREAD_THREADS_MAX threads are already active.
./asm/errno.h:14:#define EAGAIN 11 /* Try again */
懷疑當前用戶的進程數超出:
[sre@WDDS-DEV-016 ~]$ ulimit -u
1024
修改/etc/security/limits.d/90-nproc.conf中的默認值到10240,具體參見(ulimit限制之nproc問題)
修改之後的值爲10240.
[sre@WDDS-DEV-016 ~]$ ulimit -u
10240
修改用戶進程限制後,問題解決。