我的flamingo服務器(關於flamingo可以參看這裏)最近在殺掉進程(如使用Ctrl + C或者kill + 程序pid)偶爾會出現崩潰問題,雖然這個問題沒多大影響,因爲進程本來就馬上要死了,在退出的過程中崩潰也就無所謂了,但是本着嚴謹和求知的態度,我還是排查了一下。下面記錄一下debug的過程,希望對讀者有所啓發。
正常情況下,我的程序處理了Ctrl+C信號時,會走正常的退出流程,預想的程序不會崩潰的,但實際還是崩潰了。
主線程是一個EventLoop無限循環,同時程序接收到Ctrl+C信號時,設置主線程退出標誌。代碼如下:
int main(int argc, char* argv[]) 2 { 3 //設置信號處理 4 signal(SIGCHLD, SIG_DFL); 5 signal(SIGPIPE, SIG_IGN); 6 signal(SIGINT, prog_exit); 7 signal(SIGTERM, prog_exit); 8 9 //省略無關代碼... 10 11 g_mainLoop.loop(); 12 13 return 0; 14}
信號處理程序如下:
1void prog_exit(int signo) 2 { 3 std::cout << "program recv signal [" << signo << "] to exit." << std::endl; 4 5 g_mainLoop.quit(); 6 7}
通過日誌也看不到對於排查崩潰現象的有任何幫助的日誌信息,於是啓用linux的linux的coredump文件生成機制,某次產生了如下coredump文件:
於是使用gdb調試查看一下崩潰調用堆棧(第一步使用命令gdb 可執行文件名稱 core文件名,第二步使用bt命令查看崩潰堆棧):
1[zhangyl@iZ238vnojlyZ myimserver]$ gdb mychatserver core.9798 2GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 3 Copyright (C) 2013 Free Software Foundation, Inc. 4 License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> 5 This is free software: you are free to change and redistribute it. 6 There is NO WARRANTY, to the extent permitted by law. Type "show copying" 7 and "show warranty" for details. 8 This GDB was configured as "x86_64-redhat-linux-gnu". 9 For bug reporting instructions, please see: 10 <http://www.gnu.org/software/gdb/bugs/>... 11 Reading symbols from /home/zhangyl/myimserver/mychatserver...done. 12 [New LWP 9798] 13 [New LWP 9802] 14 [New LWP 9804] 15 [New LWP 9800] 16 [New LWP 9803] 17 [New LWP 9801] 18 [New LWP 9805] 19 [Thread debugging using libthread_db enabled] 20 Using host libthread_db library "/lib64/libthread_db.so.1". 21 Core was generated by `./mychatserver -d'. 22 Program terminated with signal 11, Segmentation fault. 23 #0 0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45 24 45 int avail() const { return static_cast<int>(end() - cur_); } 25Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.6.x86_64 libgcc-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64 26(gdb) bt 27#0 0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45 28#1 0x00000000004f4e20 in AsyncLogging::append (this=0x7ffd5d58ad50, 29 logline=0x7ffd5d58fe70 "20180820 15:46:06 DEBUG ~EventLoop EventLoop destructs in other thread - eventloop.cpp:87\n", len=90) 30 at /home/zhangyl/myimserver/base/asynclogging.cpp:98 31#2 0x000000000053bfc8 in asyncOutput (msg=0x7ffd5d58fe70 "20180820 15:46:06 DEBUG ~EventLoop EventLoop destructs in other thread - eventloop.cpp:87\n", len=90) 32 at /home/zhangyl/myimserver/chatserversrc/main.cpp:33 33#3 0x00000000004f205d in Logger::~Logger (this=0x7ffd5d58fe60, __in_chrg=<optimized out>) at /home/zhangyl/myimserver/base/logging.cpp:149 34#4 0x0000000000504bf1 in net::EventLoop::~EventLoop (this=0x7d1c00 <g_mainLoop>, __in_chrg=<optimized out>) at /home/zhangyl/myimserver/net/eventloop.cpp:87 35#5 0x00007f7067d79e69 in __run_exit_handlers () from /lib64/libc.so.6 36#6 0x00007f7067d79eb5 in exit () from /lib64/libc.so.6 37#7 0x00007f7067d62b1c in __libc_start_main () from /lib64/libc.so.6 38#8 0x00000000004f0ba9 in _start () 39(gdb)
崩潰的根源在於堆棧#0,使用frame命令(frame 堆棧編號,如frame 0,表示切換到堆棧#0)切換到堆棧#0:
1(gdb) frame 0 2#0 0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45 345 int avail() const { return static_cast<int>(end() - cur_); } 4(gdb) p this 5$1 = (const FixedBuffer<4000000> * const) 0x7f7067564010 6(gdb)
使用print命令(簡寫成p)查看下當前對象的this指針,this指針時一個地址值,但是不代表這個對象有效,再使用print命令查看一下這個對象本身的數據(p *this),發現這個對象已經失效:
1(gdb) p *this 2Cannot access memory at address 0x7f7067564010 3(gdb)
這個對象爲啥失效呢?而且堆棧#3和堆棧#4顯示的是在EventLoop裏面調用日誌類Logger的析構函數。先看下堆棧#4處的函數代碼:
1EventLoop::~EventLoop() 2{ 3 LOG_DEBUG << "EventLoop destructs in other thread"; 4 5 //std::stringstream ss; 6 //ss << "eventloop destructs threadid = " << threadId_; 7 //std::cout << ss.str() << std::endl; 8 9 wakeupChannel_->disableAll(); 10 wakeupChannel_->remove(); 11 ::close(wakeupFd_); 12 t_loopInThisThread = NULL; 13}
除了打印一行日誌,也沒有和日誌Logger類相關的代碼,且EventLoop也沒有與Logger類相關的成員變量(也就是無隱式析構)。我們再看下堆棧#3處的代碼:
1Logger::~Logger() 2{ 3 impl_.finish(); 4 const LogStream::Buffer& buf(stream().buffer()); 5 g_output(buf.data(), buf.length()); 6 if (impl_.level_ == FATAL) 7 { 8 g_flush(); 9 abort(); 10 } 11}
崩潰的地方就是g_output(buf.data(), buf.length()); 而這個buf的內容正好就是EventLoop析構函數裏面打印的一行日誌:
EventLoop destructs in other thread
那麼這個g_output爲啥會調用AsyncLogging::append函數呢?(參考堆棧#1)。我們通過全局搜索發現g_output是一個全局變量,並且在定義時已經有初始值:
Logger::OutputFunc g_output = defaultOutput;
而初始值defaultOutput的行爲就是將日誌輸出到控制檯:
1void defaultOutput(const char* msg, int len) 2 { 3 size_t n = fwrite(msg, 1, len, stdout); 4 //FIXME check n 5 (void)n; 6}
fwrite中的第二個參數是1,在linux中0代表標準輸入控制檯,1代表標準輸出控制檯,2代表錯誤輸出控制檯。
我們其實在main函數中,改變了日誌的輸出行爲,讓日誌即可以輸出到控制檯也能輸出到日誌文件中:
1EventLoop g_mainLoop; 2AsyncLogging* g_asyncLog = NULL; 3void asyncOutput(const char* msg, int len) 4 { 5 if (g_asyncLog != NULL) 6 { 7 g_asyncLog->append(msg, len); 8 std::cout << msg << std::endl; 9 } 10} 11 12void asyncOutput(const char* msg, int len) 13 { 14 if (g_asyncLog != NULL) 15 { 16 g_asyncLog->append(msg, len); 17 std::cout << msg << std::endl; 18 } 19} 20 21int main(int argc, char* argv[]) 22 { 23 //省略無關代碼 24 std::string strLogFileFullPath(logfilepath); 25 strLogFileFullPath += logfilename; 26 Logger::setLogLevel(Logger::DEBUG); 27 int kRollSize = 500 * 1000 * 1000; 28 AsyncLogging log(strLogFileFullPath.c_str(), kRollSize); 29 log.start(); 30 g_asyncLog = &log; 31 Logger::setOutput(asyncOutput); 32 33 //省略無關代碼 34 35 g_mainLoop.loop(); 36 37 return 0; 38}
看到這裏問題就來了,當main函數執行完畢,即出了main函數的作用域,log對象已經銷燬。此時正好正在銷燬全局對象g_mainLoop,此時調用EventLoop的析構函數,在其析構函數中打印“EventLoop destructs in other thread”這行日誌,仍然會調用asyncOutput函數,但是此時g_asyncLog 已經是一個無效的指針了,且不是空指針(野指針),所以你調用它的append方法必然會引起內存問題。
如何改正這個問題呢? 方法有多種,方法一就是在g_mainLoop.loop()後面加上一行,將全局指針對象g_asyncLog顯式設置爲空指針,這樣EventLoop裏面的析構函數打印的日誌就不會打印了,因爲asyncOutput函數中已經做了空指針判斷。代碼改成如下:
1int main(int argc, char* argv[]) 2 { 3 //省略無關代碼 4 std::string strLogFileFullPath(logfilepath); 5 strLogFileFullPath += logfilename; 6 Logger::setLogLevel(Logger::DEBUG); 7 int kRollSize = 500 * 1000 * 1000; 8 AsyncLogging log(strLogFileFullPath.c_str(), kRollSize); 9 log.start(); 10 g_asyncLog = &log; 11 Logger::setOutput(asyncOutput); 12 13 //省略無關代碼 14 15 g_mainLoop.loop(); 16 g_asyncLog = NULL; 17 18 return 0; 19}
這種解決方案缺點是丟了部分日誌,方法二就是將這個log對象也申明成全局變量,即將g_asyncLog改成對象而不是指針。我們這裏採用方法二,修改後的代碼如下:
1EventLoop g_mainLoop; 2 3AsyncLogging g_asyncLog; 4 5void asyncOutput(const char* msg, int len) 6 { 7 g_asyncLog.append(msg, len); 8 std::cout << msg << std::endl; 9} 10 11int main(int argc, char* argv[]) 12 { 13 //無關代碼省略... 14 std::string strLogFileFullPath(logfilepath); 15 strLogFileFullPath += logfilename; 16 Logger::setLogLevel(Logger::DEBUG); 17 int kRollSize = 1024 * 1024 * 1024; 18 //AsyncLogging log(strLogFileFullPath.c_str(), kRollSize); 19 g_asyncLog.setBaseName(strLogFileFullPath.c_str()); 20 g_asyncLog.setRollSize(kRollSize); 21 g_asyncLog.start(); 22 Logger::setOutput(asyncOutput); 23 24 //無關代碼省略... 25 26 g_mainLoop.loop(); 27 28 29 return 0; 30}
改完之後,重新編譯程序,我們要驗證一下,我們讓程序接收Ctrl + C信號,由於gdb默認會自己處理Ctrl + C指令(行爲是讓gdb中斷下來接收用戶輸入),我們可以修改gdb的設置讓gdb不要處理這個信號,而是把這個信號傳給我們的程序,在gdb中執行如下指令:
handle SIGINT nostop print pass
這樣程序就能響應Ctrl + C了(SIGINT是Ctrl + C產生的信號值)。
1(gdb) handle SIGINT nostop print pass 2SIGINT is used by the debugger. 3Are you sure you want to change it? (y or n) y 4Signal Stop Print Pass to program Description 5SIGINT No Yes Yes Interrupt 6(gdb) r 7Starting program: /home/zhangyl/flamingoserver/chatserver 8[Thread debugging using libthread_db enabled] 9Using host libthread_db library "/lib64/libthread_db.so.1". 10[New Thread 0x7ffff442b700 (LWP 11460)] 11[New Thread 0x7ffff3c2a700 (LWP 11461)] 12[Thread 0x7ffff3c2a700 (LWP 11461) exited] 1320180820 22:00:27 INFO show databases - DatabaseMysql.cpp:93 14 1520180820 22:00:27 INFO CMysqlManager::_IsDBExist, find database(flamingo) - MysqlManager.cpp:195 16 1720180820 22:00:27 INFO desc t_user - DatabaseMysql.cpp:93 18 1920180820 22:00:27 INFO desc t_user_relationship - DatabaseMysql.cpp:93 20 2120180820 22:00:27 INFO desc t_chatmsg - DatabaseMysql.cpp:93 22 2320180820 22:00:27 INFO SELECT f_user_id, f_username, f_nickname, f_password, f_facetype, f_customface, f_gender, f_birthday, f_signature, f_address, f_phonenumber, f_mail, f_teaminfo FROM t_user ORDER BY f_user_id DESC - DatabaseMysql.cpp:93 24 2520180820 22:00:27 INFO current base userid: 0, current base group id: 268435455 - UserManager.cpp:111 26 27[New Thread 0x7ffff3c2a700 (LWP 11462)] 28[New Thread 0x7ffff2a74700 (LWP 11463)] 29[New Thread 0x7ffff2273700 (LWP 11464)] 30[New Thread 0x7ffff1a72700 (LWP 11465)] 31[New Thread 0x7ffff1271700 (LWP 11466)] 32[New Thread 0x7ffff0a70700 (LWP 11467)] 33[New Thread 0x7fffd3fff700 (LWP 11468)] 34[New Thread 0x7fffd37fe700 (LWP 11469)] 3520180820 22:00:27 INFO chatserver initialization completed, now you can use client to connect it. - main.cpp:139 36 37^C 38Program received signal SIGINT, Interrupt. 39program recv signal [2] to exit. 40Exit loop... 41Exit chatserver.... 42[Thread 0x7ffff442b700 (LWP 11460) exited] 43[Thread 0x7fffd3fff700 (LWP 11468) exited] 44[Thread 0x7fffd37fe700 (LWP 11469) exited] 45[Thread 0x7ffff0a70700 (LWP 11467) exited] 46[Thread 0x7ffff1271700 (LWP 11466) exited] 47[Thread 0x7ffff1a72700 (LWP 11465) exited] 48[Thread 0x7ffff2273700 (LWP 11464) exited] 49[Thread 0x7ffff3c2a700 (LWP 11462) exited] 50[Thread 0x7ffff7fe5a80 (LWP 11459) exited] 51[Inferior 1 (process 11459) exited normally] 52(gdb)
至此,我們驗證並修復了該bug,這樣程序在響應Ctrl + C或者用kill + pid殺死進程就可以走正常退出流程了,而不再崩潰。不知道你學到了沒有?
小結一下: 通過上面的例子我們可以發現,作爲一個合格的linux後臺開發人員,我們不僅要熟悉業務代碼本身,還要熟練適用gdb各種命令,同時對操作系統的一些機制也要了解(例如:如何設置程序崩潰以後產生core文件)。如果你還沒有掌握,建議一定要好好練習一下gdb的使用。