linux服務器開發實戰(一)——排查Flamingo服務端一個崩潰的問題

我的flamingo服務器(關於flamingo可以參看這裏)最近在殺掉進程(如使用Ctrl + C或者kill + 程序pid)偶爾會出現崩潰問題,雖然這個問題沒多大影響,因爲進程本來就馬上要死了,在退出的過程中崩潰也就無所謂了,但是本着嚴謹和求知的態度,我還是排查了一下。下面記錄一下debug的過程,希望對讀者有所啓發。

正常情況下,我的程序處理了Ctrl+C信號時,會走正常的退出流程,預想的程序不會崩潰的,但實際還是崩潰了。

主線程是一個EventLoop無限循環,同時程序接收到Ctrl+C信號時,設置主線程退出標誌。代碼如下:

 int main(int argc, char* argv[])
 2  {  
 3    //設置信號處理
 4    signal(SIGCHLD, SIG_DFL);
 5    signal(SIGPIPE, SIG_IGN);
 6    signal(SIGINT, prog_exit);
 7    signal(SIGTERM, prog_exit);
 8
 9    //省略無關代碼...
10
11    g_mainLoop.loop();
12
13    return 0;
14}

信號處理程序如下:

1void prog_exit(int signo)
2   {
3    std::cout << "program recv signal [" << signo << "] to exit." << std::endl;
4
5    g_mainLoop.quit();
6
7}

通過日誌也看不到對於排查崩潰現象的有任何幫助的日誌信息,於是啓用linux的linux的coredump文件生成機制,某次產生了如下coredump文件:

於是使用gdb調試查看一下崩潰調用堆棧(第一步使用命令gdb 可執行文件名稱 core文件名,第二步使用bt命令查看崩潰堆棧):

 1[zhangyl@iZ238vnojlyZ myimserver]$ gdb mychatserver core.9798 
 2GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
 3   Copyright (C) 2013 Free Software Foundation, Inc.
 4   License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
 5   This is free software: you are free to change and redistribute it.
 6   There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
 7   and "show warranty" for details.
 8   This GDB was configured as "x86_64-redhat-linux-gnu".
 9   For bug reporting instructions, please see:
10   <http://www.gnu.org/software/gdb/bugs/>...
11   Reading symbols from /home/zhangyl/myimserver/mychatserver...done.
12   [New LWP 9798]
13   [New LWP 9802]
14   [New LWP 9804]
15   [New LWP 9800]
16   [New LWP 9803]
17   [New LWP 9801]
18   [New LWP 9805]
19   [Thread debugging using libthread_db enabled]
20   Using host libthread_db library "/lib64/libthread_db.so.1".
21   Core was generated by `./mychatserver -d'.
22   Program terminated with signal 11, Segmentation fault.
23   #0  0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45
24   45              int avail() const { return static_cast<int>(end() - cur_); }
25Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.6.x86_64 libgcc-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64
26(gdb) bt
27#0  0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45
28#1  0x00000000004f4e20 in AsyncLogging::append (this=0x7ffd5d58ad50, 
29    logline=0x7ffd5d58fe70 "20180820 15:46:06 DEBUG ~EventLoop EventLoop destructs in other thread - eventloop.cpp:87\n", len=90)
30    at /home/zhangyl/myimserver/base/asynclogging.cpp:98
31#2  0x000000000053bfc8 in asyncOutput (msg=0x7ffd5d58fe70 "20180820 15:46:06 DEBUG ~EventLoop EventLoop destructs in other thread - eventloop.cpp:87\n", len=90)
32    at /home/zhangyl/myimserver/chatserversrc/main.cpp:33
33#3  0x00000000004f205d in Logger::~Logger (this=0x7ffd5d58fe60, __in_chrg=<optimized out>) at /home/zhangyl/myimserver/base/logging.cpp:149
34#4  0x0000000000504bf1 in net::EventLoop::~EventLoop (this=0x7d1c00 <g_mainLoop>, __in_chrg=<optimized out>) at /home/zhangyl/myimserver/net/eventloop.cpp:87
35#5  0x00007f7067d79e69 in __run_exit_handlers () from /lib64/libc.so.6
36#6  0x00007f7067d79eb5 in exit () from /lib64/libc.so.6
37#7  0x00007f7067d62b1c in __libc_start_main () from /lib64/libc.so.6
38#8  0x00000000004f0ba9 in _start ()
39(gdb)

崩潰的根源在於堆棧#0,使用frame命令(frame 堆棧編號,如frame 0,表示切換到堆棧#0)切換到堆棧#0

1(gdb) frame 0
2#0  0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45
345              int avail() const { return static_cast<int>(end() - cur_); }
4(gdb) p this
5$1 = (const FixedBuffer<4000000> * const) 0x7f7067564010
6(gdb)

使用print命令(簡寫成p)查看下當前對象的this指針,this指針時一個地址值,但是不代表這個對象有效,再使用print命令查看一下這個對象本身的數據(p *this),發現這個對象已經失效:

1(gdb) p *this
2Cannot access memory at address 0x7f7067564010
3(gdb)

這個對象爲啥失效呢?而且堆棧#3堆棧#4顯示的是在EventLoop裏面調用日誌類Logger的析構函數。先看下堆棧#4處的函數代碼:

 1EventLoop::~EventLoop()
 2{    
 3    LOG_DEBUG << "EventLoop destructs in other thread";
 4
 5    //std::stringstream ss;
 6    //ss << "eventloop destructs threadid = " << threadId_;
 7    //std::cout << ss.str() << std::endl;
 8
 9    wakeupChannel_->disableAll();
10    wakeupChannel_->remove();
11    ::close(wakeupFd_);
12    t_loopInThisThread = NULL;
13}

除了打印一行日誌,也沒有和日誌Logger類相關的代碼,且EventLoop也沒有與Logger類相關的成員變量(也就是無隱式析構)。我們再看下堆棧#3處的代碼:

 1Logger::~Logger()
 2{
 3    impl_.finish();
 4    const LogStream::Buffer& buf(stream().buffer());
 5    g_output(buf.data(), buf.length());
 6    if (impl_.level_ == FATAL)
 7    {
 8        g_flush();
 9        abort();
10    }
11}

崩潰的地方就是g_output(buf.data(), buf.length()); 而這個buf的內容正好就是EventLoop析構函數裏面打印的一行日誌:

EventLoop destructs in other thread

那麼這個g_output爲啥會調用AsyncLogging::append函數呢?(參考堆棧#1)。我們通過全局搜索發現g_output是一個全局變量,並且在定義時已經有初始值:

Logger::OutputFunc g_output = defaultOutput;

而初始值defaultOutput的行爲就是將日誌輸出到控制檯:

1void defaultOutput(const char* msg, int len)
2  {
3    size_t n = fwrite(msg, 1, len, stdout);
4    //FIXME check n
5    (void)n;
6}

fwrite中的第二個參數是1在linux中0代表標準輸入控制檯,1代表標準輸出控制檯,2代表錯誤輸出控制檯

我們其實在main函數中,改變了日誌的輸出行爲,讓日誌即可以輸出到控制檯也能輸出到日誌文件中:

 1EventLoop g_mainLoop;
 2AsyncLogging* g_asyncLog = NULL;
 3void asyncOutput(const char* msg, int len)
 4  {
 5    if (g_asyncLog != NULL)
 6    {
 7        g_asyncLog->append(msg, len);
 8        std::cout << msg << std::endl;
 9    }
10}
11
12void asyncOutput(const char* msg, int len)
13   {
14    if (g_asyncLog != NULL)
15    {
16        g_asyncLog->append(msg, len);
17        std::cout << msg << std::endl;
18    }
19}
20
21int main(int argc, char* argv[])
22  {
23    //省略無關代碼
24    std::string strLogFileFullPath(logfilepath);
25    strLogFileFullPath += logfilename;
26    Logger::setLogLevel(Logger::DEBUG);
27    int kRollSize = 500 * 1000 * 1000;
28    AsyncLogging log(strLogFileFullPath.c_str(), kRollSize);
29    log.start();
30    g_asyncLog = &log;
31    Logger::setOutput(asyncOutput);
32
33    //省略無關代碼
34
35    g_mainLoop.loop();
36
37    return 0;
38}

看到這裏問題就來了,當main函數執行完畢,即出了main函數的作用域,log對象已經銷燬。此時正好正在銷燬全局對象g_mainLoop,此時調用EventLoop的析構函數,在其析構函數中打印“EventLoop destructs in other thread”這行日誌,仍然會調用asyncOutput函數,但是此時g_asyncLog 已經是一個無效的指針了,且不是空指針(野指針),所以你調用它的append方法必然會引起內存問題。

如何改正這個問題呢? 方法有多種,方法一就是在g_mainLoop.loop()後面加上一行,將全局指針對象g_asyncLog顯式設置爲空指針,這樣EventLoop裏面的析構函數打印的日誌就不會打印了,因爲asyncOutput函數中已經做了空指針判斷。代碼改成如下:

 1int main(int argc, char* argv[])
 2  {
 3    //省略無關代碼
 4    std::string strLogFileFullPath(logfilepath);
 5    strLogFileFullPath += logfilename;
 6    Logger::setLogLevel(Logger::DEBUG);
 7    int kRollSize = 500 * 1000 * 1000;
 8    AsyncLogging log(strLogFileFullPath.c_str(), kRollSize);
 9    log.start();
10    g_asyncLog = &log;
11    Logger::setOutput(asyncOutput);
12
13    //省略無關代碼
14
15    g_mainLoop.loop();
16    g_asyncLog = NULL;
17
18    return 0;
19}

這種解決方案缺點是丟了部分日誌,方法二就是將這個log對象也申明成全局變量,即將g_asyncLog改成對象而不是指針我們這裏採用方法二,修改後的代碼如下:

 1EventLoop g_mainLoop;
 2
 3AsyncLogging g_asyncLog;
 4
 5void asyncOutput(const char* msg, int len)
 6  {
 7    g_asyncLog.append(msg, len);
 8    std::cout << msg << std::endl;
 9}
10
11int main(int argc, char* argv[])
12  {
13    //無關代碼省略...
14    std::string strLogFileFullPath(logfilepath);
15    strLogFileFullPath += logfilename;
16    Logger::setLogLevel(Logger::DEBUG);
17    int kRollSize = 1024 * 1024 * 1024;
18    //AsyncLogging log(strLogFileFullPath.c_str(), kRollSize);
19    g_asyncLog.setBaseName(strLogFileFullPath.c_str());
20    g_asyncLog.setRollSize(kRollSize);
21    g_asyncLog.start();
22    Logger::setOutput(asyncOutput);
23
24    //無關代碼省略...
25
26    g_mainLoop.loop();
27
28
29    return 0;
30}

改完之後,重新編譯程序,我們要驗證一下,我們讓程序接收Ctrl + C信號,由於gdb默認會自己處理Ctrl + C指令(行爲是讓gdb中斷下來接收用戶輸入),我們可以修改gdb的設置讓gdb不要處理這個信號,而是把這個信號傳給我們的程序,在gdb中執行如下指令:

handle SIGINT nostop print pass 

這樣程序就能響應Ctrl + C了SIGINT是Ctrl + C產生的信號值)。

 1(gdb)  handle SIGINT nostop print pass 
 2SIGINT is used by the debugger.
 3Are you sure you want to change it? (y or n) y
 4Signal        Stop      Print   Pass to program Description
 5SIGINT        No        Yes     Yes             Interrupt
 6(gdb) r
 7Starting program: /home/zhangyl/flamingoserver/chatserver 
 8[Thread debugging using libthread_db enabled]
 9Using host libthread_db library "/lib64/libthread_db.so.1".
10[New Thread 0x7ffff442b700 (LWP 11460)]
11[New Thread 0x7ffff3c2a700 (LWP 11461)]
12[Thread 0x7ffff3c2a700 (LWP 11461) exited]
1320180820 22:00:27 INFO  show databases - DatabaseMysql.cpp:93
14
1520180820 22:00:27 INFO  CMysqlManager::_IsDBExist, find database(flamingo) - MysqlManager.cpp:195
16
1720180820 22:00:27 INFO  desc t_user - DatabaseMysql.cpp:93
18
1920180820 22:00:27 INFO  desc t_user_relationship - DatabaseMysql.cpp:93
20
2120180820 22:00:27 INFO  desc t_chatmsg - DatabaseMysql.cpp:93
22
2320180820 22:00:27 INFO  SELECT f_user_id, f_username, f_nickname, f_password,  f_facetype, f_customface, f_gender, f_birthday, f_signature, f_address, f_phonenumber, f_mail, f_teaminfo FROM t_user ORDER BY  f_user_id DESC - DatabaseMysql.cpp:93
24
2520180820 22:00:27 INFO  current base userid: 0, current base group id: 268435455 - UserManager.cpp:111
26
27[New Thread 0x7ffff3c2a700 (LWP 11462)]
28[New Thread 0x7ffff2a74700 (LWP 11463)]
29[New Thread 0x7ffff2273700 (LWP 11464)]
30[New Thread 0x7ffff1a72700 (LWP 11465)]
31[New Thread 0x7ffff1271700 (LWP 11466)]
32[New Thread 0x7ffff0a70700 (LWP 11467)]
33[New Thread 0x7fffd3fff700 (LWP 11468)]
34[New Thread 0x7fffd37fe700 (LWP 11469)]
3520180820 22:00:27 INFO  chatserver initialization completed, now you can use client to connect it. - main.cpp:139
36
37^C
38Program received signal SIGINT, Interrupt.
39program recv signal [2] to exit.
40Exit loop...
41Exit chatserver....
42[Thread 0x7ffff442b700 (LWP 11460) exited]
43[Thread 0x7fffd3fff700 (LWP 11468) exited]
44[Thread 0x7fffd37fe700 (LWP 11469) exited]
45[Thread 0x7ffff0a70700 (LWP 11467) exited]
46[Thread 0x7ffff1271700 (LWP 11466) exited]
47[Thread 0x7ffff1a72700 (LWP 11465) exited]
48[Thread 0x7ffff2273700 (LWP 11464) exited]
49[Thread 0x7ffff3c2a700 (LWP 11462) exited]
50[Thread 0x7ffff7fe5a80 (LWP 11459) exited]
51[Inferior 1 (process 11459) exited normally]
52(gdb) 

至此,我們驗證並修復了該bug,這樣程序在響應Ctrl + C或者用kill + pid殺死進程就可以走正常退出流程了,而不再崩潰。不知道你學到了沒有?

小結一下: 通過上面的例子我們可以發現,作爲一個合格的linux後臺開發人員,我們不僅要熟悉業務代碼本身,還要熟練適用gdb各種命令,同時對操作系統的一些機制也要了解(例如:如何設置程序崩潰以後產生core文件)。如果你還沒有掌握,建議一定要好好練習一下gdb的使用。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章