無符號和棧破壞情況下coredump的分析方法 (牛逼)

https://blog.csdn.net/qazxlf/article/details/50385616

原文:http://zhangzhibiao02005.blog.163.com/blog/static/37367820201482044137298/

無符號和棧破壞情況下coredump的分析方法

 

gdb 調試coredump文件中爛掉的棧幀的方法

https://blog.csdn.net/muclenerd/article/details/48005171

 

昨天在上線的時候,出現了一個無符號和棧破壞的coredump.今天總結下這類core的追查方法:

1.打開core,

gdb /tmp/gss core.201409111809.dump.gss

2.查看堆棧:

(gdb) bt

#0  0x0000000000000000 in ?? ()

#1  0x0000000000000000 in ?? ()

 汗....全亂啊..

3.查看寄存器:

(gdb) info registers

rax            0x1      1

rbx            0x715ba81359dbfdee       8168307150231961070

rcx            0x302afc63dc     206879613916

rdx            0x1      1

rsi            0x58323e90       1479687824

rdi            0x39     57

rbp            0x58323ec0       0x58323ec0

rsp            0x58323e80       0x58323e80

r8             0xffffffffffffffff       -1

r9             0x1999faf23e27a68        115298840044796520

r10            0xa      10

r11            0x202    514

r12            0x7f29ceb3ccd0   139817538276560

r13            0x583245af       1479689647

r14            0x21809000       562073600

r15            0x7f24ab8c16e0   139795473635040

rip            0x0      0

eflags         0x10207  66055

cs             0x33     51

ss             0x2b     43

ds             0x0      0

es             0x0      0

fs             0x63     99

gs             0x0      0

 查看rbp和rsp,看起來ok.

但是rip爲0,肯定有問題,意味着下次執行的地址是0,顯然會core.

說明當前線程的棧已經被寫髒了.

再來看當前線程的棧的內容:

(gdb) x/32 0x58323e80

0x58323e80:     0x00000000      0x00000000      0x00000000      0x00000000

0x58323e90:     0x00000001      0x821d183c      0x00007f24      0x00000000

0x58323ea0:     0x00000000      0x00000000      0x00000000      0x00000000

0x58323eb0:     0x00000000      0x00000000      0x00000000      0x00000000

0x58323ec0:     0x00000000      0x00000000      0x00000000      0x00000000

0x58323ed0:     0x00000000      0x00000000      0x00000000      0x00000000

0x58323ee0:     0x00000000      0x00000000      0x00000000      0x00000000

0x58323ef0:     0x00000000      0x00000000      0x00000000      0x00000000

再次說明當前線程棧已經被寫髒了. 那麼這個線程的棧是被哪個線程寫髒的呢?  還要繼續追查.

5.查看線程:

(gdb) thread

[Current thread is 1 (process 14554)]

沒辦法,core裏沒有任何棧的有用信息.

4.查看dmesg,內核幫我們記錄的進程運行信息.

gss[14554]: segfault at 0 ip (null) sp 0000000058323e80 error 14 in gss[400000+112c000]          

 

// 這行是註釋, 程序名, 線程號 , ip 指令地址, sp棧頂指針, gss程序名, 最後是鏈接器裝入的地址  400000 長度是 112c000.

                                                                                 

INFO: task gss:4461 blocked for more than 120 seconds.                                                                                                                            

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                                                                                         

gss           D ffff880c40f51800     0  4461  13021 0x00000000                                                                                                                    

 ffff8808bbcf3900 0000000000000082 ffff8808a39cfc20 ffffffff81023a15                                                                                                              

 000000000000002c ffffffff810dedff ffff8808bbcf6ae0 0000008000000000                                                                                                              

 ffff8808bbcf6d88 0000000000012180 0000000000012180 000000000000d698                                                                                                              

Call Trace:                                                                                                                                                                       

 [<ffffffff81023a15>] ? get_user_pages_fast+0xbc/0x170                                                                                                                            

 [<ffffffff810dedff>] ? block_write_end+0x4e/0x59                                                                                                                                 

 [<ffffffff81053cd0>] ? get_futex_key+0x172/0x181                                                                                                                                 

 [<ffffffff81053cfd>] ? get_futex_value_locked+0x1e/0x2d                                                                                                                          

 [<ffffffff810751e2>] ? __delayacct_add_tsk+0x16e/0x17d                                                                                                                           

 [<ffffffff81037171>] ? exit_mm+0x85/0x10e                                                                                                                                        

 [<ffffffff81038c5e>] ? do_exit+0x1f2/0x686                                                                                                                                       

 [<ffffffff81039162>] ? do_group_exit+0x70/0x97                                                                                                                                   

 [<ffffffff81041d36>] ? get_signal_to_deliver+0x308/0x328                                                                                                                         

 [<ffffffff81001ef8>] ? do_signal+0x6c/0x6c2                                                                                                                                      

 [<ffffffff81002573>] ? do_notify_resume+0x25/0x64                                                                                                                                

 [<ffffffff810befb6>] ? sys_write+0x62/0x6e                                                                                                                                       

 [<ffffffff81002c89>] ? int_signal+0x12/0x17                                                                                                                                      

INFO: task gss:4631 blocked for more than 120 seconds.                         

......

INFO: task gss:14525 blocked for more than 120 seconds.                                                                                                                           
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                                                                                         
gss           D ffff88089c4ed698     0 14525  13021 0x00000000                                                                                                                    
 ffff88085cd9e3c0 0000000000000082 0000000000000000 ffff8808bcea4000                                                                                                              
 0000000000000099 ffffffff810dedff ffff8808bbcf4020 0000000c2f62d225                                                                                                              
 ffff8808bbcf42c8 0000000000012180 0000000000012180 000000000000d698                                                                                                              
Call Trace:                                                                                                                                                                       
 [<ffffffff810dedff>] ? block_write_end+0x4e/0x59                                                                                                                                 
 [<ffffffff810834a6>] ? generic_file_buffered_write+0x210/0x297                                                                                                                   
 [<ffffffff81037171>] ? exit_mm+0x85/0x10e                                                                                                                                        
 [<ffffffff81038c5e>] ? do_exit+0x1f2/0x686                                                                                                                                       
 [<ffffffff8138a758>] ? thread_return+0x3e/0x106                                                                                                                                  
 [<ffffffff81039162>] ? do_group_exit+0x70/0x97                                                                                                                                   
 [<ffffffff81041d36>] ? get_signal_to_deliver+0x308/0x328                                                                                                                         
 [<ffffffff81001ef8>] ? do_signal+0x6c/0x6c2                                                                                                                                      
 [<ffffffff8104aede>] ? update_rmtp+0x4b/0x5e                                                                                                                                     
 [<ffffffff8104bb7a>] ? hrtimer_nanosleep+0xbd/0x108                                                                                                                              
 [<ffffffff81002573>] ? do_notify_resume+0x25/0x64                                                                                                                                
 [<ffffffff81002c89>] ? int_signal+0x12/0x17                                                                                                                                      
gss[27354]: segfault at 40333868 ip 00007f9ebc0054c8 sp 0000000040333800 error 6 in appstore-mid.so[7f9ebbe40000+36d000]      

                                                    
gss[13733]: segfault at 0 ip (null) sp 00000000619c7e80 error 14 in gss[400000+112c000]                                                                                           
gss[1246]: segfault at 0 ip (null) sp 0000000063c86e80 error 14 in gss[400000+112c000]                                                                                            
gss[16829]: segfault at 0 ip (null) sp 000000006b009e80 error 14 in gss[400000+112c000]         

 查看log,最後一次 14554 線程是 :

WARNING: 09-11 18:09:10:  gss. * 14554 [  logid:6feb2ab664e183d8  ][  reqip:  ][default_handler.cpp:751]not match any rule

 

上面加粗的字體非常重要. 

gss[27354]: segfault at 40333868 ip 00007f9ebc0054c8 sp 0000000040333800 error 6 in appstore-mid.so[7f9ebbe40000+36d000]     

來分析下這裏.

6. 根據core文件, 第一次core是在:

$ stat ~work/opdir/coresave/core.201409111809.dump.gss                                                                      
  File: `/home/work/opdir/coresave/core.201409111809.dump.gss'
  Size: 63571140608     Blocks: 124283768  IO Block: 4096   regular file
Device: 803h/2051d      Inode: 82477057    Links: 1
Access: (0644/-rw-r--r--)  Uid: (  500/    work)   Gid: (  502/    work)
Access: 2014-09-11 18:20:16.000000000 +0800
Modify: 2014-09-11 18:12:25.000000000 +0800
Change: 2014-09-11 18:12:25.000000000 +0800

但是打開wf, 發現在:

WARNING: 09-11 18:18:45:  gss. * 27486 [  logid:7045a6705e0ff029  ][  reqip:10.40.26.15 uip:112.97.36.78  ][gss_work.cpp:1005]get_info() fail[6813]
WARNING: 09-11 18:18:45:  gss. * 27486 [  logid:7045a6705e0ff029  ][  reqip:10.40.26.15 uip:112.97.36.78  ][gss_work.cpp:1005]get_info() fail[6814]
WARNING: 09-11 18:18:45:  gss. * 27486 [  logid:7045a6705e0ff029  ][  reqip:10.40.26.15 uip:112.97.36.78  ][gss_work.cpp:1005]get_info() fail[6693]
WARNING: 09-11 18:18:45:  gss. * 27486 [  logid:7045a6705e0ff029  ][  reqip:10.40.26.15 uip:112.97.36.78  ][gss_work.cpp:1005]get_info() fail[6852]
WARNING: 09-11 18:18:48:  gss. * 10875 --------------------------------------------------------------- open log.wf ------------
WARNING: 09-11 18:18:48:  gss. * 10875 [  logid:  ][  reqip:  ][ub_conf.cpp:698]int [_svr_gss_query_read_bufsize] no found, use default value [1024000]
WARNING: 09-11 18:18:48:  gss. * 10875 [  logid:  ][  reqip:  ][ub_conf.cpp:698]int [_svr_gss_query_write_bufsize] no found, use default value [1024000]
WARNING: 09-11 18:18:48:  gss. * 10875 [  logid:  ][  reqip:  ][ub_conf.cpp:698]int [_svr_gss_cmd_bufsize] no found, use default value [1024000]
WARNING: 09-11 18:18:48:  gss. * 10875 [  logid:  ][  reqip:  ][ub_conf.cpp:762]load uint [_svr_gss_query_netio_threadnum] fail, use default value [1]
WARNING: 09-11 18:18:48:  gss. * 10875 [  logid:  ][  reqip:  ][ub_conf.cpp:762]load uint [_svr_gss_query_callback_directly] fail, use default value [0]
WARNING: 09-11 18:18:48:  gss. * 10875 [  logid:  ][  reqip:  ][ub_conf.cpp:499]no found [_svr_gss_cmd_port] range check item

但是這個core沒有dump出來,因爲上一個core是在18:20才dump完成,op爲了控制連續出core打滿磁盤,所以這裏的core並沒有dump.

結合上述的dmesg的消息,

gss[27354]: segfault at 40333868 ip 00007f9ebc0054c8 sp 0000000040333800 error 6 in appstore-mid.so[7f9ebbe40000+36d000]    

發現線程27354在40333868地址出core.指令寄存器是 00007f9ebc0054c8 , 棧頂指針是 0000000040333800 , error 6 是內核的定義,與這次core關係不大.

出core是在appstore-mid.so, 這個so裝入的地址是 7f9ebbe40000  長度是36d000.

 

因爲我們的so是使用-fPIC編譯的,所以裏面的地址都是相對地址, 只有在最終啓動gss的時候纔有鏈接器給符號分配最終的地址 

所以,要想知道了core在appstore-mid.so的哪一行,還需要做一次運算,取出該地址在so中的相對地址:

實際很簡單, 用ip寄存器地址減去裝入地址即可 :

00007f9ebc0054c8  -  7f9ebbe40000 = 1C54C8

 

再使用objdump反彙編appstore-mid.so:

$ objdump -d /tmp/appstore-mid.so |grep -i 1C54C8
  1c54c8:       48 89 7c 24 68          mov    %rdi,0x68(%rsp)

這樣就定位了該地址所在的指令.

用vim打開反彙編的文件,查看調用上下文,

00000000001c54b0 <_ZN16InnerSearchFrame16pack_data_us_midEP16gss_client_req_t>:

  1c54b0:       41 57                   push   %r15

  1c54b2:       41 56                   push   %r14

  1c54b4:       41 55                   push   %r13

  1c54b6:       41 54                   push   %r12

  1c54b8:       55                      push   %rbp

  1c54b9:       53                      push   %rbx

  1c54ba:       48 81 ec 58 ff 5b 00    sub    $0x5bff58,%rsp

  1c54c1:       48 8b 1d 20 84 2a 00    mov    2786336(%rip),%rbx        # 46d8e8 <_DYNAMIC+0x658>

  1c54c8:       48 89 7c 24 68          mov    %rdi,0x68(%rsp)

  1c54cd:       48 8b 03                mov    (%rbx),%rax

 

 發現是

int InnerSearchFrame::pack_data_us_mid(gss_client_req_t* csi)

 這個函數裏面的語句.

 

爲了更好的獲取到具體的行,使用addr2line工具,

$ addr2line -e /tmp/appstore-mid.so 1c54c8 -f 
_ZN16InnerSearchFrame16pack_data_us_midEP16gss_client_req_t
/home/scmbuild/workspaces_cluster/ps.se.gss.kv.kv_1-3-124_BRANCH/ps/se/gss/kv/so/appstore_innersearch.cpp:361
 

下面就比較簡單了, review下該函數的前後調用, 發現該cpp文件在棧上開闢了非常大的數組空間:

    bs_page_res bps;                                                                                                                                                              

    bs_page_res new_bps;  

 這兩個數組,大致算了下,有近6M的空間.

我們的線上機器的棧大小是:

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 515362
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 30720
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

 

導致core已經定位: 線程棧溢出導致寫髒其他線程的數據,最終core dump.


本次core之前在上線前,已經預測到可能會導致core,但是qa和流水線都沒有core,所以就上線了,結果因爲線上資源多,流量大,最終觸發了coredump.

 

總結:

  1. 寫代碼的時候,避免在棧上直接開闢大數組,更不要寫成:char arr[BIG_BUF_SIZ] = {0}; 這個語句相當與menset了這個數組,如果上線的話,甚至直接會增加幾個ms的延時.

  2. 本次修改雖然只改了一個宏,但真正core的地方不是直接使用該宏,而是使用了包含該宏的大數組,間接地擴大了buffer. 後續修改類似的宏的時候,一定要仔細排查所有與宏相關的結構體和class.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章