關於信號處理signal()、sigaction()等的使用,相信很多人都已熟悉。 這裏主要想講一下信號處理函數使用上的一個常見陷阱:信號處理函數必須是可重入函數。如果信號處理函數不可重入,那麼可能導致很多詭異問題。
《UNIX環境高級編程》“可重入函數”章節中這樣寫道:
“但在信號處理程序中,不能判斷捕捉到信號時進程在何處執行。如果進程正在執行malloc,在其堆中分配另外的存儲空間,而此時由於捕捉到信號而插入執行該信號處理程序,其中又調用malloc,這時會發生什麼?”
關於“可重入函數”相信其概念並不難理解,但真正使用信號時,很多人都忽略了這一點,特別是一些比較隱晦的“不可重入函數”。本人在項目中就曾兩次遇到信號處理函數中調用不可重入函數導致的死鎖:某項目運行一段時間後,進程基本停止響應各種外界命令,日誌也基本停止打印(只有個別簡單輪詢線程定時大義些信息),但ps命令看到進程還在運行。看到這個問題,第一反應就是進程死鎖,gdb attach到進程上,查看各個線程的堆棧,果然, 很多線程都卡在malloc調用上:
Thread 152 (Thread 0x7f020abf5700 (LWP 7801)):
#0 0x00000032120f6dde in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x000000321207c59b in _L_lock_9495 () from /lib64/libc.so.6
#2 0x0000003212079b86 in malloc () from /lib64/libc.so.6
#3 0x00000030142bd09d in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6
#4 0x00000000005b9092 in __gnu_cxx::new_allocator<std::_List_node<Memory::TSmartObjectPtr<CPacketBase> > >::allocate(unsigned long, void const*) ()
#5 0x00000000005b8f10 in std::_List_base<Memory::TSmartObjectPtr<CPacketBase>, std::allocator<Memory::TSmartObjectPtr<CPacketBase> > >::_M_g
#6 0x00000000005b8cff in std::list<Memory::TSmartObjectPtr<CPacketBase>, std::allocator<Memory::TSmartObjectPtr<CPacketBase> > >::_M_create_
#7 0x00000000005b889b in std::list<Memory::TSmartObjectPtr<CPacketBase>, std::allocator<Memory::TSmartObjectPtr<CPacketBase> > >::_M_insert(
#8 0x00000000005b8020 in std::list<Memory::TSmartObjectPtr<CPacketBase>, std::allocator<Memory::TSmartObjectPtr<CPacketBase> > >::push_back(
#9 0x00000000006194cd in CProtoParser::parser(char*, unsigned int) ()
#10 0x0000000000618fb5 in CProtoParser::putDataLen(unsigned int) ()
#11 0x00000000006666be in CConnection::handle_input(int) ()
#12 0x000000000069595a in NetFramework::CNetThread::handle_netevent(NetFramework::list_node*) ()
#13 0x0000000000695bbf in NetFramework::CNetThread::ThreadProc(Infra::CThreadLite&) ()
#14 0x00000000006a3f38 in (anonymous namespace)::InternalThreadBody(void*) ()
#15 0x0000003212407851 in start_thread () from /lib64/libpthread.so.0
#16 0x00000032120e767d in clone () from /lib64/libc.so.6
------------------------------------------------------------------------------------------------------------------------
Thread 19 (Thread 0x7f01019ec700 (LWP 7939)):
#0 0x00000032120f6dde in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x000000321207bede in _L_lock_44 () from /lib64/libc.so.6
#2 0x0000003212074d4c in ptmalloc_lock_all () from /lib64/libc.so.6
#3 0x00000032120ab9a5 in fork () from /lib64/libc.so.6
#4 0x0000003212067c07 in _IO_proc_open@@GLIBC_2.2.5 () from /lib64/libc.so.6
#5 0x0000003212067ef9 in popen@@GLIBC_2.2.5 () from /lib64/libc.so.6
#6 0x0000000000548746 in os::shell(std::basic_ostream<char, std::char_traits<char> >*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
#7 0x00000000005f5a5c in CDiskPerfCollector::IsCollectorRunning() ()
#8 0x00000000005f5766 in CDiskPerfCollector::threadProc() ()
#9 0x00000000006a3f38 in (anonymous namespace)::InternalThreadBody(void*) ()
#10 0x0000003212407851 in start_thread () from /lib64/libpthread.so.0
#11 0x00000032120e767d in clone () from /lib64/libc.so.6
看到這裏,有些人可能認爲是glibc malloc 出什麼bug了。其實不然,仔細分析,就會發現其中蹊蹺:有一個線程,在執行malloc的過程中,跳轉到了信號處理函數中。而信號處理函數在調用某個系統api時,內部又調用了malloc。 看了glibc源碼就會知道,malloc內部也是有鎖、而且是非嵌套的,如果在上一次調用中拿到鎖,又跳轉到信號處理函數中再次malloc,自然就導致死鎖了。而且即使沒有死鎖,也極有可能破壞malloc內部維護的一些全局信息,導致後面莫名其妙的崩潰。
Thread 63 (Thread 0x7f010b3f9700 (LWP 7890)):
#0 0x00000032120f6dde in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x000000321207c59b in _L_lock_9495 () from /lib64/libc.so.6
#2 0x0000003212079b86 in malloc () from /lib64/libc.so.6
#3 0x000000321180cb8d in _dl_map_object_deps () from /lib64/ld-linux-x86-64.so.2
#4 0x0000003211812a11 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#5 0x000000321180e196 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6 0x000000321181246a in _dl_open () from /lib64/ld-linux-x86-64.so.2
#7 0x00000032121250a0 in do_dlopen () from /lib64/libc.so.6
#8 0x000000321180e196 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#9 0x00000032121251f7 in __libc_dlopen_mode () from /lib64/libc.so.6
#10 0x00000032120fd5f5 in init () from /lib64/libc.so.6
#11 0x000000321240cb23 in pthread_once () from /lib64/libpthread.so.0
#12 0x00000032120fd6f4 in backtrace () from /lib64/libc.so.6
#13 0x0000000000614363 in printStackTrace() ()
#14 0x000000000061497a in interruptTrigger(int, siginfo*, void*) ()
#15 <signal handler called>
#16 0x0000003212078c33 in _int_malloc () from /lib64/libc.so.6
#17 0x0000003212079b91 in malloc () from /lib64/libc.so.6
#18 0x00000030142bd09d in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6
#19 0x00000000006982ce in NetFramework::CSockAddrStorage::CSockAddrStorage() ()
#20 0x000000000066639d in CConnection::attach(NetFramework::CSockStream&) ()
#21 0x0000000000620be0 in CServiceBase::init(NetFramework::CSockStream&) ()
#22 0x00000000005c1193 in CSession::init(NetFramework::CSockStream&) ()
#23 0x00000000005705a8 in CDNServer::accept(NetFramework::CSockStream&) ()
#24 0x000000000061f3e4 in CServer::Internal::handle_input(int) ()
#25 0x000000000069595a in NetFramework::CNetThread::handle_netevent(NetFramework::list_node*) ()
#26 0x0000000000695bbf in NetFramework::CNetThread::ThreadProc(Infra::CThreadLite&) ()
#27 0x00000000006a3f38 in (anonymous namespace)::InternalThreadBody(void*) ()
#28 0x0000003212407851 in start_thread () from /lib64/libpthread.so.0
#29 0x00000032120e767d in clone () from /lib64/libc.so.6
由於LWP 7890 線程處理信號時兩次進入malloc死鎖,導致很多其他線程在執行到malloc時卡主。而這些線程本身可能還持有一些業務上的鎖,導致死鎖迅速擴散,最終整個進程幾乎都卡主了。
而且需要指出的是,有時候我們對malloc的調用可能比較隱晦,比如爲std::string 等賦值,打印日誌等,所以一不留神就容易栽進坑裏。文中鎖涉及的代碼,更是我們項目組一些比較資深的骨幹同事寫的,其初衷是想在收到一些特殊信號時通過backtrace等函數將當前線程的堆棧打印到日誌,方便定位問題。殊不知就是這個看似高明的處理,引發了更加複雜的問題。 由此可見,對於信號處理函數“必須保證可重入”這一點,在實際編碼中必須慎之又慎,時刻謹記。
一般來說,信號處理函數中要做的事情應該儘量簡單。通常可以置一個標識,由其他線程檢測到這個標識後再做相應處理,而不是直接在信號處理函數中做這些事情。