本文講述一次spinlock死鎖故障的定位過程,目的不在於問題本身,而在於展現一個內核bug的分析過程,提供一種分析思路,供大家參考。
一、問題現象
內核出現panic,kdump蒐集到了vmcore。vmcore中直接導致panic的log信息爲(包含相應CPU上的堆棧):
點擊(此處)摺疊或打開
- Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 18
- Pid: 12410, comm: xxxx Not tainted 2.6.32-220.el6.x86_64 #1
- Call Trace:
- <NMI> [<ffffffff814f8464>] ? panic+0x8b/0x156
- [<ffffffff810dac1a>] ? watchdog_overflow_callback+0x1fa/0x200
- [<ffffffff8110cb3d>] ? __perf_event_overflow+0x9d/0x230
- [<ffffffff8110d0f4>] ? perf_event_overflow+0x14/0x20
- [<ffffffff8101e396>] ? intel_pmu_handle_irq+0x336/0x550
- [<ffffffff814fe156>] ? kprobe_exceptions_notify+0x16/0x450
- [<ffffffff814fcc39>] ? perf_event_nmi_handler+0x39/0xb0
- [<ffffffff814fe7a5>] ? notifier_call_chain+0x55/0x80
- [<ffffffff814fe80a>] ? atomic_notifier_call_chain+0x1a/0x20
- [<ffffffff81097dce>] ? notify_die+0x2e/0x30
- [<ffffffff814fc3c3>] ? do_nmi+0x173/0x2c0
- [<ffffffff814fbcd0>] ? nmi+0x20/0x30
- [<ffffffff814fb465>] ? _spin_lock_irq+0x25/0x40
- <<EOE>> [<ffffffff814f95ec>] ? wait_for_common+0x3c/0x180
- [<ffffffff814fb58d>] ? _spin_unlock_irqrestore+0x1d/0x20
- [<ffffffff814f97c3>] ? wait_for_completion_timeout+0x13/0x20
- [<ffffffffa006e8ee>] ? _ctl_do_mpt_command+0x3be/0xce0 [mpt2sas]
- [<ffffffff8121bfcb>] ? avc_has_perm_noaudit+0x9b/0x470
- [<ffffffff814fb587>] ? _spin_unlock_irqrestore+0x17/0x20
- [<ffffffffa0070159>] ? _ctl_ioctl_main+0xdb9/0x12d0 [mpt2sas]
- [<ffffffffa0070725>] ? _ctl_ioctl+0x35/0x50 [mpt2sas]
- [<ffffffff8118fb72>] ? vfs_ioctl+0x22/0xa0
- [<ffffffff8118fd14>] ? do_vfs_ioctl+0x84/0x580
- [<ffffffff81190291>] ? sys_ioctl+0x81/0xa0
- [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
二、問題分析
1、初步分析
內核出現異常,然後panic,必然會有相關直接原因的打印,本案例中如下:
Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 18
結合內核代碼分析,可以確認是由於nmi watchdog檢測到了硬死鎖(hard LOCKUP),nmi watchdog的具體原理不贅述了,可以google相關資料。
nmi watchdog檢測到了硬死鎖表明:該CPU核上發生了關中斷死鎖的情況。
根據其堆棧可以看到最終出現死鎖的地方爲_spin_lock_irq,即阻塞在關中斷的spin_lock上,如果該鎖一直獲取不到,那就肯定是關中斷死鎖了,nmi watchdog自然就能檢測到這種情況了,所以初步推斷問題的直接原因是因爲_spin_lock_irq一直獲取不到鎖導致死鎖。
注意:這裏的堆棧有一定的迷惑作用,咋一看wait_for_common,可能會以爲該CPU是在等待complete變量的完成,但如果是這樣的話,該進程應該是D狀態,應該會調度出去,不會一直佔用CPU,nmi_watchdog也不會觸發。所以,需要看仔細。
2、深入分析
1)思維誤區
接下來要分析爲什麼spinlock獲取不到,這個問題要分析清楚,就比較複雜了。
通常來說,對於類似的死鎖問題分析,都會有這種思路:死鎖,那就肯定有進程持有相應的鎖而一直不釋放,導致本進程一直獲取不到鎖。
那就需要尋找持有鎖的進程了,持有該鎖的進程可能處於如下幾種狀態(按可能性大小排列):
a、處於RUNNING狀態,且正在其它某CPU上運行。
b、處於RUNNING狀態,但暫時沒有得到調度運行。
c、處於D狀態,等待某任務完成。
d、處於S狀態
e、已經運行結束,進程已經不存在。
最可能的肯定是a和b,c、d和e屬於明顯的內核bug,對於c和d來說,因爲spinlock持有者是不能sleep的,內核中對於持有spinlock再進行sleep的情況應該有判斷和告警;對於e來說,那就是有進程持有spin_lock鎖,但沒有釋放就退出了(比如某些異常分支),相當於泄露,這種情況內核中有靜態代碼檢查工具,對於一般的用戶態程序這種錯誤可能容易出現,但內核中這種可能性也極小。先不論可能性大小,對於本問題,這種想法將導致進入誤區。照這種思路,分析過程大致爲(這也是一種常用的分析思路,不能說不對,只能說不宜於本問題的分析):
(1)看看所有CPU上都在運行什麼任務
點擊(此處)摺疊或打開
- crash> bt -a
- PID: 18176 TASK: ffff8802b37100c0 CPU: 0 COMMAND: "monitor.s"
- [exception RIP: lock_kernel+46]
- ...
- PID: 18371 TASK: ffff880f5e5460c0 CPU: 1 COMMAND: "bash"
- [exception RIP: lock_kernel+53]
- ...
- PID: 18334 TASK: ffff880f5f219540 CPU: 2 COMMAND: "monitor.sh"
- [exception RIP: lock_kernel+46]
- ...
- PID: 15042 TASK: ffff880ec0015540 CPU: 3 COMMAND: "xxx"
- [exception RIP: __bitmap_empty+115]
- ...
- --- <NMI exception stack> ---
- #6 [ffff880eba59ddd8] __bitmap_empty at ffffffff81281f93
- #7 [ffff880eba59dde0] flush_tlb_others_ipi at ffffffff810480d8
- #8 [ffff880eba59de30] native_flush_tlb_others at ffffffff81048156
- #9 [ffff880eba59de60] flush_tlb_mm at ffffffff8104832c
- #10 [ffff880eba59de80] unmap_region at ffffffff8114744f
- #11 [ffff880eba59def0] do_munmap at ffffffff81147aa6
- #12 [ffff880eba59df50] sys_munmap at ffffffff81147be6
- #13 [ffff880eba59df80] system_call_fastpath at ffffffff8100b0f2
- ...
- PID: 12410 TASK: ffff8818681cd540 CPU: 18 COMMAND: "xxxx"(最終觸發nmi watchdog的CPU)
- [exception RIP: _spin_lock_irq+37]
- ...
- --- <NMI exception stack> ---
- #13 [ffff88186635bba8] _spin_lock_irq at ffffffff814fb465
- #14 [ffff88186635bbb0] wait_for_common at ffffffff814f95ec
- #15 [ffff88186635bc40] wait_for_completion_timeout at ffffffff814f97c3
- #16 [ffff88186635bc50] _ctl_do_mpt_command at ffffffffa006e8ee [mpt2sas]
- #17 [ffff88186635bd30] _ctl_ioctl_main at ffffffffa0070159 [mpt2sas]
- #18 [ffff88186635be30] _ctl_ioctl at ffffffffa0070725 [mpt2sas]
- #19 [ffff88186635be60] vfs_ioctl at ffffffff8118fb72
- #20 [ffff88186635bea0] do_vfs_ioctl at ffffffff8118fd14
- #21 [ffff88186635bf30] sys_ioctl at ffffffff81190291
- PID: 18122 TASK: ffff880e9f5c94c0 CPU: 19 COMMAND: "xxxxx"
- [exception RIP: _spin_lock+33]...
- --- <NMI exception stack> ---
- #6 [ffff8802b4e05d58] _spin_lock at ffffffff814fb541
- #7 [ffff8802b4e05d60] flush_tlb_others_ipi at ffffffff81048019
- #8 [ffff8802b4e05db0] native_flush_tlb_others at ffffffff81048156
- #9 [ffff8802b4e05de0] flush_tlb_mm at ffffffff8104832c
- #10 [ffff8802b4e05e00] mprotect_fixup at ffffffff811491b0
- #11 [ffff8802b4e05f20] sys_mprotect at ffffffff811494e5
- #12 [ffff8802b4e05f80] system_call_fastpath at ffffffff8100b0f2
- ...
可以看出,出CPU3、18、19外,其它CPU都是阻塞在lock_kernel(大內核鎖,老版本內核中,通常文件操作都需要持有大內核鎖,對內核性能影響極大,新版本中已經逐漸去除)上,而CPU18正是觸發nmi watchdog的CPU,其中的堆棧流程中可以看出,由ioctl進入,而ioctl的流程中正需要持有大內核鎖:
點擊(此處)摺疊或打開
- crash> l vfs_ioctl
- ...
- crash> l
- 44
- 45 if (filp->f_op->unlocked_ioctl) {
- 46 error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
- 47 if (error == -ENOIOCTLCMD)
- 48 error = -EINVAL;
- 49 goto out;
- 50 } else if (filp->f_op->ioctl) {
- 51 lock_kernel();
- 52 error = filp->f_op->ioctl(filp->f_path.dentry->d_inode,
- 53 filp, cmd, arg);
所以除CPU3、18和19外,其它的所有CPU都是因爲CPU18而阻塞。
再看看CPU19,該CPU也阻塞在spinlock上,但是位於flush_tlb_others_ipi流程中,結合CPU3一起看,可以看出CPU3也是在這個流程中,但阻塞地方不一樣(__bitmap_empty ),看看flush_tlb_others_ipi 的流程,可以確認__bitmap_empty是在獲取spinlock後的流程了:
點擊(此處)摺疊或打開
- crash> l flush_tlb_others_ipi
- ...
- 185 /*
- 186 * Could avoid this lock when
- 187 * num_online_cpus() <= NUM_INVALIDATE_TLB_VECTORS, but it is
- 188 * probably not worth checking this for a cache-hot lock.
- 189 */
- 190 spin_lock(&f->tlbstate_lock);
- 191
- crash> l
- 192 f->flush_mm = mm;
- 193 f->flush_va = va;
- 194 if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
- 195 /*
- 196 * We have to send the IPI only to
- 197 * CPUs affected.
- 198 */
- 199 apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
- 200 INVALIDATE_TLB_VECTOR_START + sender);
於是可以確認,CPU19是因爲CPU3而阻塞,那CPU3爲什麼阻塞呢?
再分析flush_tlb_others_ipi 和__bitmap_empty 的代碼,可以知道flush_tlb_others_ipi是通過核間中斷(IPI)讓其它CPU flush自己的TLB,在更新頁表或相關操作時會進行這樣的操作。flush_tlb_others_ipi需要阻塞等待其它所有CPU都處理完相應的IPI並執行晚相關的任務(flush TLB)。而此時的CPU18(觸發nmi watchdog的核)正處於關中斷狀態(_spin_lock_irq會關中斷),所以其無法響應IPI,也就無法處理相關任務,所以導致CPU3一直阻塞了。
綜上,可以看出,所有的CPU阻塞都是由於CPU18導致。於是經過這一輪,並沒有找到持有spinlock不釋放的進程,但畢竟還是理清了相關邏輯。
(2)於是再看看其它RUNNING狀態的進程
crash> ps |grep RU|wc -l
838
800多個,有點奇怪,正常情況下不能這麼多,否則這個機器性能可能面臨嚴重問題了,但這個環境中所有CPU都鎖死了,有很多沒有得到調度RUNNING進程可能就正常了。這麼多進程不能手工一個個看堆棧了,可以通過腳本處理:
crash> ps |grep RU|awk '{print $1}' > running_task_pid
Vim編輯running_task_pid文件,通過行模式插入一列bt (ctrl+v,大寫I),然後再執行
crash> < running_task_pid > running_task_stack
蒐集到所有RUNNING狀態的進程堆棧後,通過分析,確認絕大部分進程的堆棧都在schedule中:
點擊(此處)摺疊或打開
- crash> bt 27685
- PID: 27685 TASK: ffff88185936e080 CPU: 3 COMMAND: "java"
- #0 [ffff880d29d49d38] schedule at ffffffff814f8b42
- #1 [ffff880d29d49e00] schedule_timeout at ffffffff814f99d4
- #2 [ffff880d29d49eb0] sys_epoll_wait at ffffffff811c1019
- #3 [ffff880d29d49f80] system_call_fastpath at ffffffff8100b0f2
應該是因爲相關的進程被喚醒,但是一直沒有得到調度,沒有發現可疑進程。那就繼續看看D狀態進程的堆棧了
(3)查看所有D狀態進程堆棧
crash> ps |grep UN|wc -l
46
逐個查看後,仍沒有找到可疑進程。繼續硬着頭皮看看S狀態進程的堆棧了?這個就沒有太多意義了,因爲即使是S狀態進程獲取到了該lock,但其堆棧很可能已經不在原來獲取鎖的上下文中了,堆棧中基本看不出來,另一方面也太多了,還是繼續看看?用腳本。
(3)查看所有S狀態進程堆棧
crash> ps |grep IN|wc -l
3056
再篩查一遍,還是沒有看到可疑進程,進入了死衚衕,無法繼續了。獲取可能會繼續想,如果進程持有鎖後退出了,也有這種可能,但這種情況已經無法通過vmcore來追溯了。
其實,本問題並不是之前想的那樣,是因爲鎖被別人持有導致,如果我們一開始就以這種慣性思維來分析該問題,那最終就只能走到這裏了,因爲一開始就走錯了路,主要問題還在於“沒有從問題的最終現場出發”,“沒有從問題的實質出發”,從某種角度上看,對於這種內核問題的分析,就跟刑偵警察分析刑事案件一樣,最關鍵的是要“重視現場”,從案發現場尋找蛛絲馬跡,如果沒有現場,僅憑經驗和推斷,相信很多案件都無法偵破。
2)正確思路
從導致問題的最直接的現場出發
點擊(此處)摺疊或打開
- crash> bt
- PID: 12410 TASK: ffff8818681cd540 CPU: 18 COMMAND: "xxxx"
- #0 [ffff88109c6c7af0] machine_kexec at ffffffff8103237b
- #1 [ffff88109c6c7b50] crash_kexec at ffffffff810ba552
- #2 [ffff88109c6c7c20] panic at ffffffff814f846b
- #3 [ffff88109c6c7ca0] watchdog_overflow_callback at ffffffff810dac1a
- #4 [ffff88109c6c7cd0] __perf_event_overflow at ffffffff8110cb3d
- #5 [ffff88109c6c7d70] perf_event_overflow at ffffffff8110d0f4
- #6 [ffff88109c6c7d80] intel_pmu_handle_irq at ffffffff8101e396
- #7 [ffff88109c6c7e90] perf_event_nmi_handler at ffffffff814fcc39
- #8 [ffff88109c6c7ea0] notifier_call_chain at ffffffff814fe7a5
- #9 [ffff88109c6c7ee0] atomic_notifier_call_chain at ffffffff814fe80a
- #10 [ffff88109c6c7ef0] notify_die at ffffffff81097dce
- #11 [ffff88109c6c7f20] do_nmi at ffffffff814fc3c3
- #12 [ffff88109c6c7f50] nmi at ffffffff814fbcd0
- [exception RIP: _spin_lock_irq+37]
- RIP: ffffffff814fb465 RSP: ffff88186635bba8 RFLAGS: 00000002
- RAX: 0000000000000000 RBX: 0000000000002710 RCX: 000000000000fc92
- RDX: 0000000000000001 RSI: 0000000000002710 RDI: ffff881054610960
- RBP: ffff88186635bba8 R8: 0000000000000000 R9: ffff881055c438c0
- R10: 0000000000000000 R11: 0000000000000006 R12: ffff881054610958
- R13: ffff881054610960 R14: 0000000000000002 R15: ffff881054610938
- ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
- --- <NMI exception stack> ---
- #13 [ffff88186635bba8] _spin_lock_irq at ffffffff814fb465
- #14 [ffff88186635bbb0] wait_for_common at ffffffff814f95ec
- #15 [ffff88186635bc40] wait_for_completion_timeout at ffffffff814f97c3
- #16 [ffff88186635bc50] _ctl_do_mpt_command at ffffffffa006e8ee [mpt2sas]
- #17 [ffff88186635bd30] _ctl_ioctl_main at ffffffffa0070159 [mpt2sas]
- #18 [ffff88186635be30] _ctl_ioctl at ffffffffa0070725 [mpt2sas]
- #19 [ffff88186635be60] vfs_ioctl at ffffffff8118fb72
- #20 [ffff88186635bea0] do_vfs_ioctl at ffffffff8118fd14
- #21 [ffff88186635bf30] sys_ioctl at ffffffff81190291
- #22 [ffff88186635bf80] system_call_fastpath at ffffffff8100b0f2
- RIP: 0000003d5f8dd847 RSP: 00007f4d34fda528 RFLAGS: 00003202
- RAX: 0000000000000010 RBX: ffffffff8100b0f2 RCX: 0000003d5f8dd847
- RDX: 00007f4ce0002120 RSI: 00000000c0484c14 RDI: 0000000000000047
因爲 _spin_lock_irq 阻塞觸發了nmi watchdog,看看相應的spinlock的具體的值:
點擊(此處)摺疊或打開
- crash> dis -l wait_for_common
- 0xffffffff814f95e4 <wait_for_common+52>: mov %r13,%rdi
- 0xffffffff814f95e7 <wait_for_common+55>: callq 0xffffffff814fb440 <_spin_lock_irq>
可以看出_spin_lock_irq的參數通過rdi傳遞(x86_64架構的傳參規則,從左到右依次rdi、rsi...),而rdi在後續的函數中沒有再使用,所以最終上下文中的rdi即爲參數的值:ffff881054610960(bt中有RDI寄存器的值)
點擊(此處)摺疊或打開
- crash> l _spin_lock_irq
- 68 EXPORT_SYMBOL(_spin_lock_irqsave);
- 69 #endif
- 70
- 71 #ifndef _spin_lock_irq
- 72 void __lockfunc _spin_lock_irq(spinlock_t *lock)
- 73 {
- 74 __spin_lock_irq(lock);
- 75 }
- 76 EXPORT_SYMBOL(_spin_lock_irq);
- 77 #endif
- crash> spinlock_t ffff881054610960
- struct spinlock_t {
- raw_lock = {
- slock = 65537
- }
- }
- crash> eval 65537
- hexadecimal: 10001
- decimal: 65537
- octal: 200001
- binary: 0000000000000000000000000000000000000000000000010000000000000001
可以看出該spinlock的值爲00010001。
此版本中spinlock實現爲ticket_spin_lock,大致原理如下:
4 字節的lock分成兩部分:
Next(2字節)|Owner(2字節)
X86架構中,Next和Owner初始值都爲0
在獲取spinlock時,會對Next字段加1,然後判斷加1之前的Next和Owner字段是否相等,如果相等,或獲取鎖成功,如果不相等,則nop後死循環再次獲取Owner的值,一直到Next和Owner的值相等爲止。
在釋放spinlock時,會對Owner字段加1。如此當之前有進程在循環等待該spinlock時,在Owner加1後,就會因爲Next==Owner而得到該鎖,當之前有多個進程在等待該spinlock時,則最先進入等待狀態的進程會先得到鎖,這種機制能解決老版本spinlock機制中的“不公平”問題。
在初始狀態下Next=Owner=0,此時如果有進程獲取該spinlock,就可以得到該鎖。
再看看該故障中lock的值:00010001,也就是說Next和Owner都等於1,說明已經lock和unlock一次了,看到這個值也許會覺得奇怪,此時Next和Owner相等,爲何會獲取不到鎖呢?再看看出錯的具體代碼行:
點擊(此處)摺疊或打開
- crash> dis -l ffffffff814fb465
- /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/arch/x86/include/asm/spinlock.h: 127
- 0xffffffff814fb465 <_spin_lock_irq+37>: movzwl (%rdi),%edx
- crash> l /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/arch/x86/include/asm/spinlock.h: 127
- 122 static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock)
- 123 {
- 124 int inc;
- 125 int tmp;
- 126
- 127 asm volatile("1:\t\n"
- 128 "mov $0x10000, %0\n\t"
- 129 LOCK_PREFIX "xaddl %0, %1\n"
- 130 "movzwl %w0, %2\n\t"
- 131 "shrl $16, %0\n\t"
說明此時,已經進入到內聯彙編所在的代碼行了,此時的xaddl指令已經執行,Next已經加過1了,說明,在執行該_spin_lock_irq之前,該lock的Next值爲0,而Owner的值爲1,由於後續其它進程釋放該lock時,只會對Owner進行加1,而此時Owner已經大於Next了(正常使用spinlock的情況下是不可能出現這種情況的),所以此時無論如何等待,Next也不可能等於Owner了,也就是說這裏的鎖永遠也獲取不到了,於是陷入了死鎖狀態。這樣就可以解釋爲什麼會在這裏觸發nmi watchdog了。
爲什麼會出現這種情況呢?Owner怎麼可能大於Next呢?有兩種可能:
1、多做了一次unlock操作
2、併發修改該spinlock。比如:在該spinlock還在被使用時,有其它進程併發修改該spinlock。更具體的例子:在CPU1上,某上下文進行spin_lock操作後,在spin_unlock之前;在CPU2上,另一上下文對該spinlock重新進行了初始化(即將該lock值改爲0);然後在CPU1上執行unlock操作,此時該lock的Owner就被多unlock了1次,其Owner就被多加了1,就出現這種情況了。
對於第1種情況,出現的可能性極小,因爲,spin_lock和spin_unlock操作肯定是配對的,內核中有相應的靜態檢查機制,也有相應的死鎖檢測機制,出現這種直接錯誤的可能性極小。
那最可能的原因就是因爲情況2了。那就需要再仔細分析下問題出現的上下文(代碼有4行錯位,應該是vmlinux不匹配導致,但無大礙):
點擊(此處)摺疊或打開
- crash> dis -l ffffffff814f95ec
- /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/kernel/sched.c: 6228
- 0xffffffff814f95ec <wait_for_common+60>: mov (%r12),%eax
- crash> l/usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/kernel/sched.c: 6228
- 6223 timeout = schedule_timeout(timeout);
- 6224 spin_lock_irq(&x->wait.lock);
- 6225 } while (!x->done && timeout);
- 6226 __remove_wait_queue(&x->wait, &wait);
- 6227 if (!x->done)
- 6228 return timeout;
- 6229 }
- 6230 x->done--;
- 6231 return timeout ?: 1;
- 6232 }
_spin_lock_irq使用的spinlock爲x->wait.lock,
繼續看代碼
點擊(此處)摺疊或打開
- 6270 unsigned long __sched
- 6271 wait_for_completion_timeout(struct completion *x, unsigned long timeout)
- 6272 {
- 6273 return wait_for_common(x, timeout, TASK_UNINTERRUPTIBLE);
- 6274 }
- 6275 EXPORT_SYMBOL(wait_for_completion_timeout);
可以知道x是wait_for_completion_timeout傳入的completion的結構體變量,看看該結構體的定義:
點擊(此處)摺疊或打開
- crash> completion
- struct completion {
- unsigned int done;
- wait_queue_head_t wait;
- }
- SIZE: 32
- crash> wait_queue_head_t
- typedef struct __wait_queue_head {
- spinlock_t lock;
- struct list_head task_list;
- } wait_queue_head_t;
- SIZE: 24
而x是從mpt2sas驅動中的_ctl_do_mpt_command傳入:
點擊(此處)摺疊或打開
- crash> dis -l ffffffffa006e8ee
- /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/drivers/scsi/mpt2sas/mpt2sas_ctl.c: 909
- 0xffffffffa006e8ee <_ctl_do_mpt_command+958>: movzbl 0x3(%r14),%eax
- crash> l /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/drivers/scsi/mpt2sas/mpt2sas_ctl.c: 909
- 904 else
- 905 timeout = karg.timeout;
- 906 init_completion(&ioc->ctl_cmds.done);
- 907 timeleft = wait_for_completion_timeout(&ioc->ctl_cmds.done,
- 908 timeout*HZ);
- 909 if (mpi_request->Function == MPI2_FUNCTION_SCSI_TASK_MGMT) {
- 910 Mpi2SCSITaskManagementRequest_t *tm_request =
- 911 (Mpi2SCSITaskManagementRequest_t *)mpi_request;
- 912 mpt2sas_scsih_clear_tm_flag(ioc, le16_to_cpu(
看代碼,ioc爲MPT2SAS_ADAPTER結構體,該結構體定義在mpt2sas內核模塊中,需要單獨加載符號後,才能看到相關內容:
點擊(此處)摺疊或打開
- crash> mod -s mpt2sas usr/lib/debug/lib/modules/2.6.32-220.el6.x86_64/kernel/drivers/scsi/mpt2sas/mpt2sas.ko.debug
- MODULE NAME SIZE OBJECT FILE
- ffffffffa007a460 mpt2sas 173472 usr/lib/debug/lib/modules/2.6.32-220.el6.x86_64/kernel/drivers/scsi/mpt2sas/mpt2sas.ko.debug
- crash> MPT2SAS_ADAPTER
- struct MPT2SAS_ADAPTER {
- struct list_head list;
- struct Scsi_Host *shost;
- u8 id;
- u32 pci_irq;
- ...
- struct _internal_cmd ctl_cmds;
- ...
- }
- crash> _internal_cmd
- struct _internal_cmd {
- struct mutex mutex;
- struct completion done;
- void *reply;
- void *sense;
- u16 status;
- u16 smid;
- }
- SIZE: 88
需要繼續分析這個completion的使用邏輯,在init_completion後,調用wait_for_completion_timeout等待該completion變量完成,即等待其它地方調用completion()函數來喚醒該進程。看起來邏輯沒啥問題,但問題在於:如果在調用init_completion之前,就有地方調用complete()函數的話,可能就有問題了,此時,如果另外的上下文剛好在lock之後unlock之前,就剛好符合之前說的情況2了。
分析mpt2sas驅動中可能調用這個completion的complete()函數的地方,僅在mpt2sas_ctl_done()和mpt2sas_ctl_reset_hangdler()中調用了,前者是在sas命令執行完成後調用的,其實就是_ctl_do_mpt_command中要等待的;而mpt2sas_ctl_reset_hangdler()僅在reset的時候使用,這裏不用關注。
再看看_ctl_do_mpt_command的代碼:
點擊(此處)摺疊或打開
- crash> l /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/drivers/scsi/mpt2sas/mpt2sas_ctl.c: 899
- 894 mpt2sas_base_put_smid_default(ioc, smid);
- 895 break;
- 896 }
- 897 default:
- 898 mpt2sas_base_put_smid_default(ioc, smid);
- 899 break;
- 900 }
- 901
- 902 if (karg.timeout < MPT2_IOCTL_DEFAULT_TIMEOUT)
- 903 timeout = MPT2_IOCTL_DEFAULT_TIMEOUT;
- crash> l /usr/src/debug/kernel-2.6.32-220.el6/linux-2.6.32-220.el6.x86_64/drivers/scsi/mpt2sas/mpt2sas_ctl.c: 909
- 904 else
- 905 timeout = karg.timeout;
- 906 init_completion(&ioc->ctl_cmds.done);
- 907 timeleft = wait_for_completion_timeout(&ioc->ctl_cmds.done,
- 908 timeout*HZ);
- 909 if (mpi_request->Function == MPI2_FUNCTION_SCSI_TASK_MGMT) {
- 910 Mpi2SCSITaskManagementRequest_t *tm_request =
- 911 (Mpi2SCSITaskManagementRequest_t *)mpi_request;
- 912 mpt2sas_scsih_clear_tm_flag(ioc, le16_to_cpu(
- 913 tm_request->DevHandle));
在調用init_completion()之前調用了mpt2sas_base_put_smid_default,繼續分析該函數的代碼,發現該函數就是用於執行sas命令的,在命令執行完成後就可能走到mpt2sas_ctl_done()的流程,即在init_completion()之前,就可能先執行complete()函數了,由於mpt2sas_ctl_done()是異步流程(中斷觸發),完全可能在另外的CPU上執行,當命令執行比較快,在init_completion()之前就執行完了,就可能導致這樣的問題了。
顯然,這裏的init_completion()應該放到前面更好。
google下相關的補丁,果然有相應的補丁:
點擊(此處)摺疊或打開
- [PATCH 06/07] [SCSI] mpt2sas : Rearrange the the code so that the completion queues are initialized prior to sending the request to controller firmware
- Rearrange the the code so that the completion queues are initialized prior
- to sending the request to controller firmware.
- Signed-off-by: Nagalakshmi Nandigama <nagalakshmi.nandigama@xxxxxxx>
- ---
- diff --git a/drivers/scsi/mpt2sas/mpt2sas_base.c b/drivers/scsi/mpt2sas/mpt2sas_base.c
- index d0a36c9..e78733f 100644
- --- a/drivers/scsi/mpt2sas/mpt2sas_base.c
- +++ b/drivers/scsi/mpt2sas/mpt2sas_base.c
- @@ -3200,8 +3200,8 @@ mpt2sas_base_sas_iounit_control(struct MPT2SAS_ADAPTER *ioc,
- if (mpi_request->Operation == MPI2_SAS_OP_PHY_HARD_RESET ||
- mpi_request->Operation == MPI2_SAS_OP_PHY_LINK_RESET)
- ioc->ioc_link_reset_in_progress = 1;
- - mpt2sas_base_put_smid_default(ioc, smid);
- init_completion(&ioc->base_cmds.done);
- + mpt2sas_base_put_smid_default(ioc, smid);
- timeleft = wait_for_completion_timeout(&ioc->base_cmds.done,
- msecs_to_jiffies(10000));
- if ((mpi_request->Operation == MPI2_SAS_OP_PHY_HARD_RESET ||
- @@ -3302,8 +3302,8 @@ mpt2sas_base_scsi_enclosure_processor(struct MPT2SAS_ADAPTER *ioc,
- request = mpt2sas_base_get_msg_frame(ioc, smid);
- ioc->base_cmds.smid = smid;
- memcpy(request, mpi_request, sizeof(Mpi2SepReply_t));
- - mpt2sas_base_put_smid_default(ioc, smid);
- init_completion(&ioc->base_cmds.done);
- + mpt2sas_base_put_smid_default(ioc, smid);
- timeleft = wait_for_completion_timeout(&ioc->base_cmds.done,
- msecs_to_jiffies(10000));
- if (!(ioc->base_cmds.status & MPT2_CMD_COMPLETE)) {
- @@ -3810,8 +3810,8 @@ _base_event_notification(struct MPT2SAS_ADAPTER *ioc, int sleep_flag)
- for (i = 0; i < MPI2_EVENT_NOTIFY_EVENTMASK_WORDS; i++)
- mpi_request->EventMasks[i] =
- cpu_to_le32(ioc->event_masks[i]);
- - mpt2sas_base_put_smid_default(ioc, smid);
- init_completion(&ioc->base_cmds.done);
- + mpt2sas_base_put_smid_default(ioc, smid);
- timeleft = wait_for_completion_timeout(&ioc->base_cmds.done, 30*HZ);
- if (!(ioc->base_cmds.status & MPT2_CMD_COMPLETE)) {
- printk(MPT2SAS_ERR_FMT "%s: timeout\n",
- diff --git a/drivers/scsi/mpt2sas/mpt2sas_ctl.c b/drivers/scsi/mpt2sas/mpt2sas_ctl.c
- index cffed28..cb8290b 100644
- --- a/drivers/scsi/mpt2sas/mpt2sas_ctl.c
- +++ b/drivers/scsi/mpt2sas/mpt2sas_ctl.c
- @@ -819,6 +819,7 @@ _ctl_do_mpt_command(struct MPT2SAS_ADAPTER *ioc,
- _ctl_display_some_debug(ioc, smid, "ctl_request", NULL);
- #endif
- + init_completion(&ioc->ctl_cmds.done);
- switch (mpi_request->Function) {
- case MPI2_FUNCTION_SCSI_IO_REQUEST:
- case MPI2_FUNCTION_RAID_SCSI_IO_PASSTHROUGH:
- @@ -904,7 +905,6 @@ _ctl_do_mpt_command(struct MPT2SAS_ADAPTER *ioc,
- timeout = MPT2_IOCTL_DEFAULT_TIMEOUT;
- else
- timeout = karg.timeout;
- - init_completion(&ioc->ctl_cmds.done);
- timeleft = wait_for_completion_timeout(&ioc->ctl_cmds.done,
- timeout*HZ);
- if (mpi_request->Function == MPI2_FUNCTION_SCSI_TASK_MGMT) {
- @@ -1478,8 +1478,8 @@ _ctl_diag_register_2(struct MPT2SAS_ADAPTER *ioc,
- mpi_request->ProductSpecific[i] =
- cpu_to_le32(ioc->product_specific[buffer_type][i]);
- - mpt2sas_base_put_smid_default(ioc, smid);
- init_completion(&ioc->ctl_cmds.done);
- + mpt2sas_base_put_smid_default(ioc, smid);
- timeleft = wait_for_completion_timeout(&ioc->ctl_cmds.done,
- MPT2_IOCTL_DEFAULT_TIMEOUT*HZ);
- @@ -1822,8 +1822,8 @@ _ctl_send_release(struct MPT2SAS_ADAPTER *ioc, u8 buffer_type, u8 *issue_reset)
- mpi_request->VF_ID = 0; /* TODO */
- mpi_request->VP_ID = 0;
- - mpt2sas_base_put_smid_default(ioc, smid);
- init_completion(&ioc->ctl_cmds.done);
- + mpt2sas_base_put_smid_default(ioc, smid);
- timeleft = wait_for_completion_timeout(&ioc->ctl_cmds.done,
- MPT2_IOCTL_DEFAULT_TIMEOUT*HZ);
- @@ -2096,8 +2096,8 @@ _ctl_diag_read_buffer(void __user *arg, enum block_state state)
- mpi_request->VF_ID = 0; /* TODO */
- mpi_request->VP_ID = 0;
- - mpt2sas_base_put_smid_default(ioc, smid);
- init_completion(&ioc->ctl_cmds.done);
- + mpt2sas_base_put_smid_default(ioc, smid);
- timeleft = wait_for_completion_timeout(&ioc->ctl_cmds.done,
- MPT2_IOCTL_DEFAULT_TIMEOUT*HZ);
- diff --git a/drivers/scsi/mpt2sas/mpt2sas_transport.c b/drivers/scsi/mpt2sas/mpt2sas_transport.c
- index 322285c..d0750eb 100644
- --- a/drivers/scsi/mpt2sas/mpt2sas_transport.c
- +++ b/drivers/scsi/mpt2sas/mpt2sas_transport.c
- @@ -398,8 +398,8 @@ _transport_expander_report_manufacture(struct MPT2SAS_ADAPTER *ioc,
- dtransportprintk(ioc, printk(MPT2SAS_INFO_FMT "report_manufacture - "
- "send to sas_addr(0x%016llx)\n", ioc->name,
- (unsigned long long)sas_address));
- - mpt2sas_base_put_smid_default(ioc, smid);
- init_completion(&ioc->transport_cmds.done);
- + mpt2sas_base_put_smid_default(ioc, smid);
- timeleft = wait_for_completion_timeout(&ioc->transport_cmds.done,
- 10*HZ);
- @@ -1186,8 +1186,8 @@ _transport_get_expander_phy_error_log(struct MPT2SAS_ADAPTER *ioc,
- dtransportprintk(ioc, printk(MPT2SAS_INFO_FMT "phy_error_log - "
- "send to sas_addr(0x%016llx), phy(%d)\n", ioc->name,
- (unsigned long long)phy->identify.sas_address, phy->number));
- - mpt2sas_base_put_smid_default(ioc, smid);
- init_completion(&ioc->transport_cmds.done);
- + mpt2sas_base_put_smid_default(ioc, smid);
- timeleft = wait_for_completion_timeout(&ioc->transport_cmds.done,
- 10*HZ);
- @@ -1511,8 +1511,9 @@ _transport_expander_phy_control(struct MPT2SAS_ADAPTER *ioc,
- "send to sas_addr(0x%016llx), phy(%d), opcode(%d)\n", ioc->name,
- (unsigned long long)phy->identify.sas_address, phy->number,
- phy_operation));
- - mpt2sas_base_put_smid_default(ioc, smid);
- +
- init_completion(&ioc->transport_cmds.done);
- + mpt2sas_base_put_smid_default(ioc, smid);
- timeleft = wait_for_completion_timeout(&ioc->transport_cmds.done,
- 10*HZ);
- @@ -1951,8 +1952,8 @@ _transport_smp_handler(struct Scsi_Host *shost, struct sas_rphy *rphy,
- dtransportprintk(ioc, printk(MPT2SAS_INFO_FMT "%s - "
- "sending smp request\n", ioc->name, __func__));
- - mpt2sas_base_put_smid_default(ioc, smid);
- init_completion(&ioc->transport_cmds.done);
- + mpt2sas_base_put_smid_default(ioc, smid);
- timeleft = wait_for_completion_timeout(&ioc->transport_cmds.done,
- 10*HZ);
問題就此定位。
仔細想想,其實該問題的本質在於對completion結構的訪問沒有進行保護,由於可能在多CPU上併發訪問,按理應該有相應的機制進行保護纔對(比如鎖),而這裏沒有,補丁中採用的方法是“串行化”,保證這種場景下的串行執行。但理論上應無法杜絕其它場景的併發,也許在mpt2sas驅動中沒有其它的併發場景,其具體機制沒有深入研究,不能妄下結論。