記一次LIO驅動corruption的解決方法

0. 運行環境

最近在工作中，碰到了一個折磨我們半個多月的一個問題。我們存儲服務器端運行centos 7, Linux內核版本爲3.10.0-229.el7.x86_64，通過qlogic FC HBA卡對外提供塊存儲服務，存儲服務器同時提供快照服務，快照週期是15秒鐘每次。在Initiator端還對有快照的卷持續發起大量IO請求的情況下，我們發現長時間的測試會導致內核崩潰。

1. 問題現象

此外，我們還發現使用iSCSI服務在上面同樣壓力的情況下，沒有問題。但是一旦使用用qlogic HBA卡訪問存儲池，運行6~10個小時後，上面問題必現。同時，dmesg中出現很多list_del corruption。部分日誌如下：

8862 <4>1 2017-11-20T16:13:18.725115+08:00 localhost kernel - - - list_del corruption, ffff881d52b36890->next is LIST_POISON1 (dead000000100100)
28863 <4>1 2017-11-20T16:13:18.725148+08:00 localhost kernel - - - Modules linked in: target_core_file nfsd auth_rpcgss nfs_acl lockd sunrpc fuse arxcis(OF) ext4 mbcache jbd2 dccp_diag dccp tcp_diag udp_diag         inet_diag unix_diag af_packet_diag netlink_diag xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack         ipt_REJECT iptable_filter tun bridge stp llc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid1 ses enclosure iscsi_target_mod(OF) tcm_qla2xxx(F) coretemp kvm_intel iTCO_wdt iTCO_vendor_support         kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr mei_me lpc_ich mei mfd_core i2c_i801 shpchp wmi qla2xxx(F) target_core_iblock target_core_pscsi targe        t_core_mod acpi_power_meter acpi_pad scsi_transport_fc
28864 <4>1 2017-11-20T16:13:18.727391+08:00 localhost kernel - - -  scsi_tgt sg(F) PlxSvc_dbg(OF) Plx8000_NT_dbg(OF) Plx8000_DMA_dbg(OF) ipmi_watchdog(F) ipmi_poweroff(F) ipmi_si(F) ipmi_devintf(F) ipmi_msgha        ndler(F) blktap(OF) uinput ip_tables xfs libcrc32c sd_mod crc_t10dif ast syscopyarea sysfillrect sysimgblt drm_kms_helper ttm drm ixgbe ahci libahci igb mpt3sas libata crct10dif_pclmul mdio crct10dif_co        mmon crc32c_intel ptp raid_class pps_core scsi_transport_sas i2c_algo_bit i2c_core dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: arxcis]
28865 <4>1 2017-11-20T16:13:18.727418+08:00 localhost kernel - - - CPU: 30 PID: 262 Comm: kworker/u66:1 Tainted: GF          O--------------   3.10.0-229.el7.x86_64+ #1

2. 分析過程

根據內核日誌，可以看到是list_del在刪去一個節點的時候，發現這個節點已經被刪除掉，於是導致內核corruption。這個list_del是由target core mode驅動中的target_tmr_work工作線程調用，這個線程在客戶端發現IO超時後發出Lun Reset請求後會被觸發。

參考《spec4r11》:

5.6.10.4.2

Failed persistent reservation preempt

If the preempting I_T nexus’ PREEMPT service action or PREEMPT AND ABORT service action fails (e.g.,repeated TASK SET FULL status, repeated BUSY status, SCSI transport protocol time-out, or time-out due to the task set being blocked due to failed initiator port or failed SCSI initiator device), the application client may send a LOGICAL UNIT RESET task management function to the failing logical unit to remove blocking tasks and then
reissue the preempting service action.

以及SCSI規範：《scsi primarey command spec》：

5.5.1 Reservations overview
Reservations may be used to allow a device server to execute commands from a selected set of initiator ports and reject commands from initiator ports outside the selected set of initiator ports. The device server uniquely identifies initiator ports using protocol specific mechanisms. Application clients may add or remove initiator ports from the selected set using reservation commands.

特別是下面的這句話：

If the application clients do not cooperate in the reservation protocol, data may be unexpectedly modified and deadlock conditions may occur.

The scope of a reservation shall be one of the following:
a) Logical unit reservations - a logical unit reservation restricts access to the entire logical unit; and
b) Element reservations

根據上面的描述，不難看到客戶端發起的LUN Reset命令要求它兼容reservation協議, 內核驅動和應用程序一起協助才能讓reservation 工作正常。否則就會出現數據被異常地修改，導致死鎖或者其他異常。顯然，本例子中的現象看上去也是數據被異常修改導致內核corruption，據此推斷應該是客戶端內部發起的reservation 命令和存儲服務器這邊不兼容所致。

3. 解決方法

根據上面的分析和猜想，一個可行的work around是在這個存儲服務器暫時不讓它響應LUN Reset命令，據此修改了LIO 中 target_core_mod驅動中下面的代碼：

@@ -2870,7 +2870,10 @@ static void target_tmr_work(struct work_struct *work)
    tmr->response = TMR_TASK_MGMT_FUNCTION_NOT_SUPPORTED;
    break;
    case TMR_LUN_RESET:
    -ret = core_tmr_lun_reset(dev, tmr, NULL, NULL);
    +//ret = core_tmr_lun_reset(dev, tmr, NULL, NULL);
    +//tmr->response = (!ret) ? TMR_FUNCTION_COMPLETE :
    +//     TMR_FUNCTION_REJECTED;
    +ret = TMR_FUNCTION_REJECTED;
    tmr->response = (!ret) ? TMR_FUNCTION_COMPLETE :
                 TMR_FUNCTION_REJECTED;
    break;

然後重新編譯對應的target_core_mod.ko，替換之前的.ko，在同樣的壓測下做長時間的測試，發現問題消失。後來即便把壓力加大了最大，仍然沒有再現問題。

4. 總結

一般來說IO路徑上的kernel corruption/panic比較難分析，但是如果真的出現了恐懼也沒有用，可以先從dmesg/kernel log中定位出問題的模塊或者驅動，理順這個模塊在IO 堆棧中的位置，明白它在執行什麼命令請求或者高級服務，然後查閱相關規範、協議，摸透協議背後的原理、前提和要求，這樣才能地加深我們對問題本質的理解、拓寬有效解決問題的思路。

記一次LIO驅動corrutpion的解決方法

記一次LIO驅動corruption的解決方法

0. 運行環境

1. 問題現象

2. 分析過程

3. 解決方法

4. 總結

f2fs系列之十：f2fs到底如何避免wandering tree的？

如何計算和優化追加寫引擎中GC的寫放大

page cache的淘汰策略和組織形式

適配SSD介質的存儲引擎的GC的思考

FIO性能測試數據畫圖

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結