Linux內核調試技術——Fault-injection故障注入

當我們在開發內核功能或者驗證定位問題時,經常需要模擬各種內核的異常場景,來驗證程序的健壯性或加速問題的復現,比如內存分配失敗、磁盤IO錯誤超時等等。Linux內核集成了一個比較實用的功能“Fault-injection”來幫助我們進行故障注入,從而可以構建一些通用的內核異常場景。它能夠模擬內存slab分配失敗、內存頁分配失敗、磁盤IO錯誤、磁盤IO超時、futex鎖錯誤以及專門針對mmc的IO錯誤,用戶也可以利用該機制設計增加自己需要的故障注入。本文主要從內存分配和磁盤IO兩個方面介紹如何使用“Fault-injection”注入異常並詳細分析其實現。

內核版本:Linux 4.11.y

實驗環境:Rpi 3


 

Fault-injection概述

故障注入類型

Fault-injection默認實現了6種錯誤注入方式,分別是failslab、fail_page_alloc、fail_futex、fail_make_request、fail_io_timeout和fail_mmc_request。它們分別的功能如下:

1)failslab

注入slab分配器內存分配錯誤,主要包括kmalloc()、kmem_cache_alloc()等等。

2)fail_page_alloc

注入內存頁分配錯誤,主要包括alloc_pages()、get_free_pages()等等(較failslab更爲底層)。

3)fail_futex

注入futex鎖死鎖和uaddr錯誤。

4)fail_make_request

注入磁盤IO錯誤。它對塊核心層的generic_make_request()函數進行故障注入,可以通過/sys/block/<device>/make-it-fail或者/sys/block/<device>/<partition>/make-it-fail接口對指定的磁盤或分區進行注入。

5)fail_io_timeout

注入IO超時錯誤。它對IO處理流程中的IO處理完成blk_complete_request()函數進行故障注入,忽略IO完成“通知”。僅對使用通用超時處理流程的drivers有效,例如標準的scsi流程。

6)fail_mmc_request

注入mmc 數據錯誤,僅對mmc設備有效,通過對mmc core返回data error進行錯誤注入,從而可以測試mmc塊設備驅動的錯誤處理流程以及重試機制,可通過/sys/kernel/debug/mmcx/fail_mmc_request接口進行設置。

以上6中故障注入類型是內核中已經默認實現了的,用戶也可以利用其核心框架自行按需進行修改添加,只需依葫蘆畫瓢即可。我這裏挑選了使用最多的failslab、fail_page_alloc、fail_make_request和fail_io_timeout進行詳細分析,其他的兩種大同小異。

 

故障注入debugfs配置

Fault-injection提供了內核選項可以開啓debugfs控制接口,啓停或者調整故障注入配置,主要包括如下一些文件接口:

1)/sys/kernel/debug/fail*/probability:

設置異常發生的比例,百分制。如果覺得最小值1%依然太頻繁,可以設置該值爲100,然後通過interval來調整異常觸發的頻率。默認值爲0。

2)/sys/kernel/debug/fail*/interval:

設置異常發生的間隔,如果需要啓用則設置大於1的值,probability設置爲100。默認值爲1。

3)/sys/kernel/debug/fail*/times:

設置異常發生的最大次數,超過該次數後將不會再發生異常了,設置爲-1表示不設限。默認值爲1。

4)/sys/kernel/debug/fail*/space

設置異常的size餘量,每次執行到故障注入點後,都會將在該space的基礎上遞減size值,直到該值降低爲0後纔會注入異常。其中size的含義對各種異常各不相同,對於IO異常表示的是本次IO的字節數,對於內存分配表示的是內存的大小。默認值爲0。

5)/sys/kernel/debug/fail*/verbose、 verbose_ratelimit_burst

格式:{ 0 | 1 | 2 }

設置異常觸發後的內核打印信息輸出方式。0表示不輸出日誌信息;1表示輸出以“FAULT_INJECTION”開頭的最基本信息,包括觸發的類型、間隔、頻率等等;2表示會追加backtrace的輸出(這點對問題的定位很有用)。默認值爲2。

6)/sys/kernel/debug/fail*/verbose_ratelimit_interval_ms、/sys/kernel/debug/fail*/verbose_ratelimit_burst

用於控制日誌輸出ratelimit的interval和burst這兩個參數,可以用來調節日誌輸出的頻率,若太過頻繁會丟掉一些輸出,默認值分別爲0和10。

7)/sys/kernel/debug/fail*/task-filter:

格式:{ 'Y' | 'N' }

設置進程過濾,N表示不過濾,Y表示對啓用了make-it-fail的進程和在中斷上下文的流程進行過濾(通過/proc/<pid>/make-it-fail=1進行設置),不觸發故障注入。默認值爲N。

8)/sys/kernel/debug/fail*/require-start、 /sys/kernel/debug/fail*/require-end、 /sys/kernel/debug/fail*/reject-start、 /sys/kernel/debug/fail*/reject-end:

設置調用流程的虛擬地址空間過濾。若調用流程設計的代碼段(Text段)包含在require-start~require-end且不包含在reject-start~reject-end中才注入異常,可以用來設置故障注入只針對某個或某些模塊執行。默認required範圍爲[0, ULONG_MAX)(即整個虛擬地址空間),rejected範圍爲[0, 0)。

9)/sys/kernel/debug/fail*/stacktrace-depth:

設置[require-start, require-end) 和[reject-start, reject-end)跟蹤的調用深度。默認值爲32.

10)/sys/kernel/debug/fail_page_alloc/ignore-gfp-highmem:

格式:{ 'Y' | 'N' }

設置頁分配的高端內存過濾,設置爲Y後當分配的內存類型包含__GFP_HIGHMEM不啓用故障注入。默認值爲N。

11)/sys/kernel/debug/failslab/ignore-gfp-wait、 /sys/kernel/debug/fail_page_alloc/ignore-gfp-wait

格式:{ 'Y' | 'N' }

設置內存分配的分配模式過濾,設置爲Y後只對非睡眠的內存分配啓用故障注入(GFP_ATOMIC)。默認值爲N。

12)/sys/kernel/debug/fail_page_alloc/min-order:

設置頁分配order的過濾限制,當內核分配頁小於該設定值則不進行故障注入。默認值爲1

 

故障注入啓動參數配置

前文中提到的debugfs接口只在debugfs啓用後在有效,對於在內核啓動階段或沒有設置debugfs配置選項的情況,Fault-injection的默認配置值通過啓動參數進行傳遞,包括以下:

failslab=
fail_page_alloc=
fail_make_request=
fail_futex=
mmc_core.fail_request=<interval>,<probability>,<space>,<times>

通過啓動參數傳入的參數有限,目前只能接受interval、probability、space和times這4個參數(其他參數會被內核設置爲默認的值),但是在一般情況下也夠用了。

例如:如果想在內核啓動階段就啓用failslab 100%無限故障注入,則可以傳入內核啓動參數:

failslab=1,100,0,-1


 

Fault-injection使用

配置內核選項

Fault-injection功能主要涉及以下幾個內核配置選項,每一種注入模式一個配置選項,可按需開啓:

CONFIG_FAULT_INJECTION:功能總開關
    CONFIG_FAILSLAB:failslab故障注入功能配置
    CONFIG_FAIL_PAGE_ALLOC:fail_page_alloc故障注入功能配置
    CONFIG_FAIL_MAKE_REQUEST:fail_make_request故障注入功能配置
    CONFIG_FAIL_IO_TIMEOUT:fail_io_timeout故障注入功能配置
CONFIG_FAIL_MMC_REQUEST:fail_mmc_request故障注入功能配置
CONFIG_FAIL_FUTEX:fail_futex故障注入功能配置
CONFIG_FAULT_INJECTION_DEBUG_FS:debugfs接口啓用

這裏我只介紹內存和IO相關的4個故障注入功能,因此需要開啓CONFIG_FAULT_INJECTION、CONFIG_FAILSLAB、CONFIG_FAIL_PAGE_ALLOC、CONFIG_FAIL_IO_TIMEOUT和CONFIG_FAIL_MAKE_REQUEST這5個內核配置選項,與此同時,爲了操作的方便,也設置CONFIG_FAULT_INJECTION_DEBUG_FS選項開啓debugfs動態配置功能,然後重新編譯安裝內核。


 

fail_make_request使用

進入debugfs的掛載點,可以看到出現了以下幾個目錄:

[root@centos-rpi3 debug]# ls | grep fail 
fail_futex
fail_io_timeout
fail_make_request
fail_page_alloc
failslab

從名字就可以看出它們分配用於配置哪類故障注入了,在fail_make_request目錄下則有以下配置參數:

[root@centos-rpi3 fail_make_request]# ls
interval     reject-start   space             times                    verbose_ratelimit_interval_ms
probability  require-end    stacktrace-depth  verbose
reject-end   require-start  task-filter       verbose_ratelimit_burst

這些配置參數前文中已經介紹過了,這裏以100%無上限觸發make request錯誤爲例進行演示:

[root@centos-rpi3 fail_make_request]# echo 1 > interval 
[root@centos-rpi3 fail_make_request]# echo -1 > times 
[root@centos-rpi3 fail_make_request]# echo 100 > probability

這裏觸發比率設置爲100%,無觸發上限,其他參數無需修改使用默認值即可,這樣fail_make_request的參數就算配置完成了,下面來開其它:

在磁盤塊設備及其分區的sys接口目錄下都有一個make-it-fail文件,例如我樹莓派的sda和mmcblk下:

[root@centos-rpi3 block]# find -name make-it-fail
./sda/sda2/make-it-fail
./sda/make-it-fail
./sda/sda1/make-it-fail

[root@centos-rpi3 mmcblk1]# find -name make-it-fail
./mmcblk1p3/make-it-fail
./make-it-fail
./mmcblk1p1/make-it-fail
./mmcblk1p4/make-it-fail
./mmcblk1p2/make-it-fail

這個make-it-fail文件就是對相應塊設備的故障注入開關,對該文件寫入1以後對該設備就正式啓用故障注入了:

[root@centos-rpi3 sda]# echo 1 > make-it-fail   
[root@centos-rpi3 sda]# dd if=/dev/zero of=/dev/sda2 bs=4k count=1 oflag=direct

[13744.902281] FAULT_INJECTION: forcing a failure.
[13744.902281] name fail_make_request, interval 1, probability 100, space 0, times -1

[13744.922972] CPU: 2 PID: 1649 Comm: dd Not tainted 4.11.0-v7+ #1
[13744.933280] Hardware name: BCM2835
[13744.941091] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[13744.957492] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[13744.973606] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[13744.989915] [<80490414>] (should_fail) from [<80433310>] (should_fail_request+0x28/0x30)
[13745.006664] [<80433310>] (should_fail_request) from [<804334ac>] (generic_make_request_checks+0xe4/0x668)
[13745.025003] [<804334ac>] (generic_make_request_checks) from [<80435868>] (generic_make_request+0x20/0x228)
[13745.043590] [<80435868>] (generic_make_request) from [<80435b18>] (submit_bio+0xa8/0x194)
[13745.060677] [<80435b18>] (submit_bio) from [<802b7cac>] (__blkdev_direct_IO_simple+0x158/0x2e0)
[13745.078294] [<802b7cac>] (__blkdev_direct_IO_simple) from [<802b8224>] (blkdev_direct_IO+0x3c4/0x400)
[13745.096168] [<802b8224>] (blkdev_direct_IO) from [<8021520c>] (generic_file_direct_write+0xac/0x1c0)
[13745.113872] [<8021520c>] (generic_file_direct_write) from [<802153e0>] (__generic_file_write_iter+0xc0/0x204)
[13745.132466] [<802153e0>] (__generic_file_write_iter) from [<802b8e50>] (blkdev_write_iter+0xb0/0x130)
[13745.150240] [<802b8e50>] (blkdev_write_iter) from [<80279d2c>] (__vfs_write+0xd4/0x124)
[13745.166717] [<80279d2c>] (__vfs_write) from [<8027b6d4>] (vfs_write+0xb0/0x1c4)
[13745.182487] [<8027b6d4>] (vfs_write) from [<8027ba40>] (SyS_write+0x4c/0x98)
[13745.198050] [<8027ba40>] (SyS_write) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)

可以看到,對sda2設備,fail_make_request異常已經被成功注入了,如若去掛載設備上的Ext4文件系統將會無法掛載:

[root@centos-rpi3 sda]# mount /dev/sda1 /mnt/
mount: /dev/sda1: can't read superblock


 

fail_io_timeou使用

fail_io_timeout故障注入的用法同fail_make_request類似,在debugfs掛載點的fail_io_timeout目錄下存在同樣的幾個配置文件,現按同樣的方式進行配置:

[root@centos-rpi3 fail_io_timeout]# echo 1 > interval 
[root@centos-rpi3 fail_io_timeout]# echo -1 > times
[root@centos-rpi3 fail_io_timeout]# echo 100 > probability

配置完成後,同樣需要對塊設備啓用,啓用的接口爲/sys/block/sdx/io-timeout-fail,注意該異常只能對磁盤塊設備(struct gendisk)注入而無法對分區注入。

[root@centos-rpi3 sda]# echo 1 > io-timeout-fail 
[root@centos-rpi3 sda]# dd if=/dev/zero of=/dev/sda2 bs=4k count=1 oflag=direct

[15198.056490] FAULT_INJECTION: forcing a failure.
[15198.056490] name fail_io_timeout, interval 1, probability 100, space 0, times -1
[15198.081768] CPU: 0 PID: 1405 Comm: usb-storage Not tainted 4.11.0-v7+ #1
[15198.097541] Hardware name: BCM2835
[15198.105454] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[15198.122090] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[15198.138443] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[15198.155013] [<80490414>] (should_fail) from [<8043edbc>] (blk_should_fake_timeout+0x30/0x38)
[15198.172426] [<8043edbc>] (blk_should_fake_timeout) from [<8043ed68>] (blk_complete_request+0x20/0x44)
[15198.190450] [<8043ed68>] (blk_complete_request) from [<8053e2c4>] (scsi_done+0x24/0x98)
[15198.207148] [<8053e2c4>] (scsi_done) from [<805a5b34>] (usb_stor_control_thread+0x130/0x28c)
[15198.224387] [<805a5b34>] (usb_stor_control_thread) from [<8013c0b0>] (kthread+0x12c/0x168)
[15198.241451] [<8013c0b0>] (kthread) from [<80108268>] (ret_from_fork+0x14/0x2c)

由於完成complete完成調用被忽略,一般的IO會超時並重試,而dd命令在下面的同步調用流程中會等待barrier命令返回,從而會進入D狀態而引起hungtask:

[15235.156646] INFO: task dd:1738 blocked for more than 120 seconds.
[15235.167371]       Not tainted 4.11.0-v7+ #1
[15235.176039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[15235.393326] dd              D    0  1738   1371 0x00000000
[15235.403207] [<80723ccc>] (__schedule) from [<8072447c>] (schedule+0x44/0xa8)
[15235.418950] [<8072447c>] (schedule) from [<80727bdc>] (schedule_timeout+0x1f8/0x338)
[15235.435528] [<80727bdc>] (schedule_timeout) from [<80724f44>] (wait_for_common+0xe8/0x190)
[15235.452572] [<80724f44>] (wait_for_common) from [<8072500c>] (wait_for_completion+0x20/0x24)
[15235.469657] [<8072500c>] (wait_for_completion) from [<80134cac>] (flush_work+0x11c/0x1a0)
[15235.486675] [<80134cac>] (flush_work) from [<801369a0>] (__cancel_work_timer+0x138/0x208)
[15235.503478] [<801369a0>] (__cancel_work_timer) from [<80136a8c>] (cancel_delayed_work_sync+0x1c/0x20)
[15235.521301] [<80136a8c>] (cancel_delayed_work_sync) from [<8044ad58>] (disk_block_events+0x74/0x78)
[15235.538733] [<8044ad58>] (disk_block_events) from [<802b9838>] (__blkdev_get+0x108/0x430)
[15235.555331] [<802b9838>] (__blkdev_get) from [<802b9cfc>] (blkdev_get+0x19c/0x310)
[15235.571404] [<802b9cfc>] (blkdev_get) from [<802ba40c>] (blkdev_open+0x7c/0x88)
[15235.587364] [<802ba40c>] (blkdev_open) from [<80276f08>] (do_dentry_open+0x100/0x30c)
[15235.603973] [<80276f08>] (do_dentry_open) from [<80278530>] (vfs_open+0x60/0x8c)
[15235.620265] [<80278530>] (vfs_open) from [<8028929c>] (path_openat+0x410/0xef4)
[15235.636666] [<8028929c>] (path_openat) from [<8028ac30>] (do_filp_open+0x70/0xc4)
[15235.653430] [<8028ac30>] (do_filp_open) from [<802788fc>] (do_sys_open+0x11c/0x1d4)
[15235.670269] [<802788fc>] (do_sys_open) from [<802789e0>] (SyS_open+0x2c/0x30)
[15235.686748] [<802789e0>] (SyS_open) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)

若是對古仔的ext4文件系統進行touch操作,會出現以下hungtask現象:

[ 1964.356864] INFO: task touch:1363 blocked for more than 120 seconds.
[ 1964.367450]       Not tainted 4.11.0-v7+ #1
[ 1964.375846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1964.392079] touch           D    0  1363   1078 0x00000000
[ 1964.401834] [<80723ccc>] (__schedule) from [<8072447c>] (schedule+0x44/0xa8)
[ 1964.417429] [<8072447c>] (schedule) from [<80149104>] (io_schedule+0x20/0x40)
[ 1964.432969] [<80149104>] (io_schedule) from [<80724998>] (bit_wait_io+0x1c/0x64)
[ 1964.448858] [<80724998>] (bit_wait_io) from [<80724d08>] (__wait_on_bit+0x94/0xcc)
[ 1964.465131] [<80724d08>] (__wait_on_bit) from [<80724e50>] (out_of_line_wait_on_bit+0x78/0x84)
[ 1964.482734] [<80724e50>] (out_of_line_wait_on_bit) from [<802b1dcc>] (__wait_on_buffer+0x3c/0x44)
[ 1964.500710] [<802b1dcc>] (__wait_on_buffer) from [<80308a44>] (ext4_read_inode_bitmap+0x6b0/0x758)
[ 1964.518775] [<80308a44>] (ext4_read_inode_bitmap) from [<803095b0>] (__ext4_new_inode+0x470/0x15dc)
[ 1964.536764] [<803095b0>] (__ext4_new_inode) from [<8031c064>] (ext4_create+0xb0/0x178)
[ 1964.553559] [<8031c064>] (ext4_create) from [<8028996c>] (path_openat+0xae0/0xef4)
[ 1964.570016] [<8028996c>] (path_openat) from [<8028ac30>] (do_filp_open+0x70/0xc4)
[ 1964.586650] [<8028ac30>] (do_filp_open) from [<802788fc>] (do_sys_open+0x11c/0x1d4)
[ 1964.603310] [<802788fc>] (do_sys_open) from [<802789e0>] (SyS_open+0x2c/0x30)
[ 1964.619605] [<802789e0>] (SyS_open) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)

關閉故障注入後,hungtask可恢復。

 

內存分配failslab使用

在debugfs掛載點的failslab目錄也同樣有類似的幾個配置文件,只是多兩個特有的配置:ignore-gfp-wait和cache_filter,前者用於過濾__GFP_RECLAIM類型內存分配的開關,後者用於過濾用戶需要的slab分配,避免一啓用後直接系統就報錯而無法運行。

[root@centos-rpi3 failslab]# ls
cache-filter     probability   require-end    stacktrace-depth  verbose
ignore-gfp-wait  reject-end    require-start  task-filter       verbose_ratelimit_burst
interval         reject-start  space          times             verbose_ratelimit_interval_ms

用戶可以在/sys/kernel/slab/xxx/failslab中配置要進行注入的kmem_cache類型:

[root@centos-rpi3 slab]# ls /sys/kernel/slab
:at-0000016   :t-0001024             dentry                   inotify_inode_mark    nsproxy
:at-0000024   :t-0001536             dio                      ip4-frags             pid
:at-0000032   :t-0002048             discard_cmd              ip_dst_cache          pid_namespace
:at-0000040   :t-0003072             discard_entry            ip_fib_alias          pool_workqueue
:at-0000048   :t-0004032             dmaengine-unmap-2        ip_fib_trie           posix_timers_cache
:at-0000064   :t-0004096             dnotify_mark             ip_mrt_cache          proc_inode_cache
:at-0000072   :t-0008192             dnotify_struct           jbd2_inode            radix_tree_node
:at-0000104   :tA-0000032            dquot                    jbd2_journal_handle   request_queue
:at-0000112   :tA-0000064            eventpoll_epi            jbd2_journal_head     request_sock_TCP
:at-0000184   :tA-0000088            eventpoll_pwq            jbd2_revoke_record_s  rpc_buffers
:at-0000192   :tA-0000128            ext4_allocation_context  jbd2_revoke_table_s   rpc_inode_cache
:atA-0000136  :tA-0000256            ext4_extent_status       jbd2_transaction_s    rpc_tasks
:atA-0000528  :tA-0000448            ext4_free_data           kernfs_node_cache     scsi_data_buffer
:t-0000024    :tA-0000704            ext4_groupinfo_4k        key_jar               scsi_sense_cache
:t-0000032    :tA-0003776            ext4_inode_cache         kioctx                sd_ext_cdb
:t-0000040    PING                   ext4_io_end              kmalloc-1024          secpath_cache
:t-0000048    RAW                    ext4_prealloc_space      kmalloc-128           sgpool-128
:t-0000056    TCP                    ext4_system_zone         kmalloc-192           sgpool-16
:t-0000064    UDP                    f2fs_extent_node         kmalloc-2048          sgpool-32
:t-0000080    UDP-Lite               f2fs_extent_tree         kmalloc-256           sgpool-64
:t-0000088    UNIX                   f2fs_ino_entry           kmalloc-4096          sgpool-8
:t-0000112    aio_kiocb              f2fs_inode_cache         kmalloc-512           shmem_inode_cache
:t-0000120    anon_vma               f2fs_inode_entry         kmalloc-64            sighand_cache
:t-0000128    anon_vma_chain         fanotify_event_info      kmalloc-8192          signal_cache
:t-0000144    bdev_cache             fasync_cache             kmem_cache            sigqueue
:t-0000152    bio-0                  fat_cache                kmem_cache_node       sit_entry_set
:t-0000176    bio-1                  fat_inode_cache          mbcache               skbuff_fclone_cache
:t-0000192    biovec-128             file_lock_cache          mm_struct             skbuff_head_cache
:t-0000208    biovec-16              file_lock_ctx            mnt_cache             sock_inode_cache
:t-0000256    biovec-256             files_cache              mqueue_inode_cache    task_delay_info
:t-0000320    biovec-64              filp                     names_cache           task_group
:t-0000328    blkdev_ioc             flow_cache               nat_entry             task_struct
:t-0000344    blkdev_requests        free_nid                 nat_entry_set         taskstats
:t-0000384    bsg_cmd                fs_cache                 net_namespace         tcp_bind_bucket
:t-0000448    buffer_head            fscache_cookie_jar       nfs_commit_data       trace_event_file
:t-0000512    cachefiles_object_jar  fsnotify_mark            nfs_direct_cache      tw_sock_TCP
:t-0000576    cfq_io_cq              ftrace_event_field       nfs_inode_cache       uid_cache
:t-0000704    cfq_queue              inet_peer_cache          nfs_page              user_namespace
:t-0000768    configfs_dir_cache     inmem_page_entry         nfs_read_data         vm_area_struct
:t-0000904    cred_jar               inode_cache              nfs_write_data        xfrm_dst_cache

這裏以:t-0000xxx多表示爲kmalloc()分配使用的指定內存大小,由好多一些是指向它們的符號鏈接:

[root@centos-rpi3 slab]# ll kmalloc-1024
lrwxrwxrwx 1 root root 0 May 21 09:26 kmalloc-1024 -> :t-0001024

下面以ext4_inode_cache爲例進行故障注入:

[root@centos-rpi3 failslab]# echo -1 > times
[root@centos-rpi3 failslab]# echo 100 > probability
[root@centos-rpi3 failslab]# echo 1 > cache-filter
[root@centos-rpi3 failslab]# echo 1 > /sys/kernel/slab/ext4_inode_cache/failslab
[root@centos-rpi3 failslab]# echo N >  ignore-gfp-wait

啓用以後可以在ext4文件系統中執行創建文件等命令,會打印如下故障注入信息:

[  157.633204] FAULT_INJECTION: forcing a failure.
[  157.633204] name failslab, interval 1, probability 100, space 0, times -1
[  157.659029] CPU: 1 PID: 379 Comm: in:imjournal Not tainted 4.11.0-v7+ #1
[  157.675420] Hardware name: BCM2835
[  157.683660] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[  157.701176] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[  157.718337] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[  157.735689] [<80490414>] (should_fail) from [<8026950c>] (should_failslab+0x60/0x8c)
[  157.753309] [<8026950c>] (should_failslab) from [<80266120>] (kmem_cache_alloc+0x44/0x230)
[  157.771433] [<80266120>] (kmem_cache_alloc) from [<80323390>] (ext4_alloc_inode+0x24/0x104)
[  157.789647] [<80323390>] (ext4_alloc_inode) from [<80296254>] (alloc_inode+0x2c/0xb0)
[  157.807213] [<80296254>] (alloc_inode) from [<80297d90>] (new_inode_pseudo+0x18/0x5c)
[  157.824793] [<80297d90>] (new_inode_pseudo) from [<80297df0>] (new_inode+0x1c/0x30)
[  157.842200] [<80297df0>] (new_inode) from [<803091d0>] (__ext4_new_inode+0x90/0x15dc)
[  157.859751] [<803091d0>] (__ext4_new_inode) from [<8031c064>] (ext4_create+0xb0/0x178)
[  157.877373] [<8031c064>] (ext4_create) from [<8028996c>] (path_openat+0xae0/0xef4)
[  157.894604] [<8028996c>] (path_openat) from [<8028ac30>] (do_filp_open+0x70/0xc4)
[  157.911620] [<8028ac30>] (do_filp_open) from [<802788fc>] (do_sys_open+0x11c/0x1d4)
[  157.928965] [<802788fc>] (do_sys_open) from [<802789e0>] (SyS_open+0x2c/0x30)
[  157.945881] [<802789e0>] (SyS_open) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)

 

內存分配fail_page_alloc使用

在debugfs掛載點的fail_page_alloc目錄也同樣有類似的幾個配置文件,只是fail_page_alloc會多3個特有的配置:min-order,ignore-gfp-wait和ignore-gfp-highmem。ignore-gfp-wait爲用於過濾__GFP_DIRECT_RECLAIM類型內存分配的開關,ignore-gfp-highmem爲用於是否過濾高端內存分配__GFP_HIGHMEM的開關,min-order爲對故障注入最小分配頁的過濾器,只有大於該參數的分配才能夠進行故障注入

[root@centos-rpi3 fail_page_alloc]# ls
ignore-gfp-highmem  probability   require-start     times
ignore-gfp-wait     reject-end    space             verbose
interval            reject-start  stacktrace-depth  verbose_ratelimit_burst
min-order           require-end   task-filter       verbose_ratelimit_interval_ms

[root@centos-rpi3 fail_page_alloc]# echo 2 > times
[root@centos-rpi3 fail_page_alloc]# echo 100 > probability

[18950.321696] FAULT_INJECTION: forcing a failure.
[18950.321696] name fail_page_alloc, interval 1, probability 100, space 0, times -1
[18950.439402] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O    4.11.0-v7+ #1
[18950.516300] Hardware name: BCM2835
[18950.553611] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[18950.628930] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[18950.703954] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[18950.780256] [<80490414>] (should_fail) from [<8021d208>] (__alloc_pages_nodemask+0xc0/0xf88)
[18950.858001] [<8021d208>] (__alloc_pages_nodemask) from [<8021e210>] (page_frag_alloc+0x68/0x188)
[18950.937115] [<8021e210>] (page_frag_alloc) from [<80625774>] (__netdev_alloc_skb+0xb0/0x154)
[18951.016021] [<80625774>] (__netdev_alloc_skb) from [<8055fe0c>] (rx_submit+0x3c/0x20c)
[18951.094485] [<8055fe0c>] (rx_submit) from [<80560450>] (rx_complete+0x1e0/0x204)
[18951.172404] [<80560450>] (rx_complete) from [<80569aa4>] (__usb_hcd_giveback_urb+0x80/0x154)
[18951.251386] [<80569aa4>] (__usb_hcd_giveback_urb) from [<80569cc8>] (usb_hcd_giveback_urb+0x4c/0xf4)
[18951.331268] [<80569cc8>] (usb_hcd_giveback_urb) from [<805931e8>] (completion_tasklet_func+0x6c/0x98)
[18951.411890] [<805931e8>] (completion_tasklet_func) from [<805a1060>] (tasklet_callback+0x20/0x24)
[18951.493310] [<805a1060>] (tasklet_callback) from [<80122f74>] (tasklet_hi_action+0x74/0x108)
[18951.575189] [<80122f74>] (tasklet_hi_action) from [<8010162c>] (__do_softirq+0x134/0x3ac)
[18951.658448] [<8010162c>] (__do_softirq) from [<80122b44>] (irq_exit+0xf8/0x164)
[18951.741056] [<80122b44>] (irq_exit) from [<80175384>] (__handle_domain_irq+0x68/0xc0)
[18951.824219] [<80175384>] (__handle_domain_irq) from [<801014f0>] (bcm2836_arm_irqchip_handle_irq+0xa8/0xb0)
[18951.909282] [<801014f0>] (bcm2836_arm_irqchip_handle_irq) from [<807293fc>] (__irq_svc+0x5c/0x7c)

當然了,內存異常一般都是不希望全局生效的,但又沒有設置過多的過濾器,因此往往需要用戶對生效範圍(如某個模塊或者某個調用)和概率等進行設置。

 

Fault-injection實現

核心數據結構

/*
 * For explanation of the elements of this struct, see
 * Documentation/fault-injection/fault-injection.txt
 */
struct fault_attr {
	unsigned long probability;
	unsigned long interval;
	atomic_t times;
	atomic_t space;
	unsigned long verbose;
	bool task_filter;
	unsigned long stacktrace_depth;
	unsigned long require_start;
	unsigned long require_end;
	unsigned long reject_start;
	unsigned long reject_end;

	unsigned long count;
	struct ratelimit_state ratelimit_state;
	struct dentry *dname;
};
該結構體是fault-injection實現的核心結構體,該結構體中的大多數字段是否都有一種似成相識的感覺 :) ,其實它們都對應到debugfs中的各個配置接口文件。最後的三個字段是用於功能實現控制用的,其中count用於統計故障注入點的執行次數,ratelimit_state用於日誌輸出頻率控制,最後的dname表示故障的類型(即fail_make_request、failslab等等)。下面來跟蹤程序流程逐個詳細分析fail_make_request、fail_io_timeout、failslab和fail_page_alloc的實現。

fail_make_request

static DECLARE_FAULT_ATTR(fail_make_request);

static int __init setup_fail_make_request(char *str)
{
	return setup_fault_attr(&fail_make_request, str);
}
__setup("fail_make_request=", setup_fail_make_request);
首先代碼中靜態定義一個struct fault_attr結構以實例fail_make_request用於描述fail_make_request類型故障注入,DECLARE_FAULT_ATTR是一個宏定義:

#define FAULT_ATTR_INITIALIZER {					\
		.interval = 1,						\
		.times = ATOMIC_INIT(1),				\
		.require_end = ULONG_MAX,				\
		.stacktrace_depth = 32,					\
		.ratelimit_state = RATELIMIT_STATE_INIT_DISABLED,	\
		.verbose = 2,						\
		.dname = NULL,						\
	}

#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER
這裏fail_make_request的一些通用的字段被初始化爲默認的值。隨後通過這裏的__setup宏可知,在內核初始化階段將處理“fail_make_request=xxx”的啓動參數,註冊的處理函數爲setup_fail_make_request,它進一步調用通用函數setup_fault_attr,對fail_make_request結構體進一步初始化。
/*
 * setup_fault_attr() is a helper function for various __setup handlers, so it
 * returns 0 on error, because that is what __setup handlers do.
 */
int setup_fault_attr(struct fault_attr *attr, char *str)
{
	unsigned long probability;
	unsigned long interval;
	int times;
	int space;

	/* "<interval>,<probability>,<space>,<times>" */
	if (sscanf(str, "%lu,%lu,%d,%d",
			&interval, &probability, &space, ×) < 4) {
		printk(KERN_WARNING
			"FAULT_INJECTION: failed to parse arguments\n");
		return 0;
	}

	attr->probability = probability;
	attr->interval = interval;
	atomic_set(&attr->times, times);
	atomic_set(&attr->space, space);

	return 1;
}
EXPORT_SYMBOL_GPL(setup_fault_attr);
前文中已經介紹了,啓動參數的配置只能接收interval、probability、space和times這4個參數,由setup_fault_attr()負責解析並賦值到fail_make_request結構體中。下面來看下debugfs的入口:

static int __init fail_make_request_debugfs(void)
{
	struct dentry *dir = fault_create_debugfs_attr("fail_make_request",
						NULL, &fail_make_request);

	return PTR_ERR_OR_ZERO(dir);
}

late_initcall(fail_make_request_debugfs);
該函數也在內核初始化階段調用,它會在debugfs的目錄下創建一個名爲fail_make_request的attr目錄,如下:

struct dentry *fault_create_debugfs_attr(const char *name,
			struct dentry *parent, struct fault_attr *attr)
{
	umode_t mode = S_IFREG | S_IRUSR | S_IWUSR;
	struct dentry *dir;

	dir = debugfs_create_dir(name, parent);
	if (!dir)
		return ERR_PTR(-ENOMEM);

	if (!debugfs_create_ul("probability", mode, dir, &attr->probability))
		goto fail;
	if (!debugfs_create_ul("interval", mode, dir, &attr->interval))
		goto fail;
	if (!debugfs_create_atomic_t("times", mode, dir, &attr->times))
		goto fail;
	if (!debugfs_create_atomic_t("space", mode, dir, &attr->space))
		goto fail;
	if (!debugfs_create_ul("verbose", mode, dir, &attr->verbose))
		goto fail;
	if (!debugfs_create_u32("verbose_ratelimit_interval_ms", mode, dir,
				&attr->ratelimit_state.interval))
		goto fail;
	if (!debugfs_create_u32("verbose_ratelimit_burst", mode, dir,
				&attr->ratelimit_state.burst))
		goto fail;
	if (!debugfs_create_bool("task-filter", mode, dir, &attr->task_filter))
		goto fail;

#ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER

	if (!debugfs_create_stacktrace_depth("stacktrace-depth", mode, dir,
				&attr->stacktrace_depth))
		goto fail;
	if (!debugfs_create_ul("require-start", mode, dir,
				&attr->require_start))
		goto fail;
	if (!debugfs_create_ul("require-end", mode, dir, &attr->require_end))
		goto fail;
	if (!debugfs_create_ul("reject-start", mode, dir, &attr->reject_start))
		goto fail;
	if (!debugfs_create_ul("reject-end", mode, dir, &attr->reject_end))
		goto fail;

#endif /* CONFIG_FAULT_INJECTION_STACKTRACE_FILTER */

	attr->dname = dget(dir);
	return dir;
fail:
	debugfs_remove_recursive(dir);

	return ERR_PTR(-ENOMEM);
}
EXPORT_SYMBOL_GPL(fault_create_debugfs_attr);
首先傳入的parent爲NULL,所以fail_make_request目錄創建的點爲degubfs的根目錄,然後在該目錄下依次創建probability、interval、times等等之前看到的屬性配置文件,最後將目錄的dentry保存到attr->dname字段中。接下來再看一下塊設備的開關接口:

#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
	__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
#endif
...
#ifdef CONFIG_FAIL_MAKE_REQUEST
ssize_t part_fail_show(struct device *dev,
		       struct device_attribute *attr, char *buf)
{
	struct hd_struct *p = dev_to_part(dev);

	return sprintf(buf, "%d\n", p->make_it_fail);
}

ssize_t part_fail_store(struct device *dev,
			struct device_attribute *attr,
			const char *buf, size_t count)
{
	struct hd_struct *p = dev_to_part(dev);
	int i;

	if (count > 0 && sscanf(buf, "%d", &i) > 0)
		p->make_it_fail = (i == 0) ? 0 : 1;

	return count;
}
#endif
基於sysfs的接口,當用戶往/sys/block/sdx/make-it-fail寫入非0時,對應設備struct hd_struct的make_it_fail字段就被置位爲1,開關就打開了,否則設置爲0,開關就關閉了。

瞭解了以上的參數配置接口,下面進入正題,fail_make_request究竟是如何注入故障的?如何判斷是否需要注入?

首先來關注以下通用IO提交流程:submit_bio()->generic_make_request()->generic_make_request_checks():

static noinline_for_stack bool
generic_make_request_checks(struct bio *bio)
{
	...

	part = bio->bi_bdev->bd_part;
	if (should_fail_request(part, bio->bi_iter.bi_size) ||
	    should_fail_request(&part_to_disk(part)->part0,
				bio->bi_iter.bi_size))
		goto end_io;

	...
}
在IO提交流程的generic_make_request_checks()函數中會調用should_fail_request()函數進行故障注入的判斷,如果這裏返回true(注入),那IO的提交流程就不會繼續下去,上層的generic_make_request()函數會返回cookie爲BLK_QC_T_NONE,故障得以注入。

static bool should_fail_request(struct hd_struct *part, unsigned int bytes)
{
	return part->make_it_fail && should_fail(&fail_make_request, bytes);
}
判斷是否注入故障主要取決於should_fail_request()函數,該函數返回true表示需要注入,返回false表示不注入。入參中第一個參數爲磁盤設備hd_struct結構體,第二個參數爲本次IO的字節數。這裏可以看到hd_struct結構體中的make_it_fail開關的作用了,然後還有一個“與”條件爲通用函數should_fail()的返回結果。should_fail()函數時整個故障注入條件判斷的核心,它將根據struce fault_attr結構體中配置的參數進行評估。

/*
 * This code is stolen from failmalloc-1.0
 * http://www.nongnu.org/failmalloc/
 */

bool should_fail(struct fault_attr *attr, ssize_t size)
{
	/* No need to check any other properties if the probability is 0 */
	if (attr->probability == 0)
		return false;

	if (attr->task_filter && !fail_task(attr, current))
		return false;

	if (atomic_read(&attr->times) == 0)
		return false;

	if (atomic_read(&attr->space) > size) {
		atomic_sub(size, &attr->space);
		return false;
	}

	if (attr->interval > 1) {
		attr->count++;
		if (attr->count % attr->interval)
			return false;
	}

	if (attr->probability <= prandom_u32() % 100)
		return false;

	if (!fail_stacktrace(attr))
		return false;

	fail_dump(attr);

	if (atomic_read(&attr->times) != -1)
		atomic_dec_not_zero(&attr->times);

	return true;
}
EXPORT_SYMBOL_GPL(should_fail);
1)如果設置的probability爲0,不注入;

2)如果設置了進程過濾,調用fail_task函數進行判斷,如果當前進程的make_it_fail標沒有記置位或者在中斷上下文中,不注入;

static bool fail_task(struct fault_attr *attr, struct task_struct *task)
{
	return !in_interrupt() && task->make_it_fail;
}

3)如果注入次數不足超過上限,不注入;

4)如果預留的size餘量大於本次io的字節數,那遞減餘量,不注入;

5)如果設置的間隔數大於1,則計算調用次數統計,若間隔數未到則不注入;

6)判斷注入比率,通過prandom_u32()%100得到一個100以內的隨機值,以此實現注入比率;

7)如果配置了CONFIG_FAULT_INJECTION_STACKTRACE_FILTER內核參數,這裏會在fail_stacktrace()函數中判斷執行流程中調用棧的代碼段和配置“[require-start, require-end) ,[reject-start, reject-end)”的對應關係(這個函數中給出了獲取調用棧地址的方法,這是一些比較有用的工具函數,值得mark一下):

#ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER

static bool fail_stacktrace(struct fault_attr *attr)
{
	struct stack_trace trace;
	int depth = attr->stacktrace_depth;
	unsigned long entries[MAX_STACK_TRACE_DEPTH];
	int n;
	bool found = (attr->require_start == 0 && attr->require_end == ULONG_MAX);

	if (depth == 0)
		return found;

	trace.nr_entries = 0;
	trace.entries = entries;
	trace.max_entries = depth;
	trace.skip = 1;

	save_stack_trace(&trace);
	for (n = 0; n < trace.nr_entries; n++) {
		if (attr->reject_start <= entries[n] &&
			       entries[n] < attr->reject_end)
			return false;
		if (attr->require_start <= entries[n] &&
			       entries[n] < attr->require_end)
			found = true;
	}
	return found;
}

#else

static inline bool fail_stacktrace(struct fault_attr *attr)
{
	return true;
}

#endif /* CONFIG_FAULT_INJECTION_STACKTRACE_FILTER */
首先爲了加速默認條件下的判斷,當require_xxx和reject_xxx的值爲默認時,直接返回pass(表示可以注入)。若用戶設置了非默認值,則調用save_stack_trace函數向上逐級抓取調用棧,棧的深度由attr->stacktrace_depth給出,最大支持深度爲32級,每一級調用函數的地址保存在trace.entries這個數組中,接下來就逐函數判斷了,現判斷[reject_start, reject_end),再判斷[require-start, require-end) 。

回到should_fail()函數中,如果上邊的各項判斷都順利通過了,就表示可以注入故障了,不過再注入故障之前最後要做的就是打印日誌信息:

static void fail_dump(struct fault_attr *attr)
{
	if (attr->verbose > 0 && __ratelimit(&attr->ratelimit_state)) {
		printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure.\n"
		       "name %pd, interval %lu, probability %lu, "
		       "space %d, times %d\n", attr->dname,
		       attr->interval, attr->probability,
		       atomic_read(&attr->space),
		       atomic_read(&attr->times));
		if (attr->verbose > 1)
			dump_stack();
	}
}
這些打印信息在前文介紹使用時已經看到了。如果verbose爲2還會調用dump_stack()打出內核調用棧。

至此fail_make_request類型的故障注入就分析完了,其中最核心的部分也已經分析清楚了,餘下的三個故障注入類型也大同小異。

 


fail_io_timeout

static DECLARE_FAULT_ATTR(fail_io_timeout);

static int __init setup_fail_io_timeout(char *str)
{
	return setup_fault_attr(&fail_io_timeout, str);
}
__setup("fail_io_timeout=", setup_fail_io_timeout);

fail_io_timeout的定義也由DECLARE_FAULT_ATTR宏給出,它的啓動參數初始化接口由setup_fail_io_timeout()進行處理。

static int __init fail_io_timeout_debugfs(void)
{
	struct dentry *dir = fault_create_debugfs_attr("fail_io_timeout",
						NULL, &fail_io_timeout);

	return PTR_ERR_OR_ZERO(dir);
}

debugfs的接口由fail_io_timeout_debugfs()函數負責在debugfs根目錄創建。這兩點同之前的fail_make_request是一樣的。

int blk_should_fake_timeout(struct request_queue *q)
{
	if (!test_bit(QUEUE_FLAG_FAIL_IO, &q->queue_flags))
		return 0;

	return should_fail(&fail_io_timeout, 1);
}

判斷是否要進行故障注入的接口blk_should_fake_timeout(),它會在進行should_fail()判斷之前對功能的開關進行判斷,這裏的開關爲QUEUE_FLAG_FAIL_IO標記,該標記通過/sys/block/sdx/io-timeout-fail接口設置,對應的sysfs處理函數爲:

ssize_t part_timeout_show(struct device *dev, struct device_attribute *attr,
			  char *buf)
{
	struct gendisk *disk = dev_to_disk(dev);
	int set = test_bit(QUEUE_FLAG_FAIL_IO, &disk->queue->queue_flags);

	return sprintf(buf, "%d\n", set != 0);
}

ssize_t part_timeout_store(struct device *dev, struct device_attribute *attr,
			   const char *buf, size_t count)
{
	struct gendisk *disk = dev_to_disk(dev);
	int val;

	if (count) {
		struct request_queue *q = disk->queue;
		char *p = (char *) buf;

		val = simple_strtoul(p, &p, 10);
		spin_lock_irq(q->queue_lock);
		if (val)
			queue_flag_set(QUEUE_FLAG_FAIL_IO, q);
		else
			queue_flag_clear(QUEUE_FLAG_FAIL_IO, q);
		spin_unlock_irq(q->queue_lock);
	}

	return count;
}

當用戶寫入非0後,part_timeout_store函數對相應的磁盤所在的struct request_queue結構體的queue_flags設置QUEUE_FLAG_FAIL_IO標記,對應磁盤的fail_io_timeout故障注入開關也就打開了,否則就復位該標記(關閉該開關)。

blk_should_fake_timeout()函數調用的地方(即故障注入點)一共有2處,分別如下:

1、blk_complete_request

void blk_complete_request(struct request *req)
{
	if (unlikely(blk_should_fake_timeout(req->q)))
		return;
	if (!blk_mark_rq_complete(req))
		__blk_complete_request(req);
}
EXPORT_SYMBOL(blk_complete_request);

該函數在底層IO寫入完成或出現錯誤以後會由底層硬件驅動進行回調,正常的執行流程下會調用__blk_complete_request(),然後提交BLOCK_SOFTIRQ類型的軟中斷:

void __blk_complete_request(struct request *req)
{
	...
do_local:
		if (list->next == &req->ipi_list)
			raise_softirq_irqoff(BLOCK_SOFTIRQ);

	...
}
static __latent_entropy void blk_done_softirq(struct softirq_action *h)
{
	...
	while (!list_empty(&local_list)) {
		struct request *rq;

		rq = list_entry(local_list.next, struct request, ipi_list);
		list_del_init(&rq->ipi_list);
		rq->q->softirq_done_fn(rq);
	}
}

BLOCK_SOFTIRQ類型的軟中斷由blk_done_softirq()負責處理,它回調註冊到request_queue中softirq_done_fn函數指針的函數,例如對於SCSI設備接下來的流程就是scsi_softirq_done()->scsi_finish_command()->scsi_io_completion()->scsi_end_request()->blk_update_request()->req_bio_endio()->bio_endio()完成本次IO,最後通知上層。

但是如果在blk_complete_request函數中注入故障,主動丟棄complete回調的向上傳遞,那就會觸發request請求超時,調用流程如下:

blk_timeout_work()->blk_rq_check_expired()->blk_rq_timed_out()->scsi_times_out(),由scsi驅動程序進行超時處理,後臺工作隊列會定期進行IO的重試操作。

2、blk_mq_complete_request

blk層多隊列完成回調函數,在使能內核多隊列功能時IO的提交流程會走該流程分支。

scsi_mq_done()->blk_mq_complete_request()->__blk_mq_complete_request()->blk_mq_end_request()->blk_update_request()

 

failslab

failslab控制結構體:

static struct {
	struct fault_attr attr;
	bool ignore_gfp_reclaim;
	bool cache_filter;
} failslab = {
	.attr = FAULT_ATTR_INITIALIZER,
	.ignore_gfp_reclaim = true,
	.cache_filter = false,
};

failslab對通用的struct fault_attr結構體進行封裝,另外定義了兩個單獨的參數ignore_gfp_reclaim和cache_filter,前者用於過濾__GFP_RECLAIM類型內存分配的開關,後者用於過濾用戶需要的slab分配。之所以設置這兩個參數是爲了用戶正對某些特定的kmem_cache注入需要,同事也爲了避免一啓用failslab後整個系統報錯不可進一步調試。

failslab的啓動配置初始化接口:

static int __init setup_failslab(char *str)
{
	return setup_fault_attr(&failslab.attr, str);
}
__setup("failslab=", setup_failslab);

failslab的debugfs配置接口:

static int __init failslab_debugfs_init(void)
{
	struct dentry *dir;
	umode_t mode = S_IFREG | S_IRUSR | S_IWUSR;

	dir = fault_create_debugfs_attr("failslab", NULL, &failslab.attr);
	if (IS_ERR(dir))
		return PTR_ERR(dir);

	if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
				&failslab.ignore_gfp_reclaim))
		goto fail;
	if (!debugfs_create_bool("cache-filter", mode, dir,
				&failslab.cache_filter))
		goto fail;

	return 0;
fail:
	debugfs_remove_recursive(dir);

	return -ENOMEM;
}

failslab在創建debugfs目錄和配置文件時會多創建兩個配置文件ignore-gfp-wait和cache-filter,分別對應於struct failslab結構體中的ignore_gfp_reclaim和cache_filter這兩個配置參數。

failslab cache_filter過濾器配置接口:

#ifdef CONFIG_FAILSLAB
static ssize_t failslab_show(struct kmem_cache *s, char *buf)
{
	return sprintf(buf, "%d\n", !!(s->flags & SLAB_FAILSLAB));
}

static ssize_t failslab_store(struct kmem_cache *s, const char *buf,
							size_t length)
{
	if (s->refcount > 1)
		return -EINVAL;

	s->flags &= ~SLAB_FAILSLAB;
	if (buf[0] == '1')
		s->flags |= SLAB_FAILSLAB;
	return length;
}
SLAB_ATTR(failslab);
#endif

該接口位於/sys/kernel/slab/xxx/failslab,設置爲非0後會在kmem_cache的flags標識位置位SLAB_FAILSLAB(否則則清除該標識),SLAB_FAILSLAB標識在故障注入判斷函數should_failslab中進行確認:

bool should_failslab(struct kmem_cache *s, gfp_t gfpflags)
{
	/* No fault-injection for bootstrap cache */
	if (unlikely(s == kmem_cache))
		return false;

	if (gfpflags & __GFP_NOFAIL)
		return false;

	if (failslab.ignore_gfp_reclaim && (gfpflags & __GFP_RECLAIM))
		return false;

	if (failslab.cache_filter && !(s->flags & SLAB_FAILSLAB))
		return false;

	return should_fail(&failslab.attr, s->object_size);
}

首先若ignore_gfp_reclaim標識啓用,則自動忽略__GFP_RECLAIM類型的內存分配,然後如果cache_filter過濾器被啓用,則自動過濾SLAB_FAILSLAB標識未置位的內存分配,最後調用should_fail進行通用標識判斷。

接下來看一下failslab是在何處進行故障注入的:

static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
						     gfp_t flags)
{
	flags &= gfp_allowed_mask;
	lockdep_trace_alloc(flags);
	might_sleep_if(gfpflags_allow_blocking(flags));

	if (should_failslab(s, flags))
		return NULL;

	if (memcg_kmem_enabled() &&
	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
		return memcg_kmem_get_cache(s);

	return s;
}

正常的內存分配的調用流程爲:kmem_cache_alloc()->slab_alloc()->slab_alloc_node()->slab_pre_alloc_hook()(kmalloc的內存分配流程在也類似),這裏判斷如果進行故障注入則直接返回NULL,表示內存分配失敗,故障得以注入。

 

fail_page_alloc

fail_page_alloc的故障注入比failslab的更爲底層,直接在內存夥伴系統中進行注入。它也有自己的結構體定義:

static struct {
	struct fault_attr attr;

	bool ignore_gfp_highmem;
	bool ignore_gfp_reclaim;
	u32 min_order;
} fail_page_alloc = {
	.attr = FAULT_ATTR_INITIALIZER,
	.ignore_gfp_reclaim = true,
	.ignore_gfp_highmem = true,
	.min_order = 1,
};

在通用struct fault_attr配置參數的基礎之上又增加了ignore_gfp_reclaim、ignore_gfp_highmem和min_order這三個配置參數,第一個爲用於過濾__GFP_DIRECT_RECLAIM類型內存分配的開關,第二個爲用於過濾高端內存分配__GFP_HIGHMEM的開關,最後一個爲對故障注入最小分配頁的過濾器,只有大於該參數的分配才能夠進行故障注入。

fail_page_alloc的啓動參數配置接口:

static int __init setup_fail_page_alloc(char *str)
{
	return setup_fault_attr(&fail_page_alloc.attr, str);
}
__setup("fail_page_alloc=", setup_fail_page_alloc);

fail_page_alloc的debugfs配置接口:

static int __init fail_page_alloc_debugfs(void)
{
	umode_t mode = S_IFREG | S_IRUSR | S_IWUSR;
	struct dentry *dir;

	dir = fault_create_debugfs_attr("fail_page_alloc", NULL,
					&fail_page_alloc.attr);
	if (IS_ERR(dir))
		return PTR_ERR(dir);

	if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
				&fail_page_alloc.ignore_gfp_reclaim))
		goto fail;
	if (!debugfs_create_bool("ignore-gfp-highmem", mode, dir,
				&fail_page_alloc.ignore_gfp_highmem))
		goto fail;
	if (!debugfs_create_u32("min-order", mode, dir,
				&fail_page_alloc.min_order))
		goto fail;

	return 0;
fail:
	debugfs_remove_recursive(dir);

	return -ENOMEM;
}

它在通用配置文件的基礎上多創建了ignore-gfp-wait、ignore-gfp-highmem和min-order這三個配置文件,分別對應於fail_page_alloc結構體的ignore_gfp_reclaim、ignore_gfp_highmem和min_order配置參數。

下面來分析故障注入判斷函數:

static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
{
	if (order < fail_page_alloc.min_order)
		return false;
	if (gfp_mask & __GFP_NOFAIL)
		return false;
	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
		return false;
	if (fail_page_alloc.ignore_gfp_reclaim &&
			(gfp_mask & __GFP_DIRECT_RECLAIM))
		return false;

	return should_fail(&fail_page_alloc.attr, 1 << order);
}

1)首先判斷如果本次需要申請頁的order值小於min_order過濾器的設置值則不注入故障;

2)然後如果申請頁置位了__GFP_NOFAIL標記頁不注入故障;

3)如果打開了高端內存開關,則對於置位了__GFP_HIGHMEM的高端頁分配不注入故障;

4)如果打開了ignore_gfp_reclaim開關,則對置位了__GFP_DIRECT_RECLAIM的頁分配不注入故障;

5)最後執行should_fail進行通用判斷。

fail_page_alloc故障注入到內存頁分配流程的prepare_alloc_pages()函數中:

static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
		struct zonelist *zonelist, nodemask_t *nodemask,
		struct alloc_context *ac, gfp_t *alloc_mask,
		unsigned int *alloc_flags)
{
	ac->high_zoneidx = gfp_zone(gfp_mask);
	ac->zonelist = zonelist;
	ac->nodemask = nodemask;
	ac->migratetype = gfpflags_to_migratetype(gfp_mask);

	if (cpusets_enabled()) {
		*alloc_mask |= __GFP_HARDWALL;
		if (!ac->nodemask)
			ac->nodemask = &cpuset_current_mems_allowed;
		else
			*alloc_flags |= ALLOC_CPUSET;
	}

	lockdep_trace_alloc(gfp_mask);

	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

	if (should_fail_alloc_page(gfp_mask, order))
		return false;

	if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
		*alloc_flags |= ALLOC_CMA;

	return true;
}

正常的內存頁分配流程之一爲:page_frag_alloc()->alloc_pages_node()->__alloc_pages_node()->__alloc_pages->__alloc_pages_nodemask->prepare_alloc_pages(),該函數返回true表示可以分配頁,如果故障注入,那這裏就返回false,無法分配頁,則調用方需要進行異常處理。

 

總結

在我們編寫和調試內核程序的時候,一般情況下很容易只考慮到正常的執行流程,而對一些不常見的異常流程缺乏有效的處理機制,導致程序的健壯性不夠,這樣往往由於各種原因最終導致內核出現“卡死”、panic等用戶不想見到的結果。本文介紹了內核中用於磁盤IO和內存分配的4種常見的故障注入技術(Fault injection),在程序的調試過程中或問題定位復現時可以用來模擬故障的場景,大大的提高了軟件開發驗證的效率。

 

參考文獻:

Documentation/fault-injection/fault-injection.txt


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章