disk-io

http://linuxperf.com/?p=161
1 目錄
1. BLOCK相關 1
1.1 對標vtune針對Disk I/O採樣的事件 2
1.2 相關採樣事件解釋 2
1.2.1 block_bio_queue 2
1.2.2 block_rq_insert 4
2 採樣試驗 6
2.1 I/O時間採樣 6
2.1.1 I/O的時間 6
2.1.2 測試命令 6
2.1.3 結果說明 6
2.2 I/O Depth 7
2.2.1 I/O Depth概念 7
2.2.2 測試命令 7
2.2.3 結果 7
2.3 I/O Summary 7
2.3.1 計算方式 7
2.3.2 測試命令 8
2.3.3 結果(讀、寫) 8
2.4 I/O Data transfer 8
2.4.1 計算方式 8
2.4.2 測試命令 8
2.4.3 結果 9
2.5 I/O wait 9
2.6 Thread 9
2.7 Page fault 10
2.7.1 X86 10
2.7.2 ARM 11

  1. BLOCK相關
    Trace_point 功能說明
    block:block_touch_buffer mark a buffer accessed
    block:block_dirty_buffer mark a buffer dirty
    block:block_rq_abort
    block:block_rq_requeue place block IO request back on a queuen/
    block:block_rq_complete block IO operation completed by device driver
    block:block_rq_insert insert block operation request into queue
    block:block_rq_issue issue pending block IO request operation to device driver(from queue to device driver)
    block:block_bio_bounce used bounce buffer when processing block operation
    block:block_bio_complete completed all work on the block operation
    block:block_bio_backmerge merging block operation to the end of an existing operation
    block:block_bio_frontmerge merging block operation to the beginning of an existing operation
    block:block_bio_queue putting new block IO operation in queue
    block:block_getrq get a free request entry in queue for block IO operations
    block:block_sleeprq waiting to get a free request entry in queue for block IO operation
    block:block_plug keep operations requests in request queue
    block:block_unplug release of operations requests in request queue
    block:block_split split a single bio struct into two bio structs
    block:block_bio_remap map request for a logical device to the raw device
    block:block_rq_remap map request for a block operation request

Note:
TRACE_EVENT = DEFINE_EVENT + DECLARE_EVENT_CLASS

1.1 對標vtune針對Disk I/O採樣的事件

Note:
Block_bio_complete 見上
Block_bio_queue 見上
Block_rq_insert 見上
Block_rq_complete 見上
Block_rq_issue 見上
Page_fault_kernel
Page_fault_user
Sched_switch Tracepoint for task switches, performed by the scheduler
Tracing_mark_write

1.2 相關採樣事件解釋
1.2.1 block_bio_queue
1.2.1.1 源碼定義:
TRACE_EVENT(block_bio_queue,
TP_PROTO(struct request_queue *q, struct bio *bio),
TP_ARGS(q, bio),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( sector_t, sector )
__field( unsigned int, nr_sector )
__array( char, rwbs, RWBS_LEN )
__array( char, comm, TASK_COMM_LEN )
),
TP_fast_assign(
__entry->dev = bio->bi_bdev->bd_dev;
__entry->sector = bio->bi_iter.bi_sector;
__entry->nr_sector = bio_sectors(bio);
blk_fill_rwbs(__entry->rwbs, bio->bi_opf, bio->bi_iter.bi_size
);
memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
),
TP_printk(“%d,%d %s %llu + %u [%s]”,
MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
(unsigned long long)__entry->sector,
__entry->nr_sector, __entry->comm)
);
1.2.1.2 結果輸出字段說明:
使用:perf record –e block:block_bio_queue –a sleep 10

kworker/u98:1 16697 [014] 2245745.995877: block:block_bio_queue: 8,0 W 7717467464 + 8 [kworker/u98:1]
executable name, excluding path PID Core_number Time Trace_point 主設備號 子設備號 操作碼 扇區地址 扇區數 executable name, excluding path

1.2.1.3 查看輸出格式
cat /sys/kernel/debug/tracing/events/block/block_bio_queue/format
name: block_bio_queue
ID: 771
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

field:dev_t dev;    offset:8;   size:4; signed:0;
field:sector_t sector;  offset:16;  size:8; signed:0;
field:unsigned int nr_sector;   offset:24;  size:4; signed:0;
field:__u8 rwbs[15];    offset:28;  size:8; signed:1;
field:__u8 comm[15];    offset:36;  size:16;    signed:1;

print fmt: “%d,%d %s %llu + %u [%s]”, ((unsigned int) ((REC->dev) >> 20)), ((unsigned int) ((REC->dev) & ((1U << 20) - 1))), REC->rwbs, (unsigned long long)REC->sector, REC->nr_sector, REC->comm
1.2.2 block_rq_insert
1.2.2.1 源碼定義
DEFINE_EVENT(block_rq, block_rq_insert,
TP_PROTO(struct request_queue *q, struct request *rq),
TP_ARGS(q, rq)
);
DECLARE_EVENT_CLASS(block_rq,
TP_PROTO(struct request_queue *q, struct request *rq),
TP_ARGS(q, rq),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( sector_t, sector )
__field( unsigned int, nr_sector )
__field( unsigned int, bytes )
__array( char, rwbs, RWBS_LEN )
__array( char, comm, TASK_COMM_LEN )
__dynamic_array( char, cmd, blk_cmd_buf_len(rq) )
),
TP_fast_assign(
__entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
__entry->sector = (rq->cmd_type == REQ_TYPE_BLOCK_PC) ? 0 : blk_rq_pos(rq);
__entry->nr_sector = (rq->cmd_type == REQ_TYPE_BLOCK_PC) ?0 : blk_rq_sectors(rq);
__entry->bytes = (rq->cmd_type == REQ_TYPE_BLOCK_PC) ? blk_rq_bytes(rq) : 0;
blk_fill_rwbs(__entry->rwbs, rq->cmd_flags, blk_rq_bytes(rq));
blk_dump_cmd(__get_str(cmd), rq);
memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
),
TP_printk(“%d,%d %s %u (%s) %llu + %u [%s]”,
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->rwbs, __entry->bytes, __get_str(cmd),
(unsigned long long)__entry->sector,
__entry->nr_sector, __entry->comm)
);
name: block_rq_insert
ID: 777
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

field:dev_t dev;    offset:8;   size:4; signed:0;
field:sector_t sector;  offset:16;  size:8; signed:0;
field:unsigned int nr_sector;   offset:24;  size:4; signed:0;
field:unsigned int bytes;   offset:28;  size:4; signed:0;
field:__u8 rwbs[15];    offset:32;  size:8; signed:1;
field:__u8 comm[15];    offset:40;  size:16;    signed:1;
field:__data_loc char[] cmd;    offset:56;  size:4; signed:1;

print fmt: “%d,%d %s %u (%s) %llu + %u [%s]”, ((unsigned int) ((REC->dev) >> 20)), ((unsigned int) ((REC->dev) & ((1U << 20) - 1))), REC->rwbs, REC->bytes, __get_str(cmd), (unsigned long long)REC->sector, REC->nr_sector, REC->comm
1.2.2.2 結果輸出
dd 26317 [030] 2415313.166505: block:block_rq_insert: 8,0 R 0 () 1051648 + 1024 [dd]
executable name, excluding path PID Core-number Time Trace_point 主設備號 子設備號 操作碼 命令類型 命令 扇區地址 扇區數 executable name, excluding path

1.2.2.3 查看輸出格式
cat /sys/kernel/debug/tracing/events/block/block_bio_queue/format

1.2.3 sched_switch
1.2.3.1 源碼定義
TRACE_EVENT(sched_switch,
TP_PROTO(bool preempt,
struct task_struct *prev,
struct task_struct *next),
TP_ARGS(preempt, prev, next),
TP_STRUCT__entry(
__array( char, prev_comm, TASK_COMM_LEN )
__field( pid_t, prev_pid )
__field( int, prev_prio )
__field( long, prev_state )
__array( char, next_comm, TASK_COMM_LEN )
__field( pid_t, next_pid )
__field( int, next_prio )
),
TP_fast_assign(
memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
__entry->prev_pid = prev->pid;
__entry->prev_prio = prev->prio;
__entry->prev_state = __trace_sched_switch_state(preempt, prev);//prev->state
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid = next->pid;
__entry->next_prio = next->prio;
/* XXX SCHED_DEADLINE */
),
TP_printk(“prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d”,
__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
__entry->prev_state & (TASK_STATE_MAX-1) ?
__print_flags(__entry->prev_state & (TASK_STATE_MAX-1), “|”,
{ 1, “S”} , { 2, “D” }, { 4, “T” }, { 8, “t” },
{ 16, “Z” }, { 32, “X” }, { 64, “x” },
{ 128, “K” }, { 256, “W” }, { 512, “P” },
{ 1024, “N” }) : “R”,
__entry->prev_state & TASK_STATE_MAX ? “+” : “”,
__entry->next_comm, __entry->next_pid, __entry->next_prio)
);
1.2.3.2 結果輸出
dd 24961 [037] 3195292.316309: sched:sched_switch: dd:24961 [120] D|W ==> swapper/37:0 [120]
executable name, excluding path PID Core-number Time Trace_point 進程切換前的進程名 切換前PID 進程優先級(0-139) 進程狀態 切換後進程名 切換後PID 切換後進程優先級

1.2.3.3 Linux進程狀態說明
/* Used in tsk->state: */

define TASK_RUNNING 0

define TASK_INTERRUPTIBLE 1

define TASK_UNINTERRUPTIBLE 2

define __TASK_STOPPED 4

define __TASK_TRACED 8

/* Used in tsk->exit_state: */

define EXIT_DEAD 16

define EXIT_ZOMBIE 32

define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)

/* Used in tsk->state again: */

define TASK_DEAD 64

define TASK_WAKEKILL 128

define TASK_WAKING 256

define TASK_PARKED 512

define TASK_NOLOAD 1024

define TASK_NEW 2048

define TASK_STATE_MAX 4096

define TASK_STATE_TO_CHAR_STR “RSDTtXZxKWPNn”

狀態 定義
R Running.運行中
S Interruptible Sleep.等待調用
D Uninterruptible Sleep.等待磁盤IO,D狀態
t 進程被跟蹤狀態
T Stoped.暫停或者跟蹤狀態
X Dead.即將被撤銷
Z Zombie.進程已經結束,僅映像名留存
x Task_dead
K Task_wakekill
W task_waiking
P Task_parked
N Task_NOload
n Task_new
1.2.3.4 查看輸出格式
cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
2 採樣試驗
2.1 I/O時間採樣
2.1.1 I/O的時間
bio請求的時間 或 request的時間(?)
2.1.2 測試命令
Note:內存過大導致數據存在於緩存之中時,可能會採集不到read操作
先執行:echo 3 > /proc/sys/vm/drop_caches
perf record -e block:block_bio_queue -e block:block_bio_complete -a dd oflag=direct if=/dev/sda2 of=/home/lxb/test.txt bs=1G count=1
2.1.3 結果說明
發送並響應讀請求(bio,寫請求類似):

注:連續多個bio可能會被合併成一個request下發,此時對應的bio_complete不會返回,返回的是rq_complete,因此直接對bio_complete採樣會存在問題。

不同採樣點的時間差值比較如下:
2.1.3.1 讀請求時:

block_bio_queue -> block_rq_insert(0.0007) -> block_rq_complete(0.03) -> block_bio_complete(?)
block_bio_queue與block_rq_insert之間差總的結果兩個量級,block_bio_complete和block_rq_complete之間類比,所以block_rq_complete可以替代block_bio_complete;
2.1.3.2 寫請求時:

block_bio_queue -> block_rq_insert(0.00001) -> block_rq_complete(0.004) -> block_bio_complete(?)
block_bio_queue與block_rq_insert之間與總的結果相差兩個量級,block_bio_complete和block_rq_complete之間類比,所以block_rq_complete也可以替代block_bio_complete。
2.1.3.3 結論
一個bio請求的時間可以簡單等價爲block_bio_queue到block_rq_complete的時間。
2.2 I/O Summary
2.2.1 計算方式
計算每一個I/O請求執行的時間(Trq_complete –Trq_insert,使用sector地址來標記insert與complete一一對應的關係),對執行時間進行不同時間段的統計.
橫座標:執行一個請求所花費的時間;縱座標:給定時間範圍內執行的請求的數量
2.2.2 測試命令
perf record -e block:block_rq_insert -e block:block_rq_complete -a dd oflag=direct if=/dev/sda2 of=/home/lxb/test.txt bs=1G count=1
2.2.3 結果(讀、寫)
Vtune:

Own:

2.3 I/O Depth
2.3.1 I/O Depth概念
I/O隊列深度決定了給定塊設備寫I/O的最大併發數。等價爲當前I/O隊列中等待被處理的request數量。
查看隊列深度:cat /sys/block/sdc/queue/nr_requests
計算公式:某T時刻的隊列深度:NTI/ODepth = NT-1I/ODepth + N¬¬time

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章