http://www.oenhan.com/ext3-fs-directio
DirectIO是write函數的一個選項,用來確定數據內容直接寫到磁盤上,而非緩存中,保證即是系統異常了,也能保證緊要數據寫到磁盤上,具體寫文件的機制流程可以參考前面寫的<Linux內核寫文件流程>,DirectIO流程也是接續着寫文件流程而來的。
內核走到__generic_file_aio_write函數時,系統根據file->f_flags & O_DIRECT判斷進入DirectIO處理的分支:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
|
if
(unlikely(file->f_flags & O_DIRECT)) { loff_t endbyte; ssize_t written_buffered; written = generic_file_direct_write(iocb, iov, &nr_segs, pos, ppos, count, ocount); if
(written < 0 || written == count) goto
out; /* * direct-io write to a hole: fall through to buffered I/O * for completing the rest of the request. */ pos += written; count -= written; written_buffered = generic_file_buffered_write(iocb, iov, nr_segs, pos, ppos, count, written); /* * If generic_file_buffered_write() retuned a synchronous error * then we want to return the number of bytes which were * direct-written, or the error code if that was zero. Note * that this differs from normal direct-io semantics, which * will return -EFOO even if some bytes were written. */ if
(written_buffered < 0) { err = written_buffered; goto
out; } /* * We need to ensure that the page cache pages are written to * disk and invalidated to preserve the expected O_DIRECT * semantics. */ endbyte = pos + written_buffered - written - 1; err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte); if
(err == 0) { written = written_buffered; invalidate_mapping_pages(mapping, pos >> PAGE_CACHE_SHIFT, endbyte >> PAGE_CACHE_SHIFT); }
else { /* * We don't know how much we wrote, so just return * the number of bytes which were direct-written */ } } |
依次先看generic_file_direct_write函數,主要有filemap_write_and_wait_range,invalidate_inode_pages2_range和mapping->a_ops->direct_IO起作用。
filemap_write_and_wait_range主要用來刷mapping下的髒頁,在__filemap_fdatawrite_range下調用do_writepages實現:
1
2
3
4
5
6
7
8
9
10
11
12
|
int
do_writepages( struct
address_space *mapping, struct
writeback_control *wbc) { int
ret; if
(wbc->nr_to_write <= 0) return
0; if
(mapping->a_ops->writepages) ret = mapping->a_ops->writepages(mapping, wbc); else ret = generic_writepages(mapping, wbc); return
ret; } |
filemap_write_and_wait_range如果有寫入量則返回,後續的兩個函數則不執行。我的理解是直寫後相關數據都要一起刷到磁盤上,避免direct_IO的已經在磁盤上,而之前緩存的則不在,系統異常後文件系統就掛了。
如果沒有寫入量,則根據mapping->nrpages判斷進入invalidate_inode_pages2_range,作用就是檢查當前內存中是否由對應將要direct_IO的緩存頁,如果有,則將其緩存標記爲無效。目的是,因爲direct_IO寫入的數據並不緩存,如果direct_IO寫入數據之前有對應緩存,而且是clean的,direct_IO完成之後,緩存和磁盤數據就不一致了,讀取緩存的時候,如果沒有保護,獲取的數據就不是磁盤上的數據。如果的確有對應緩存標記爲無效,則返回不執行後面的函數。
後面纔到真正的主題,mapping->a_ops->direct_IO,在struct address_space_operations ext3_ordered_aops結構體裏面有定義,是ext3_direct_IO,核心通過__blockdev_direct_IO實現,在direct_io_worker中組裝了dio結構,然後通過dio_bio_submit,本質就是通過submit_bio(dio->rw, bio)提交到io層。所謂direct_io和其他讀寫比較就是跨過了buffer層,不要中間線程pdflush和kjournald定期刷盤到IO層。這個時候也不一定數據就在磁盤上了,direct_IO就是先假定IO的設備驅動沒有較大延時的。
mapping->a_ops->direct_IO執行完成了,invalidate_inode_pages2_range又搞了一邊,理由如下:
/* Finally, try again to invalidate clean pages which might have been cached by non-direct readahead, or faulted in by get_user_pages(), if the source of the write was an mmap’ed region of the file , we’re writing. Either one is a pretty crazy thing to do, so we don’t support it 100%. If this invalidation fails, tough, the write still worked…*/
系統複雜度很高的時候,就很難找到完全的數字式的過程保證,有時候土法煉鋼也是簡單有效的。
再次退回到__generic_file_aio_write函數,
1
2
3
4
5
6
7
8
9
10
11
12
13
|
written = generic_file_direct_write(iocb, iov, &nr_segs, pos, ppos, count, ocount); if
(written < 0 || written == count) goto
out; /* * direct-io write to a hole: fall through to buffered I/O * for completing the rest of the request. */ pos += written; count -= written; written_buffered = generic_file_buffered_write(iocb, iov, nr_segs, pos, ppos, count, written); |
如果generic_file_direct_write返回值不爲count,則重新執行緩存寫generic_file_buffered_write,前面已經分析過,如果寫入數據有相關的髒頁,或者有對應的緩存即是clean,寫入量則不是期待的count,此處要重新進行緩存寫入。
結果我們就看到,所謂的direct_IO並不完全保證跨越buffer,在某些條件下,也是buffer寫入。所以在極端要求directIO情況下,就要對應的規避掉這兩種情況,控制緩存映射。
小工具vmtouch對於緩存控制還是簡單有效