Linux DirectIO機制分析

http://www.oenhan.com/ext3-fs-directio


DirectIO是write函數的一個選項,用來確定數據內容直接寫到磁盤上,而非緩存中,保證即是系統異常了,也能保證緊要數據寫到磁盤上,具體寫文件的機制流程可以參考前面寫的<Linux內核寫文件流程>,DirectIO流程也是接續着寫文件流程而來的。

內核走到__generic_file_aio_write函數時,系統根據file->f_flags & O_DIRECT判斷進入DirectIO處理的分支:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
if (unlikely(file->f_flags & O_DIRECT)) {
    loff_t endbyte;
    ssize_t written_buffered;
 
    written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
                        ppos, count, ocount);
    if (written < 0 || written == count)
        goto out;
    /*
     * direct-io write to a hole: fall through to buffered I/O
     * for completing the rest of the request.
     */
    pos += written;
    count -= written;
    written_buffered = generic_file_buffered_write(iocb, iov,
                    nr_segs, pos, ppos, count,
                    written);
    /*
     * If generic_file_buffered_write() retuned a synchronous error
     * then we want to return the number of bytes which were
     * direct-written, or the error code if that was zero.  Note
     * that this differs from normal direct-io semantics, which
     * will return -EFOO even if some bytes were written.
     */
    if (written_buffered < 0) {
        err = written_buffered;
        goto out;
    }
 
    /*
     * We need to ensure that the page cache pages are written to
     * disk and invalidated to preserve the expected O_DIRECT
     * semantics.
     */
    endbyte = pos + written_buffered - written - 1;
    err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
    if (err == 0) {
        written = written_buffered;
        invalidate_mapping_pages(mapping,
                     pos >> PAGE_CACHE_SHIFT,
                     endbyte >> PAGE_CACHE_SHIFT);
    } else {
        /*
         * We don't know how much we wrote, so just return
         * the number of bytes which were direct-written
         */
    }
}

依次先看generic_file_direct_write函數,主要有filemap_write_and_wait_range,invalidate_inode_pages2_range和mapping->a_ops->direct_IO起作用。

filemap_write_and_wait_range主要用來刷mapping下的髒頁,在__filemap_fdatawrite_range下調用do_writepages實現:

1
2
3
4
5
6
7
8
9
10
11
12
int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
    int ret;
 
    if (wbc->nr_to_write <= 0)
        return 0;
    if (mapping->a_ops->writepages)
        ret = mapping->a_ops->writepages(mapping, wbc);
    else
        ret = generic_writepages(mapping, wbc);
    return ret;
}

filemap_write_and_wait_range如果有寫入量則返回,後續的兩個函數則不執行。我的理解是直寫後相關數據都要一起刷到磁盤上,避免direct_IO的已經在磁盤上,而之前緩存的則不在,系統異常後文件系統就掛了。

如果沒有寫入量,則根據mapping->nrpages判斷進入invalidate_inode_pages2_range,作用就是檢查當前內存中是否由對應將要direct_IO的緩存頁,如果有,則將其緩存標記爲無效。目的是,因爲direct_IO寫入的數據並不緩存,如果direct_IO寫入數據之前有對應緩存,而且是clean的,direct_IO完成之後,緩存和磁盤數據就不一致了,讀取緩存的時候,如果沒有保護,獲取的數據就不是磁盤上的數據。如果的確有對應緩存標記爲無效,則返回不執行後面的函數。

後面纔到真正的主題,mapping->a_ops->direct_IO,在struct address_space_operations ext3_ordered_aops結構體裏面有定義,是ext3_direct_IO,核心通過__blockdev_direct_IO實現,在direct_io_worker中組裝了dio結構,然後通過dio_bio_submit,本質就是通過submit_bio(dio->rw, bio)提交到io層。所謂direct_io和其他讀寫比較就是跨過了buffer層,不要中間線程pdflush和kjournald定期刷盤到IO層。這個時候也不一定數據就在磁盤上了,direct_IO就是先假定IO的設備驅動沒有較大延時的。

mapping->a_ops->direct_IO執行完成了,invalidate_inode_pages2_range又搞了一邊,理由如下:

/* Finally, try again to invalidate clean pages which might have been  cached by non-direct readahead, or faulted in by get_user_pages(),  if the source of the write was an mmap’ed region of the file , we’re writing. Either one is a pretty crazy thing to do,  so we don’t support it 100%. If this invalidation  fails, tough, the write still worked…*/

系統複雜度很高的時候,就很難找到完全的數字式的過程保證,有時候土法煉鋼也是簡單有效的。

再次退回到__generic_file_aio_write函數,

1
2
3
4
5
6
7
8
9
10
11
12
13
written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
                    ppos, count, ocount);
if (written < 0 || written == count)
    goto out;
/*
 * direct-io write to a hole: fall through to buffered I/O
 * for completing the rest of the request.
 */
pos += written;
count -= written;
written_buffered = generic_file_buffered_write(iocb, iov,
                nr_segs, pos, ppos, count,
                written);

如果generic_file_direct_write返回值不爲count,則重新執行緩存寫generic_file_buffered_write,前面已經分析過,如果寫入數據有相關的髒頁,或者有對應的緩存即是clean,寫入量則不是期待的count,此處要重新進行緩存寫入。

結果我們就看到,所謂的direct_IO並不完全保證跨越buffer,在某些條件下,也是buffer寫入。所以在極端要求directIO情況下,就要對應的規避掉這兩種情況,控制緩存映射。

小工具vmtouch對於緩存控制還是簡單有效


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章