Linux DirectIO機制分析

http://www.oenhan.com/ext3-fs-directio

DirectIO是write函數的一個選項，用來確定數據內容直接寫到磁盤上，而非緩存中，保證即是系統異常了，也能保證緊要數據寫到磁盤上，具體寫文件的機制流程可以參考前面寫的<Linux內核寫文件流程>,DirectIO流程也是接續着寫文件流程而來的。

內核走到__generic_file_aio_write函數時，系統根據file->f_flags & O_DIRECT判斷進入DirectIO處理的分支:

if 
(unlikely(file->f_flags & O_DIRECT)) {
    loff_t endbyte;
    ssize_t written_buffered;
 
    written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
                        ppos, count, ocount);
    if
(written < 0 || written == count)
        goto
out;
    /*
     * direct-io write to a hole: fall through to buffered I/O
     * for completing the rest of the request.
     */
    pos += written;
    count -= written;
    written_buffered = generic_file_buffered_write(iocb, iov,
                    nr_segs, pos, ppos, count,
                    written);
    /*
     * If generic_file_buffered_write() retuned a synchronous error
     * then we want to return the number of bytes which were
     * direct-written, or the error code if that was zero.  Note
     * that this differs from normal direct-io semantics, which
     * will return -EFOO even if some bytes were written.
     */
    if
(written_buffered < 0) {
        err = written_buffered;
        goto
out;
    }
 
    /*
     * We need to ensure that the page cache pages are written to
     * disk and invalidated to preserve the expected O_DIRECT
     * semantics.
     */
    endbyte = pos + written_buffered - written - 1;
    err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
    if
(err == 0) {
        written = written_buffered;
        invalidate_mapping_pages(mapping,
                     pos >> PAGE_CACHE_SHIFT,
                     endbyte >> PAGE_CACHE_SHIFT);
    }
else {
        /*
         * We don't know how much we wrote, so just return
         * the number of bytes which were direct-written
         */
    }
}

依次先看generic_file_direct_write函數，主要有filemap_write_and_wait_range，invalidate_inode_pages2_range和mapping->a_ops->direct_IO起作用。

filemap_write_and_wait_range主要用來刷mapping下的髒頁，在__filemap_fdatawrite_range下調用do_writepages實現：

int 
do_writepages(struct 
address_space *mapping, struct 
writeback_control *wbc)
{
    int
ret;
 
    if
(wbc->nr_to_write <= 0)
        return
0;
    if
(mapping->a_ops->writepages)
        ret = mapping->a_ops->writepages(mapping, wbc);
    else
        ret = generic_writepages(mapping, wbc);
    return
ret;
}

filemap_write_and_wait_range如果有寫入量則返回，後續的兩個函數則不執行。我的理解是直寫後相關數據都要一起刷到磁盤上，避免direct_IO的已經在磁盤上，而之前緩存的則不在，系統異常後文件系統就掛了。

如果沒有寫入量，則根據mapping->nrpages判斷進入invalidate_inode_pages2_range，作用就是檢查當前內存中是否由對應將要direct_IO的緩存頁，如果有，則將其緩存標記爲無效。目的是，因爲direct_IO寫入的數據並不緩存，如果direct_IO寫入數據之前有對應緩存，而且是clean的，direct_IO完成之後，緩存和磁盤數據就不一致了，讀取緩存的時候，如果沒有保護，獲取的數據就不是磁盤上的數據。如果的確有對應緩存標記爲無效，則返回不執行後面的函數。

後面纔到真正的主題，mapping->a_ops->direct_IO，在struct address_space_operations ext3_ordered_aops結構體裏面有定義，是ext3_direct_IO，核心通過__blockdev_direct_IO實現，在direct_io_worker中組裝了dio結構，然後通過dio_bio_submit，本質就是通過submit_bio(dio->rw, bio)提交到io層。所謂direct_io和其他讀寫比較就是跨過了buffer層，不要中間線程pdflush和kjournald定期刷盤到IO層。這個時候也不一定數據就在磁盤上了，direct_IO就是先假定IO的設備驅動沒有較大延時的。

mapping->a_ops->direct_IO執行完成了，invalidate_inode_pages2_range又搞了一邊，理由如下：

/* Finally, try again to invalidate clean pages which might have been cached by non-direct readahead, or faulted in by get_user_pages()， if the source of the write was an mmap’ed region of the file ， we’re writing. Either one is a pretty crazy thing to do, so we don’t support it 100%. If this invalidation fails, tough, the write still worked…*/

系統複雜度很高的時候，就很難找到完全的數字式的過程保證，有時候土法煉鋼也是簡單有效的。

再次退回到__generic_file_aio_write函數，

written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
                    ppos, count, ocount);
if 
(written < 0 || written == count)
    goto
out;
/*
 * direct-io write to a hole: fall through to buffered I/O
 * for completing the rest of the request.
 */
pos += written;
count -= written;
written_buffered = generic_file_buffered_write(iocb, iov,
                nr_segs, pos, ppos, count,
                written);

如果generic_file_direct_write返回值不爲count，則重新執行緩存寫generic_file_buffered_write，前面已經分析過，如果寫入數據有相關的髒頁，或者有對應的緩存即是clean，寫入量則不是期待的count，此處要重新進行緩存寫入。

結果我們就看到，所謂的direct_IO並不完全保證跨越buffer，在某些條件下，也是buffer寫入。所以在極端要求directIO情況下，就要對應的規避掉這兩種情況，控制緩存映射。

小工具vmtouch對於緩存控制還是簡單有效

Linux DirectIO機制分析

Linux內核文件一致性之被動一致性

Linux 的併發可管理工作隊列機制探討

linux下數據同步、回寫機制分析

【轉】通過blktrace, debugfs分析磁盤IO

【轉】ext4+delalloc造成單次寫延遲增加的分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結