Asynchronous vectored I/O operations

英文原文是https://lwn.net/Articles/170954/，其描述了vectored operations。這裏在原文中間加上了與之對應的內核代碼的分析。

The file_operations structure contains pointers to the basic I/O operations exported by filesystems and char device drivers. This structure currently contains three different methods for performing a read operation:

    ssize_t (*read) (struct file *filp, char __user *buffer, size_t size, 
                     loff_t *pos);
    ssize_t (*readv) (struct file *filp, const struct iovec *iov, 
                      unsigned long niov, loff_t *pos);
    ssize_t (*aio_read) (struct kiocb *iocb, char __user *buffer, 
                         size_t size, loff_t pos);

Normal read operations end up with a call to the read() method, which reads a single segment from the source into the supplied buffer. The readv() method implements the system call by the same name; it will read one segment and scatter it into several user buffers, each of which is described by an iovec structure. Finally, aio_read()is invoked in response to asynchronous I/O requests; it reads a single segment into the supplied buffer, possibly returning before the operation is complete. There is a similar set of three methods for write operations.

=========================添加============================

上述對於讀取單個segment的並寫入單個用戶buffer的操作，其opcode是IO_CMD_PREAD，對於read的vectored操作來說，對應的opcode是IO_CMD_PREADV，它也是讀取單個數據段(屬於磁盤分區或文件的)，但是要將讀取的數據分散寫入多個用戶buffer，每個用戶的buffer由struct iovec來描述：

struct iovec iov = { .iov_base = buf(存儲數據的用戶buffer), .iov_len = count(本buffer存放的讀取數據的長度，也即本buffer的長度) };

從下面的io_prep_preadv也可以看出，讀取的只有一個段，即從fd所表示的文件的offset偏移處讀取iovcnt長度的數據，讀取的數據存入到struct iovec *iov指針所指向的多個用戶buffer中(IO的完成過程實際是要由dma從後端將數據搬移到用戶buffer所映射的內核page中)，用戶buffer的數量由參數iovcnt指定：

函數aio_setup_rw函數在aio流程中被調用，對於vectored操作，調用import_iovec函數爲多個用戶buffer+len信息創建迭代器，用於後續io操作：

對於vectored的讀操作來說，需要讀取的總的數據長度就是用戶buffer數組中所有用戶buffer所能存入的數據長度總和，即每個buffer描述結構struct iovec iov中的iov_len之和：

struct iovec iov = { .iov_base = buf(存儲數據的用戶buffer), .iov_len = count(本buffer存放的讀取數據的長度) };

rw_copy_check_uvector函數：

引入vectored IO的目的與優點：
Vectored IO（也稱爲Scatter / Gather）在緩衝區向量（多個用戶buffer組成）上運行，並允許每個系統調用使用多個緩衝區從磁盤讀取和寫入數據。
執行向量讀取時，首先將字節從源讀取到緩衝區。然後，從第一個緩衝區的長度開始直到第二個緩衝區的長度偏移的源的字節將被讀入第二個緩衝區，依此類推，就像源一個接一個地填充緩衝區一樣。向量寫入以類似的方式工作：緩衝區將被寫入，就好像它們在寫入之前被連接一樣。
這種方法可以通過允許讀取較小的塊，因此避免爲連續塊分配大的存儲區域，同時減少用來自磁盤的數據填充所有這些緩衝區所需的系統調用量。另一個優點是讀取和寫入都是原子的：內核阻止其他進程在讀取和寫入操作期間對同一描述符執行IO，從而保證數據的完整性。
從開發人員的角度來看，如果數據在文件中以某種方式佈局。例如，它被拆分爲固定大小的頭和多個固定大小的塊，則可以發出一個單獨的調用來填充單獨的緩衝區分配給這些部分，即使用vectored IO的多個用戶buffer，將讀取的不同塊的數據分別放入這些buffer。每個buffer存儲的數量量大小就是在struct iovec中定義的，這個大小與數據在文件中的佈局匹配就行了。
這聽起來很有用，但不知何故，只有少數數據庫使用向量IO。這可能是因爲通用數據庫同時處理大量文件，試圖保證每個正在運行的操作的活躍性並減少它們的延遲，因此可以按塊訪問和緩存數據。向量IO對於分析工作負載和/或列式數據庫更有用，其中數據連續存儲在磁盤上，並且其處理可以在稀疏塊中並行完成。其中一個例子是Apache Arrow。

=======================================================

Back in November, Zach Brown posted a vectored AIO patch intended to provide a combination of the vectored (readv()/writev()) operations and asynchronous I/O. To that end, it defined a couple of new AIO operations for user space, and added two more file_operations methods: aio_readv() and aio_writev(). There was some resistance to the idea of creating yet another pair of operations, and a feeling that there was a better way. The result, after work by Christoph Hellwig and Badari Pulavarty, is a new vectored AIO patch with a much simpler interface - at the cost of a significant API change.

The observation was made that a number of subsystems use vectored I/O operations internally in all cases, even in the case of a "scalar" read() or write() call. For example, the read() function in the current mainline pipe driver is:

    static ssize_t
    pipe_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
    {
	struct iovec iov = { .iov_base = buf, .iov_len = count };
	return pipe_readv(filp, &iov, 1, ppos);
    }

Here, the read() method is essentially superfluous; it is provided simply because the API requires it. So, it was asked, rather than adding more vectored I/O operations, why not just "vectorize" the standard API? The resulting patch set brings about that change in a couple of steps.

The first of those is to change the prototypes for the asynchronous I/O methods to:

    ssize_t (*aio_read) (struct kiocb *iocb, const struct iovec *iov, 
             unsigned long niov, loff_t pos);
    ssize_t (*aio_write) (struct kiocb *iocb, const struct iovec *iov,  
             unsigned long niov, loff_t pos);

Thus, the single buffer has been replaced with an array of iovec structures, each describing one segment of the I/O operation. For the current single-buffer AIO read and write commands, the new code creates a single-entry iovec array and passes it to the new methods. (It's worth noting that, as the code is currently written, that iovecarray is no longer valid after aio_read() or aio_write() returns; that array will need to be copied for any operation which remains outstanding when those functions finish).

The prototypes of a couple of VFS helper functions (generic_file_aio_read() and generic_file_aio_write()) have been changed in a similar manner. These changes ripple through every driver and filesystem providing AIO methods, making the patch reasonably large. A second patch then adds two new AIO operations (IOCB_CMD_PREADV and IOCB_CMD_PWRITEV) to the user-space interface, making vectored asynchronous I/O available to applications.

The patch set then goes one step further by eliminating the readv() and writev() methods altogether. With this patch in place, any filesystem or driver which wishes to provide vectored I/O operations must do so via aio_read()and aio_write() instead. Note that this change does not imply that asynchronous operations themselves must be supported - it is entirely permissible (if suboptimal) for aio_read() and aio_write() to operate synchronously at all times. But this patch does make it necessary for modules wishing to provide vectored operations to, at a minimum, provide the file_operations methods for asynchronous I/O. If the AIO methods are not available for a given device or filesystem, a call to readv() or writev() will be emulated through multiple calls to read() or write(), as usual.

Finally, with this patch in place, it is possible for a driver or filesystem to omit the read() and write() methods altogether if the asynchronous versions are provided. If, for example, only aio_read() is provided, all read() andreadv() system calls will be handled by the aio_read() method. If, someday, all code implements the AIO methods, the regular read() and write() methods could be removed altogether. That would result in an interface which contained only one method for all read operations (and one more for writes). This change would also realize the vision expressed at the 2003 Kernel Summit that all I/O paths inside the kernel would, in the end, be made asynchronous.

There has been little discussion of the current patch set, so it is hard to predict what may ultimately become of it. Given that it simplifies a core kernel API while simultaneously making it more powerful, however, chances are that some version of this patch will find its way into the kernel eventually.

(For more information on the AIO interface, see this Driver Porting Series article or chapter 15 of LDD3).

Asynchronous vectored I/O operations

io_submit，io_getevents可能被阻塞的原因

io_uring相對libaio的優勢

get_user_page, get_user_pages_fast

systemtap探測vfs dentry

關於linux vfs dentry cache

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結