頁緩衝在《linux內核情景分析》一書的第5.6節文件的寫與讀一章中說明的很詳細,這裏摘抄下來;
在文件系統層中有三隔主要的數據結構,file結構、dentry結構和inode結構;
file結構:代表目標文件的一個上下文,不同進程可以在同一文件上建立不同的上下文,而且同一進程也可以通過打開一個文件多次而建立起多個上下文。因此不能在file結構上設置緩衝區隊列,因爲這些file結構體之間都不共享。
dentry結構體:該結構體是文件名結構體,通過軟/硬鏈接可以得到多個dentry結構體對應一個文件,dentry結構體和文件也不是一對一關係,所以也不能在該結構體上建立緩衝區隊列;
inode結構體:很顯然就只有inode結構體了,inode結構體和文件是一對一的關係,可以這麼說inode就是代表文件。在inode結構體上設置了i_mapping指針,該指針指向了一個address_space數據結構,一般來說該數據結構就是inode->i_data,緩衝區隊列就是在該數據結構中;
掛在緩衝區隊列中的不是記錄塊而是內存頁面,因此當一個進程調用mmap()函數將一個文件映射到它用戶空間時,它只要設置相應的內存映射表,就可以很自然的把這些緩存頁面映射到進程的用戶空間。所以才又起名爲i_mapping。
這裏還要了解下基數樹概念,先看看圖(圖片來自《深入linux內核架構》)
基數樹不是不是平衡樹,樹本身由兩種不同的數據結構組成,樹根節點和非葉子節點,樹根節點由簡單的數據結構表示,其中包含了樹的高度和指向組成樹的第一個節點的數據結構。節點本質上是數組,count是該節點的指針計數,其他的都是指向下一層節點的指針。而葉子節點是指向page的指針;
其中節點上的數據結構還包含了搜索標記,比如髒頁標記和回寫標記,可以很快的指定哪邊有標記的頁;
塊緩衝
塊緩衝在結構上由兩個部分組成:
1、緩衝頭:包含與緩衝區狀態相關的所有管理數據,塊號、長度,訪問器等,這些緩衝頭不直接存儲在緩衝頭之後,而是由緩衝頭指針指向的物理內存獨立區域中。
2、有用的數據保存在專門分配的頁中,這些頁也可以能同事存在頁緩衝中。
緩衝頭:
/*
* Historically, a buffer_head was used to map a single block
* within a page, and of course as the unit of I/O through the
* filesystem and block layers. Nowadays the basic I/O unit
* is the bio, and buffer_heads are used for extracting block
* mappings (via a get_block_t call), for tracking state within
* a page (via a page_mapping) and for wrapping bio submission
* for backward compatibility reasons (e.g. submit_bh).
*/
struct buffer_head {
unsigned long b_state; /* buffer state bitmap (see above) *///緩衝區狀態標識,看下面
struct buffer_head *b_this_page;/* circular list of page's buffers *///指向下一個緩衝頭
struct page *b_page; /* the page this bh is mapped to *///指向擁有該塊緩衝區的頁描述符指針
sector_t b_blocknr; /* start block number *///塊設備的邏輯塊號
size_t b_size; /* size of mapping *///塊大小
char *b_data; /* pointer to data within the page *///塊在緩衝頁內的位置
struct block_device *b_bdev;//指向塊設備描述符
bh_end_io_t *b_end_io; /* I/O completion *///i/o完成回調函數
void *b_private; /* reserved for b_end_io *///指向i/o完成回調函數的數據參數
struct list_head b_assoc_buffers; /* associated with another mapping */
struct address_space *b_assoc_map; /* mapping this buffer is
associated with */
atomic_t b_count; /* users using this buffer_head *///塊使用計算器
};
緩衝區頭部的通用標誌
enum bh_state_bits {
BH_Uptodate, /* Contains valid data *///表示緩衝區包含有效數據
BH_Dirty, /* Is dirty *///緩衝區是髒的
BH_Lock, /* Is locked *///緩衝區被鎖住
BH_Req, /* Has been submitted for I/O *///初始化緩衝區而請求數據傳輸
BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise
* IO completion of other buffers in the page
*/
BH_Mapped, /* Has a disk mapping *///b_bdev和b_blocknr是有效的
BH_New, /* Disk mapping was newly created by get_block *///剛分配還沒有訪問過
BH_Async_Read, /* Is under end_buffer_async_read I/O *///異步讀該緩衝區
BH_Async_Write, /* Is under end_buffer_async_write I/O *///異步寫該緩衝區
BH_Delay, /* Buffer is not yet allocated on disk *///還沒有在磁盤上分配緩衝區
BH_Boundary, /* Block is followed by a discontiguity *///
BH_Write_EIO, /* I/O error on write *///i/o錯誤
BH_Unwritten, /* Buffer is allocated on disk but not written */
BH_Quiet, /* Buffer Error Prinks to be quiet */
BH_Meta, /* Buffer contains metadata */
BH_Prio, /* Buffer should be submitted with REQ_PRIO */
BH_PrivateStart,/* not a state bit, but the first bit available
* for private allocation by other entities
*/
};
如果一個頁作爲緩衝區頁使用,那麼與它的塊緩衝區相關的所有緩衝區首部都被收集在一個單向循環鏈表中。緩衝頁描述符的private字段指向該頁中第一個塊的緩衝區首部;而每個緩衝區首部的b_this_page字段中,該字段是指向鏈表中下一個緩衝區首部的指針。每個緩衝區首部的b_page指向所屬的緩衝區頁描述符;
從上圖可以看出一個緩衝頁對應了4個緩衝區,這就統一了page cache和buffer cache了。修改緩衝區或者緩衝頁,他們之間都會相互影響。
address_space結構體:
struct address_space {
struct inode *host; /* owner: inode, block_device *///指向宿主文件的inode
struct radix_tree_root page_tree; /* radix tree of all pages *///基數樹的root
spinlock_t tree_lock; /* and lock protecting it *///基數樹的鎖
unsigned int i_mmap_writable;/* count VM_SHARED mappings *///vm_SHARED共享映射頁計數
struct rb_root i_mmap; /* tree of private and shared mappings *///私有和共享映射的樹
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings *///匿名映射的鏈表元素
struct mutex i_mmap_mutex; /* protect tree, count, list *///包含樹的mutex
/* Protected by tree_lock together with the radix tree */
unsigned long nrpages; /* number of total pages *///頁的總數
pgoff_t writeback_index;/* writeback starts here *///回寫的開始
const struct address_space_operations *a_ops; /* methods *///函數指針
unsigned long flags; /* error bits/gfp mask *///錯誤碼
struct backing_dev_info *backing_dev_info; /* device readahead, etc *///設備預讀
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
void *private_data; /* ditto */
} __attribute__((aligned(sizeof(long))));
struct inode *host和struct radix_tree_root page_tree關聯了文件和內存頁。
346 struct address_space_operations {
347 int (*writepage)(struct page *page, struct writeback_control *wbc);//寫操作,從頁寫到所有者的磁盤映像
348 int (*readpage)(struct file *, struct page *);//讀操作,從所有者磁盤映像讀取到頁
349
350 /* Write back some dirty pages from this mapping. */
351 int (*writepages)(struct address_space *, struct writeback_control *);//指定數量的所有者髒頁回寫磁盤
352
353 /* Set a page dirty. Return true if this dirtied it */
354 int (*set_page_dirty)(struct page *page);//把所有者的頁設置爲髒頁
355
356 int (*readpages)(struct file *filp, struct address_space *mapping,
357 struct list_head *pages, unsigned nr_pages);//從磁盤中讀取所有者頁的鏈表
358
359 int (*write_begin)(struct file *, struct address_space *mapping,
360 loff_t pos, unsigned len, unsigned flags,
361 struct page **pagep, void **fsdata);//
362 int (*write_end)(struct file *, struct address_space *mapping,
363 loff_t pos, unsigned len, unsigned copied,
364 struct page *page, void *fsdata);
365
366 /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
367 sector_t (*bmap)(struct address_space *, sector_t);
368 void (*invalidatepage) (struct page *, unsigned long);
369 int (*releasepage) (struct page *, gfp_t);
370 void (*freepage)(struct page *);
371 ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
372 loff_t offset, unsigned long nr_segs);
373 int (*get_xip_mem)(struct address_space *, pgoff_t, int,
374 void **, unsigned long *);
375 /*
376 * migrate the contents of a page to the specified target. If sync
377 * is false, it must not block.
378 */
379 int (*migratepage) (struct address_space *,
380 struct page *, struct page *, enum migrate_mode);
381 int (*launder_page) (struct page *);
382 int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
383 unsigned long);
384 int (*error_remove_page)(struct address_space *, struct page *);
385
386 /* swapfile support */
387 int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
388 sector_t *span);
389 void (*swap_deactivate)(struct file *file);
390 };
391