writeback機制源碼分析

原創作品，允許轉載，轉載時請務必以超鏈接形式標明文章原始出處、作者信息和本聲明。否則將追究法律責任。http://alanwu.blog.51cto.com/3652632/1110046

writeback相關數據結構

與writeback相關的數據結構主要有：

1，backing_dev_info，該數據結構描述了backing_dev的所有信息，通常塊設備的request queue中會包含backing_dev對象。

2，bdi_writeback，該數據結構封裝了writeback的內核線程以及需要操作的inode隊列。

3，wb_writeback_work，該數據結構封裝了writeback的工作任務。

各數據結構之間的關係如下圖所示：

下面對各個數據結構做簡要介紹。

bdi information

bdi對象在塊設備添加的時候需要註冊到系統的bdi隊列中。對於ext3而言，在mount的時候需要將底層塊設備的bdi對象聯繫到ext3 root_inode中。bdi對象數據結構定義如下：

struct backing_dev_info {
struct list_head bdi_list;
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
unsigned long state; /* Always use atomic bitops on this */
unsigned int capabilities; /* Device capabilities */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data; /* Pointer to aux data for congested func */
char *name;
struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
unsigned long bw_time_stamp; /* last time write bw is updated */
unsigned long dirtied_stamp;
unsigned long written_stamp; /* pages written at bw_time_stamp */
unsigned long write_bandwidth; /* the estimated write bandwidth */
unsigned long avg_write_bandwidth; /* further smoothed write bw */
/*
* The base dirty throttle rate, re-calculated on every 200ms.
* All the bdi tasks' dirty rate will be curbed under it.
* @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit
* in small steps and is much more smooth/stable than the latter.
*/
unsigned long dirty_ratelimit;
unsigned long balanced_dirty_ratelimit;
struct prop_local_percpu completions;
int dirty_exceeded;
unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;
struct bdi_writeback wb; /* default writeback info for this bdi，writeback對象 */
spinlock_t wb_lock; /* protects work_list */
/* 任務鏈表 */
struct list_head work_list;
struct device *dev;
/* 在laptop模式下應用的定時器 */
struct timer_list laptop_mode_wb_timer;
#ifdef CONFIG_DEBUG_FS
struct dentry *debug_dir;
struct dentry *debug_stats;
#endif
};

在bdi數據結構中定義了一個writeback對象，該對象是對writeback內核線程的描述，並且封裝了需要處理的inode隊列。在bdi數據結構中有一條work_list，該work隊列維護了writeback內核線程需要處理的任務。如果該隊列上沒有work可以處理，那麼writeback內核線程將會睡眠等待。

writeback

writeback對象封裝了內核線程task以及需要處理的inode隊列。當page cache/buffer cache需要刷新radix tree上的inode時，可以將該inode掛載到writeback對象的b_dirty隊列上，然後喚醒writeback線程。在處理過程中，inode會被移到b_io隊列上進行處理。多條鏈表的方式可以降低多線程之間的資源共享。writeback數據結構具體定義如下：

struct bdi_writeback {
struct backing_dev_info *bdi; /* our parent bdi */
unsigned int nr;
unsigned long last_old_flush; /* last old data flush */
unsigned long last_active; /* last time bdi thread was active */
struct task_struct *task; /* writeback thread */
struct timer_list wakeup_timer; /* used for delayed bdi thread wakeup */
struct list_head b_dirty; /* dirty inodes */
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
spinlock_t list_lock; /* protects the b_* lists */
};

writeback work

wb_writeback_work數據結構是對writeback任務的封裝，不同的任務可以採用不同的刷新策略。writeback線程的處理對象就是writeback_work。如果writeback_work隊列爲空，那麼內核線程就可以睡眠了。Writeback_work的數據結構定義如下：

struct wb_writeback_work {
long nr_pages;
struct super_block *sb; /* superblock對象 */
unsigned long *older_than_this;
enum writeback_sync_modes sync_mode;
unsigned int tagged_writepages:1;
unsigned int for_kupdate:1;
unsigned int range_cyclic:1;
unsigned int for_background:1;
enum wb_reason reason; /* why was writeback initiated? */
struct list_head list; /* pending work list，鏈入bdi-> work_list隊列 */
struct completion *done; /* set if the caller waits，work完成時通知調用者 */
};

writeback主要函數分析

writeback機制的主要函數包括如下兩個方面：

1，管理bdi對象並且fork相應的writeback內核線程處理cache數據的刷新工作。

2，writeback內核線程處理函數，實現dirty page的刷新操作

writeback線程管理

Linux中有一個內核守護線程，該線程用來管理系統bdi隊列，並且負責爲block device創建writeback thread。當bdi中有dirty page並且還沒有爲bdi分配內核線程的時候，bdi_forker_thread程序會爲其分配線程資源；當一個writeback線程長時間處於空閒狀態時，bdi_forker_thread程序會釋放該線程資源。

writeback線程管理程序分析如下：

static int bdi_forker_thread(void *ptr)
{
struct bdi_writeback *me = ptr;
current->flags |= PF_SWAPWRITE;
set_freezable();
/*
* Our parent may run at a different priority, just set us to normal
*/
set_user_nice(current, 0);
for (;;) {
struct task_struct *task = NULL;
struct backing_dev_info *bdi;
enum {
NO_ACTION, /* Nothing to do */
FORK_THREAD, /* Fork bdi thread */
KILL_THREAD, /* Kill inactive bdi thread */
} action = NO_ACTION;
/*
* Temporary measure, we want to make sure we don't see
* dirty data on the default backing_dev_info
*/
if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list)) {
del_timer(&me->wakeup_timer);
wb_do_writeback(me, 0);
}
spin_lock_bh(&bdi_lock);
/*
* In the following loop we are going to check whether we have
* some work to do without any synchronization with tasks
* waking us up to do work for them. Set the task state here
* so that we don't miss wakeups after verifying conditions.
*/
set_current_state(TASK_INTERRUPTIBLE);
/* 遍歷所有的bdi對象，檢查這些bdi是否存在髒數據，如果有髒數據，那麼需要爲其fork線程，然後做writeback操作 */
list_for_each_entry(bdi, &bdi_list, bdi_list) {
bool have_dirty_io;
if (!bdi_cap_writeback_dirty(bdi) ||
bdi_cap_flush_forker(bdi))
continue;
WARN(!test_bit(BDI_registered, &bdi->state),
"bdi %p/%s is not registered!\n", bdi, bdi->name);
/* 檢查是否存在髒數據 */
have_dirty_io = !list_empty(&bdi->work_list) ||
wb_has_dirty_io(&bdi->wb);
/*
* If the bdi has work to do, but the thread does not
* exist - create it.
*/
if (!bdi->wb.task && have_dirty_io) {
/*
* Set the pending bit - if someone will try to
* unregister this bdi - it'll wait on this bit.
*/
/* 如果有髒數據，並且不存在線程，那麼接下來做線程的FORK操作 */
set_bit(BDI_pending, &bdi->state);
action = FORK_THREAD;
break;
}
spin_lock(&bdi->wb_lock);
/*
* If there is no work to do and the bdi thread was
* inactive long enough - kill it. The wb_lock is taken
* to make sure no-one adds more work to this bdi and
* wakes the bdi thread up.
*/
/* 如果一個bdi長時間沒有髒數據，那麼執行線程的KILL操作，結束掉該bdi對應的writeback線程 */
if (bdi->wb.task && !have_dirty_io &&
time_after(jiffies, bdi->wb.last_active +
bdi_longest_inactive())) {
task = bdi->wb.task;
bdi->wb.task = NULL;
spin_unlock(&bdi->wb_lock);
set_bit(BDI_pending, &bdi->state);
action = KILL_THREAD;
break;
}
spin_unlock(&bdi->wb_lock);
}
spin_unlock_bh(&bdi_lock);
/* Keep working if default bdi still has things to do */
if (!list_empty(&me->bdi->work_list))
__set_current_state(TASK_RUNNING);
/* 執行線程的FORK和KILL操作 */
switch (action) {
case FORK_THREAD:
/* FORK一個bdi_writeback_thread線程，該線程的名字爲flush-major:minor */
__set_current_state(TASK_RUNNING);
task = kthread_create(bdi_writeback_thread, &bdi->wb,
"flush-%s", dev_name(bdi->dev));
if (IS_ERR(task)) {
/*
* If thread creation fails, force writeout of
* the bdi from the thread. Hopefully 1024 is
* large enough for efficient IO.
*/
writeback_inodes_wb(&bdi->wb, 1024,
WB_REASON_FORKER_THREAD);
} else {
/*
* The spinlock makes sure we do not lose
* wake-ups when racing with 'bdi_queue_work()'.
* And as soon as the bdi thread is visible, we
* can start it.
*/
spin_lock_bh(&bdi->wb_lock);
bdi->wb.task = task;
spin_unlock_bh(&bdi->wb_lock);
wake_up_process(task);
}
bdi_clear_pending(bdi);
break;
case KILL_THREAD:
/* KILL一個線程 */
__set_current_state(TASK_RUNNING);
kthread_stop(task);
bdi_clear_pending(bdi);
break;
case NO_ACTION:
/* 如果沒有可執行的動作，那麼調度本線程睡眠一段時間 */
if (!wb_has_dirty_io(me) || !dirty_writeback_interval)
/*
* There are no dirty data. The only thing we
* should now care about is checking for
* inactive bdi threads and killing them. Thus,
* let's sleep for longer time, save energy and
* be friendly for battery-driven devices.
*/
schedule_timeout(bdi_longest_inactive());
else
schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
try_to_freeze();
break;
}
}
return 0;
}

writeback線程

writeback線程是bdi_forker_thread 創建的，該線程的任務就是處理等待的數據回刷任務。線程處理函數爲bdi_writeback_thread，其會調用wb_do_writeback函數完成具體操作，該函數分析如下：

long wb_do_writeback(struct bdi_writeback *wb, int force_wait)
{
struct backing_dev_info *bdi = wb->bdi;
struct wb_writeback_work *work;
long wrote = 0;
set_bit(BDI_writeback_running, &wb->bdi->state);
/* 處理等待的work，所有等待work pengding在bdi->work_list上 */
while ((work = get_next_work_item(bdi)) != NULL) {
/*
* Override sync mode, in case we must wait for completion
* because this thread is exiting now.
*/
if (force_wait)
work->sync_mode = WB_SYNC_ALL;
trace_writeback_exec(bdi, work);
/* 調用wb_writeback函數處理相應的inode */
wrote += wb_writeback(wb, work);
/*
* Notify the caller of completion if this is a synchronous
* work item, otherwise just free it.
*/
/* 通知上層軟件，相應的work已經完成 */
if (work->done)
complete(work->done);
else
kfree(work);
}
/*
* Check for periodic writeback, kupdated() style
*/
/* 處理週期性的dirty page刷新作業，buffer cache就會走這條路徑，在下面的函數中會創建work，並且調用wb_writeback函數進行處理 */
wrote += wb_check_old_data_flush(wb);
wrote += wb_check_background_flush(wb);
clear_bit(BDI_writeback_running, &wb->bdi->state);
return wrote;
}