linux內核源碼閱讀之facebook硬盤加速flashcache之四

這一小節介紹一下flashcache讀寫入口和讀寫的基礎實現。

首先，不管是模塊還是程序，必須先找到入口，用戶態代碼會經常去先看main函數，內核看module_init，同樣看IO流時候也要找到入口。flashcache作爲一個dm_target，入口就是struct target_type 的map函數，對應的是flashcache_map函數：

1581/*
1582 * Decide the mapping and perform necessary cache operations for a bio request.
1583 */
1584int 
1585flashcache_map(struct dm_target *ti, struct bio *bio,
1586	       union map_info *map_context)
1587{
1588	struct cache_c *dmc = (struct cache_c *) ti->private;
1589	int sectors = to_sector(bio->bi_size);
1590	int queued;
1591	
1592	if (sectors <= 32)
1593		size_hist[sectors]++;
1594
1595	if (bio_barrier(bio))
1596		return -EOPNOTSUPP;
1597
1598	VERIFY(to_sector(bio->bi_size) <= dmc->block_size);
1599
1600	if (bio_data_dir(bio) == READ)
1601		dmc->reads++;
1602	else
1603		dmc->writes++;
1604
1605	spin_lock_irq(&dmc->cache_spin_lock);
1606	if (unlikely(sysctl_pid_do_expiry && 
1607		     (dmc->whitelist_head || dmc->blacklist_head)))
1608		flashcache_pid_expiry_all_locked(dmc);
1609	if ((to_sector(bio->bi_size) != dmc->block_size) ||
1610	    (bio_data_dir(bio) == WRITE && flashcache_uncacheable(dmc))) {
1611		queued = flashcache_inval_blocks(dmc, bio);
1612		spin_unlock_irq(&dmc->cache_spin_lock);
1613		if (queued) {
1614			if (unlikely(queued < 0))
1615				flashcache_bio_endio(bio, -EIO);
1616		} else {
1617			/* Start uncached IO */
1618			flashcache_start_uncached_io(dmc, bio);
1619		}
1620	} else {
1621		spin_unlock_irq(&dmc->cache_spin_lock);		
1622		if (bio_data_dir(bio) == READ)
1623			flashcache_read(dmc, bio);
1624		else
1625			flashcache_write(dmc, bio);
1626	}
1627	return DM_MAPIO_SUBMITTED;
1628}

第1588行，dmc = ti->private，是什麼時候保持的這個指針呢？看構造函數flashcache_ctr

1350     ti->split_io = dmc->block_size;
1351     ti->private = dmc;

這裏對private賦值，這裏還有一個額外的收穫，就是1350行，這是告訴dm層將IO分發爲指定大小下發到dm_target設備。所以就有了flashcache_map函數1609行判斷bio->bi_size是否爲block_size大小。1606行和1610行是關於黑名單管理的，用於管理哪些進程或組不使用flashcache的，這裏暫且不管，有興趣可以查看flashcache_ioctl。

爲什麼大小不爲block_size就直接下發到磁盤呢？因爲flashcache只處理block_size大小的數據，由於設置了ti->spilit_io爲block_size，所以flashcache_map接收到的數據都不會超過block_size，取大的bio在dm層被拆分成最大block_size的bio下發。那麼處理小塊數據對flashcache來講有什麼不好呢？因爲flashcache爲了提高效率都在按block_size下發到磁盤，這時有小的數據塊緩存，那麼必須要湊齊block_size才能下發，那怎麼湊齊呢，就要去磁盤裏讀。所以flashcache對於緩存的數據是有選擇性的，那麼也決定了上層了流量模型不能是小塊數據，這樣的話flashcache就會直接下發到磁盤，就沒起到緩存的作用了。

如果是小數據塊的情況，第1611行調用flashcache_inval_block將與該bio有交集的cache塊全部設置爲INVALID，因爲不再是最新的了。然後很不幸的是，設置cache塊爲invalid也會失敗，按直觀的想法就是設置一個髒標誌位不就行了嗎？根據墨菲定律，我們總是會過於樂觀的判斷一件事情。這裏先不講這些異常處理，因爲如果還沒有理解正常流程是什麼樣的，講異常就失去了意義。

這樣我們就很快找了正常流程的讀寫入口，第1623行是讀入口，第1625行是寫入口。

這裏不急於去看讀寫實現，先來說說flashcache採用的讀寫磁盤的方法。

flashcache中跟磁盤相關的讀寫分爲以下兩類：

1）磁盤跟內存的交互

2）磁盤跟磁盤之前的交互

比如說讀不命中時就是直接從磁盤讀，屬於第1種情況，那讀命中呢？也是屬於第1種情況，不過這時候是從SSD讀。磁盤跟磁盤之間交互是用於寫髒數據，將SSD中髒cache塊拷貝到磁盤上去。現在介紹下兩種情況使用的接口函數，這樣後面在看讀寫流程時看到這兩個函數就十分親切了，並且清楚地知道數據是從哪裏流向哪裏。

首先看第一種情況是通過flashcache_dm_io_sync_vm函數實現的：

571int
572flashcache_dm_io_sync_vm(struct cache_c *dmc, struct dm_io_region *where, int rw, void *data)
573{
574	unsigned long error_bits = 0;
575	int error;
576	struct dm_io_request io_req = {
577		.bi_rw = rw,
578		.mem.type = DM_IO_VMA,
579		.mem.ptr.vma = data,
580		.mem.offset = 0,
581		.notify.fn = NULL,
582		.client = dmc->io_client,
583	};
584
585	error = dm_io(&io_req, 1, where, &error_bits);
586	if (error)
587		return error;
588	if (error_bits)
589		return error_bits;
590	return 0;
591}

這裏我們只關心dm_io的使用，並不關心其實現，因爲這已經涉及到dm層的代碼了。

dmc 就是flashcache在內存中的管理結構

where是讀寫的目標設備

rw 讀寫

data 對應的內存地址

我們就以flashcache_md_create中讀flash_superblock爲例

720	header = (struct flash_superblock *)vmalloc(512);
721	if (!header) {
722		DMERR("flashcache_md_create: Unable to allocate sector");
723		return 1;
724	}
725	where.bdev = dmc->cache_dev->bdev;
726	where.sector = 0;
727	where.count = 1;
728#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27)
729	error = flashcache_dm_io_sync_vm(&where, READ, header);
730#else
731	error = flashcache_dm_io_sync_vm(dmc, &where, READ, header);
732#endif

第一個參數dmc，第二個參數設置設備爲SSD，即cache_dev->bdev，扇區0開始，1個扇區大小，讀，目的地址是header。由於flashcache_dm_io_sync_vm中第581行設置fn=NULL，所以該函數是同步的。

現在看第二類磁盤和磁盤之間交互。看函數原型：

int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
unsigned num_dests, struct dm_io_region *dests,
unsigned flags, dm_kcopyd_notify_fn fn, void *context);

第一個參數dm_kcopyd_client，在使用kcopyd異步拷貝服務時，必須先創建一個對應的client，創建在flashcache_ctr函數中

1208     r = dm_kcopyd_client_create(FLASHCACHE_COPY_PAGES, &dmc->kcp_client);
1209     if (r) {
1210          ti->error = "Failed to initialize kcopyd client\n";
1211          dm_io_client_destroy(dmc->io_client);
1212          goto bad3;
1213     }

第二個參數dm_io_region是源地址，第四個參數是目的地址，定義如下

struct dm_io_region {
     struct block_device *bdev;
     sector_t sector;
     sector_t count;          /* If this is zero the region is ignored. */
};

dm_kcopyd_notify_fn fn是kcopyd處理完請求的回調函數

context 是回調函數參數，在flashcache都設置對應的kcached_job

小結一下，以上兩類函數其實本質是一樣的，調用者填寫好源地址和目的地址，地址可以是內存中的也可以是設備的，填好之後就調用函數，再接着就等回調通知。就好比我們在網上購物，帳號(dm_client)登錄，我們只負責填好訂單（dm_io_region），具體的生產製造物流過程我不關心，我只關心門鈴響(dm_kcopyd_notify_fn)的時候我要的物品都已經送上門來了。

linux內核源碼閱讀之facebook硬盤加速flashcache之四

【SQL進階】CASE語句的使用

npm error Cannot read properties of null (reading 'isDescendantOf')

Linux Memory Management Notes

linux內核源碼閱讀之facebook硬盤加速flashcache之三

linux內核源碼閱讀之facebook硬盤加速flashcache之四

linux內核源碼閱讀之facebook硬盤加速flashcache之五

linux內核源碼閱讀之facebook硬盤加速利器flashcache之一

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結