linux內核源碼閱讀之facebook硬盤加速flashcache之七


這一節講緩存的寫回磁盤流程。這裏隆重介紹一下兩位幕後的英雄:
1724/*
1725 * Sync all dirty blocks. We pick off dirty blocks, sort them, merge them with
1726 * any contigous blocks we can within the set and fire off the writes.
1727 */
1728void
1729flashcache_sync_blocks(struct cache_c *dmc)

同步所有的髒塊,從cache塊中挑出髒塊,排序,合併,下發到磁盤。
第二位刷緩存的英雄:
1004/*
1005 * Clean dirty blocks in this set as needed.
1006 *
1007 * 1) Select the n blocks that we want to clean (choosing whatever policy), sort them.
1008 * 2) Then sweep the entire set looking for other DIRTY blocks that can be tacked onto
1009 * any of these blocks to form larger contigous writes. The idea here is that if you
1010 * are going to do a write anyway, then we might as well opportunistically write out
1011 * any contigous blocks for free (Bob's idea).
1012 */
1013void
1014flashcache_clean_set(struct cache_c *dmc, int set)

同步集合裏的髒塊到磁盤上。根據刷緩存策略(FIFO/LRU),選擇合適個數的髒塊並排序。

第一個函數flashcache_sync_blocks主要用於flashcache設備要刪除時,或者用戶手動刷寫緩存時調用。該函數遍歷所有的cache塊,看到髒的就記錄下來,然後按集合爲單位一一寫回。
1724/*
1725 * Sync all dirty blocks. We pick off dirty blocks, sort them, merge them with
1726 * any contigous blocks we can within the set and fire off the writes.
1727 */
1728void
1729flashcache_sync_blocks(struct cache_c *dmc)
1730{
1731     unsigned long flags;
1732     int index;
1733     struct dbn_index_pair *writes_list;
1734     int nr_writes;
1735     int i, set;
1736     struct cacheblock *cacheblk;
1737
1738     /*
1739     * If a (fast) removal of this device is in progress, don't kick off
1740     * any more cleanings. This isn't sufficient though. We still need to
1741     * stop cleanings inside flashcache_dirty_writeback_sync() because we could
1742     * have started a device remove after tested this here.
1743     */
1744     if (atomic_read(&dmc->fast_remove_in_prog) || sysctl_flashcache_stop_sync)
1745          return;
1746     writes_list = kmalloc(dmc->assoc * sizeof(struct dbn_index_pair), GFP_NOIO);
1747     if (writes_list == NULL) {
1748          dmc->memory_alloc_errors++;
1749          return;
1750     }
1751     nr_writes = 0;
1752     set = -1;
1753     spin_lock_irqsave(&dmc->cache_spin_lock, flags);    
1754     index = dmc->sync_index;
1755     while (index < dmc->size &&
1756            (nr_writes + dmc->clean_inprog) < dmc->max_clean_ios_total) {
1757          VERIFY(nr_writes <= dmc->assoc);
1758          if (((index % dmc->assoc) == 0) && (nr_writes > 0)) {
1759               /*
1760               * Crossing a set, sort/merge all the IOs collected so
1761               * far and issue the writes.
1762               */
1763               VERIFY(set != -1);
1764               flashcache_merge_writes(dmc, writes_list, &nr_writes, set);
1765               spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1766               for (i = 0 ; i < nr_writes ; i++)
1767                    flashcache_dirty_writeback_sync(dmc, writes_list[i].index);
1768               nr_writes = 0;
1769               set = -1;
1770               spin_lock_irqsave(&dmc->cache_spin_lock, flags);
1771          }
1772          cacheblk = &dmc->cache[index];
1773          if ((cacheblk->cache_state & (DIRTY | BLOCK_IO_INPROG)) == DIRTY) {
1774               cacheblk->cache_state |= DISKWRITEINPROG;
1775               writes_list[nr_writes].dbn = cacheblk->dbn;
1776               writes_list[nr_writes].index = cacheblk - &dmc->cache[0];
1777               set = index / dmc->assoc;
1778               nr_writes++;
1779          }
1780          index++;
1781     }
1782     dmc->sync_index = index;
1783     if (nr_writes > 0) {
1784          VERIFY(set != -1);
1785          flashcache_merge_writes(dmc, writes_list, &nr_writes, set);
1786          spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1787          for (i = 0 ; i < nr_writes ; i++)
1788               flashcache_dirty_writeback_sync(dmc, writes_list[i].index);
1789     } else
1790          spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1791     kfree(writes_list);
1792}

首先進入1744行,判斷是否要快速移除該設備,如果是的話就直接返回不用刷緩存了。接着申請一個內存塊writes_list用於記錄髒塊,這裏對應的記錄數據結構是struct dbn_index_pair,這個結構體很簡單,只有兩個域:
393struct dbn_index_pair {
394     sector_t     dbn;
395     int          index;
396};

域dbn就是用來記錄cache塊對應的磁盤扇區,這個域用於寫回磁盤之前的排序,另一個域是index,是cache塊在dmc->cache中的下標。接下來到1751行,變量nr_writes用於記錄每次寫磁盤的cache塊數,初始化爲0。1754行變量index用於記錄開始掃描髒塊的起始位置,這樣下次再進入這個函數的時候就從未掃描的cache塊接着刷。
到1755行一個while循環,循環結束滿足下列條件之一:
1)已經掃描到最後一個cache塊
2)下發髒塊已經達到系統上限
到1758行檢查是否到了下一個集合,如果此時有髒塊就開始寫到磁盤。然後到了排序合併髒塊,對於機械磁盤來說排序之後寫的速度會快些。
402void
403flashcache_merge_writes(struct cache_c *dmc, struct dbn_index_pair *writes_list,
404               int *nr_writes, int set)
405{    
406     int start_index = set * dmc->assoc;
407     int end_index = start_index + dmc->assoc;
408     int old_writes = *nr_writes;
409     int new_inserts = 0;
410     struct dbn_index_pair *set_dirty_list = NULL;
411     int ix, nr_set_dirty;
412    
413     if (unlikely(*nr_writes == 0))
414          return;
415     sort(writes_list, *nr_writes, sizeof(struct dbn_index_pair),
416          cmp_dbn, swap_dbn_index_pair);
417     if (sysctl_flashcache_write_merge == 0)
418          return;
419     set_dirty_list = kmalloc(dmc->assoc * sizeof(struct dbn_index_pair), GFP_ATOMIC);
420     if (set_dirty_list == NULL) {
421          dmc->memory_alloc_errors++;
422          goto out;
423     }
424     nr_set_dirty = 0;
425     for (ix = start_index ; ix < end_index ; ix++) {
426          struct cacheblock *cacheblk = &dmc->cache[ix];
427
428          /*
429          * Any DIRTY block in "writes_list" will be marked as
430          * DISKWRITEINPROG already, so we'll skip over those here.
431          */
432          if ((cacheblk->cache_state & (DIRTY | BLOCK_IO_INPROG)) == DIRTY) {
433               set_dirty_list[nr_set_dirty].dbn = cacheblk->dbn;
434               set_dirty_list[nr_set_dirty].index = ix;
435               nr_set_dirty++;
436          }
437     }
438     if (nr_set_dirty == 0)
439          goto out;

到415行以dbn爲關鍵字進行排序,接下來425行循環本集合,看是否有cache塊爲髒,如果沒有的話就從439行跳出。剛開始還挺疑惑,進來這個函數之前不是已經把cache塊都掃描過了嗎?爲什麼這裏還要再掃描一遍?何況外面掃描的時候已經加鎖了,所以進來這裏掃描也肯定是掃描不到新的髒塊啊?後面再仔細看了一遍上下文,才發現是這樣的,第一次調用的時候dmc->sync_index爲0,那麼到flashcache_merge_writes函數也就掃描不到新的髒塊了,就是說如果是從每個集合的第一個cache塊開始掃描時到flashcache_merge_writes函數也就掃描不到新的髒塊,但如果這一次掃描到一個集合中間的某一個cache塊時,下一次掃描從dmc->sync_index開始,那麼進入flashcache_merge_writes時就相當時將整個集合再掃描一次提取髒塊。
既然flashcache_merge_writes都會將集合再重新掃描一遍,那flashcache_sync_blocks就不用費功夫先掃描直接讓前者掃描不就行了?但前者還有個開關sysctl_flashcache_write_merge設置需不需要merge髒塊,不merge的話就什麼也不做,所以flashcache_sync_blocks還是要先掃描一遍,做自己該做的事情,至於flashcache_merge_writes做不做merge我不管,我只知道調用了flashcache_merge_writes之後給我返回最新的要寫回的髒塊。
這裏就回到flashcache_sync_blocks函數1766行將髒塊寫回磁盤,這個函數對我們來說已經是小case了,但是否真正理解嗎?來看一下函數原型:
1670static void
1671flashcache_dirty_writeback_sync(struct cache_c *dmc, int index)

從原型可以推測,對於一個髒塊,只要知道其index,就可以得到其源地址和目的地址,即在SSD中的位置和在磁盤上的位置。這裏就不說答案了,因爲答案在前面的小節裏已經展示出來了,現在再說出來反而浪費了一個獨立思考的機會。
退出while循環,來到了第1782行記錄這一次最後遍歷的cache塊下標,接着判斷一下是否是因爲下發髒塊達到最大值而退出,如果是的話那麼就有可能有記錄的髒塊沒有下發,這裏判斷一下nr_writes,如果大於0表示還有未下發的髒塊需要下發。
到這裏一次刷新髒塊結束了,如果是因爲下發髒塊達到最大值而結束的,那麼下一次調用是怎麼觸發的呢?
當然是要等前面的刷新結束之後再繼續調用這個函數下發新的刷新,具體流程大家自己分析一下代碼會更有成就感。

接着講flashcache_clean_set,在代碼時搜索了一下這個函數,好多地方都會調用到,歸納一下:
1)找不到可用cache塊時
2)寫cache塊時
3)寫磁盤返回時
1004/*
1005 * Clean dirty blocks in this set as needed.
1006 *
1007 * 1) Select the n blocks that we want to clean (choosing whatever policy), sort them.
1008 * 2) Then sweep the entire set looking for other DIRTY blocks that can be tacked onto
1009 * any of these blocks to form larger contigous writes. The idea here is that if you
1010 * are going to do a write anyway, then we might as well opportunistically write out
1011 * any contigous blocks for free (Bob's idea).
1012 */
1013void
1014flashcache_clean_set(struct cache_c *dmc, int set)
1015{
1016     unsigned long flags;
1017     int to_clean = 0;
1018     struct dbn_index_pair *writes_list;
1019     int nr_writes = 0;
1020     int start_index = set * dmc->assoc;
1021    
1022     /*
1023     * If a (fast) removal of this device is in progress, don't kick off
1024     * any more cleanings. This isn't sufficient though. We still need to
1025     * stop cleanings inside flashcache_dirty_writeback() because we could
1026     * have started a device remove after tested this here.
1027     */
1028     if (atomic_read(&dmc->fast_remove_in_prog))
1029          return;
1030     writes_list = kmalloc(dmc->assoc * sizeof(struct dbn_index_pair), GFP_NOIO);
1031     if (unlikely(sysctl_flashcache_error_inject & WRITES_LIST_ALLOC_FAIL)) {
1032          if (writes_list)
1033               kfree(writes_list);
1034          writes_list = NULL;
1035          sysctl_flashcache_error_inject &= ~WRITES_LIST_ALLOC_FAIL;
1036     }
1037     if (writes_list == NULL) {
1038          dmc->memory_alloc_errors++;
1039          return;
1040     }
1041     dmc->clean_set_calls++;
1042     spin_lock_irqsave(&dmc->cache_spin_lock, flags);
1043     if (dmc->cache_sets[set].nr_dirty < dmc->dirty_thresh_set) {
1044          dmc->clean_set_less_dirty++;
1045          spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1046          kfree(writes_list);
1047          return;
1048     } else
1049          to_clean = dmc->cache_sets[set].nr_dirty - dmc->dirty_thresh_set;
1050     if (sysctl_flashcache_reclaim_policy == FLASHCACHE_FIFO) {
1051          int i, scanned;
1052          int start_index, end_index;
1053
1054          start_index = set * dmc->assoc;
1055          end_index = start_index + dmc->assoc;
1056          scanned = 0;
1057          i = dmc->cache_sets[set].set_clean_next;
1058          DPRINTK("flashcache_clean_set: Set %d", set);
1059          while (scanned < dmc->assoc &&
1060                 ((dmc->cache_sets[set].clean_inprog + nr_writes) < dmc->max_clean_ios_set) &&
1061                 ((nr_writes + dmc->clean_inprog) < dmc->max_clean_ios_total) &&
1062                 nr_writes < to_clean) {
1063               if ((dmc->cache[i].cache_state & (DIRTY | BLOCK_IO_INPROG)) == DIRTY) {    
1064                    dmc->cache[i].cache_state |= DISKWRITEINPROG;
1065                    writes_list[nr_writes].dbn = dmc->cache[i].dbn;
1066                    writes_list[nr_writes].index = i;
1067                    nr_writes++;
1068               }
1069               scanned++;
1070               i++;
1071               if (i == end_index)
1072                    i = start_index;
1073          }
1074          dmc->cache_sets[set].set_clean_next = i;
1075     } else { /* flashcache_reclaim_policy == FLASHCACHE_LRU */
1076          struct cacheblock *cacheblk;
1077          int lru_rel_index;
1078
1079          lru_rel_index = dmc->cache_sets[set].lru_head;
1080          while (lru_rel_index != FLASHCACHE_LRU_NULL &&
1081                 ((dmc->cache_sets[set].clean_inprog + nr_writes) < dmc->max_clean_ios_set) &&
1082                 ((nr_writes + dmc->clean_inprog) < dmc->max_clean_ios_total) &&
1083                 nr_writes < to_clean) {
1084               cacheblk = &dmc->cache[lru_rel_index + start_index];              
1085               if ((cacheblk->cache_state & (DIRTY | BLOCK_IO_INPROG)) == DIRTY) {
1086                    cacheblk->cache_state |= DISKWRITEINPROG;
1087                    writes_list[nr_writes].dbn = cacheblk->dbn;
1088                    writes_list[nr_writes].index = cacheblk - &dmc->cache[0];
1089                    nr_writes++;
1090               }
1091               lru_rel_index = cacheblk->lru_next;
1092          }
1093     }
1094     if (nr_writes > 0) {
1095          int i;
1096
1097          flashcache_merge_writes(dmc, writes_list, &nr_writes, set);
1098          dmc->clean_set_ios += nr_writes;
1099          spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1100          for (i = 0 ; i < nr_writes ; i++)
1101               flashcache_dirty_writeback(dmc, writes_list[i].index);
1102     } else {
1103          int do_delayed_clean = 0;
1104
1105          if (dmc->cache_sets[set].nr_dirty > dmc->dirty_thresh_set)
1106               do_delayed_clean = 1;
1107          spin_unlock_irqrestore(&dmc->cache_spin_lock, flags);
1108          if (dmc->cache_sets[set].clean_inprog >= dmc->max_clean_ios_set)
1109               dmc->set_limit_reached++;
1110          if (dmc->clean_inprog >= dmc->max_clean_ios_total)
1111               dmc->total_limit_reached++;
1112          if (do_delayed_clean)
1113               schedule_delayed_work(&dmc->delayed_clean, 1*HZ);
1114          dmc->clean_set_fails++;
1115     }
1116     kfree(writes_list);
1117}

先看輸入參數,一個是dmc,另一個是set,集合下標。
1028行判斷是否快速移除,如果是就不做任何處理。
1030申請寫記錄內存,結構struct dbn_index_pair剛剛已經看過了。
1031行,這個是用於測試用的,故意設置申請內存申請失敗的情況下程序是否能正確運行。
1037行,申請不到內存就返回。
1043行,判斷集合髒塊是否達到水位線,沒有達到水位線就不用頻繁去刷了。
1049行,計算出當前需要刷的髒塊數。
1050行,如果當前刷髒塊策略爲FIFO,則按照FIFO遍歷集合,記錄髒塊信息。
1075行,是LRU策略。
這兩個策略沒有優劣之分,只有說在某種應用下哪種策略更適合。
那麼還有其他可以比較的嗎?來看一下兩種策略的內存開銷吧。
FIFO的開銷是在每個集合管理結構cache_set中增加一個set_clean_next、set_fifo_next字段。
LRU的開銷是集合中有lru_head, lru_tail,cache塊中還有lru_prev, lru_next。
注意這裏的lru_prev, lru_next都用16位無符號數來表示,在64位系統中每個字段節省了48個位。
帶來的負面作用是每個集合中最多可以有2的16次方個cache塊。
在某些應用中用下標表示會遠比指針表示來得優越。在我曾經做過的一個項目中,要求在程序異常時能夠立即恢復並且不影響正在使用服務的客戶,這個時候就要求程序重新啓動的時候要完全恢復到原來運行的狀態,那麼就要求需要恢復的數據通通不能用到指針,因爲程序重啓後這些指針都已經無效了,這個時候下標表示就派上用場了。
在獲取髒塊記錄之後,在1094行下發髒塊。
1102行,有兩種情況一是沒有髒塊,二是下發髒塊達到上限,第二種情況下到1113行隔1秒再調度一次刷髒塊。
講到這裏,似乎已經把所有的系統都遍歷一遍了。然而我們一直是太樂觀了,因爲還有更重要的好戲還等着我們去探索。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章