【翻譯】Chromium 網絡棧 disk cache 設計原理

英文原文鏈接

1、概覽

The disk cache stores resources fetched from the web so that they can be accessed quickly at a latter time if needed. The main characteristics of
 Chromium disk cache are:
 
    ·The cache should not grow unbounded so there must be an algorithm for deciding when to remove old entries.
    ·While it is not critical to lose some data from the cache, having to discard the whole cache should be minimized. 
       The current design should be able to gracefully handle application crashes, no matter what is going on at that time, 
       only discarding the resources that were open at that time. However, if the whole computer crashes while we are updating 
       the cache, everything on the cache probably will be discarded.
    ·Access to previously stored data should be reasonably efficient, and it should be possible to use synchronous or asynchronous operations.
    ·We should be able to avoid conflicts that prevent us from storing two given resources simultaneously. In other words, the design 
       should avoid cache trashing.
    ·It should be possible to remove a given entry from the cache, and keep working with a given entry while at the same time making 
       it inaccessible to other requests (as if it was never stored).
    ·The cache should not be using explicit multithread synchronization because it will always be called from the same thread. However, 
       callbacks should avoid reentrancy problems so they must be issued through the thread's message loop.

disk_cache 存儲從 web 獲取到的資源,以便下次需要時快速訪問,它的主要特徵包括:

    ·緩存不會無限增長,因此必須有相關算法來決定什麼時候移除舊的入口對象
    ·當緩存可以容忍丟失部分數據的時候,應儘可能少的丟棄整個緩存;目前的設計應能優雅的處理應用的崩潰情況,不管應用當時在做什麼,只會丟棄當時被打開了的資源。
      然而,如果在更新緩存時整個計算機崩潰了,很可能全部緩存會被丟棄。
    ·應能以高效的方式訪問之前存儲的數據,並且能夠支持數據的同步或異步操作。
    ·應避免同時存儲兩個給定的資源時出現衝突。
    ·應能從緩存中移除指定的入口對象;在保持指定的入口對象有效的同時能阻止其他請求來訪問整個入口對象(對於其他請求的表現是沒有該入口對象)。
    ·緩存不可以被用於顯式的多線程同步(因爲它總是被同一個線程調用)。然而,回調函數應避免重入問題,這樣它們就可以通過 message loop 被調用。

2、External Interface

Any implementation of Chromium's cache exposes two interfaces: disk_cache::Backend and disk_cache::Entry. (see src/net/disk_cache/disk_cache.h). 
The Backend provides methods to enumerate the resources stored on the cache (a.k.a Entries), open old entries or create new ones etc. Operations specific
 to a given resource are handled with the Entry interface.

An entry is identified by its key, which is just the name of the resource (for example http://www.google.com/favicon.ico ). Once an entry is created, the data for 
that particular resource is stored in separate chunks or data streams: one for the HTTP headers and another one for the actual resource data, so the index for
 the required stream is an argument to the Entry::ReadData and Entry::WriteData methods.

Chromium cache 的所有實現均對外暴露兩個接口:disk_cache::Backend and disk_cache::Entry,前者用於列舉存儲在 cache 中的資源項、打開舊的入口,或者爲緩存創建新的入口;
後者則提供對給定資源的具體操作,比如 Read、Write 等

每個入口對象由一個唯一的 key 來標識,也就是資源的名字,一旦一個入口對象被創建成功,它對應的數據會被存儲在獨立的 chunks 或者 streams 中,因此當讀寫這些數據時,
需要提供相應 stream 對象的索引號。

3、Disk Structure

All the files that store Chromium's disk cache live in a single folder (you guessed it, it is called cache), and every file inside that folder is considered 
to be part of the cache (so it may be deleted by Chromium at some point!).

Chromium uses at least five files: one index file and four data files. If any of those files is missing or corrupt, the whole set of files is recreated. The index
 file contains the main hash table used to locate entries on the cache, and the data files contain all sorts of interesting data, from bookkeeping information
 to the actual HTTP headers and data of a given request. These data files are also known as block-files, because their file format is optimized to store information 
on fixed-size “blocks”. For instance, a given block-file may store blocks of 256 bytes and it will be used to store data that can span from one to four such blocks,
 in other words, data with a total size of 1 KB or less.

When the size of a piece of data is bigger than disk_cache::kMaxBlockSize (16 KB), it will no longer be stored inside one of our standard block-files. In this case, 
it will be stored in a “separate file”, which is a file that has no special headers and contains only the data we want to save. The name of a separate file follows 
the form f_xx, where xx is just the hexadecimal number that identifies the file.

所有存儲 Chromium disk cache 的文件均放在一個文件夾下(這個文件夾就是 disk cache),每個文件代表 cache 的一部分(這些文件會在某個時間點被 chromium 刪除)。

Chromium 存儲 cache 至少需要 5 個文件:一個索引文件,四個數據文件。這 5 個文件中的任意一個丟失或損壞,會導致這個 5 個文件全部重新創建。索引文件包含了用於定位 
cache 的入口對象的哈希表;數據文件則包含了各種各樣的數據,比如簿記信息、HTTP 頭,Request body 等,這些數據文件也被稱爲塊文件,因爲它們使用固定大小的塊來存儲
信息,例如,某個塊文件的塊大小是 256B,塊文件包含的塊數爲1 至 4 個,那麼這個塊文件至多可以存儲 1KB 的數據。

當塊大小超過 disk_cache::kMaxBlockSize (16KB),它將不會被存儲在標準的塊文件中,而是存儲到獨立的文件,獨立的文件中不存儲特殊的頭,只存儲我們想保存的數據,
並且命名格式爲 f_xx,xx 是標識該文件的十六進制的數字。

3.1 Cache Address

Every piece of data stored by the disk cache has a given “cache address”. The cache address is simply a 32-bit number that describes exactly where the
 data is actually located.
A cache entry will have an address; the HTTP headers will have another address, the actual request data will have a different address, the entry name (key) 
may have another address and auxiliary information for the entry (such as the rankings info for the eviction algorithm) will have another address. This allows 
us to reuse the same infrastructure to efficiently store different types of data while at the same time keeping frequently modified data together, so that we
 can leverage the underlying operating system to reduce access latency.
The structure of a cache address is defined on disk_cache/addr.h, and basically tells if the required data is stored inside a block-file or as a separate file and
 the number of the file (block file or otherwise). If the data is part of a block-file, the cache address also has the number of the first block with the data, the
 number of blocks used and the type of block file.
These are few examples of valid addresses:
0x00000000: not initialized
0x8000002A: external file f_00002A
0xA0010003: block-file number 1 (data_1), initial block number 3, 1 block of length.

disk cache 存儲的所有數據都有對應的 cache address,它是一個 32-bit 的數字,用於定位數據存儲的位置,cache 入口對象、入口對象的 key、入口對象的輔助信息、
HTTP 頭部以及 Request body 等都有對應的 cache address;這樣設計的好處是允許我們複用同樣的基礎設施,在能夠高效地存儲不同類型的數據的同時把經常需要修改的數據放在
一起,從而可以最大化的利用操作系統減少訪問時延。cache address 的結構體定義在 disk_cache/addr.h中,通過它可以知道請求的數據是存儲在 block-file 還是一個獨立的文件中,
也能知道存儲所用的文件數量。如果 block-file 存儲了數據,那麼 cache address 結構體(class Addr)也會存儲第一個塊的地址,以及使用的總塊數和 block-file 的類型。下面列舉
幾類有效的地址:

0x00000000: not initialized
0x8000002A: external file f_00002A
0xA0010003: block-file number 1 (data_1), initial block number 3, 1 block of length.

3.2 Index File Structure

The index file structure is specified on disk_cache/disk_format.h. Basically, it is just an disk_cache::IndexHeader structure followed by the actual hash table. 
The number of entries in the table is at least disk_cache::kIndexTablesize (65536), but the actual size is controlled by the table_len member of the header.
The whole file is memory mapped to allow fast translation between the hash of the name of a resource (the key), and the cache address that stores the resource. 
The low order bits of the hash are used to index the table, and the content of the table is the address of the first stored resource with the same low order bits on the hash.
One of the things that must be verified when dealing with the disk cache files (the index and every block-file) is that the magic number on the header matches the 
expected value, and that the version is correct. The version has a mayor and a minor part, and the expected behavior is that any change on the mayor number 
means that the format is now incompatible with older formats.

索引文件結構定義在 disk_cache/disk_format.h 中,它由 disk_cache::IndexHeader 以及一個哈希表組成,哈希表的元素數量至少是 disk_cache::kIndexTablesize (65536),
但實際數量由 IndexHeader 的成員 table_len 決定。整個文件被映射在內存中從而能夠快速通過數據的哈希值獲取到數據的 cache address。哈希值低位部分用作哈希表的索引,索
引對應的值就是就是數據存儲的地址(假設有多個數據的哈希值一樣,那麼就是第一個數據的地址)。在處理 disk cache 文件時,必須驗證 disk_cache::IndexHeader 中魔法數字、
版本號是否合法有效。版本號分爲主要、次要部分,主要部分的任何變化會導致與舊版本的不兼容,也就是舊版本失效。

3.3 Block File Structure

The block-file structure is specified on disk_cache/disk_format.h. Basically, it is just a file header (disk_cache::BlockFileHeader) followed by a variable number of 
fixed-size data blocks. Block files are named data_n, where n is the decimal file number.
The header of the file (8 KB) is memory mapped to allow efficient creation and deletion of elements of the file. The bulk of the header is actually a bitmap that
 identifies used blocks on the file. The maximum number of blocks that can be stored on a single file is thus a little less than 64K.
Whenever there are not enough free blocks on a file to store more data, the file is grown by 1024 blocks until the maximum number of blocks is reached. At that 
moment, a new block-file of the same type is created, and the two files are linked together using the next_file member of the header. The type of the block-file is 
simply the size of the blocks that the file stores, so all files that store blocks of the same size are linked together. Keep in mind that even if there are multiple block-files
 chained together, the cache address points directly to the file that stores a given record. The chain is only used when looking for space to allocate a new record.

block-flie 的結構體定義在 disk_cache/disk_format.h,它由文件頭(disk_cache::BlockFileHeader)、數量不定的數據塊組成,數據塊的大小是固定的。數據塊以 data_n 的方式命名,
n 是十進制的數字。文件頭(8kB)被映射到內存中從而可以高效的創建和刪除文件的元素。文件頭主要是一個 bitmap,用於標識哪些塊已經被使用。單個文件可以存儲的塊數不超過 64K
(頭大小爲8KB,8x1024x8)。如果可用的塊數不夠存儲更多的數據,那麼文件大小將增加 1024 個塊,直到達到最大塊數,而一旦達到最大塊數卻仍有更多的數據需要存儲,那麼將創
建一個新的同類型 block-file,block-file 之間通過文件頭中的 next_file 成員鏈接起來。block-file 的類型指定了其塊的大小,所有同大小塊的文件將被鏈接在一起。即使多個 block-file
 鏈接在一起,對應 cache address 只指向給定記錄的存儲地址(即 cache address 指向的可能是 block-file 中的某個塊的地址)。在尋找可用空間來分配給一個新的記錄時會用到這條
 block-file 鏈(由於對齊原因,各個 block-file 的塊不一定全部用完了)。

To simplify allocation of disk space, it is only possible to store records that use from one to four actual blocks. If the total size of the record is bigger than that, another 
type of block-file must be used. For example, to store a string of 2420 bytes, three blocks of 1024 bytes are needed, so that particular string will go to the block-file 
that has blocks of 1KB.
Another simplification of the allocation algorithm is that a piece of data is not going to cross the four block alignment boundary. In other words, if the bitmap says that
 block 0 is used, and everything else is free (state A), and we want to allocate space for four blocks, the new record will use blocks 4 through 7 (state B), leaving three
 unused blocks in the middle. However, if after that we need to allocate just two blocks instead of four, the new record will use blocks 1 and 2 (state C).

爲了簡化磁盤空間的分配,只允許 1 至 4 個塊來存儲記錄,如果記錄的大小超過 4 個塊,那麼會創建一個新的 block-file。比如,要存儲 2420 個字節的 string 對象,3 個 1024B 
大小的塊就夠用了,因此它將被存儲到一個 block-file 中。另外一個簡化點就是,對齊方式是以 4 個塊爲單位,因此假設 record 1 存儲需要 1 個 block,而 record 2 需要 4 個 block,
record 1、record 2 會各自佔用一個 block-file,如下圖 State 2;如果此時 record 3 需要 2 個 block,那麼 record 1 和 record 3 會共用一個 block-file,如下圖 State 3。
There are a couple of fields on the header to help the process of allocating space for a new record. The empty field stores counters of available space per 
block type and hints stores the last scanned location per block type. In this context, a block type is the number of blocks requested for the allocation. When
 a file is empty, it can store up to X records of four blocks each (X being close to 64K / 4). After a record of one block is allocated, it is able be able to store
 X-1 records of four blocks, and one record of three blocks. If after that, a record of two blocks is allocated, the new capacity is X-1 records of four blocks
 and one record of one block, because the space that was available to store the record of three blocks was used to store the new record (two blocks), leaving
 one empty block.
It is important to realize that once a record has been allocated, its size cannot be increased. The only way to grow a record that was already saved is to read it, 
then delete it from the file and allocate a new record of the required size.

From the reliability point of view, having the header memory mapped allows us to detect scenarios when the application crashes while we are in the middle of
 modifying the allocation bitmap. The updating field of the header provides a way to signal that we are updating something on the headers, so that if the field
 is set when the file is open, the header must be checked for consistency.

文件頭中有多個成員被用於更好的處理新記錄的空間分配,例如 empty 存儲了不同類型塊的可用數量,hints 存儲了不同類型塊最後被掃描的位置。塊的類型決定了分配空間時
需要分配多少個塊。如果一個文件是空的,它最多可以存儲 X 個記錄(假設每個記錄的大小爲 4 個塊,X 不超過 64K/4);假設此時存儲了一個記錄,佔用一個 block,那麼該文件,
還可以存儲 X-1 個大小爲 4 個塊的記錄和一個大小爲 3 個塊的記錄,其他情況參考上一段。一旦一個記錄被分配了,它的大小不可以增加,假設需要增加已經存儲了的記錄的大小,
需要先讀取它並刪除,然後申請一個新的空間來存儲大小增加後的記錄。

從可靠性的角度來說,把文件頭映射到內存中可以方便我們在修改 bitmap 時檢測到應用崩潰的場景。文件頭成員的更新提供了一種方式來通知我們正在更新文件頭的成員,因此在
文件打開時某個成員被設置了,必須檢查文件頭的一致性。

3.4 Cache Entry

An entry is basically a complete entity stored by the cache. It is divided in two main parts: the disk_cache::EntryStore stores the part that fully identifies the 
entry and doesn’t change very often, and the disk_cache::RankingsNode stores the part that changes often and is used to implement the eviction algorithm.

The RankingsNode is always the same size (36 bytes), and it is stored on a dedicated type of block files (with blocks of 36 bytes). On the other hand, the 
EntryStore can use from one to four blocks of 256 bytes each, depending on the actual size of the key (name of the resource). In case the key is too long
 to be stored directly as part of the EntryStore structure, the appropriate storage will be allocated and the address of the key will be saved on the long_key 
field, instead of the full key.
The other things stored within EntryStore are addresses of the actual data streams associated with this entry, the key’s hash and a pointer to the next entry
 that has the same low-order hash bits (and thus shares the same position on the index table).
Whenever an entry is in use, its RankingsNode is marked as in-use so that when a new entry is read from disk we can tell if it was properly closed or not.

入口對象是被完全存儲在緩存中的實體對象,它分爲兩部分:disk_cache::EntryStore 存儲不經常被修改的部分,disk_cache::RankingsNode 則存儲經常被修改的部分,且用於
實現刪除算法。

RankingsNode 大小是 36 B,且用於存儲在特定類型的 block-file(塊大小爲 36 B)。而 EntryStore 的存儲需要 1 到 4 個塊(塊大小爲 256B),塊數取決於 key (key 是
資源的名字)的大小。如果 key 的大小超過 EntryStore 能存儲的大小,那麼會分配合適大小的空間來存儲 key,且它的地址將被存儲在 long_key 成員中。
存儲在 EntryStore 中的還包括與入口對象關聯的數據 stream 的地址、key 的哈希值,以及下一個入口對象的指針(哈希值的低位部分相同而產生衝突的入口對象共享同一個位置)
入口對象一旦被使用,對應的 RankingsNode 被標記爲 in-use,當一個新的入口對象從磁盤讀取的時候,我們可以判斷這個入口對象是否關閉。

3.5 The Big Picture

This diagram shows a disk cache with 7 files on disk: the index file, 5 block-files and one separate file. data_1 and data_4 are chained together so they store
 blocks of the same size (256 bytes), while data_2 stores blocks of 1KB and data_3 stores blocks of 4 KB. The depicted entry has its key stored outside the
 EntryStore structure, and given that it uses two blocks, it must be between one and two kilobytes. This entry also has two data streams, one for the HTTP 
headers (less than 256 bytes) and another one for the actual payload (more than 16 KB so it lives on a dedicated file). All blue arrows indicate that a cache
 address is used to locate another piece of data.

下圖展示了磁盤上一個包含了 7 個文件的 disk cache:1 個索引文件,5 個 block-file以及一個獨立的文件。data_1 和 data_4 被鏈接在一起因此他們的 blocks 大小是一樣的
且爲 256B,而 data_2 的塊大小爲 1KB,data_3 則爲 4KB。展示的這個入口對象的 key 被存儲在 EntryStore 結構之外,假設這個 key 使用兩個 block,key 的大小在
1KB 到 2KB 之間。這個入口對象包括兩個數據 stream,一個是 HTTP 頭(小於 256B),一個是負載(超過 16KB,因此存儲在一個專門的文件中)。所有藍色箭頭表示了
一個 cache address 用於指向另外的數據。

4、Implementation Notes

Chromium has two different implementations of the cache interfaces: while the main one is used to store info on a given disk, there is also a very simple
 implementation that doesn’t use a hard drive at all, and stores everything in memory. The in-memory implementation is used for the Incognito mode so 
that even if the application crashes it will be quite difficult to extract any information that was accessed while browsing in that mode.
There are a few different types of caches (see net/base/cache_type.h), mainly defined by their intended use: there is a media specific cache, the general 
purpose disk cache, and another one that serves as the back end storage for AppCache, in addition to the in-memory type already mentioned. All types of 
caches behave in a similar way, with the exception that the eviction algorithm used by the general purpose cache is not the same LRU used by the others.
The regular cache implementation is located on disk_cache/backend_impl.cc and disk_cache/entry_impl.cc. Most of the files on that folder are actually related
 to the main implementation, except for a few that implement the in-memory cache: disk_cache/mem_backend_impl.cc and disk_cache/entry_impl.cc.

Chromium 實現了兩種類型的 cache 接口與:一種用於將信息存儲在磁盤,另一種簡單的實現則只將信息存儲在內存中。後者用於隱式模式因此即使應用崩潰了也難以提取到
任何當時瀏覽的信息。
有幾種不同的緩存類型定義在 net/base/cache_type.h,被用於不同的使用目的:比如用於 media 的緩存,通用目的的緩存,以及用戶 APP 緩存的後端存儲。所有類型的緩存
對外表現出相同的行爲,唯一不同的是使用的緩存刪除算法,其他類型的緩存使用的 LRU 算法,而通用目的類型的緩存使用的是另外的算法。
緩存實現一般在 disk_cache/backend_impl.cc and disk_cache/entry_impl.cc 中。該目錄下大部分文件與第一種實現相關,少數與第二種相關
(disk_cache/mem_backend_impl.cc and disk_cache/entry_impl.cc)。

4.1 Lower Interface

The lower interface of the disk cache (the one that deals with the OS) is handled mostly by two files: disk_cache/file.h and disk_cache/mapped_file.h, with separate 
implementations per operating system. The most notable requirement is support for partially memory-mapped files, but asynchronous interfaces and a decent file 
system level cache go a long way towards performance (we don’t want to replicate the work of the OS).
To deal with all the details about block-file access, the disk cache keeps a single object that deals with all of them: a disk_cache::BlockFiles object. This object enables 
allocation and deletion of disk space, and provides disk_cache::File object pointers to other people so that they can access the information that they need.
A StorageBlock is a simple template that represents information stored on a block-file, and it provides methods to load and store the required data from disk (based 
on the record’s cache address). We have two instantiations of the template, one for dealing with the EntryStore structure and another one for dealing with the 
RankingsNode structure. With this template, it is common to find code like entry->rankings()->Store().

4.2 Eviction

Support for the eviction algorithm of the cache is implemented on disk_cache/rankings (and mem_rankings for the in-memory one), and the eviction itself is implemented
 on disk_cache/eviction. Right now we have a simple Least Recently Used algorithm that just starts deleting old entries once a certain limit is exceeded, and a second 
algorithm that takes reuse and age into account before evicting an entry. We also have the concept of transaction when one of the lists is modified so that if the 
application crashes in the middle of inserting or removing an entry, next time we will roll the change back or forward so that the list is always consistent.
In order to support reuse as a factor for evictions, we keep multiple lists of entries depending on their type: not reused, with low reuse and highly reused. We also have
 a list of recently evicted entries so that if we see them again we can adjust their eviction next time we need the space. There is a time-target for each list and we try to 
avoid eviction of entries without having the chance to see them again. If the cache uses only LRU, all lists except the not-reused are empty.

cache 刪除算法的支持實現在 disk_cache/rankings 中,而刪除算法本身的實現是在 disk_cache/eviction;目前有 LRU 算法,以及考慮了複用和入口對象存留時長的算法。當 lists 被修改
時我們也有事務的概念,若應用在插入或者移除入口對象時崩潰,下一次我們會回滾或者繼續從而保證 lists 總是一致的。
爲了支持複用作爲刪除的一個考慮因素,我們根據入口對象的使用情況保留了多個 list:沒有被複用的,偶爾被複用的,和經常被複用的。我們也保留了最近被刪除的入口對象鏈表,以便再次
看到它們時,下次需要空間時可以調整刪除狀態。每個入口對象鏈表都有一個存留週期,我們會刪除那些超過了存留週期並且不再會被使用的入口對象。如果使用 LRU 算法,除了 not-reused
鏈表,其他都是空鏈表。

4.3 Buffering

When we start writing data for a new entry we allocate a buffer of 16 KB where we keep the first part of the data. If the total length is less than the buffer size, we 
only write the information to disk when the entry is closed; however, if we receive more than 16 KB, then we start growing that buffer until we reach a limit for this 
stream (1 MB), or for the total size of all the buffers that we have. This scheme gives us immediate response when receiving small entries (we just copy the data), and 
works well with the fact that the total record size is required in order to create a new cache address for it. It also minimizes the number of writes to disk so it improves 
performance and reduces disk fragmentation.

當我們開始爲一個新的入口對象寫數據的時候會申請一個 16KB 的緩衝區用於存儲第一部分數據,如果數據的總長度小於緩衝區,那麼只有當然入口對象被關閉時纔會將數據寫入磁盤,然而
當我們接收的數據超過 16KB 時,我們開始增加緩衝區的大小直到單個 stream 的上限(1MB),或者所有擁有的緩衝區的總大小。這個方案在接收小的入口對象時可以立即得到響應,並且
在需要記錄的總大小來創建新的 cache address 時也能工作的很好。方案也能減少磁盤的寫的次數從而既能優化性能又可以減少磁盤碎片。

4.4 Deleting Entries

To delete entries from the cache, one of the Doom*() methods can be used. All that they do is to mark a given entry to be deleted once all users have closed the entry. 
Of course, this means that it is possible to open a given entry multiple times (and read and write to it simultaneously). When an entry is doomed (marked for deletion), 
it is removed from the index table so that any attempt to open it again will fail (and creating the entry will succeed), even when an already created Entry object can still 
be used to read and write the old entry.
When two objects are open at the same time, both users will see what the other is doing with the entry (there is only one “real” entry, and they see a consistent state
 of it). That’s even true if the entry is doomed after it was open twice. However, once the entry is created after it was doomed, we end up with basically two separate 
entries, one for the old, doomed entry, and another one for the newly created one.

使用 Doom* 系列方法來刪除入口對象,當所有用戶關閉了入口對象,它們會把入口對象標記爲 deleted。當然也意味着可以對給定入口對象多次執行打開操作(同時讀寫)。當一個入口對象
被標記爲 deleted,它會從索引表中移除,並且任何嘗試再次打開它的操作均會失敗(創建一個新入口對象則會成功),及時一個已經創建的入口對象仍然能夠用於讀寫舊的入口對象。
當兩個入口對象被同時打開,兩個用戶均能看到對方對入口對象做的操作(只有一個實際的入口對象存在,用戶會看到一個一致的狀態)。及時在入口對象關閉前被打開了兩次,另一個爲關閉
的用戶也能看到入口對象被關閉了。然而,一旦一個新入口對象在就的入口對象關閉後立即被創建了,我們會得到兩個獨立的入口對象,一個是舊的已經關閉了的,一個是新創建的。

4.5 Enumerations

A good example of enumerating the entries stored by the cache is located at src/net/url_request/url_request_view_cache_job.cc . It should be noted that this interface
 is not making any statements about the order in which the entries are enumerated, so it is not a good idea to make assumptions about it. Also, it could take a long time 
to go through all the info stored on disk.

4.6 Sparse Data

An entry can be used to store sparse data instead of a single, continuous stream. In this case, only two streams can be stored by the entry, a regular one
 (the first one), and a sparse one (the second one). Internally, the cache will distribute sparse chunks among a set of dedicated entries (child entries) that are
 linked together from the main entry (the parent entry). Each child entry will store a particular range of the sparse data, and inside that range we could have
 "holes" that have not been written yet.
 This design allows the user to store files of any length (even bigger than the total size of the cache), while the cache is in fact simultaneously evicting parts of that
 file, according to the regular eviction policy. Most of this logic is implemented on disk_cache/sparse_control (and disk_cache/mem_entry_impl for the in-memory case).

入口對象可被用於存儲稀疏數據而不是一個單一且連續的 stream。因此入口對象只會存儲兩個 stream,一個是常規 stream,一個是稀疏 stream。在內部,緩存會把稀疏的 chunks
 分佈在專門的入口對象集中,這些入口對象與主入口對象鏈接在一起;每個入口對象存儲稀疏數據一個特定範圍的數據(稱爲片段),片段之間未被填充數據的部分被稱爲 holes。這種設計
允許用戶存儲任何長度的文件(甚至大於緩存的總大小),而實際上,根據常規逐出策略,緩存實際上是在逐出該文件的某些部分,其實現可以參考 disk_cache/sparse_control。

4.7 Dedicated Thread

We have a dedicated thread to perform most of the work, while the public API is called on the regular network thread (the browser's IO thread).

The reason for this dedicated thread is to be able to remove any potentially blocking call from the IO thread, because that thread serves the IPC 
messages with all the renderer and plugin processes: even if the browser's UI remains responsive when the IO thread is blocked, there is no way to 
talk to any renderer so tabs look unresponsive. On the other hand, if the computer's IO subsystem is under heavy load, any disk access can block for a long time.

Note that it may be possible to extend the use of asynchronous IO and just keep using the same thread. However, we are not really using asynchronous
 IO for Posix (due to compatibility issues), and even in Windows, not every operation can be performed asynchronously; for instance, opening and closing 
a file are always synchronous operations, so they are subject to significant delays under the proper set of circumstances.

有專門的線程處理了大部分的事情,公共 API 則是在常規的網絡線程(瀏覽器的 IO 線程)中調用。

使用專門線程的原因是可以移除潛在的來自 IO 線程的阻塞式調用,因爲 IO 線程需要處理來自渲染器和插件的 IPC 消息:即使瀏覽器的 UI 在 IO 線程被阻塞時保持響應,
也無法與渲染器保持通信,從而瀏覽器的 Tab 頁看起來是未響應的。另一方面,如果計算機的 IO 子系統處於高負載狀態,任何磁盤訪問被阻塞的時間會較長。

把異步 IO 的使用擴展到一個線程上完成也是有可能的。然而我們不一定可以在 Posix 系統中使用異步 IO(兼容性問題),甚至在 Windows 也不是所有操作都可以異步完成,
例如打開和關閉一個文件總是一個同步操作,因此在某些情況下會有嚴重的時延。

Another thing to keep in mind is that we tend to perform a large number of IO operations, knowing that most of the time they just end up being completed by 
the system's cache. It would have been possible to use asynchronous operations all the time, but the code would have been much harder to understand because
 that means a lot of very fragmented state machines. And of course that doesn't solve the problem with Open/Close.

As a result, we have a mechanism to post tasks from the main thread (IO thread), to a background thread (Cache thread), and back, and we forward most of the 
API to the actual implementation that runs on the background thread. See disk_cache/in_flight_io and disk_cache/in_flight_backend_io. There are a few methods 
that are not forwarded to the dedicated thread, mostly because they don't interact with the files, and only provide state information. There is no locking to access
 the cache, so these methods are generally racing with actual modifications, but that non-racy guarantee is not made by the API. For example, getting the size of 
a data stream (entry::GetDataSize()) is racing with any pending WriteData operation, so it may return the value before of after the write completes.

Note that we have multiple instances of disk-caches, and they all have the same background thread.

我們傾向於需要完成大量的 IO 操作,並且大部分是用系統緩存完成。這種情況下可以一直使用異步操作,但編碼複雜度會很高,因爲會存在非常多碎片化的狀態機,而且也並未解決文
件打開關閉的同步問題。

因此,我們實現了一種機制,主線程(IO 線程)把任務提交到後臺線程(cache 線程),並且把大部分的 API 轉發到後臺線程去完成(參考 
disk_cache/in_flight_io and disk_cache/in_flight_backend_io)。只有很少的方法無法轉發到後臺線程,大部分情況是因爲這些方法更多的是提供狀態信息,並不與文件發生交互。
訪問 cache 不需要加鎖,只有在修改數據的時候纔會發生數據競態,這些 API 並不能保證非競態;例如,獲取數據 stream 大小的方法(比如 entry::GetDataSize)會與 WriteData 
的方法產生競態,它可能在 WriteData 之前後之後返回結果而產生競態。

所有的 disk-cache 實例共享同一個後臺線程。

5、Data Integrity

There is a balance to achieve between performance and crash resilience. At one extreme, every unexpected failure will lead to unrecoverable corrupt information and at the other
 extreme every action has to be flushed to disk before moving on to be able to guarantee the correct ordering of operations. We didn’t want to add the complexity of a journaling 
system given that the data stored by the cache is not critical by definition, and doing that implies some performance degradation.
The current system relies heavily on the presence of an OS-wide file system cache that provides adequate performance characteristics at the price of losing some deterministic 
guarantees about when the data actually reaches the disk (we just know that at some point, some part of the OS will actually decide that it is time to write the information to disk, 
but not if page X will get to disk before page Y).
Some critical parts of the system are directly memory mapped so that, besides providing optimum performance, even if the application crashes the latest state will be flushed to
 disk by the system. Of course, if the computer crashes we’ll end up on a pretty bad state because we don’t know if some part of the information reached disk or not (each memory 
page can be in a different state).
The most common problem if the system crashes is that the lists used by the eviction algorithm will be corrupt because some pages will have reached disk while others will effectively 
be on a “previous” state, still linking to entries that were removed etc. In this case, the corruption will not be detected at start up (although individual dirty entries will be detected and 
handled correctly), but at some point we’ll find out and proceed to discard the whole cache. It could be possible to start “saving” individual good entries from the cache, but the benefit
 is probably not worth the increased complexity.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章