【翻译】Chromium 网络栈 disk cache 设计原理

英文原文链接

1、概览

The disk cache stores resources fetched from the web so that they can be accessed quickly at a latter time if needed. The main characteristics of
 Chromium disk cache are:
 
    ·The cache should not grow unbounded so there must be an algorithm for deciding when to remove old entries.
    ·While it is not critical to lose some data from the cache, having to discard the whole cache should be minimized. 
       The current design should be able to gracefully handle application crashes, no matter what is going on at that time, 
       only discarding the resources that were open at that time. However, if the whole computer crashes while we are updating 
       the cache, everything on the cache probably will be discarded.
    ·Access to previously stored data should be reasonably efficient, and it should be possible to use synchronous or asynchronous operations.
    ·We should be able to avoid conflicts that prevent us from storing two given resources simultaneously. In other words, the design 
       should avoid cache trashing.
    ·It should be possible to remove a given entry from the cache, and keep working with a given entry while at the same time making 
       it inaccessible to other requests (as if it was never stored).
    ·The cache should not be using explicit multithread synchronization because it will always be called from the same thread. However, 
       callbacks should avoid reentrancy problems so they must be issued through the thread's message loop.

disk_cache 存储从 web 获取到的资源,以便下次需要时快速访问,它的主要特征包括:

    ·缓存不会无限增长,因此必须有相关算法来决定什么时候移除旧的入口对象
    ·当缓存可以容忍丢失部分数据的时候,应尽可能少的丢弃整个缓存;目前的设计应能优雅的处理应用的崩溃情况,不管应用当时在做什么,只会丢弃当时被打开了的资源。
      然而,如果在更新缓存时整个计算机崩溃了,很可能全部缓存会被丢弃。
    ·应能以高效的方式访问之前存储的数据,并且能够支持数据的同步或异步操作。
    ·应避免同时存储两个给定的资源时出现冲突。
    ·应能从缓存中移除指定的入口对象;在保持指定的入口对象有效的同时能阻止其他请求来访问整个入口对象(对于其他请求的表现是没有该入口对象)。
    ·缓存不可以被用于显式的多线程同步(因为它总是被同一个线程调用)。然而,回调函数应避免重入问题,这样它们就可以通过 message loop 被调用。

2、External Interface

Any implementation of Chromium's cache exposes two interfaces: disk_cache::Backend and disk_cache::Entry. (see src/net/disk_cache/disk_cache.h). 
The Backend provides methods to enumerate the resources stored on the cache (a.k.a Entries), open old entries or create new ones etc. Operations specific
 to a given resource are handled with the Entry interface.

An entry is identified by its key, which is just the name of the resource (for example http://www.google.com/favicon.ico ). Once an entry is created, the data for 
that particular resource is stored in separate chunks or data streams: one for the HTTP headers and another one for the actual resource data, so the index for
 the required stream is an argument to the Entry::ReadData and Entry::WriteData methods.

Chromium cache 的所有实现均对外暴露两个接口:disk_cache::Backend and disk_cache::Entry,前者用于列举存储在 cache 中的资源项、打开旧的入口,或者为缓存创建新的入口;
后者则提供对给定资源的具体操作,比如 Read、Write 等

每个入口对象由一个唯一的 key 来标识,也就是资源的名字,一旦一个入口对象被创建成功,它对应的数据会被存储在独立的 chunks 或者 streams 中,因此当读写这些数据时,
需要提供相应 stream 对象的索引号。

3、Disk Structure

All the files that store Chromium's disk cache live in a single folder (you guessed it, it is called cache), and every file inside that folder is considered 
to be part of the cache (so it may be deleted by Chromium at some point!).

Chromium uses at least five files: one index file and four data files. If any of those files is missing or corrupt, the whole set of files is recreated. The index
 file contains the main hash table used to locate entries on the cache, and the data files contain all sorts of interesting data, from bookkeeping information
 to the actual HTTP headers and data of a given request. These data files are also known as block-files, because their file format is optimized to store information 
on fixed-size “blocks”. For instance, a given block-file may store blocks of 256 bytes and it will be used to store data that can span from one to four such blocks,
 in other words, data with a total size of 1 KB or less.

When the size of a piece of data is bigger than disk_cache::kMaxBlockSize (16 KB), it will no longer be stored inside one of our standard block-files. In this case, 
it will be stored in a “separate file”, which is a file that has no special headers and contains only the data we want to save. The name of a separate file follows 
the form f_xx, where xx is just the hexadecimal number that identifies the file.

所有存储 Chromium disk cache 的文件均放在一个文件夹下(这个文件夹就是 disk cache),每个文件代表 cache 的一部分(这些文件会在某个时间点被 chromium 删除)。

Chromium 存储 cache 至少需要 5 个文件:一个索引文件,四个数据文件。这 5 个文件中的任意一个丢失或损坏,会导致这个 5 个文件全部重新创建。索引文件包含了用于定位 
cache 的入口对象的哈希表;数据文件则包含了各种各样的数据,比如簿记信息、HTTP 头,Request body 等,这些数据文件也被称为块文件,因为它们使用固定大小的块来存储
信息,例如,某个块文件的块大小是 256B,块文件包含的块数为1 至 4 个,那么这个块文件至多可以存储 1KB 的数据。

当块大小超过 disk_cache::kMaxBlockSize (16KB),它将不会被存储在标准的块文件中,而是存储到独立的文件,独立的文件中不存储特殊的头,只存储我们想保存的数据,
并且命名格式为 f_xx,xx 是标识该文件的十六进制的数字。

3.1 Cache Address

Every piece of data stored by the disk cache has a given “cache address”. The cache address is simply a 32-bit number that describes exactly where the
 data is actually located.
A cache entry will have an address; the HTTP headers will have another address, the actual request data will have a different address, the entry name (key) 
may have another address and auxiliary information for the entry (such as the rankings info for the eviction algorithm) will have another address. This allows 
us to reuse the same infrastructure to efficiently store different types of data while at the same time keeping frequently modified data together, so that we
 can leverage the underlying operating system to reduce access latency.
The structure of a cache address is defined on disk_cache/addr.h, and basically tells if the required data is stored inside a block-file or as a separate file and
 the number of the file (block file or otherwise). If the data is part of a block-file, the cache address also has the number of the first block with the data, the
 number of blocks used and the type of block file.
These are few examples of valid addresses:
0x00000000: not initialized
0x8000002A: external file f_00002A
0xA0010003: block-file number 1 (data_1), initial block number 3, 1 block of length.

disk cache 存储的所有数据都有对应的 cache address,它是一个 32-bit 的数字,用于定位数据存储的位置,cache 入口对象、入口对象的 key、入口对象的辅助信息、
HTTP 头部以及 Request body 等都有对应的 cache address;这样设计的好处是允许我们复用同样的基础设施,在能够高效地存储不同类型的数据的同时把经常需要修改的数据放在
一起,从而可以最大化的利用操作系统减少访问时延。cache address 的结构体定义在 disk_cache/addr.h中,通过它可以知道请求的数据是存储在 block-file 还是一个独立的文件中,
也能知道存储所用的文件数量。如果 block-file 存储了数据,那么 cache address 结构体(class Addr)也会存储第一个块的地址,以及使用的总块数和 block-file 的类型。下面列举
几类有效的地址:

0x00000000: not initialized
0x8000002A: external file f_00002A
0xA0010003: block-file number 1 (data_1), initial block number 3, 1 block of length.

3.2 Index File Structure

The index file structure is specified on disk_cache/disk_format.h. Basically, it is just an disk_cache::IndexHeader structure followed by the actual hash table. 
The number of entries in the table is at least disk_cache::kIndexTablesize (65536), but the actual size is controlled by the table_len member of the header.
The whole file is memory mapped to allow fast translation between the hash of the name of a resource (the key), and the cache address that stores the resource. 
The low order bits of the hash are used to index the table, and the content of the table is the address of the first stored resource with the same low order bits on the hash.
One of the things that must be verified when dealing with the disk cache files (the index and every block-file) is that the magic number on the header matches the 
expected value, and that the version is correct. The version has a mayor and a minor part, and the expected behavior is that any change on the mayor number 
means that the format is now incompatible with older formats.

索引文件结构定义在 disk_cache/disk_format.h 中,它由 disk_cache::IndexHeader 以及一个哈希表组成,哈希表的元素数量至少是 disk_cache::kIndexTablesize (65536),
但实际数量由 IndexHeader 的成员 table_len 决定。整个文件被映射在内存中从而能够快速通过数据的哈希值获取到数据的 cache address。哈希值低位部分用作哈希表的索引,索
引对应的值就是就是数据存储的地址(假设有多个数据的哈希值一样,那么就是第一个数据的地址)。在处理 disk cache 文件时,必须验证 disk_cache::IndexHeader 中魔法数字、
版本号是否合法有效。版本号分为主要、次要部分,主要部分的任何变化会导致与旧版本的不兼容,也就是旧版本失效。

3.3 Block File Structure

The block-file structure is specified on disk_cache/disk_format.h. Basically, it is just a file header (disk_cache::BlockFileHeader) followed by a variable number of 
fixed-size data blocks. Block files are named data_n, where n is the decimal file number.
The header of the file (8 KB) is memory mapped to allow efficient creation and deletion of elements of the file. The bulk of the header is actually a bitmap that
 identifies used blocks on the file. The maximum number of blocks that can be stored on a single file is thus a little less than 64K.
Whenever there are not enough free blocks on a file to store more data, the file is grown by 1024 blocks until the maximum number of blocks is reached. At that 
moment, a new block-file of the same type is created, and the two files are linked together using the next_file member of the header. The type of the block-file is 
simply the size of the blocks that the file stores, so all files that store blocks of the same size are linked together. Keep in mind that even if there are multiple block-files
 chained together, the cache address points directly to the file that stores a given record. The chain is only used when looking for space to allocate a new record.

block-flie 的结构体定义在 disk_cache/disk_format.h,它由文件头(disk_cache::BlockFileHeader)、数量不定的数据块组成,数据块的大小是固定的。数据块以 data_n 的方式命名,
n 是十进制的数字。文件头(8kB)被映射到内存中从而可以高效的创建和删除文件的元素。文件头主要是一个 bitmap,用于标识哪些块已经被使用。单个文件可以存储的块数不超过 64K
(头大小为8KB,8x1024x8)。如果可用的块数不够存储更多的数据,那么文件大小将增加 1024 个块,直到达到最大块数,而一旦达到最大块数却仍有更多的数据需要存储,那么将创
建一个新的同类型 block-file,block-file 之间通过文件头中的 next_file 成员链接起来。block-file 的类型指定了其块的大小,所有同大小块的文件将被链接在一起。即使多个 block-file
 链接在一起,对应 cache address 只指向给定记录的存储地址(即 cache address 指向的可能是 block-file 中的某个块的地址)。在寻找可用空间来分配给一个新的记录时会用到这条
 block-file 链(由于对齐原因,各个 block-file 的块不一定全部用完了)。

To simplify allocation of disk space, it is only possible to store records that use from one to four actual blocks. If the total size of the record is bigger than that, another 
type of block-file must be used. For example, to store a string of 2420 bytes, three blocks of 1024 bytes are needed, so that particular string will go to the block-file 
that has blocks of 1KB.
Another simplification of the allocation algorithm is that a piece of data is not going to cross the four block alignment boundary. In other words, if the bitmap says that
 block 0 is used, and everything else is free (state A), and we want to allocate space for four blocks, the new record will use blocks 4 through 7 (state B), leaving three
 unused blocks in the middle. However, if after that we need to allocate just two blocks instead of four, the new record will use blocks 1 and 2 (state C).

为了简化磁盘空间的分配,只允许 1 至 4 个块来存储记录,如果记录的大小超过 4 个块,那么会创建一个新的 block-file。比如,要存储 2420 个字节的 string 对象,3 个 1024B 
大小的块就够用了,因此它将被存储到一个 block-file 中。另外一个简化点就是,对齐方式是以 4 个块为单位,因此假设 record 1 存储需要 1 个 block,而 record 2 需要 4 个 block,
record 1、record 2 会各自占用一个 block-file,如下图 State 2;如果此时 record 3 需要 2 个 block,那么 record 1 和 record 3 会共用一个 block-file,如下图 State 3。
There are a couple of fields on the header to help the process of allocating space for a new record. The empty field stores counters of available space per 
block type and hints stores the last scanned location per block type. In this context, a block type is the number of blocks requested for the allocation. When
 a file is empty, it can store up to X records of four blocks each (X being close to 64K / 4). After a record of one block is allocated, it is able be able to store
 X-1 records of four blocks, and one record of three blocks. If after that, a record of two blocks is allocated, the new capacity is X-1 records of four blocks
 and one record of one block, because the space that was available to store the record of three blocks was used to store the new record (two blocks), leaving
 one empty block.
It is important to realize that once a record has been allocated, its size cannot be increased. The only way to grow a record that was already saved is to read it, 
then delete it from the file and allocate a new record of the required size.

From the reliability point of view, having the header memory mapped allows us to detect scenarios when the application crashes while we are in the middle of
 modifying the allocation bitmap. The updating field of the header provides a way to signal that we are updating something on the headers, so that if the field
 is set when the file is open, the header must be checked for consistency.

文件头中有多个成员被用于更好的处理新记录的空间分配,例如 empty 存储了不同类型块的可用数量,hints 存储了不同类型块最后被扫描的位置。块的类型决定了分配空间时
需要分配多少个块。如果一个文件是空的,它最多可以存储 X 个记录(假设每个记录的大小为 4 个块,X 不超过 64K/4);假设此时存储了一个记录,占用一个 block,那么该文件,
还可以存储 X-1 个大小为 4 个块的记录和一个大小为 3 个块的记录,其他情况参考上一段。一旦一个记录被分配了,它的大小不可以增加,假设需要增加已经存储了的记录的大小,
需要先读取它并删除,然后申请一个新的空间来存储大小增加后的记录。

从可靠性的角度来说,把文件头映射到内存中可以方便我们在修改 bitmap 时检测到应用崩溃的场景。文件头成员的更新提供了一种方式来通知我们正在更新文件头的成员,因此在
文件打开时某个成员被设置了,必须检查文件头的一致性。

3.4 Cache Entry

An entry is basically a complete entity stored by the cache. It is divided in two main parts: the disk_cache::EntryStore stores the part that fully identifies the 
entry and doesn’t change very often, and the disk_cache::RankingsNode stores the part that changes often and is used to implement the eviction algorithm.

The RankingsNode is always the same size (36 bytes), and it is stored on a dedicated type of block files (with blocks of 36 bytes). On the other hand, the 
EntryStore can use from one to four blocks of 256 bytes each, depending on the actual size of the key (name of the resource). In case the key is too long
 to be stored directly as part of the EntryStore structure, the appropriate storage will be allocated and the address of the key will be saved on the long_key 
field, instead of the full key.
The other things stored within EntryStore are addresses of the actual data streams associated with this entry, the key’s hash and a pointer to the next entry
 that has the same low-order hash bits (and thus shares the same position on the index table).
Whenever an entry is in use, its RankingsNode is marked as in-use so that when a new entry is read from disk we can tell if it was properly closed or not.

入口对象是被完全存储在缓存中的实体对象,它分为两部分:disk_cache::EntryStore 存储不经常被修改的部分,disk_cache::RankingsNode 则存储经常被修改的部分,且用于
实现删除算法。

RankingsNode 大小是 36 B,且用于存储在特定类型的 block-file(块大小为 36 B)。而 EntryStore 的存储需要 1 到 4 个块(块大小为 256B),块数取决于 key (key 是
资源的名字)的大小。如果 key 的大小超过 EntryStore 能存储的大小,那么会分配合适大小的空间来存储 key,且它的地址将被存储在 long_key 成员中。
存储在 EntryStore 中的还包括与入口对象关联的数据 stream 的地址、key 的哈希值,以及下一个入口对象的指针(哈希值的低位部分相同而产生冲突的入口对象共享同一个位置)
入口对象一旦被使用,对应的 RankingsNode 被标记为 in-use,当一个新的入口对象从磁盘读取的时候,我们可以判断这个入口对象是否关闭。

3.5 The Big Picture

This diagram shows a disk cache with 7 files on disk: the index file, 5 block-files and one separate file. data_1 and data_4 are chained together so they store
 blocks of the same size (256 bytes), while data_2 stores blocks of 1KB and data_3 stores blocks of 4 KB. The depicted entry has its key stored outside the
 EntryStore structure, and given that it uses two blocks, it must be between one and two kilobytes. This entry also has two data streams, one for the HTTP 
headers (less than 256 bytes) and another one for the actual payload (more than 16 KB so it lives on a dedicated file). All blue arrows indicate that a cache
 address is used to locate another piece of data.

下图展示了磁盘上一个包含了 7 个文件的 disk cache:1 个索引文件,5 个 block-file以及一个独立的文件。data_1 和 data_4 被链接在一起因此他们的 blocks 大小是一样的
且为 256B,而 data_2 的块大小为 1KB,data_3 则为 4KB。展示的这个入口对象的 key 被存储在 EntryStore 结构之外,假设这个 key 使用两个 block,key 的大小在
1KB 到 2KB 之间。这个入口对象包括两个数据 stream,一个是 HTTP 头(小于 256B),一个是负载(超过 16KB,因此存储在一个专门的文件中)。所有蓝色箭头表示了
一个 cache address 用于指向另外的数据。

4、Implementation Notes

Chromium has two different implementations of the cache interfaces: while the main one is used to store info on a given disk, there is also a very simple
 implementation that doesn’t use a hard drive at all, and stores everything in memory. The in-memory implementation is used for the Incognito mode so 
that even if the application crashes it will be quite difficult to extract any information that was accessed while browsing in that mode.
There are a few different types of caches (see net/base/cache_type.h), mainly defined by their intended use: there is a media specific cache, the general 
purpose disk cache, and another one that serves as the back end storage for AppCache, in addition to the in-memory type already mentioned. All types of 
caches behave in a similar way, with the exception that the eviction algorithm used by the general purpose cache is not the same LRU used by the others.
The regular cache implementation is located on disk_cache/backend_impl.cc and disk_cache/entry_impl.cc. Most of the files on that folder are actually related
 to the main implementation, except for a few that implement the in-memory cache: disk_cache/mem_backend_impl.cc and disk_cache/entry_impl.cc.

Chromium 实现了两种类型的 cache 接口与:一种用于将信息存储在磁盘,另一种简单的实现则只将信息存储在内存中。后者用于隐式模式因此即使应用崩溃了也难以提取到
任何当时浏览的信息。
有几种不同的缓存类型定义在 net/base/cache_type.h,被用于不同的使用目的:比如用于 media 的缓存,通用目的的缓存,以及用户 APP 缓存的后端存储。所有类型的缓存
对外表现出相同的行为,唯一不同的是使用的缓存删除算法,其他类型的缓存使用的 LRU 算法,而通用目的类型的缓存使用的是另外的算法。
缓存实现一般在 disk_cache/backend_impl.cc and disk_cache/entry_impl.cc 中。该目录下大部分文件与第一种实现相关,少数与第二种相关
(disk_cache/mem_backend_impl.cc and disk_cache/entry_impl.cc)。

4.1 Lower Interface

The lower interface of the disk cache (the one that deals with the OS) is handled mostly by two files: disk_cache/file.h and disk_cache/mapped_file.h, with separate 
implementations per operating system. The most notable requirement is support for partially memory-mapped files, but asynchronous interfaces and a decent file 
system level cache go a long way towards performance (we don’t want to replicate the work of the OS).
To deal with all the details about block-file access, the disk cache keeps a single object that deals with all of them: a disk_cache::BlockFiles object. This object enables 
allocation and deletion of disk space, and provides disk_cache::File object pointers to other people so that they can access the information that they need.
A StorageBlock is a simple template that represents information stored on a block-file, and it provides methods to load and store the required data from disk (based 
on the record’s cache address). We have two instantiations of the template, one for dealing with the EntryStore structure and another one for dealing with the 
RankingsNode structure. With this template, it is common to find code like entry->rankings()->Store().

4.2 Eviction

Support for the eviction algorithm of the cache is implemented on disk_cache/rankings (and mem_rankings for the in-memory one), and the eviction itself is implemented
 on disk_cache/eviction. Right now we have a simple Least Recently Used algorithm that just starts deleting old entries once a certain limit is exceeded, and a second 
algorithm that takes reuse and age into account before evicting an entry. We also have the concept of transaction when one of the lists is modified so that if the 
application crashes in the middle of inserting or removing an entry, next time we will roll the change back or forward so that the list is always consistent.
In order to support reuse as a factor for evictions, we keep multiple lists of entries depending on their type: not reused, with low reuse and highly reused. We also have
 a list of recently evicted entries so that if we see them again we can adjust their eviction next time we need the space. There is a time-target for each list and we try to 
avoid eviction of entries without having the chance to see them again. If the cache uses only LRU, all lists except the not-reused are empty.

cache 删除算法的支持实现在 disk_cache/rankings 中,而删除算法本身的实现是在 disk_cache/eviction;目前有 LRU 算法,以及考虑了复用和入口对象存留时长的算法。当 lists 被修改
时我们也有事务的概念,若应用在插入或者移除入口对象时崩溃,下一次我们会回滚或者继续从而保证 lists 总是一致的。
为了支持复用作为删除的一个考虑因素,我们根据入口对象的使用情况保留了多个 list:没有被复用的,偶尔被复用的,和经常被复用的。我们也保留了最近被删除的入口对象链表,以便再次
看到它们时,下次需要空间时可以调整删除状态。每个入口对象链表都有一个存留周期,我们会删除那些超过了存留周期并且不再会被使用的入口对象。如果使用 LRU 算法,除了 not-reused
链表,其他都是空链表。

4.3 Buffering

When we start writing data for a new entry we allocate a buffer of 16 KB where we keep the first part of the data. If the total length is less than the buffer size, we 
only write the information to disk when the entry is closed; however, if we receive more than 16 KB, then we start growing that buffer until we reach a limit for this 
stream (1 MB), or for the total size of all the buffers that we have. This scheme gives us immediate response when receiving small entries (we just copy the data), and 
works well with the fact that the total record size is required in order to create a new cache address for it. It also minimizes the number of writes to disk so it improves 
performance and reduces disk fragmentation.

当我们开始为一个新的入口对象写数据的时候会申请一个 16KB 的缓冲区用于存储第一部分数据,如果数据的总长度小于缓冲区,那么只有当然入口对象被关闭时才会将数据写入磁盘,然而
当我们接收的数据超过 16KB 时,我们开始增加缓冲区的大小直到单个 stream 的上限(1MB),或者所有拥有的缓冲区的总大小。这个方案在接收小的入口对象时可以立即得到响应,并且
在需要记录的总大小来创建新的 cache address 时也能工作的很好。方案也能减少磁盘的写的次数从而既能优化性能又可以减少磁盘碎片。

4.4 Deleting Entries

To delete entries from the cache, one of the Doom*() methods can be used. All that they do is to mark a given entry to be deleted once all users have closed the entry. 
Of course, this means that it is possible to open a given entry multiple times (and read and write to it simultaneously). When an entry is doomed (marked for deletion), 
it is removed from the index table so that any attempt to open it again will fail (and creating the entry will succeed), even when an already created Entry object can still 
be used to read and write the old entry.
When two objects are open at the same time, both users will see what the other is doing with the entry (there is only one “real” entry, and they see a consistent state
 of it). That’s even true if the entry is doomed after it was open twice. However, once the entry is created after it was doomed, we end up with basically two separate 
entries, one for the old, doomed entry, and another one for the newly created one.

使用 Doom* 系列方法来删除入口对象,当所有用户关闭了入口对象,它们会把入口对象标记为 deleted。当然也意味着可以对给定入口对象多次执行打开操作(同时读写)。当一个入口对象
被标记为 deleted,它会从索引表中移除,并且任何尝试再次打开它的操作均会失败(创建一个新入口对象则会成功),及时一个已经创建的入口对象仍然能够用于读写旧的入口对象。
当两个入口对象被同时打开,两个用户均能看到对方对入口对象做的操作(只有一个实际的入口对象存在,用户会看到一个一致的状态)。及时在入口对象关闭前被打开了两次,另一个为关闭
的用户也能看到入口对象被关闭了。然而,一旦一个新入口对象在就的入口对象关闭后立即被创建了,我们会得到两个独立的入口对象,一个是旧的已经关闭了的,一个是新创建的。

4.5 Enumerations

A good example of enumerating the entries stored by the cache is located at src/net/url_request/url_request_view_cache_job.cc . It should be noted that this interface
 is not making any statements about the order in which the entries are enumerated, so it is not a good idea to make assumptions about it. Also, it could take a long time 
to go through all the info stored on disk.

4.6 Sparse Data

An entry can be used to store sparse data instead of a single, continuous stream. In this case, only two streams can be stored by the entry, a regular one
 (the first one), and a sparse one (the second one). Internally, the cache will distribute sparse chunks among a set of dedicated entries (child entries) that are
 linked together from the main entry (the parent entry). Each child entry will store a particular range of the sparse data, and inside that range we could have
 "holes" that have not been written yet.
 This design allows the user to store files of any length (even bigger than the total size of the cache), while the cache is in fact simultaneously evicting parts of that
 file, according to the regular eviction policy. Most of this logic is implemented on disk_cache/sparse_control (and disk_cache/mem_entry_impl for the in-memory case).

入口对象可被用于存储稀疏数据而不是一个单一且连续的 stream。因此入口对象只会存储两个 stream,一个是常规 stream,一个是稀疏 stream。在内部,缓存会把稀疏的 chunks
 分布在专门的入口对象集中,这些入口对象与主入口对象链接在一起;每个入口对象存储稀疏数据一个特定范围的数据(称为片段),片段之间未被填充数据的部分被称为 holes。这种设计
允许用户存储任何长度的文件(甚至大于缓存的总大小),而实际上,根据常规逐出策略,缓存实际上是在逐出该文件的某些部分,其实现可以参考 disk_cache/sparse_control。

4.7 Dedicated Thread

We have a dedicated thread to perform most of the work, while the public API is called on the regular network thread (the browser's IO thread).

The reason for this dedicated thread is to be able to remove any potentially blocking call from the IO thread, because that thread serves the IPC 
messages with all the renderer and plugin processes: even if the browser's UI remains responsive when the IO thread is blocked, there is no way to 
talk to any renderer so tabs look unresponsive. On the other hand, if the computer's IO subsystem is under heavy load, any disk access can block for a long time.

Note that it may be possible to extend the use of asynchronous IO and just keep using the same thread. However, we are not really using asynchronous
 IO for Posix (due to compatibility issues), and even in Windows, not every operation can be performed asynchronously; for instance, opening and closing 
a file are always synchronous operations, so they are subject to significant delays under the proper set of circumstances.

有专门的线程处理了大部分的事情,公共 API 则是在常规的网络线程(浏览器的 IO 线程)中调用。

使用专门线程的原因是可以移除潜在的来自 IO 线程的阻塞式调用,因为 IO 线程需要处理来自渲染器和插件的 IPC 消息:即使浏览器的 UI 在 IO 线程被阻塞时保持响应,
也无法与渲染器保持通信,从而浏览器的 Tab 页看起来是未响应的。另一方面,如果计算机的 IO 子系统处于高负载状态,任何磁盘访问被阻塞的时间会较长。

把异步 IO 的使用扩展到一个线程上完成也是有可能的。然而我们不一定可以在 Posix 系统中使用异步 IO(兼容性问题),甚至在 Windows 也不是所有操作都可以异步完成,
例如打开和关闭一个文件总是一个同步操作,因此在某些情况下会有严重的时延。

Another thing to keep in mind is that we tend to perform a large number of IO operations, knowing that most of the time they just end up being completed by 
the system's cache. It would have been possible to use asynchronous operations all the time, but the code would have been much harder to understand because
 that means a lot of very fragmented state machines. And of course that doesn't solve the problem with Open/Close.

As a result, we have a mechanism to post tasks from the main thread (IO thread), to a background thread (Cache thread), and back, and we forward most of the 
API to the actual implementation that runs on the background thread. See disk_cache/in_flight_io and disk_cache/in_flight_backend_io. There are a few methods 
that are not forwarded to the dedicated thread, mostly because they don't interact with the files, and only provide state information. There is no locking to access
 the cache, so these methods are generally racing with actual modifications, but that non-racy guarantee is not made by the API. For example, getting the size of 
a data stream (entry::GetDataSize()) is racing with any pending WriteData operation, so it may return the value before of after the write completes.

Note that we have multiple instances of disk-caches, and they all have the same background thread.

我们倾向于需要完成大量的 IO 操作,并且大部分是用系统缓存完成。这种情况下可以一直使用异步操作,但编码复杂度会很高,因为会存在非常多碎片化的状态机,而且也并未解决文
件打开关闭的同步问题。

因此,我们实现了一种机制,主线程(IO 线程)把任务提交到后台线程(cache 线程),并且把大部分的 API 转发到后台线程去完成(参考 
disk_cache/in_flight_io and disk_cache/in_flight_backend_io)。只有很少的方法无法转发到后台线程,大部分情况是因为这些方法更多的是提供状态信息,并不与文件发生交互。
访问 cache 不需要加锁,只有在修改数据的时候才会发生数据竞态,这些 API 并不能保证非竞态;例如,获取数据 stream 大小的方法(比如 entry::GetDataSize)会与 WriteData 
的方法产生竞态,它可能在 WriteData 之前后之后返回结果而产生竞态。

所有的 disk-cache 实例共享同一个后台线程。

5、Data Integrity

There is a balance to achieve between performance and crash resilience. At one extreme, every unexpected failure will lead to unrecoverable corrupt information and at the other
 extreme every action has to be flushed to disk before moving on to be able to guarantee the correct ordering of operations. We didn’t want to add the complexity of a journaling 
system given that the data stored by the cache is not critical by definition, and doing that implies some performance degradation.
The current system relies heavily on the presence of an OS-wide file system cache that provides adequate performance characteristics at the price of losing some deterministic 
guarantees about when the data actually reaches the disk (we just know that at some point, some part of the OS will actually decide that it is time to write the information to disk, 
but not if page X will get to disk before page Y).
Some critical parts of the system are directly memory mapped so that, besides providing optimum performance, even if the application crashes the latest state will be flushed to
 disk by the system. Of course, if the computer crashes we’ll end up on a pretty bad state because we don’t know if some part of the information reached disk or not (each memory 
page can be in a different state).
The most common problem if the system crashes is that the lists used by the eviction algorithm will be corrupt because some pages will have reached disk while others will effectively 
be on a “previous” state, still linking to entries that were removed etc. In this case, the corruption will not be detected at start up (although individual dirty entries will be detected and 
handled correctly), but at some point we’ll find out and proceed to discard the whole cache. It could be possible to start “saving” individual good entries from the cache, but the benefit
 is probably not worth the increased complexity.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章