The Google File System : part2 DESIGN OVERVIEW

2.DESIGN OVERVIEW
2.1 Assumptions
In designing a file system for our needs, we have been guided by assumptions that offer both challenges and opportunities. 
We alluded to some key observations earlier and now lay out our assumptions in more details.
2.设计概述
2.1假设
在为我们的需求设计文件系统时,我们以提供挑战和机会的假设为指导。
我们早些时候提到了一些重要的观点,现在更详细地阐述了我们的假设。

(1)
The system is built from many inexpensive commodity components that often fail. 
It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
(2)
The system stores a modest number of large files. 
We expect a few million files, each typically 100 MB or larger in size. 
Multi-GB files are the common case and should be managed efficiently.
Small files must be supported, but we need not optimize for them.
(3)
The workloads primarily consist of two kinds of reads:
large streaming reads and small random reads. 
In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more.
Successive operations from the same client often read through a contiguous region of a file. 
A small random read typically reads a few KBs at some arbitrary offset. 
Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go back and forth.
(4)
The workloads also have many large, sequential writes that append data to files. 
Typical operation sizes are similar to those for reads. 
Once written, files are seldom modified again. 
Small writes at arbitrary positions in a file are supported but do not have to be efficient.
(5)
The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. 
Our files are often used as producer-consumer queues or for many-way merging. 
Hundreds of producers, running one per machine, will concurrently append to a file. 
Atomicity with minimal synchronization overhead is essential. 
The file may be read later, or a consumer may be reading through the file simultaneously.
(6)
High sustained bandwidth is more important than low latency. 
Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.

(1)
该系统由许多经常失败的廉价商品组件构成。
它必须不断监控自身,并且可以从常规的基础上检测,容忍和及时恢复组件故障。
(2)
系统存储适量的大文件。
我们预计有几百万个文件,每个文件大小通常为100 MB或更大。
多GB文件是常见的情况,应有效管理。
必须支持小文件,但是我们不需要对它们进行优化。
(3)工作量主要包括两种:
大流读取和小随机读取。
在大型流读取中,单个操作通常读取数百个KB,更常见的是1 MB或更多。
来自同一客户端的连续操作通常会读取文件的连续区域。
一个小的随机读取通常在某个任意偏移量读取几KB。
性能意识的应用程序通常会批量排列小读数,以便稳定地通过文件,而不是来回移动。
(4)
这些工作负载还有许多大量的顺序写入,可以将数据附加到文件中。
典型的操作尺寸与读取相似。
一旦写入,文件很少被修改。
支持文件中任意位置的小写,但不必高效。
(5)
系统必须为同时附加到同一文件的多个客户端高效地实现定义良好的语义。
我们的文件通常用作生产者 - 消费者队列或多路合并。
数以百计的生产商,每台机器上运行一个,将同时附加到一个文件。
具有最小同步开销的原子性至关重要。
该文件可以稍后读取,或者消费者可能同时读取文件。
(6)
高持续带宽比低延迟更重要。
我们的大多数目标应用程序以高速率批量处理数据,而对单个读取或写入的响应时间要求较少。

2.2 Interface
GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. 
Files are organized hierarchically in directories and identified by path names. 
We support the usual operations to create, delete,open, close, read, and write files.
Moreover, GFS has snapshot and record append operations. 
Snapshot creates a copy of a file or a directory tree at low cost. 
Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. 
It is useful for implementing multi-way merge results and producer-consumer queues that many clients can simultaneously append to without additional locking. 
We have found these types of files to be invaluable in building large distributed applications. 
Snapshot and record append are discussed further in Sections 3.4 and 3.3 respectively.

2.2接口
GFS提供了一个熟悉的文件系统界面,尽管它没有实现标准的API,如POSIX。
文件在目录中分层组织,并由路径名称标识。
我们支持创建,删除,打开,关闭,读取和写入文件的常规操作。
此外,GFS具有快照和记录追加操作。
快照以低成本创建文件或目录树的副本。
记录附件允许多个客户端同时将数据附加到同一个文件,同时保证每个客户端的附加信息的原子性。
它对于实现多路合并结果和生产者 - 消费者队列很有用,许多客户端可以同时附加到,而不需要额外的锁定。
我们发现这些类型的文件在构建大型分布式应用程序中是非常宝贵的。
第3.4节和第3.3节进一步讨论了快照和记录附录。

2.3 Architecture

A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 1. 


Each of these is typically a commodity Linux machine running a user-level server process. 
It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.
Files are divided into fixed-size chunks. 
Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation.
Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. 
For reliability, each chunk is replicated on multiple chunkservers. 
By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace.

The master maintains all file system metadata. 
This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks.
It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. 
The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.
GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. 
Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. 
We do not provide the POSIX API and therefore need not hook into the Linux vnode layer.
Neither the client nor the chunkserver caches file data.
Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. 
Not having them simplifies the client and the overall system by eliminating cache coherence issues.(Clients do cache metadata, however.) 
Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.

2.3 结构
GFS群集由单个主站和多个块服务器组成,并由多个客户机访问,如图1所示。
这些通常是运行用户级服务器进程的商品Linux机器。
只要机器资源允许,运行可能片状的应用程序代码导致的可靠性较低,可以轻松地在同一机器上同时运行chunkserver和客户端。
文件分为固定大小的块。
每个块由块创建时由主机分配的不可变且全局唯一的64位块句柄标识。
Chunkserver将本地磁盘上的块存储为 Linux 文件,并读取或写入由块处理和字节范围指定的块数据。
为了可靠性,每个块都复制在多个块服务器上。
默认情况下,我们存储三个副本,尽管用户可以为文件命名空间的不同区域指定不同的复制级别。

主人维护所有文件系统元数据。
这包括命名空间,访问控制信息,从文件到块的映射以及块的当前位置。
它还控制系统范围的活动,如块租约管理,垃圾收集孤儿块,以及chunkserver之间的块迁移。
主人周期性地与HeartBeat消息中的每个chunkserver进行通信,给出指令并收集其状态。
链接到每个应用程序的GFS客户端代码实现文件系统API,并与主机和组块服务器通信,代表应用程序读取或写入数据。
客户端与主服务器进行元数据操作交互,但是所有数据承载通信都直接连接到块服务器。
我们不提供POSIX API,因此不需要挂接到Linux vnode层。
客户端和chunkserver都不会缓存文件数据。
客户端缓存没有什么好处,因为大多数应用程序都流过大型文件或者工作集太大而无法缓存。
没有它们通过消除缓存一致性问题来简化客户端和整个系统(客户端执行缓存元数据)。
块服务器不需要缓存文件数据,因为块被存储为本地文件,因此Linux的缓冲区高速缓存已经将内存中经常访问的数据保存下来。

2.4 Single Master
Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. 
However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. 
Clients never read and write file data through the master. 

Instead, a client asks the master which chunkservers it should contact. 
It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.
Let us explain the interactions for a simple read with reference to Figure 1. 
First, using the fixed chunk size, the client translates the file name and byte offset specified by the application into a chunk index within the file.
Then, it sends the master a request containing the file name and chunk index. 
The master replies with the corresponding chunk handle and locations of the replicas. 
The client caches this information using the file name and chunk index as the key.
The client then sends a request to one of the replicas, most likely the closest one. 
The request specifies the chunk handle and a byte range within that chunk. 
Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened.
In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. 
This extra information sidesteps several future client-master interactions at practically no extra cost.

2.4 单一的 Master
拥有一个主人大大简化了我们的设计,并使主人能够使用全球知识制作复杂的块布局和复制决策。
但是,我们必须尽量减少对读写的参与,从而不会成为瓶颈。
客户端从不通过主机读取和写入文件数据。

相反,客户端向主机询问应该联系哪个chunkserver。
它在有限的时间内缓存此信息,并直接与chunkservers进行交互以进行许多后续操作。
让我们来看一下简单阅读的交互作用,参考图1。
首先,使用固定块大小,客户端将应用程序指定的文件名和字节偏移量转换为文件中的块索引。
然后,它向主机发送包含文件名和块索引的请求。
主人回复相应的块处理和副本的位置。
客户端使用文件名和块索引作为关键字缓存此信息。
然后,客户端向其中一个副本发送请求,最有可能是最接近的副本。
该请求指定该块中的块句柄和一个字节范围。
对于相同的块的进一步读取不需要更多的客户端 - 主机交互,直到缓存的信息到期或文件被重新打开为止。
实际上,客户端通常在相同的请求中要求多个块,并且主机还可以包括紧接在请求之后的块的信息。
这些额外的信息避免了几个未来的客户 - 主机交互,几乎没有额外的成本。

2.5 Chunk Size
Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system block sizes. 
Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed.
Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size.
A large chunk size offers several important advantages.
First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. 
The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. 
Even for small random reads, the client can comfortably cache all the chunk location information for a multi-TB working set. 
Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. 
Third, it reduces the size of the metadata stored on the master. 
This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
On the other hand, a large chunk size, even with lazy space allocation, has its disadvantages. 
A small file consists of a small number of chunks, perhaps just one. 
The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. 
In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially.
However, hot spots did develop when GFS was first used by a batch-queue system: 
an executable was written to GFS as a single-chunk file and then started on hundreds of machines at the same time. 
The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. 
We fixed this problem by storing such executables with a higher replication factor and by making the batch-queue system stagger application start times. 
A potential long-term solution is to allow clients to read data from other clients in such situations.

2.5 块大小
块大小是关键设计参数之一。我们选择了64 MB,比典型的文件系统块大小要大得多。
每个块副本都作为纯文件存储在chunkserver中,只能根据需要进行扩展。
懒惰空间分配避免了由于内部碎片而浪费空间,也许是对这样大块大小的最大的反对意见。
大块大小提供了几个重要的优点。
首先,它减少了客户端与主机交互的需要,因为在相同的组块上的读取和写入只需要一个初始请求给主机来获取组块位置信息。
这种减少对于我们的工作负载特别重要,因为应用程序主要是依次读写大文件。
即使对于小的随机读取,客户端也可以舒适地缓存多TB工作集的所有块位置信息。
第二,由于在大块上,客户端更有可能在给定的块上执行许多操作,它可以通过在长时间内保持与chunkserver的持续TCP连接来减少网络开销。
第三,它减少了存储在主机上的元数据的大小。
这允许我们将元数据保留在内存中,这反过来又带来了我们将在2.6.1节中讨论的其他优点。
另一方面,大块大小,即使是空闲的空间分配,也有其缺点。
一个小文件由少量的块组成,也许只有一个。
如果许多客户端访问同一个文件,那么存储这些块的块服务器可能会成为热点。
实际上,热点并不是一个主要问题,因为我们的应用程序主要是依次读取大块多个文件。
然而,当GFS首次被批处理队列系统使用时,热点确实发生了:
一个可执行文件被写入GFS作为单块文件,然后同时在数百台机器上启动。
存储此可执行文件的几个chunkserver由数百个同时发送的请求重载。
我们通过存储具有更高复制因子的可执行文件并使批处理队列系统交错应用程序启动时间来修复此问题。
潜在的长期解决方案是允许客户端在这种情况下从其他客户端读取数据。

2.6 Metadata
The master stores three major types of metadata: 
the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory. 
The first two types (namespaces and file-to-chunk mapping) are also kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines. 
Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the event of a master crash. 
The master does not store chunk location information persistently. 
Instead, it asks each chunkserver about its chunks at master startup and whenever a chunkserver joins the cluster.

2.6 Metadata
主人存储三种主要类型的元数据:
文件和块命名空间,从文件到块的映射以及每个块的副本的位置。 所有元数据都保存在主内存中。
前两种类型(命名空间和文件到组块映射)也通过将突变记录到存储在主节点本地磁盘上并在远程计算机上覆制的操作日志来保持持久性。
使用日志允许我们简单,可靠地更新主状态,并且在主机崩溃时不会有不一致的风险。
主人不会持久存储块位置信息。
相反,它会在主启动时以及每个chunkserver加入群集时向每个chunkserver询问其块。

2.6.1 In-Memory Data Structures
Since metadata is stored in memory, master operations are fast. 
Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background.
This periodic scanning is used to implement chunk garbage collection, re-replication in the presence of chunkserver failures, and chunk migration to balance load and disk space usage across chunkservers. 
Sections 4.3 and 4.4 will discuss these activities further.

One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. 
This is not a serious limitation in practice. 
The master maintains less than 64 bytes of metadata for each 64 MB chunk. 
Most chunks are full because most files contain many chunks, only the last of which may be partially filled. 
Similarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.
If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory.

2.6.1内存数据结构
由于元数据存储在内存中,主操作很快。
此外,主机在后台周期性地扫描整个状态是容易和有效的。
此定期扫描用于在存在chunkserver故障的情况下实现块垃圾收集,重新复制,以及块迁移以平衡块服务器之间的负载和磁盘空间使用情况。
4.3和4.4节将进一步讨论这些活动。

这种仅用于存储器的方法的一个潜在问题是,块的数量以及整个系统的容量受到主机具有多少存储器的限制。
这在实践中不是严重的限制。
主机为每个64 MB组块维护少于64字节的元数据。
大多数块是满的,因为大多数文件包含很多块,只有最后一个可能部分填充。
类似地,文件命名空间数据通常每个文件需要少于64个字节,因为它使用前缀压缩紧凑地存储文件名。
如果需要支持更大的文件系统,向主机添加额外内存的成本是通过将元数据存储在内存中获得的简单性,可靠性,性能和灵活性来支付的一个很小的代价。

2.6.2 Chunk Locations
The master does not keep a persistent record of which chunkservers have a replica of a given chunk. 
It simply polls chunkservers for that information at startup. 
The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages.
We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter. 
This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. 
In a cluster with hundreds of servers, these events happen all too often.
Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. 
There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause
chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.

2.6.2块位置
主人不会保留持久性记录,其中哪个chunkservers具有给定块的副本。
它只是在启动时轮询该块信息。
主人可以随时保持自己的最新状态,因为它控制所有的块放置,并通过常规HeartBeat消息监视chunkserver状态。
我们最初尝试将主持人的位置信息持续保留,但是我们决定在启动时请求来自块服务器的数据更为简单,并在此之后定期进行。
这消除了保持主服务器和chunkservers同步的问题,因为chunkservers加入并离开集群,更改名称,失败,重新启动等等。
在具有数百台服务器的群集中,这些事件经常发生。
理解这种设计决策的另一种方法是意识到一个chunkserver在其自己的磁盘上具有哪些块或者没有的块的最后一个字。
尝试保持对主机上的此信息的一致视图是没有意义的,因为chunkserver上的错误可能会导致
大块自动消失(例如,磁盘可能会坏,被禁用),或者操作员可能会重命名chunkserver。

2.6.3 Operation Log
The operation log contains a historical record of critical metadata changes. 
It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent operations. 
Files and chunks, as well as their versions (see Section 4.5), are all uniquely and eternally identified by the logical times at which they were created.
Since the operation log is critical, we must store it reliably and not make changes visible to clients until metadata changes are made persistent. 
Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. 
Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely. 
The master batches several log records together before flushing thereby reducing the impact of flushing and replication on overall system throughput.
The master recovers its file system state by replaying the operation log. 
To minimize startup time, we must keep the log small. 
The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the limited number of log records after that. 
The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing. 
This further speeds up recovery and improves availability.
Because building a checkpoint can take a while, the master’s internal state is structured in such a way that a new checkpoint can be created without delaying incoming mutations. 
The master switches to a new log file and creates the new checkpoint in a separate thread. 
The new checkpoint includes all mutations before the switch. 
It can be created in a minute or so for a cluster with a few million files. 
When completed, it is written to disk both locally and remotely.
Recovery needs only the latest complete checkpoint and subsequent log files. 
Older checkpoints and log files can be freely deleted, though we keep a few around to guard against catastrophes. 
A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints.

2.6.3操作日志
操作日志包含关键元数据更改的历史记录。
它是GFS的核心。它不仅是元数据的唯一永久性记录,而且还作为定义并发操作顺序的逻辑时间线。
文件和块,以及它们的版本(见第4.5节)都是由它们创建的逻辑时间唯一和永久地确定的。
由于操作日志至关重要,因此我们必须可靠地存储它,而不会使更改对客户端可见,直到元数据更改被持久化为止。
否则,我们有效地丢失了整个文件系统或最近的客户端操作,即使这些块本身生存下来。
因此,我们将其复制到多个远程计算机上,并在本地和远程将相应的日志记录刷新到磁盘后才响应客户端操作。
主机在冲洗之前将几个日志记录批在一起,从而减少冲洗和复制对整个系统吞吐量的影响。
主机通过重播操作日志恢复其文件系统状态。
为了最大限度地减少启动时间,我们必须保持日志小。
主机检查它的状态,只要日志增长超过一定大小,以便它可以通过从本地磁盘加载最新的检查点来恢复,然后仅重播有限数量的日志记录。
检查点是一个紧凑的B树形式,可以直接映射到内存中,并用于命名空间查找,无需额外的解析。
这进一步加快了恢复速度并提高了可用性
因为构建检查点可能需要一段时间,所以主机的内部状态的结构使得可以创建新的检查点而不会延迟进入的突变。
主人切换到新的日志文件,并在单独的线程中创建新的检查点。
新检查点包括切换前的所有突变。
对于具有几百万个文件的集群,它可以在一分钟内创建。
完成后,将本地和远程写入磁盘。
恢复只需要最新的完整检查点和后续日志文件。
较旧的检查点和日志文件可以自由删除,尽管我们保留几个来防范灾难。
检查点期间的故障不会影响正确性,因为恢复代码检测并跳过不完整的检查点。

2.7 Consistency Model
GFS has a relaxed consistency model that supports our highly distributed applications well but remains relatively simple and efficient to implement. 
We now discuss GFS’s guarantees and what they mean to applications. 
We also highlight how GFS maintains these guarantees but leave the details to other parts of the paper.

2.7一致性模型
GFS具有轻松的一致性模型,可以很好地支持我们高度分布的应用程序,但仍然相对简单和高效地实现。
我们现在讨论GFS的保证以及它们对应用程序的意义。
我们还强调了GFS如何保持这些保证,但将细节留给本文件的其他部分。

2.7.1 Guarantees by GFS
File namespace mutations (e.g., file creation) are atomic.
They are handled exclusively by the master: namespace locking guarantees atomicity and correctness (Section 4.1);
the master’s operation log defines a global total order of these operations (Section 2.6.3).
The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations. 

Table 1 summarizes the result. 


A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. 
A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety. 
When a mutation succeeds without interference from concurrent writers, the affected region is defined (and by implication consistent): 
all clients will always see what the mutation has written. 
Concurrent successful mutations leave the region undefined but consistent: 
all clients see the same data, but it may not reflect what any one mutation has written. 
Typically, it consists of mingled fragments from multiple mutations. 
A failed mutation makes the region inconsistent (hence also undefined): 
different clients may see different data at different times. 
We describe below how our applications can distinguish defined regions from undefined regions. 
The applications do not need to further distinguish between different kinds of undefined regions.
Data mutations may be writes or record appends. 
A write causes data to be written at an application-specified file offset. 
A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS’s choosing (Section 3.3). 
(In contrast, a “regular” append is merely a write at an offset that the client believes to be the current end of file.) 
The offset is returned to the client and marks the beginning of a defined region that contains the record.
In addition, GFS may insert padding or record duplicates in between. 
They occupy regions considered to be inconsistent and are typically dwarfed by the amount of user data.
After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation. 
GFS achieves this by (a) applying mutations to a chunk in the same order on all its replicas(Section 3.1), and (b) using chunk version numbers to detect any replica that has become stale because it has missed mutations while its chunkserver was down (Section 4.5). 
Stale replicas will never be involved in a mutation or given to clients asking the master for chunk locations. 
They are garbage collected at the earliest opportunity.
Since clients cache chunk locations, they may read from a stale replica before that information is refreshed. 
This window is limited by the cache entry’s timeout and the next open of the file, which purges from the cache all chunk information for that file. 
Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunk rather than outdated data. 
When a reader retries and contacts the master, it will immediately get current chunk locations.
Long after a successful mutation, component failures can of course still corrupt or destroy data. 
GFS identifies failed chunkservers by regular handshakes between master and all chunkservers and detects data corruption by checksumming
(Section 5.2). 
Once a problem surfaces, the data is restored from valid replicas as soon as possible (Section 4.3). 
A chunk is lost irreversibly only if all its replicas are lost before GFS can react, typically within minutes. 
Even in this case, it becomes unavailable, not corrupted: 
applications receive clear errors rather than corrupt data.

2.7.1 GFS的保证
文件命名空间突变(例如,文件创建)是原子的。
它们由主机专门处理:命名空间锁定保证原子性和正确性(第4.1节);
主操作日志定义了这些操作的全局总顺序(第2.6.3节)。
数据突变后文件区的状态取决于突变的类型,是否成功或失败,以及是否存在同时发生突变。
表1总结了结果。
如果所有客户端总是看到相同的数据,无论他们读取哪些副本,文件区域都是一致的。
文件数据变异之后定义区域是一致的,并且客户端将会看到全部变更的内容。
当突变成功而没有来自并发作者的干扰时,受影响的区域被定义(并且通过含义一致):
所有客户将始终看到突变的写作。
并发成功的突变使该地区未定义但一致:
所有客户都看到相同的数据,但它可能不会反映任何一个突变所写的内容。
通常,它由来自多个突变的混合片段组成。
失败的突变使得该区域不一致(因此也是未定义的):
不同的客户可能会在不同的时间看到不同的数据
下面我们将描述我们的应用程序如何区分定义的区域和未定义的区域。
应用程序不需要进一步区分不同类型的未定义区域。
数据突变可能是写入或记录追加。
写入会使数据以应用程序指定的文件偏移量写入。
即使在存在并发突变的情况下,记录追加也会将数据(“记录”)原子附加至少一次,但与GFS的选择相抵消(第3.3节)。
(相比之下,“常规”附件只是在客户端认为是当前文件结尾的偏移量上写入。)
偏移量返回给客户端,并标记包含记录的定义区域的开头。
此外,GFS可能会在其间插入填充或重复记录。
它们占据被认为不一致的区域,并且通常与用户数据量相比较。
经过一系列成功的突变后,突变的文件区域被保证被定义幷包含最后一个突变所写的数据。
GFS通过以下方式实现这一点:
(a)在其所有副本上以相同顺序对块进行突变(第3.1节),和(b)使用块版本号来检测已经变得陈旧的复制品,因为它的chunkserver为(第4.5节)。
陈旧的复制品永远不会涉及突变,也不会给客户询问主人的块位置。
他们是最早收集的垃圾。
由于客户端缓存块位置,所以在刷新信息之前,它们可能会从陈旧的副本中读取。
此窗口受到缓存条目的超时和文件的下一次打开的限制,从缓存中清除该文件的所有块信息。
而且,由于我们大多数的文件只是附加的,所以一个陈旧的副本通常返回一个过早的块而不是过时的数据。
当读者重试并联系主人时,它将立即获得当前的块位置。
成功突变后,组件故障当然可能会损坏或破坏数据。
GFS通过master和所有chunkserver之间的常规握手来识别失败的chunkservers,并通过校验和检测数据损坏(第5.2节)。
一旦出现问题,数据将尽快从有效副本中恢复(第4.3节)。
只有在GFS可以做出反应之前,如果所有的副本都丢失,那么这个块才会不可逆转,通常在几分钟之内。
即使在这种情况下,它变得不可用,不会损坏:应用程序收到清除错误,而不是损坏的数据。

2.7.2 Implications for Applications
GFS applications can accommodate the relaxed consistency model with a few simple techniques already needed for other purposes: 
relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records.
Practically all our applications mutate files by appending rather than overwriting. 
In one typical use, a writer generates a file from beginning to end. 
It atomically renames the file to a permanent name after writing all the data, or periodically checkpoints how much has been successfully written. 
Checkpoints may also include application-level checksums. 
Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined state. 
Regardless of consistency and concurrency issues, this approach has served us well. 
Appending is far more efficient and more resilient to application failures than random writes. 
Checkpointing allows writers to restart incrementally and keeps readers from processing successfully written file data that is still incomplete from the application’s perspective.
In the other typical use, many writers concurrently append to a file for merged results or as a producer-consumer queue. 

Record append’s append-at-least-once semantics preserves each writer’s output. 
Readers deal with the occasional padding and duplicates as follows. 
Each record prepared by the writer contains extra information like checksums so that its validity can be verified. 
A reader can identify and discard extra padding and record fragments using the checksums. 
If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent operations), it can filter them out using unique identifiers in the records, which are often needed anyway to name corresponding application entities such as web documents. 
These functionalities for record I/O (except duplicate removal) are in library code shared by our applications and applicable to other file interface implementations at Google. 
With that, the same sequence of records, plus rare duplicates, is always delivered to the record reader.

2.7.2对应用的启示
GFS应用程序可以通过其他目的已经需要的几种简单技术来适应轻松的一致性模型:
依靠追加而不是覆盖,检查点和编写自我验证的自我识别记录。
实际上,我们的所有应用程序通过附加而不是覆盖来突变文件。
在一个典型的使用中,写入器从头到尾生成文件。
在写入所有数据之后,它将文件原子地重命名为永久名称,或者定期检查点已经成功写入了多少。
检查点还可能包括应用程序级别的校验和。
读者仅验证并处理直到最后一个已知处于定义状态的检查点的文件区域。
无论一致性和并发性问题如何,这种方法对我们都很好。
对于应用程序故障而言,附加功能比随机写入更有效率,更具弹性。
检查点允许作者逐步重新启动,使读者不会处理从应用程序的角度来看仍然不完整的成功编写的文件数据。
在其他典型用途中,许多作者同时附加到一个文件中以便合并结果或作为生产者 - 消费者队列。

记录附加的至少一次语义保留每个作者的输出。
读者处理偶尔的填充和重复如下。
作者准备的每个记录都包含额外的信息,如校验和,以便验证其有效性。
读者可以使用校验和识别和丢弃额外的填充和记录片段。
如果它不能容忍偶尔的重复(例如,如果它们将触发非幂等操作),则它可以使用记录中的唯一标识符来过滤它们,这些标识通常被称为相应的应用实体(例如web文档)。
用于记录 I/O 的功能(除了重复删除)在我们的应用程序共享的库代码中,适用于Google的其他文件接口实现。
由此,记录读取器始终将相同的记录顺序加上罕见的重复记录。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章