The Google File System : part3 SYSTEM INTERACTIONS

3. SYSTEM INTERACTIONS
We designed the system to minimize the master’s involvement in all operations. 
With that background, we now describe how the client, master, and chunkservers interact to implement data mutations, atomic record append, and snapshot.
3.系统交互
我们设计了系统,以尽量减少master对所有操作的参与。
在这种背景下,我们现在描述客户端,master服务器和chunkserver如何交互来实现数据突变,原子记录追加和快照。

3.1 Leases and Mutation Order
A mutation is an operation that changes the contents or metadata of a chunk such as a write or an append operation. 
Each mutation is performed at all the chunk’s replicas.
We use leases to maintain a consistent mutation order across replicas. 
The master grants a chunk lease to one of the replicas, which we call the primary. 
The primary picks a serial order for all mutations to the chunk. 
All replicas follow this order when applying mutations. 

Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary.
The lease mechanism is designed to minimize management overhead at the master. 
A lease has an initial timeout of 60 seconds. 
However, as long as the chunk is being mutated, the primary can request and typically receive extensions from the master indefinitely. 
These extension requests and grants are piggybacked on the HeartBeat messages regularly exchanged between the master and all chunkservers.
The master may sometimes try to revoke a lease before it expires 
(e.g., when the master wants to disable mutations on a file that is being renamed). 
Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires.
In Figure 2, we illustrate this process by following the control flow of a write through these numbered steps.

3.1租赁和变更令
突变是改变诸如写入(write)或追加(append)操作之类的块的内容(contents)或metadata的操作。
每个突变都是在所有的块的复本上进行的。
我们使用租赁来保持复制品间的一致变异顺序。
master向其中一个副本授予了一个块租约,我们称之为primary副本。
primary 选择一个序列顺序的所有突变的块。
所有复制品在应用突变时遵循此顺序。
因此,全局突变顺序首先由master选择的租赁授权订单定义,并在租赁期间由primary分配的序列号定义。
租赁机制旨在最大限度地减少 master 的管理开销。
租约的初始超时时间为60秒。
但是,只要大块被突变,primary可以无限期地请求并通常从master接收扩展。
这些扩展请求和授权是在Master和所有chunkserver之间定期交换的HeartBeat消息中搭载的。
master 有时可能会尝试在租约过期前撤销租约
(例如,当master想要禁用正在重命名的文件上的突变时)。
即使master与primary进行通信,也可以在旧租约到期后,安全地向另一个副本发放新租约。
在图2中,我们通过遵循这些编号步骤的写入的控制流程来说明该过程。

1. The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas. 
If no one has a lease, the master grants one to a replica it chooses (not shown).

2. The master replies with the identity of the primary and the locations of the other (secondary) replicas. 
The client caches this data for future mutations. 
It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease.
The client pushes the data to all the replicas. 
A client can do so in any order. 
Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out. 
By decoupling the data flow from the control flow, we can improve performance by scheduling the expensive data flow based on the network topology regardless of which chunkserver is the primary. 
Section 3.2 discusses this further.
Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary.
The request identifies the data pushed earlier to all of the replicas. 
The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialization. 
It applies the mutation to its own local state in serial number order.
The primary forwards the write request to all secondary replicas. 
Each secondary replica applies mutations in the same serial number order assigned by the primary.

1.客户端向 master 询问哪个chunkserver持有该块的当前租约和其他副本的位置。
如果没有人有租约,则master授予一个选择的副本(未显示)。

2.master以primary身份和其他(次要)副本的位置进行回复。
客户端缓存这些数据用于将来的突变。
只有当master服务器无法访问或者不再存在租约时,才需要再次与master服务器联系。
客户端将数据推送到所有副本。
客户可以按任何顺序进行。
每个chunkserver将数据存储在内部LRU缓冲区高速缓存中,直到数据被使用或老化为止。
通过将数据流与控制流分离,我们可以通过调度基于网络拓扑的昂贵的数据流来提高性能,而不管 primary 的是哪个chunkserver。
第3.2节进一步讨论。
一旦所有的副本已经确认接收到数据,客户端向主服务器发送写请求。
该请求标识了早期推送到所有副本的数据。
primary分配连续的序列号到其接收到的所有突变,可能来自多个客户端,这提供必要的序列化。
它以序列号顺序将突变应用于其本地状态。
primary将写入请求转发到所有辅助副本。
每个次要复制品以primary分配的相同序列号顺序应用突变。

The secondaries all reply to the primary indicating that they have completed the operation.
The primary replies to the client. 
Any errors encountered at any of the replicas are reported to the client.
In case of errors, the write may have succeeded at the primary and an arbitrary subset of the secondary replicas. 
(If it had failed at the primary, it would not have been assigned a serial number and forwarded.)
The client request is considered to have failed, and the modified region is left in an inconsistent state. 
Our client code handles such errors by retrying the failed mutation. 
It will make a few attempts at steps (3) through (7) before falling back to a retry from the beginning of the write.
If a write by the application is large or straddles a chunk boundary, GFS client code breaks it down into multiple write operations. 
They all follow the control flow described above but may be interleaved with and overwritten by concurrent operations from other clients. Therefore, the shared file region may end up containing fragments from different clients, although the replicas will be identical because the individual operations are completed successfully in the same order on all replicas. 
This leaves the file region in consistent but undefined state as noted in Section 2.7.

次级人员都对primary表示已经完成了该操作。
primary回复客户端。
在任何副本上遇到的任何错误都会报告给客户端。
在出现错误的情况下,写入可能在次要副本的primary和任意子集中成功。
(如果primary失败,则不会分配序列号并转发。)
客户端请求被认为失败,修改后的区域处于不一致的状态。
我们的客户端代码通过重试失败的突变来处理这些错误。
在从写入开始返回到重试之前,将对步骤(3)至(7)进行几次尝试。
如果应用程序的写入较大或跨越块边界,则GFS客户机代码将其分解为多个写入操作。
它们都遵循上述控制流程,但可以与来自其他客户端的并发操作交织并覆盖。因此,共享文件区域可能会包含来自不同客户端的片段,尽管副本将是相同的,因为单个操作在所有副本上以相同的顺序成功完成。
这使文件区域保持一致但未定义的状态,如第2.7节所述。

3.2 Data Flow


We decouple the flow of data from the flow of control to use the network efficiently. 
While control flows from the client to the primary and then to all secondaries, data is pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion. 
Our goals are to fully utilize each machine’s network bandwidth, avoid network bottlenecks and high-latency links, and minimize the latency to push through all the data.
To fully utilize each machine’s network bandwidth, the data is pushed linearly along a chain of chunkservers rather than distributed in some other topology (e.g., tree). 
Thus, each machine’s full outbound bandwidth is used to transfer the data as fast as possible rather than divided among multiple recipients.
To avoid network bottlenecks and high-latency links (e.g.,inter-switch links are often both) as much as possible, each machine forwards the data to the “closest” machine in the network topology that has not received it. 
Suppose the client is pushing data to chunkservers S1 through S4. 
It sends the data to the closest chunkserver, say S1. 
S1 forwards it to the closest chunkserver S2 through S4 closest to S1, say S2. 
Similarly, S2 forwards it to S3 or S4, whichever is closer to S2, and so on. 
Our network topology is simple enough that “distances” can be accurately estimated from IP addresses.

Finally, we minimize latency by pipelining the data transfer over TCP connections. 
Once a chunkserver receives some data, it starts forwarding immediately. 
Pipelining is especially helpful to us because we use a switched network with full-duplex links. 
Sending the data immediately does not reduce the receive rate. Without network congestion, the ideal elapsed time for transferring B bytes to R replicas is B/T + RL where T is the network throughput and L is latency to transfer bytes between two machines. 
Our network links are typically 100 Mbps (T ), and L is far below 1 ms.
Therefore, 1 MB can ideally be distributed in about 80 ms.


3.2数据流
我们将数据流与控制流分离,以有效地使用网络。
当控制从客户端流向主服务器,然后从所有辅助服务器传输时,数据将以流水线的方式沿着精心挑选的链接服务器线性推送。
我们的目标是充分利用每台机器的网络带宽,避免网络瓶颈和高延迟链路,并最大限度地减少所有数据的延迟。
为了充分利用每个机器的网络带宽,数据沿着一组块服务器线性地推送,而不是分布在某些其他拓扑(例如树)中。
因此,每个机器的完全出站带宽用于尽可能快地传输数据,而不是在多个接收者之间分配。
为了尽可能地避免网络瓶颈和高延迟链路(例如,交换机间链路通常两者),每个机器将数据转发到尚未接收它的网络拓扑中的“最接近”的机器。
假设客户端正在将数据推送到块服务器S1到S4。
它将数据发送到最近的chunkserver,说S1。
S1将其转发到最接近S1的最接近的块服务器S2至S4,如S2。
类似地,S2将其转发到S3或S4,以较接近S2为准,依此类推。
我们的网络拓扑结构足够简单,可以从IP地址准确地估计“距离”。

最后,我们通过流水线连接TCP连接上的数据传输来最小化延迟。
一旦chunkserver接收到一些数据,它将立即开始转发。
流水线对我们尤其有帮助,因为我们使用具有全双工链路的交换网络。
立即发送数据不会降低接收速率。没有网络拥塞,将B字节传输到R副本的理想耗用时间是 B/T + RL,其中T是网络吞吐量,L是在两台机器之间传输字节的延迟。
我们的网络链路通常为100 Mbps(T),L远低于1 ms。
因此,1 MB可以理想地分布在大约80毫秒。

3.3 Atomic Record Appends
GFS provides an atomic append operation called record append. 
In a traditional write, the client specifies the offset at which data is to be written. 
Concurrent writes to the same region are not serializable: 
the region may end up containing data fragments from multiple clients. 
In a record append, however, the client specifies only the data. 
GFS appends it to the file at least once atomically (i.e., as one continuous sequence of bytes) at an offset of GFS’s choosing and returns that offset to the client. 
This is similar to writing to a file opened in O APPEND mode in Unix without the race conditions when multiple writers do so concurrently.
Record append is heavily used by our distributed applications in which many clients on different machines append to the same file concurrently. 
Clients would need additional complicated and expensive synchronization, for example through a distributed lock manager, if they do so with traditional writes. 
In our workloads, such files often serve as multiple-producer/single-consumer queues or contain merged results from many different clients.
Record append is a kind of mutation and follows the control flow in Section 3.1 with only a little extra logic at the primary. 
The client pushes the data to all replicas of the last chunk of the file Then, it sends its request to the primary. 
The primary checks to see if appending the record to the current chunk would cause the chunk to exceed the maximum size (64 MB). 
If so, it pads the chunk to the maximum size, tells secondaries to do the same, and replies to the client indicating that the operation should be retried on the next chunk. 
(Record append is restricted to be at most one-fourth of the maximum chunk size to keep worstcase fragmentation at an acceptable level.) 
If the record fits within the maximum size, which is the common case, the primary appends the data to its replica, tells the secondaries to write the data at the exact offset where it has, and finally replies success to the client.

3.3原子记录附加
GFS提供了称为记录追加的原子追加操作。
在传统的写入中,客户端指定要写入数据的偏移量。
并发写入同一个区域是不可序列化的:
该区域可能最终包含来自多个客户端的数据片段。
但是,在记录追加中,客户端仅指定数据。
GFS在GFS选择的偏移处至少一次将其附加到文件至少一次(即,作为一个连续的字节序列),并将该偏移量返回给客户端。
这类似于在Unix中以O APPEND模式打开的文件,当多个作者同时进行时,不会出现竞争条件。
记录附件被我们的分布式应用程序大量使用,其中不同机器上的许多客户端同时附加到同一个文件。
客户端将需要额外的复杂和昂贵的同步,例如通过分布式锁管理器,如果他们用传统的写入方式。
在我们的工作负载中,这些文件通常用作多个生产者/单个消费者队列,或者包含许多不同客户端的合并结果。
记录追加是一种突变,遵循3.1节中的控制流程,只有一点额外的逻辑。
客户端将数据推送到文件的最后一个块的所有副本。然后,它将其请求发送给主服务器。
primary 检查是否将记录附加到当前块将导致块超过最大大小(64 MB)。
如果是这样,它将该块填充到最大大小,告诉二进制程序执行相同的操作,并回复客户端,指出应该在下一个块上重试该操作。
(记录追加限制为最大块大小的四分之一,以保持最坏的碎片在可接受的水平。)
如果记录符合最大尺寸(通常是这种情况),则 primary 将数据附加到其副本,告诉二进制文件将数据写入其所在的准确偏移量,最后向客户端回复成功。

If a record append fails at any replica, the client retries the operation. 
As a result, replicas of the same chunk may contain different data possibly including duplicates of the same record in whole or in part. 
GFS does not guarantee that all replicas are bytewise identical. 
It only guarantees that the data is written at least once as an atomic unit. 
This property follows readily from the simple observation that for the operation to report success, the data must have been written at the same offset on all replicas of some chunk. 
Furthermore, after this, all replicas are at least as long as the end of record and therefore any future record will be assigned a higher offset or a different chunk even if a different replica later becomes the primary. 
In terms of our consistency guarantees, the regions in which successful record append operations have written their data are defined (hence consistent),whereas intervening regions are inconsistent (hence undefined). 
Our applications can deal with inconsistent regions as we discussed in Section 2.7.2.

如果任何副本上的记录追加失败,客户端将重试该操作。
因此,相同块的副本可能包含不同的数据,可能包括全部或部分同一记录的重复。
GFS并不保证所有的副本都是相同的。
它只保证将数据作为原子单元写入至少一次。
这个属性很容易从简单的观察结果来看,对于报告成功的操作,数据必须在一些块的所有副本上以相同的偏移量写入。
此外,在此之后,所有副本至少与记录结束一样长,因此即使稍后成为 primary 副本,任何将来的记录将被分配较高的偏移量或不同的块。
在我们的一致性保证方面,成功记录附加操作已经写入数据的区域被定义(因此是一致的),而中间区域是不一致的(因此是未定义的)。
我们的应用可以处理不一致的区域,如我们在第2.7.2节中讨论的。

3.4 Snapshot
The snapshot operation makes a copy of a file or a directory tree (the “source”) almost instantaneously, while minimizing any interruptions of ongoing mutations. 
Our users use it to quickly create branch copies of huge data sets (and often copies of those copies, recursively), or to checkpoint the current state before experimenting with changes that can later be committed or rolled back easily.
Like AFS [5], we use standard copy-on-write techniques to implement snapshots. When the master receives a snapshot request, it first revokes any outstanding leases on the chunks in the files it is about to snapshot. 
This ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder. 
This will give the master an opportunity to create a new copy of the chunk first.
After the leases have been revoked or have expired, the master logs the operation to disk. 
It then applies this log record to its in-memory state by duplicating the metadata for the source file or directory tree. 
The newly created snapshot files point to the same chunks as the source files.
The first time a client wants to write to a chunk C after the snapshot operation, it sends a request to the master to find the current lease holder. 
The master notices that the reference count for chunk C is greater than one. 
It defers replying to the client request and instead picks a new chunk handle C’. 
It then asks each chunkserver that has a current replica of C to create a new chunk called C’. 
By creating the new chunk on the same chunkservers as the original, we ensure that the data can be copied locally, not over the network (our disks are about three times as fast as our 100 Mb Ethernet links). 
From this point, request handling is no different from that for any chunk: 
the master grants one of the replicas a lease on the new chunk C’ and replies to the client,which can write the chunk normally, not knowing that it has just been created from an existing chunk.

3.4快照
快照操作几乎即时地复制文件或目录树(“源”),同时最小化正在进行的突变的任何中断。
我们的用户使用它来快速创建巨大数据集的分支副本(通常是这些副本的副本,递归地),或者在实验可以稍后提交或回滚的更改之前检查当前状态。
像AFS [5],我们使用标准的写时复制技术实现快照。当主服务器收到快照请求时,它首先撤销要快照的文件中的块中的任何未完成的租约。
这确保了对这些块的任何后续写入将需要与主机的交互以找到租赁持有者。
这将给 master 一个机会,首先创建一个新的副本。
在租赁已经被撤销或已经过期之后,主站将操作记录到磁盘。
然后通过复制源文件或目录树的元数据将该日志记录应用到其内存中状态。
新创建的快照文件指向与源文件相同的块。
客户端首次在快照操作后写入块C时,向主机发送请求以查找当前的租赁持有者。
master注意到C块的引用计数大于1。
它阻止回复客户端请求,而是选择一个新的块处理C'。
然后,它要求具有C的当前副本的每个chunkserver创建一个名为C'的新块。
通过在与原始设备相同的块服务器上创建新块,我们确保数据可以在本地复制,而不是通过网络复制(我们的磁盘大约是我们的100 Mb以太网链路的三倍)。
从这一点上,请求处理与任何块都没有区别:
master 授予一个副本一个新的块C'上的租约,并回复客户端,这可以正常写入块,不知道它刚刚从现有的块中创建。
发布了63 篇原创文章 · 获赞 24 · 访问量 6万+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章