Part 2: CHAPTER 9 Consistency and Consensus

文章目錄

?

It turns out that there are deep connections between ordering, linearizability, and consensus. 闡明它們之間的關係?
相比於單機事務,分佈式事務有何不同?如何實現?
瞭解 google spanner(事務) 和 zk(consensus) 的實現
CAP 定理的理解

resource

http://www.bailis.org/blog/linearizability-versus-serializability

In this chapter, we will talk about some examples of algorithms and protocols for building fault-tolerant(可用性?) distributed systems. 本章講什麼

The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees.

Summary

Consistent:

Linearizability: 提供 stronger consistency, make replicated data appear as though there were only a single copy, and to make all operations act on it atomically

優點: 易於理解
缺點: 對網絡問題敏感,性能慢

Causality: which puts all operations in a single, totally orderedtimeline, causality provides us with a weaker consistency model

優點: 對網絡問題不敏感
使用場景受限

Consensus

Consensus algorithms are a huge breakthrough for distributed systems: they bring concrete safety properties (agreement, integrity, and validity) to systems where everything else is uncertain, and they nevertheless remain fault-tolerant (able to make progress as long as a majority of nodes are working and reachable). They provide total order broadcast, and therefore they can also implement linearizable atomic operations in a fault-tolerant way.

定義: Deciding something in such a way that all nodes agree on what was decided, and such that the decision is irrevocable.

使用場景:

Linearizable compare-and-set registers
Atomic transaction commit
Total order broadcast
Locks and leases
Membership/coordination service
Uniqueness constraint

Zookeeper:

providing an “outsourced” consensus
failure detection
membership service

Consistency Guaranteess

分佈式 db 中,由於網絡等因素,數據不一致一定會發生,因此 Most replicated databases provide at least eventual consistency。

eventual consistency: 所有 replicas 的數據最終會達到一致。

this is a very weak guarantee—it doesn’t say anything about when the replicas will converge
難以使用和測試: you need to be constantly aware of its limitations and not accidentally assume too much

stronger consistency: 所有 replicas 數據總是保持一致

缺點: worse performance, less fault-tolerant
優勢: easier to use correctly

distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.

Linearizability:

3 個特質:

Recency gurantee
single operations on singel object
time dependency and always move forward in time

定義:

Linearizability(atomic consistency) is a guarantee about single operations on single objects. It provides a real-time (i.e., wall-clock) guarantee on the behavior of a set of single operations (often reads and writes) on a single object。

Linearizability is a recency guarantee(once a new value has been written or read, all subsequent reads see the value that was written, until it is overwritten again) on reads and writes of a register (an individual object). It doesn’t group operations together into transactions.

Vs Serializability:

linearizability can be viewed as a special case of strict serializability where transactions are restricted to consist of a single operation applied to a single object.

[http://www.bailis.org/blog/linearizability-versus-serializability/]

使用場景:

Locking and leader election: They use consensus algorithms to implement linearizable operations in a fault-tolerant way
Constraints and uniqueness guarantees
Cross-channel timing dependencies

Implementing Linearizable System

The most common approach to making a system fault-tolerant is to use replication? 怎麼說

Single-leader replication(potentially linearizable): If you make reads from the leader, or from synchronously updated followers, they have the potential to be linearizable.
Consensus algorithms(linearizable): consensus protocols contain measures to prevent split brain and stale replicas.
Multi-leader replication(not linearizable): because they concurrently process writes on multiple nodes and asynchronously replicate them to other nodes.
Leaderless replication(probably not linearizable): sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes (w + r > n)

簡單的使用 quorums,即使滿足 w + r > n, 也不一定是 linearizable, 如:

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-PpyVwbKp-1579403595685)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p45)]

The Cost of Linearizability

The CAP theorem:
一般來說，分區容錯無法避免，因此可以認爲 CAP 的 P 總是成立。CAP 定理告訴我們，剩下的 C 和 A 無法同時做到。

CAP 的定義比較侷限,The CAP theorem as formally defined is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault (network partitions,vi or nodes that are alive but disconnected from each other). It doesn’t say anything about network delays, dead nodes, or other trade-offs.

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-XbD8HtS6-1579403595686)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p46)]

Linearizability and network delays:
A faster algorithm for linearizability does not exist, but weaker consistency models can be much faster, so this trade-off is important for latency-sensitive systems.

Ordering Guarantees

It turns out that there are deep connections between ordering, linearizability, and consensus.

Ordering and Causality

Ordering helps preserve causality:

causal dependency: 比如事件的依賴順序
happened before relationship
Cross-channel timing dependencies
…

Causality imposes an ordering on events,These chains of causally dependent operations define the causal order in the system—i.e., what happened before what.

The causal order is not a total order

Causality: 在 Casual 中沒有依賴的操作可以併發執行,因此這些操作無法比較,所以 Casual 是 partial order(偏序關係).

Linearizability: In a linearizable system, we have a total order of operations. Therefore, according to this definition, there are no concurrent operations in a linearizable datastore.

Linearizability is stronger than causal consistency

Linearizability implies causality is what makes linearizable systems simple to understand and appealing.

Causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures.

Capturing causal dependencies

In order to maintain causality, you need to know which operation happened before which other operation.

In order to determine causal dependencies, we need some way of describing the “knowledge” of a node in the system.

Causal consistency goes further: it needs to track causal dependencies across the entire database, not just for a single key

In order to determine the causal ordering, the database needs to know which version of the data was read by the application

Sequence Number Ordering

使用 logical clock 給 event 編號,we can use sequence numbers or timestamps to order events, and they provide a total order.

We can create equence numbers in a total order that is consistent with causality,先發生的 event 的 number 更小.

Lamport timestamps: 保證 causality

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-FVTv2jFg-1579403595686)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p50)]

Each node has a unique identifier,and each node keeps a counter of the number of operations it has processed. The Lamport timestamp is then simply a pair of (counter, node ID). It provides total ordering.

原理解釋:

[https://jameshfisher.com/2017/02/12/what-are-lamport-timestamps/]
[https://www.cnblogs.com/bangerlee/p/5448766.html]

Total Order Broadcast

This idea of knowing when your total order is finalized is captured in the topic of total order broadcast.

Total order broadcast is usually described as a protocol for exchanging messages between nodes, Two safety properties always be satisfied:

Reliable delivery: msg 需要被 delivered 到所有 node
Messages are delivered to every node in the same order.

This is no coincidence: it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus. That is, if you can solve one of these problems, you can transform it into a solution for the others. This is quite a profound and surprising insight!

Distributed Transactions and Consensus

Atomic commit,即保證分佈式事務的原子性,需要依賴 consensus algo,2PC is a kind of consensus algorithm, which solving atomic commit.

Atomic Commit and Two-Phase Commit(2PC)

Atomicity prevents failed transactions from littering the database with half-finished results and half-updated state.

From single-node to distributed atomic commit

存儲硬件層面保證: Thus, it is a single device (the controller of one particular disk drive, attached to one particular node) that makes the commit atomic.

節點層面保證: A node must only commit once it is certain that all other nodes in the transaction are also going to commitA transaction commit must be irrevocable。

應用層面保證: However, from the database’s point of view this is a separate transaction, and thus any cross-transaction correctness requirements are the application’s problem

Introduction to two-phase cmmit

Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes—i.e., to ensure that either all nodes commit or all nodes abort.

在分佈式系統裏，每個節點都可以知曉自己操作的成功或者失敗，卻無法知道其他節點操作的成功或失敗。當一個事務跨多個節點時，爲了保持事務的原子性與一致性，需要引入一個協調(Coordinator)來統一掌控所有參與者(Participant)的操作結果，並指示它們是否要把操作結果進行真正的提交或者回滾.

2PC顧名思義分爲兩個階段，其實施思路可概括爲：

投票階段(voting phase): 參與者將操作結果通知協調者；
提交階段(commit phase): 收到參與者的通知後，協調者再向參與者發出通知，根據反饋情況決定各參與者是否要提交還是回.

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-A9oElRSG-1579403595686)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p47)]

Much of the performance cost inherent in two-phase commit is due to the additional disk forcing (fsync) that is required for crash recovery [88], and the additional network round-trips.

Coordinator failure

2PC can become stuck waiting for the coordinator to recover.

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-OLcLfF8h-1579403595687)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p48)]

The only way 2PC can complete is by waiting for the coordinator to recover. This is why the coordinator must write its commit or abort decision to a transaction log on disk before sending commit or abort requests to participants:when the coordinator recovers, it determines the status of all in-doubt transactions by reading its transaction log. Any transactions that don’t have a commit record in the coordinator’s log are aborted. Thus, the commit point of 2PC comes down to a regular single-node atomic commit on the coordinator.

Distributed Transactions in Practice

Two types of distributed transactions are often conflated:

Database-internal distributed transactions: work well as usual
Heterogeneous distributed transactions: more challenge

Exactly-once message processing

Thus, by atomically committing the message and the side effects of its processing, we can ensure that the message is effectively processed exactly once, even if it required a few retries before it succeeded.

Such a distributed transaction is only possible if all systems affected by the transaction are able to use the same atomic commit protocol.

XA transaction

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-IkjhNBwl-1579403595687)(evernotecid://A5948D7B-23EE-4089-A2BB-79DB53BFAC02/appyinxiangcom/26101857/ENResource/p49)]

The transaction coordinator implements the XA API.

Holding locks while in doubt

The database cannot release those locks until the transaction commits or aborts, Therefore, when using two-phase commit, a transaction must hold onto the locks throughout the time it is in doubt.

This can cause large parts of your application to become unavailable until the in-doubt transaction is resolved.

Recovering from coordinator failure

Orphaned in-doubt transactions(如transaction log lost) cannot be resolved automatically, so they sit forever in the database, holding locks and blocking other transactions,只能通過管理員手動解決.

Many XA implementations have an emergency escape hatch called heuristic decisions: allowing a participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive decision from the coordinator.

Limitations of distributed transactions

對於 XA transactions, the key realization is that the transaction coordinator is itself a kind of database (in which transaction outcomes are stored), and so it needs to be approached with the same care as any other important database.

Fault-Tolerant Consensus

思想: Everyone decides on the same outcome, and once you have decided, you cannot change your mind, a consensus algorithm must satisfy the following properties:

Uniform agreement: No two nodes decide differently
Integrity: No node decides twice.
Validity
Termination: the idea of fault tolerance: 當有 node crash,也可以達成決策

Fault Tolerance:

爲保證 termination,需要假設一旦 node crash, it suddenly disappears and never comes back, 以避免無限等待 node recover.
爲保證 termination, 可以使用 quorum 來允許部分 node crash.
系統一定保證 safety properties,所以 Termination 不滿足(如大量 node crash)也不會使系統做出錯誤的決策。

Consensus algorithm and total order broadcast

Consensu algorithm: vsr, paxos, raft, zab…

這些算法不直接使用上述 Consensus 模型,而是 they decide on a sequence of values, which makes them total order broadcast algorithms.

So, total order broadcast is equivalent to repeated rounds of consensus (each consensus decision corresponding to one message delivery).

例子: [https://www.cnblogs.com/j-well/p/7061091.html]

如何選舉 leader:

Consensus algo 使用 leader,但是不保證 leader 唯一,所以需要解決選主問題:

當有多個 leader 時候,use a leader in some form or another, but they don’t guarantee that the leader is unique.

Node 爲了確定自己是 leader,每次操作前需要舉行投票確認自己的身份,因此 we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal.

Consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected, ensuring that the safety properties are always met.

Limitations of consensus

每次 proposal votes is a kind of synchronous replication. 影響性能
因爲要進行 majority votes,所以對機器數量有要求(如:the remaining two out of three form a majority),如果發生 split brain,部分機器就會變得不可用。
Most consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster.
Consensus systems generally rely on timeouts to detect failed nodes. 在網絡不好時候可能導致頻繁的選主.
Sometimes, consensus algorithms are particularly sensitive to network problems.

Membership and Coordination Services

ZooKeeper 使用場景:

Linearizable atomic operations
total order broadcast
Failure detection
Change notifications

登徒夢

發佈了63 篇原創文章 · 獲贊 16 · 訪問量 8萬+

私信關注