一致性算法—Paxos、Raft、ZAB

寫在前面

1、分佈式系統對fault tolerence的一般解決方案是state machine replication（狀態機複製）。
2、分佈式一致性算法的一種更準確的說法應該是：state machine replication的共識（consensus）算法。
3、pasox其實是一個共識算法。系統的最終一致性，不僅需要達成共識，還會取決於client的行爲。
4、分佈式系統中有多個節點就會存在節點間通信的問題，存在着兩種節點通訊模型：共享內存（Shared memory）、消息傳遞（Messages passing），以下談到的算法都是基於消息傳遞的通訊模型的。它的假設前提是，在分佈式系統中進程之間的通信會出現丟失、延遲、重複等現象，但不會出現傳錯的現象。以下的算法就是爲了保證在這樣的系統中進程間基於消息傳遞就某個值達成一致。

一、一致性概述

當前工業實際應用中的一致性模型分類

1.1、弱一致性（最終一致性）

DNS（Domain Name System）
Gossip（Cassandra、Redis的通信協議）

1.2、強一致性

大體可分兩類：

1.2.1、主從同步

基本思想：
主從同步複製：
1、Master接受寫請求
2、Master複製日誌至slave
3、Master等待，直到所有從庫返回

存在的問題:
一個節點失敗，Master阻塞，導致整集羣不可用，保證了一致性，可用性大大降低

1.2.2、多數派

基本思想：
每次寫都保證寫入大於N/2個節點，每次讀保證從大於N/2個節點中讀。

相關算法：
Paxos
Raft（multi-paxos）
ZAB（multi-paxos）

二、Pasox

Paxos算法是萊斯利·蘭伯特(Leslie Lamport)1990年提出的一種基於消息傳遞的一致性算法。
Paxos的發展分類：Basic Paxos、Multi Paxos、Fast Paxos

2.1、Basic Paxos

2.1.1、角色介紹

Client：系統外部角色，請求發起者。像民衆
Proposer：接受Client請求，向集羣提出提議（propose），並在衝突發生時，起到衝突調解的作用。像議員，替民衆提出議案
Acceptor：提議投票和接受者，只有在形成法定人數（Quorum，一般即爲majority-多數派）時，提議纔會最終被接受。像國會
Learner：提議接受者，backup-備份，對集羣一致性沒什麼影響。像記錄員

2.1.2、步驟、階段（phases）

1、Phase 1a：Prepare
proposer提出一個**提議，編號爲N，**此N大於這個proposer之前提出的提案編號。請求acceptors的quorum接受。
2、 Phase 1b：Promise
如果N大於此acceptor之前接受的任何提案編號則接受，否則拒絕。
3、Phase 2a：Accept
如果達到了多數派，proposer會發出 accept請求，此請求包含提案編號N，以及提案內容。
4、Phase 2b：Accepted
如果此acceptor在此期間沒有收到任何編號大於N的提案，則接受此提案內容，否則忽略。

2.1.3、基本流程

2.1.3.1、正常流程

there is 1 Client, 1 Proposer, 3 Acceptors (i.e. the Quorum size is 3) and 2 Learners (represented by the 2 vertical lines).

This diagram represents the case of a first round, which is successful (i.e. no process in the network fails).

Client   Proposer      Acceptor     Learner  
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(1)
   |         |<---------X--X--X       |  |  Promise(1,{Va,Vb,Vc})
   |         X--------->|->|->|       |  |  Accept!(1,V)
   |         |<---------X--X--X------>|->|  Accepted(1,V)
   |<---------------------------------X--X  Response
   |         |          |  |  |       |  |

2.1.3.2、一個Acceptor宕機

In the following diagram, one of the Acceptors in the Quorum fails, so the Quorum size becomes 2. In this case, 
the Basic Paxos protocol still succeeds.

Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(1)
   |         |          |  |  !       |  |  !! FAIL !!
   |         |<---------X--X          |  |  Promise(1,{Va, Vb, null})
   |         X--------->|->|          |  |  Accept!(1,V)
   |         |<---------X--X--------->|->|  Accepted(1,V)
   |<---------------------------------X--X  Response
   |         |          |  |          |  |

2.1.3.4、一個Learner宕機

In the following case, one of the (redundant) Learners fails, but the Basic Paxos protocol still succeeds.

Client Proposer         Acceptor     Learner
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(1)
   |         |<---------X--X--X       |  |  Promise(1,{Va,Vb,Vc})
   |         X--------->|->|->|       |  |  Accept!(1,V)
   |         |<---------X--X--X------>|->|  Accepted(1,V)
   |         |          |  |  |       |  !  !! FAIL !!
   |<---------------------------------X     Response
   |         |          |  |  |       |

2.1.3.4、一個Proposer宕機

In this case, a Proposer fails after proposing a value, but before the agreement is reached. Specifically, it fails in the middle of the Accept 
message, so only one Acceptor of the Quorum receives the value. Meanwhile, a new Leader (a Proposer) is elected (but this is not shown in detail).
Note that there are 2 rounds in this case (rounds proceed vertically, from the top to the bottom).

Client  Proposer        Acceptor     Learner
   |      |             |  |  |       |  |
   X----->|             |  |  |       |  |  Request
   |      X------------>|->|->|       |  |  Prepare(1)
   |      |<------------X--X--X       |  |  Promise(1,{Va, Vb, Vc})
   |      |             |  |  |       |  |
   |      |             |  |  |       |  |  !! Leader fails during broadcast !!
   |      X------------>|  |  |       |  |  Accept!(1,V)
   |      !             |  |  |       |  |
   |         |          |  |  |       |  |  !! NEW LEADER !!
   |         X--------->|->|->|       |  |  Prepare(2)
   |         |<---------X--X--X       |  |  Promise(2,{V, null, null})
   |         X--------->|->|->|       |  |  Accept!(2,V)
   |         |<---------X--X--X------>|->|  Accepted(2,V)
   |<---------------------------------X--X  Response
   |         |          |  |  |       |  |

2.1.4、潛在問題

2.1.4.1、活鎖（livelock）或決鬥（dueling）

活鎖發生的流程：

The most complex case is when multiple Proposers believe themselves to be Leaders. For instance, the current leader may fail and later recover, 
but the other Proposers have already re-selected a new leader. The recovered leader has not learned this yet and attempts to begin one round in
conflict with the current leader. In the diagram below, 4 unsuccessful rounds are shown, but there could be more (as suggested at the bottom of
the diagram). 

Client   Leader         Acceptor     Learner
|      |             |  |  |       |  |
X----->|             |  |  |       |  |  Request
|      X------------>|->|->|       |  |  Prepare(1)
|      |<------------X--X--X       |  |  Promise(1,{null,null,null})
|      !             |  |  |       |  |  !! LEADER FAILS
|         |          |  |  |       |  |  !! NEW LEADER (knows last number was 1)
|         X--------->|->|->|       |  |  Prepare(2)
|         |<---------X--X--X       |  |  Promise(2,{null,null,null})
|      |  |          |  |  |       |  |  !! OLD LEADER recovers
|      |  |          |  |  |       |  |  !! OLD LEADER tries 2, denied
|      X------------>|->|->|       |  |  Prepare(2)
|      |<------------X--X--X       |  |  Nack(2)
|      |  |          |  |  |       |  |  !! OLD LEADER tries 3
|      X------------>|->|->|       |  |  Prepare(3)
|      |<------------X--X--X       |  |  Promise(3,{null,null,null})
|      |  |          |  |  |       |  |  !! NEW LEADER proposes, denied
|      |  X--------->|->|->|       |  |  Accept!(2,Va)
|      |  |<---------X--X--X       |  |  Nack(3)
|      |  |          |  |  |       |  |  !! NEW LEADER tries 4
|      |  X--------->|->|->|       |  |  Prepare(4)
|      |  |<---------X--X--X       |  |  Promise(4,{null,null,null})
|      |  |          |  |  |       |  |  !! OLD LEADER proposes, denied
|      X------------>|->|->|       |  |  Accept!(3,Vb)
|      |<------------X--X--X       |  |  Nack(4)
|      |  |          |  |  |       |  |  ... and so on ...

解決辦法：如果發生衝突，則Proposer等待一個Random的Timeout（一般幾秒）再提交自己的提議。

2.1.4.2、難實現、效率低（2輪RTT）

1、Basic Paxos的難度是較爲出名的，且不易理解；
2、提交提議、提交提案（日誌）內容進行了兩輪RTT操作，效率較低。

2.2、Multi Paxos

2.2.1、角色介紹

減少角色，簡化步驟：
由於Basic Paxos存在活鎖問題，而且根因是多個Proposer導致的。Multi Paxos則提出了一個新的概念——Leader，由於Basic Paxos存在兩輪RTT導致的效率低下問題，Multi Paxos則通過Leader角色 + 在消息中增加一個隨機的I（the round number I is included along with each value which is incremented in each round by the same Leader），使得兩輪RTT只在競選Leader時出現，其餘情況只進行一輪RTT

Leader：唯一的Proposer，所有請求都需經過此Leader

2.1.3、基本流程

2.1.3.1、選主流程

1、從Basic Paxos Protocol的角色關係出發:  

In the following diagram, only one instance (or "execution") of the basic Paxos protocol, with an initial Leader (a Proposer), 
is shown. Note that a Multi-Paxos consists of several instances of the basic Paxos protocol.

Client   Proposer      Acceptor     Learner
|         |          |  |  |       |  | --- First Request ---
X-------->|          |  |  |       |  |  Request
|         X--------->|->|->|       |  |  Prepare(N)
|         |<---------X--X--X       |  |  Promise(N,I,{Va,Vb,Vc})
|         X--------->|->|->|       |  |  Accept!(N,I,V)
|         |<---------X--X--X------>|->|  Accepted(N,I,V)
|<---------------------------------X--X  Response
|         |          |  |  |       |  |

where V = last of (Va, Vb, Vc).

2、從Multi Paxos Protocol角色關係出發：
A common deployment of the Multi-Paxos consists in collapsing the role of the Proposers, Acceptors and Learners to "Servers". 
So, in the end, there are only "Clients" and "Servers".

Client      Servers
|         |  |  | --- First Request ---
X-------->|  |  |  Request
|         X->|->|  Prepare(N)
|         |<-X--X  Promise(N, I, {Va, Vb})
|         X->|->|  Accept!(N, I, Vn)
|         X<>X<>X  Accepted(N, I)
|<--------X  |  |  Response
|         |  |  |

2.1.3.2、正常請求操作流程

1、從Basic Paxos Protocol的角色關係出發: 

In this case, subsequence instances of the basic Paxos protocol (represented by I+1) use the same leader, so the phase 1 (of these subsequent
instances of the basic Paxos protocol), which consist in the Prepare and Promise sub-phases, is skipped. Note that the Leader should be stable,
i.e. it should not crash or change.

The following diagram represents the first "instance" of a basic Paxos protocol, when the roles of the Proposer, Acceptor and Learner are collapsed to a single role, called the "Server".

Client   Proposer       Acceptor     Learner
|         |          |  |  |       |  |  --- Following Requests ---
X-------->|          |  |  |       |  |  Request
|         X--------->|->|->|       |  |  Accept!(N,I+1,W)
|         |<---------X--X--X------>|->|  Accepted(N,I+1,W)
|<---------------------------------X--X  Response
|         |          |  |  |       |  |

2、從Multi Paxos Protocol角色關係出發：

In the subsequent instances of the basic Paxos protocol, with the same leader as in the previous instances of the basic Paxos protocol, 
the phase 1 can be skipped.

Client      Servers
X-------->|  |  |  Request
|         X->|->|  Accept!(N,I+1,W)
|         X<>X<>X  Accepted(N,I+1)
|<--------X  |  |  Response
|         |  |  |

三、Raft

Raft可以認爲是比Multi Paxos更簡單的一致性算法

3.1、Raft協議中的相關概念定義

3.1.1、角色定義

Leader：
主節點，整個集羣只有一個Leader，所有的寫請求都通過Leader發送給Follower；

Follower：
從節點（服從角色）；

Candidate：
在Leader消息發送失敗或宕機，整集羣沒有Leader時，此時Follower接收Leader的心跳包失敗，則Follwer開始競選Leader時，它們的身份是Candidate。Candidate只是箇中間狀態，不會長期存在。

3.1.2、Term（任期）定義

在每一個Leader的任期期間，都有唯一表示該任期的一個Term；

3.2、基本操作

3.2.1、Raft將state machine replication劃分爲三個子問題

1、Leader Election
2、Log Replication
3、Safety

3.2.2、Leader Election步驟

集羣啓動或Leader的心跳包消息無法發送給Follower時，觸發 Leader Election——選主操作。

3.2.3、Log Replication步驟

1、所有的寫請求都要經過Leader；
2、Leader將寫請求攜帶在心跳包中發送給Follower；
3、當Leader收到多數派回覆的消息後，則先自己提交寫操作，同時發送Commit請求給Follower；

3.2.4、Safety保證

1、Leader宕機感知：
a、Raft通過TimeOut來保證Follower能正確感知Leader宕機或消息丟失的事件，並觸發Follower競選Leader；
b、Leader需要給Follower發送心跳包（heartbeats），數據也是攜帶在心跳包中發送給Follower的；

2、選主平票情況
Leader Election時平票情況下，則兩個Candidates會產生一個隨機的Timewait，繼續發送下一個競選消息。

3、、腦裂（大小集羣）情況：
小集羣由於沒有得到多數派的回覆，寫操作失敗；
大集羣會發生重新選主的過程，且新Leader擁有自己新的Term(任期)，寫操作成功；
當小集羣回到大集羣時，由於小集羣的Term小於新集羣的Term，則同步新集羣的信息。

3.3、一致性並不代表完全正確性

3.3.1、Client Request操作的三個可能結果：成功、失敗、unknown（Timeout）

理解unknown（Timeout）
場景：Client寫請求，Leader向Follower同步日誌，此時集羣中有3個節點失敗，2個節點存活，結果是？假設節點爲：S1、S2、S3、S4、S5（Leader）

假設S5和S4存活，Client發起第N次寫請求爲操作I 時，由於Leader沒有得到多數派的回覆，操作I只被發送到了S4中，此時Leader即會返回Client unknown，因爲Leader不知道後面會不會成功將該條日誌寫入多數派中。
結果1：假設Leader在返回客戶端後，宕機的Follower：S1、S2、S3恢復正常，Leader再次發送第N次寫請求——操作I，且得到了多數派的回覆，則提交日誌，寫操作最終結果爲成功；
結果2：假設Leader在返回客戶端後，此時S5和S4宕機，且S1、S2、S3恢復正常，此時S1、S2、S3觸發選主操作，且集羣恢復可用，如果此時Client發起第N+1次請求爲操作I+1 ,且Client操作成功後 S5、S4恢復正常，則保存在S5、S4中的操作I 會被刪除，S5、S4同步最新的操作I+1 到本地。則第N次寫請求—操作I 失敗；

總結：一致性需要客戶端和共識算法（Consensus）來共同保證。

四、ZAB

ZAB的全稱是Zookeeper atomic broadcast protocol，是Zookeeper內部用到的一致性協議。基本與Raft相同。
在一些名詞的叫法上有些區別：
如ZAB將某一個leader的週期稱爲epoch，而Raft則稱爲Term。
實現上也有些許不同：
Raft保證日誌連續性，心跳方向爲Leader至Follower。ZAB則相反。

五、一致性算法的實踐

5.1、使用Paxos的組件

Chubby(Google首次運用Multi Paxos算法到工程領域)

5.2、使用Raft的組件

Redis-Cluster、etcd

5.3、使用ZAB的組件

Zookeeper（Yahoo開源）

附

本文參考鏈接：
https://raft.github.io/
https://www.bilibili.com/video/av21667358?t=3887
https://en.wikipedia.org/wiki/Paxos_(computer_science)
https://en.wikipedia.org/wiki/Raft_(computer_science)