時光粒子源碼
分佈式一致性/分佈式存儲等開源技術探討, GitHub:https://timequark.github.io/
先來看一下raft白皮書中的 role 角色轉換圖:
下面的是個人土製的轉換圖:
raft 中的 Role 角色共有三類
-
Leader
Leader的職能有:
(1)處理read/write請求
(2)存儲 Log 數據
(3)向集羣其它節點發送 heartbeat 心跳請求,確保集羣通信正常
(4)向Follower發送Log Entry數據,完成 Replication 冗餘
(5)跟蹤Follower的數據複製狀態
(6)Log Compation(raftos目前不完備)
(7)snapshot(raftos目前不完備)
Leader 會不停的向集羣其它節點發送 heartbeat 心跳,且每個心跳請求都有一個 ID (int類型遞增),如果收到過半節點的 append_entries_response,則重置 step_down_timer 定時器;如果沒有收到過半節點的迴應,累計次數超過 step_down_missed_heartbeats 次,step_down_timer 會被觸發,Leader 退化爲 Follower 。
-
Candidate
只用來做 election 選舉。
首先,term + 1,voted_for 置爲自身的 ID,給自己投1票,然後廣播 request_vote 請求。收到過半 vote_granted 爲 True 的 response 後,升級爲 Leader。如果定時器觸發前,沒有贏得過半的投票,則直接轉變成 Follower 角色。
下面小節會具體分析 request_vote 請求攜帶的參數。
-
Follower
接收來自 Leader 的 append_entries 請求、來自 Candidate 的 request_vote 請求。這裏要注意以下幾點:
(1)Follower.start 時, init_storage 方法只能第一次加載時纔對 term 置 0,但每次都會重置 voted_for。
(2)on_receive_append_entries 只有在順利通過 @validate_term、@validate_commit_index 驗證時,纔會重置 election_timer,否則就有退化爲 Candidate 進行重新選舉的可能。
(3)on_receive_request_vote 只有在沒有投過票,並且來自 Candidate 的 last_log_term、last_log_index 有效時,纔會迴應 vote_granted 爲 True。
(4)on_receive_request_vote 沒有重置 election_timer 動作。因爲作爲 Follower 自身,並不知道此次選舉是否會有新的 Leader 生成,只能通過有效的 on_receive_request_vote 才能感知 Leader 的存在。
Leader
state.py
class Leader(BaseRole):
"""Raft Leader
Upon election: send initial empty AppendEntries RPCs (heartbeat) to each server;
repeat during idle periods to prevent election timeouts
— If command received from client: append entry to local log, respond after entry applied to state machine
- If last log index ≥ next_index for a follower: send AppendEntries RPC with log entries starting at next_index
— If successful: update next_index and match_index for follower
— If AppendEntries fails because of log inconsistency: decrement next_index and retry
— If there exists an N such that N > commit_index, a majority of match_index[i] ≥ N,
and log[N].term == self term: set commit_index = N
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.heartbeat_timer = Timer(config.heartbeat_interval, self.heartbeat)
self.step_down_timer = Timer(
config.step_down_missed_heartbeats * config.heartbeat_interval,
self.state.to_follower
)
# Heartbeat 時自加1
self.request_id = 0
# 收到 append_entries_response 時,根據 request_id ,判定是否有過半 Follower 迴應
self.response_map = {}
def start(self):
self.init_log()
# LIUHAO: Trigger leader call 'append_entries' automatically
self.heartbeat()
self.heartbeat_timer.start()
self.step_down_timer.start()
def stop(self):
self.heartbeat_timer.stop()
self.step_down_timer.stop()
# init_log 是在 start() 方法而不是 __init__ 方法中調用,
# Candidate 升級爲 Leader 時,只有 next_index、match_index 會重新初始化,其它數據保持不變
def init_log(self):
# LIUHAO
# - Initiate next_index of each follower to leader's last_log_index+1. Leader will try to broadcast 'append_entries' command to each follower with lastest log data.
# If follower reply not 'success', next_index will descrease automatically.
# If follower reply 'success', leader will update 'match_index' to 'last_log_index' of follower.
# - 'self.state.cluster' doesn't include this node refer to register.py:register
self.log.next_index = {
follower: self.log.last_log_index + 1 for follower in self.state.cluster
}
# LIUHAO
# - Initiate match_index to 0. match_index will catch up to the 'next_index' of each server after leader broadcasting 'append_entries' commands and receives 'success' response
self.log.match_index = {
follower: 0 for follower in self.state.cluster
}
async def append_entries(self, destination=None):
"""AppendEntries RPC — replicate log entries / heartbeat
Args:
destination — destination id
Request params:
term — leader’s term
leader_id — so follower can redirect clients
prev_log_index — index of log entry immediately preceding new ones
prev_log_term — term of prev_log_index entry
commit_index — leader’s commit_index
entries[] — log entries to store (empty for heartbeat)
"""
# Send AppendEntries RPC to destination if specified or broadcast to everyone
# 支持 send 單點或 broadcast 廣播消息
destination_list = [destination] if destination else self.state.cluster
for destination in destination_list:
data = {
'type': 'append_entries',
'term': self.storage.term,
'leader_id': self.id, # LIUHAO: It's just a leader_id. When a Follower receives 'append_entries' message, the Follower will update its Leader property.
'commit_index': self.log.commit_index,
'request_id': self.request_id
}
next_index = self.log.next_index[destination]
prev_index = next_index - 1
if self.log.last_log_index >= next_index:
# Follower 節點數據未同步時,這裏僅僅只同步 1 個 entry
data['entries'] = [self.log[next_index]]
else:
# heartbeat 心跳,不攜帶數據
data['entries'] = []
# Follower 需要檢查上一個 Log Entry 的 index、term 是否與 Leader 匹配,確保 Follower 數據的一致性
data.update({
'prev_log_index': prev_index,
'prev_log_term': self.log[prev_index]['term'] if self.log and prev_index else 0
})
asyncio.ensure_future(self.state.send(data, destination), loop=self.loop)
@validate_commit_index
@validate_term
def on_receive_append_entries_response(self, data):
sender_id = self.state.get_sender_id(data['sender'])
# Count all unqiue responses per particular heartbeat interval
# and step down via <step_down_timer> if leader doesn't get majority of responses for
# <step_down_missed_heartbeats> heartbeats
if data['request_id'] in self.response_map:
self.response_map[data['request_id']].add(sender_id)
if self.state.is_majority(len(self.response_map[data['request_id']]) + 1):
# 迴應過半,重置 step_down_timer,刪除 response_map 中 request_id 的請求記錄
self.step_down_timer.reset()
del self.response_map[data['request_id']]
if not data['success']:
# LIUHAO: next_index is descreasing. Maybe in order to tolerant the follower to recover log data and catch up Leader
# next_index[follower] 自減 1,供下一次 append_entries 使用
self.log.next_index[sender_id] = max(self.log.next_index[sender_id] - 1, 1)
else:
# LIUHAO: Trace next_index, match_index for follower inside Leader.
# append_entries 成功時,
# next_index[follower_id] 更新爲Follower的last_log_index+1,
# match_index[follower_id]更新爲Follower的last_log_index
self.log.next_index[sender_id] = data['last_log_index'] + 1
self.log.match_index[sender_id] = data['last_log_index']
# 更新commit_index
self.update_commit_index()
# Send AppendEntries RPC to continue updating fast-forward log (data['success'] == False)
# or in case there are new entries to sync (data['success'] == data['updated'] == True)
if self.log.last_log_index >= self.log.next_index[sender_id]:
# LIUHAO: Continue to send data to the follower
# 繼續向 Follower 同步數據
asyncio.ensure_future(self.append_entries(destination=sender_id), loop=self.loop)
def update_commit_index(self):
commited_on_majority = 0
# 在當前[commit_index+1, last_log_index+1)範圍內遍歷,Leader中的 index 已得到 match_index 半數以
# 上 Follower 迴應,並且,log[index]['term'] 與最新 storage.term 相同時,更新 commit_index
for index in range(self.log.commit_index + 1, self.log.last_log_index + 1):
commited_count = len([
1 for follower in self.log.match_index
if self.log.match_index[follower] >= index
])
# If index is matched on at least half + self for current term — commit
# That may cause commit fails upon restart with stale logs
is_current_term = self.log[index]['term'] == self.storage.term
if self.state.is_majority(commited_count + 1) and is_current_term:
commited_on_majority = index
else:
break
if commited_on_majority > self.log.commit_index:
self.log.commit_index = commited_on_majority
# Write 接口
async def execute_command(self, command):
"""Write to log & send AppendEntries RPC"""
self.apply_future = asyncio.Future(loop=self.loop)
entry = self.log.write(self.storage.term, command)
asyncio.ensure_future(self.append_entries(), loop=self.loop)
Candidate
state.py
class Candidate(BaseRole):
"""Raft Candidate
— On conversion to candidate, start election:
— Increment self term
— Vote for self
— Reset election timer
— Send RequestVote RPCs to all other servers
— If votes received from majority of servers: become leader
— If AppendEntries RPC received from new leader: convert to follower
— If election timeout elapses: start new election
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# election 超時後,自動轉變成 Follower
self.election_timer = Timer(self.election_interval, self.state.to_follower)
self.vote_count = 0
def start(self):
"""Increment current term, vote for herself & send vote requests"""
# 開始 election 時,term 自加 1,且給自己投一票
self.storage.update({
'term': self.storage.term + 1,
'voted_for': self.id
})
self.vote_count = 1
# 發送拉票消息
self.request_vote()
# 啓動 election timer
self.election_timer.start()
def stop(self):
self.election_timer.stop()
def request_vote(self):
"""RequestVote RPC — gather votes
Arguments:
term — candidate’s term
candidate_id — candidate requesting vote
last_log_index — index of candidate’s last log entry
last_log_term — term of candidate’s last log entry
"""
data = {
'type': 'request_vote',
'term': self.storage.term,
'candidate_id': self.id,
'last_log_index': self.log.last_log_index,
'last_log_term': self.log.last_log_term
}
#
# 向集羣中其它所有節點廣播 request_vote 消息,不論其它節點的 Role 是 Leader、Folloer、還是 Candidate,
# 每個節點各自到什麼時間,做什麼事,
# 因此 BaseRole 中抽象了以下幾個方法的空實現,來應對可能接收到的各中消息的可能:
# - on_receive_request_vote(self, data)
# - on_receive_request_vote_response(self, data)
# - on_receive_append_entries(self, data)
# - on_receive_append_entries_response(self, data)
#
self.state.broadcast(data)
@validate_term
def on_receive_request_vote_response(self, data):
"""Receives response for vote request.
If the vote was granted then check if we got majority and may become Leader
"""
if data.get('vote_granted'):
self.vote_count += 1
# 得到過半投票後,Candidate 切換成 Leader
if self.state.is_majority(self.vote_count):
self.state.to_leader()
@validate_term
def on_receive_append_entries(self, data):
"""If we discover a Leader with the same term — step down"""
# LIUHAO
# Confusion here. When 'storage.term' < data['term'], @validate_term will keep 'storage.term' update and change self to Follower.
# Then the code here will change self to Follower again. What I thought is that 'split vote' case may happen.
# This doesn't make any problem ??? . Whatever....
#
# 這裏有個二次切換 Follower 的問題,情景如下:
# 集羣中有兩個以上的 Candidate 在選舉,例如叫 A、B,且 A.term > B.term;
# 當A選舉成功,A 成爲 Leader,緊接着向 B 發送 append_entries 消息,Candidate B 在
# on_receive_append_entries 中 @validate_term 將 B.term := A.term,且切換成 Follower,
# 這裏判斷 B.term == A.term,會再次切換成 Follower
#
# 上面描述的情景是有一定概率出現的,由於 Follower 的 election_interval 的隨機性,再加上網絡狀態良好的話,
# 所以,出現上面情景的概率不會高。
if self.storage.term == data['term']:
self.state.to_follower()
@staticmethod
def election_interval():
return random.uniform(*config.election_interval)
Candidate
state.py
class Candidate(BaseRole):
"""Raft Candidate
— On conversion to candidate, start election:
— Increment self term
— Vote for self
— Reset election timer
— Send RequestVote RPCs to all other servers
— If votes received from majority of servers: become leader
— If AppendEntries RPC received from new leader: convert to follower
— If election timeout elapses: start new election
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# election 超時後,自動轉變成 Follower
self.election_timer = Timer(self.election_interval, self.state.to_follower)
self.vote_count = 0
def start(self):
"""Increment current term, vote for herself & send vote requests"""
# 開始 election 時,term 自加 1,且給自己投一票
self.storage.update({
'term': self.storage.term + 1,
'voted_for': self.id
})
self.vote_count = 1
# 發送拉票消息
self.request_vote()
# 啓動 election timer
self.election_timer.start()
def stop(self):
self.election_timer.stop()
def request_vote(self):
"""RequestVote RPC — gather votes
Arguments:
term — candidate’s term
candidate_id — candidate requesting vote
last_log_index — index of candidate’s last log entry
last_log_term — term of candidate’s last log entry
"""
data = {
'type': 'request_vote',
'term': self.storage.term,
'candidate_id': self.id,
'last_log_index': self.log.last_log_index,
'last_log_term': self.log.last_log_term
}
#
# 向集羣中其它所有節點廣播 request_vote 消息,不論其它節點的 Role 是 Leader、Folloer、還是 Candidate,
# 每個節點各自到什麼時間,做什麼事,
# 因此 BaseRole 中抽象了以下幾個方法的空實現,來應對可能接收到的各中消息的可能:
# - on_receive_request_vote(self, data)
# - on_receive_request_vote_response(self, data)
# - on_receive_append_entries(self, data)
# - on_receive_append_entries_response(self, data)
#
self.state.broadcast(data)
@validate_term
def on_receive_request_vote_response(self, data):
"""Receives response for vote request.
If the vote was granted then check if we got majority and may become Leader
"""
if data.get('vote_granted'):
self.vote_count += 1
# 得到過半投票後,Candidate 切換成 Leader
if self.state.is_majority(self.vote_count):
self.state.to_leader()
@validate_term
def on_receive_append_entries(self, data):
"""If we discover a Leader with the same term — step down"""
# LIUHAO
# Confusion here. When 'storage.term' < data['term'], @validate_term will keep 'storage.term' update and change self to Follower.
# Then the code here will change self to Follower again. What I thought is that 'split vote' case may happen.
# This doesn't make any problem ??? . Whatever....
#
# 這裏有個二次切換 Follower 的問題,情景如下:
# 集羣中有兩個以上的 Candidate 在選舉,例如叫 A、B,且 A.term > B.term;
# 當A選舉成功,A 成爲 Leader,緊接着向 B 發送 append_entries 消息,Candidate B 在
# on_receive_append_entries 中 @validate_term 將 B.term := A.term,且切換成 Follower,
# 這裏判斷 B.term == A.term,會再次切換成 Follower
#
# 上面描述的情景是有一定概率出現的,由於 Follower 的 election_interval 的隨機性,再加上網絡狀態良好的話,
# 所以,出現上面情景的概率不會高。
if self.storage.term == data['term']:
self.state.to_follower()
@staticmethod
def election_interval():
return random.uniform(*config.election_interval)
Follower
state.py
class Follower(BaseRole):
"""Raft Follower
— Respond to RPCs from candidates and leaders
— If election timeout elapses without receiving AppendEntries RPC from current leader
or granting vote to candidate: convert to candidate
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# 注意這裏的 election_interval 是隨機生成的,隨機範圍參照 config.py
self.election_timer = Timer(self.election_interval, self.start_election)
def start(self):
# 初始化 storage (term、voted_for)
self.init_storage()
self.election_timer.start()
def stop(self):
self.election_timer.stop()
def init_storage(self):
"""Set current term to zero upon initialization & voted_for to None"""
# 僅僅首次初始化爲0,storage 文件生成後,這裏邏輯全程不會再進入
if not self.storage.exists('term'):
self.storage.update({
'term': 0,
})
# 清空 voted_for
self.storage.update({
'voted_for': None
})
@staticmethod
def election_interval():
return random.uniform(*config.election_interval)
@validate_commit_index
@validate_term
def on_receive_append_entries(self, data):
# LIUHAO: Update 'leader_id' to 'leader' property of Class State!
# We can have a look at description in Class State. Like the following part:
#
# # <Leader object> if state is leader
# # <state_id> if state is follower
# # <None> if leader is not chosen yet
# leader = None
self.state.set_leader(data['leader_id'])
# Reply False if log doesn’t contain an entry at prev_log_index whose term matches prev_log_term
try:
prev_log_index = data['prev_log_index']
# 檢查Leader側提供的Follower的prev_log_index、Leader的term,與本地相比,是否有效
# 如果無效,則直接返回 False
# 注意:
# raft白皮書有提到,無效時,可以攜帶Follower的 last_log_index,給到 Leader 側,這樣做可以使
# Leader 側快速定位 Follower 的 next_index,進而減少Leader側無效的 append_entries 通信次數
if prev_log_index > self.log.last_log_index or (
prev_log_index and self.log[prev_log_index]['term'] != data['prev_log_term']
):
response = {
'type': 'append_entries_response',
'term': self.storage.term,
'success': False,
'request_id': data['request_id']
}
# 異步迴應Leader
asyncio.ensure_future(self.state.send(response, data['sender']), loop=self.loop)
return
except IndexError:
pass
# If an existing entry conflicts with a new one (same index but different terms),
# delete the existing entry and all that follow it
# 將Leader發過來的entries數據,存至Log中 new_index 開始的位置
new_index = data['prev_log_index'] + 1
try:
# 有衝突時,直接擦除至尾部,向Leader看齊
if self.log[new_index]['term'] != data['term'] or (
self.log.last_log_index != prev_log_index
):
self.log.erase_from(new_index)
except IndexError:
pass
# LIUHAO: TODO
# 'log.write' will append entries to its tail. Should we reply Leader False message???
# It's always one entry for now
for entry in data['entries']:
self.log.write(entry['term'], entry['command'])
# Update commit index if necessary
# 注意這裏的條件,Follower的commit_index 小於 Leader的commit_index時,才更新
# 問題:
# Follower的commit_index 大於 Leader的commit_index時,如何處理?
# 思考:
# 大於的情形有可能是 Follower 曾經是 Leader,commit_index 比較新 ,因爲某些原因降級成 Follower。
# 但是,這種情形也不合理,因爲 Leader 的 commit_index 只有收到過半Follower的 append_entries_response 後纔會更新,
# 如此,Follower 的 commit_index 一定是小於 Leader 的 commit_index,直至 Leader 同步完最後一個 last_log_index
# 的 entry,Follower 的 commit_index 等於 Leader 的 commit_index(因爲 Leader 的 update_commit_index 遍歷範圍
# [commit_index+1, last_log_index+1) 時 index 最大值爲 last_log_index )。
if self.log.commit_index < data['commit_index']:
self.log.commit_index = min(data['commit_index'], self.log.last_log_index)
# Respond True since entry matching prev_log_index and prev_log_term was found
response = {
'type': 'append_entries_response',
'term': self.storage.term,
'success': True,
'last_log_index': self.log.last_log_index, # LIUHAO: Here, 'log.last_log_index' will be updated for that more than 1 entry be appended to the Log list
'request_id': data['request_id']
}
asyncio.ensure_future(self.state.send(response, data['sender']), loop=self.loop)
# 重置選舉定時器
self.election_timer.reset()
@validate_term
def on_receive_request_vote(self, data):
# LIUAHO: Insure that Follower has not voted for any Candidate
if self.storage.voted_for is None and not data['type'].endswith('_response'):
# Candidates' log has to be up-to-date
# If the logs have last entries with different terms,
# then the log with the later term is more up-to-date. If the logs end with the same term,
# then whichever log is longer is more up-to-date.
if data['last_log_term'] != self.log.last_log_term:
up_to_date = data['last_log_term'] > self.log.last_log_term
else:
up_to_date = data['last_log_index'] >= self.log.last_log_index
if up_to_date:
self.storage.update({
'voted_for': data['candidate_id']
})
response = {
'type': 'request_vote_response',
'term': self.storage.term,
'vote_granted': up_to_date
}
asyncio.ensure_future(self.state.send(response, data['sender']), loop=self.loop)
def start_election(self):
self.state.to_candidate()
def leader_required(func):
@functools.wraps(func)
async def wrapped(cls, *args, **kwargs):
# 確保或等待當前集羣中存在 Leader
await cls.wait_for_election_success()
# 如果 Leader 不是自己,拋出異常
if not isinstance(cls.leader, Leader):
raise NotALeaderException(
'Leader is {}!'.format(cls.leader or 'not chosen yet')
)
return await func(cls, *args, **kwargs)
return wrapped