1、Nacos集羣選舉策略
在Raft協議中,節點有三種角色:
- Leader:負責接收客戶端的請求
- Candidate:用於選舉Leader的一種角色
- Follower:負責響應來自Leader或者Candidate的請求
選舉分爲兩個階段:
- 服務啓動的時候
- leader掛了的時候
所有節點啓動的時候,都是follower狀態。 如果在一段時間內如果沒有收到leader的心跳(可能是沒有leader,也可能是leader掛了),那麼follower會變成Candidate。然後發起選舉,選舉之前,會增加term,這個term和zookeeper中的epoch的道理是一樣的。
follower會投自己一票,並且給其他節點發送票據vote,等到其他節點回復。
在這個過程中,可能出現幾種情況
- 收到過半的票數通過,則成爲leader
- 被告知其他節點已經成爲leader,則自己切換爲follower
- 一段時間內沒有收到過半的投票,則重新發起選舉
約束條件:
- 在任一term中,單個節點最多隻能投一票
2、Nacos Raft源碼分析
2.1、RaftCore.init()
Nacos Server在啓動的時候會調用RaftCore.init()方法進行集羣選舉操作和節點之間的心跳機制
/**
* @author nacos
*/
@Component
public class RaftCore {
@PostConstruct
public void init() throws Exception {
Loggers.RAFT.info("initializing Raft sub-system");
executor.submit(notifier);
long start = System.currentTimeMillis();
raftStore.loadDatums(notifier, datums);
setTerm(NumberUtils.toLong(raftStore.loadMeta().getProperty("term"), 0L));
Loggers.RAFT.info("cache loaded, datum count: {}, current term: {}", datums.size(), peers.getTerm());
while (true) {
if (notifier.tasks.size() <= 0) {
break;
}
Thread.sleep(1000L);
}
initialized = true;
Loggers.RAFT.info("finish to load data from disk, cost: {} ms.", (System.currentTimeMillis() - start));
//節點選舉
GlobalExecutor.registerMasterElection(new MasterElection());
//集羣節點的心跳機制
GlobalExecutor.registerHeartbeat(new HeartBeat());
Loggers.RAFT.info("timer started: leader timeout ms: {}, heart-beat timeout ms: {}",
GlobalExecutor.LEADER_TIMEOUT_MS, GlobalExecutor.HEARTBEAT_INTERVAL_MS);
}
}
在init()方法中,使用GlobalExecutor.registerMasterElection(new MasterElection());方法來進行選舉操作;
registerMasterElection()方法中,啓動了一個定時任務去執行MasterElection裏面的操作;接下來看MasterElection裏面的邏輯:
2.2、new MasterElection()
public class MasterElection implements Runnable {
@Override
public void run() {
try {
if (!peers.isReady()) {
return;
}
//獲取本機RaftPeer信息
RaftPeer local = peers.local();
local.leaderDueMs -= GlobalExecutor.TICK_PERIOD_MS;
if (local.leaderDueMs > 0) {
return;
}
// reset timeout
//重置選舉超時時間和發送心跳時間
local.resetLeaderDue();
local.resetHeartbeatDue();
//發送選票信息到其他nacos節點
sendVote();
} catch (Exception e) {
Loggers.RAFT.warn("[RAFT] error while master election {}", e);
}
}
在new MasterElection()線程中,首先會獲取本機nacos節點的RaftPeer信息;RaftPeer包括了一下信息:
- ip:節點ip地址
- voteFor:節點選票信息
- term:可理解爲選舉時間,同zookeeper的邏輯時鐘
- state:節點角色,默認爲follower
獲取到了本機RaftPeer信息之後,首先重置選舉超時時間和發送心跳時間;然後調用sendVote()方法進行選舉操作
2.3、RaftCore.sendVote()
public void sendVote() {
//1、獲取本機nacos節點的RaftPeer信息
RaftPeer local = peers.get(NetUtils.localServer());
Loggers.RAFT.info("leader timeout, start voting,leader: {}, term: {}",
JSON.toJSONString(getLeader()), local.term);
//2、重置leader節點==null,同時重置其他各個節點的選票信息==null
peers.reset();
//3、本機節點設置term+1
local.term.incrementAndGet();
//4、本機節點設置選票信息爲自己
local.voteFor = local.ip;
//5、同時修改本機節點信息爲CANDIDATE昨天
local.state = RaftPeer.State.CANDIDATE;
Map<String, String> params = new HashMap<>(1);
//6、將本機節點的RaftPeer信息進行組裝
params.put("vote", JSON.toJSONString(local));
//7、通過httpClient給nacos集羣的其他節點發送選票信息
for (final String server : peers.allServersWithoutMySelf()) {
final String url = buildURL(server, API_VOTE);
try {
HttpClient.asyncHttpPost(url, null, params, new AsyncCompletionHandler<Integer>() {
@Override
public Integer onCompleted(Response response) throws Exception {
if (response.getStatusCode() != HttpURLConnection.HTTP_OK) {
Loggers.RAFT.error("NACOS-RAFT vote failed: {}, url: {}", response.getResponseBody(), url);
return 1;
}
//8、接收其他節點對於前面發送的選票信息的返回結果
RaftPeer peer = JSON.parseObject(response.getResponseBody(), RaftPeer.class);
Loggers.RAFT.info("received approve from peer: {}", JSON.toJSONString(peer));
//9、決定哪一個是Leader節點操作
peers.decideLeader(peer);
return 0;
}
});
} catch (Exception e) {
Loggers.RAFT.warn("error while sending vote to server: {}", server);
}
}
}
}
在sendVote()方法中,主要的步驟是:
- 獲取本機nacos節點的RaftPeer信息(ip,term,voteFor,state)
- 調用peers.reset()方法,重置nacos集羣leader的RaftPeer節點內容爲null
- 重新設置本機節點RaftPeer的信息;主要是修改state,設置選票信息,term+1
- 將設置好的本機RaftPeer信息進行封裝到HashMap中
- 通過httpClient給nacos集羣的其他節點發送選票信息
- 接收其他節點對於前面發送的選票信息的返回結果
- 通過選舉結果選出Leader結果
其中peers.reset()方法中的邏輯代碼爲:
public void reset() {
leader = null;
for (RaftPeer peer : peers.values()) {
peer.voteFor = null;
}
}
通過httpClient將本機選票信息發送給其他節點,並返回其他節點的選票結果邏輯主要是將請求到RaftController.vote()方法中:
@NeedAuth
@PostMapping("/vote")
public JSONObject vote(HttpServletRequest request, HttpServletResponse response) throws Exception {
RaftPeer peer = raftCore.receivedVote(
JSON.parseObject(WebUtils.required(request, "vote"), RaftPeer.class));
return JSON.parseObject(JSON.toJSONString(peer));
}
在vote()方法中,主要是調用RaftCore.receivedVote()方法;
2.4、RaftCore.receivedVote()
該方法就是nacos節點接受其他節點的選票信息並返回自己的選票信息結果
public synchronized RaftPeer receivedVote(RaftPeer remote) {
if (!peers.contains(remote)) {
throw new IllegalStateException("can not find peer: " + remote.ip);
}
RaftPeer local = peers.get(NetUtils.localServer());
if (remote.term.get() <= local.term.get()) {
String msg = "received illegitimate vote" +
", voter-term:" + remote.term + ", votee-term:" + local.term;
Loggers.RAFT.info(msg);
if (StringUtils.isEmpty(local.voteFor)) {
local.voteFor = local.ip;
}
return local;
}
local.resetLeaderDue();
local.state = RaftPeer.State.FOLLOWER;
local.voteFor = remote.ip;
local.term.set(remote.term.get());
Loggers.RAFT.info("vote {} as leader, term: {}", remote.ip, remote.term);
return local;
}
該方法中的邏輯比較簡單明瞭:
- 首先判斷RaftPeerSet中是否包含了遠程RaftPeer信息(可以把RaftPeer看成nacos節點對象,RaftPeerSet就是nacos集羣節點的組合)
- 然後獲取本機節點RaftPeer信息
- 通過比較本機節點和遠程節點信息的term值,來做出選票結果(如果本機節點term的值大於遠程節點term的值,那麼本機節點選票信息就設置爲自己,選自己作爲Leader節點,並返回給遠程節點;反之則將選票信息設置爲遠程節點信息)
2.5、RaftPeerSet.decideLeader()
在2.3的RaftCore.sendVote()方法中,每個本機nacos節點都會將自己的選票信息發送給nacos集羣中的其他節點,請求到其他節點的RaftController.vote()方法中,vote()方法通過調用2.4中的RaftCore.receivedVote()方法來處理其他節點的選票信息並進行判斷之後返回自身的選票信息給原來的nacos節點;
RaftCore.sendVote()方法中獲取到了其他節點的選票結果之後,會調用decideLeader()方法來選出Leade節點
public RaftPeer decideLeader(RaftPeer candidate) {
peers.put(candidate.ip, candidate);
SortedBag ips = new TreeBag();
int maxApproveCount = 0;
String maxApprovePeer = null;
for (RaftPeer peer : peers.values()) {
if (StringUtils.isEmpty(peer.voteFor)) {
continue;
}
ips.add(peer.voteFor);
if (ips.getCount(peer.voteFor) > maxApproveCount) {
maxApproveCount = ips.getCount(peer.voteFor);
maxApprovePeer = peer.voteFor;
}
}
if (maxApproveCount >= majorityCount()) {
RaftPeer peer = peers.get(maxApprovePeer);
peer.state = RaftPeer.State.LEADER;
if (!Objects.equals(leader, peer)) {
leader = peer;
applicationContext.publishEvent(new LeaderElectFinishedEvent(this, leader));
Loggers.RAFT.info("{} has become the LEADER", leader.ip);
}
}
return leader;
}
該方法中首先會找出得票最多的節點的信息以及該節點的得票數;然後判斷得票數是否超過了一半的nacos集羣節點數量;如果沒有超過,直接返回leader(null);如果超過了則將該節點的信息賦值給Leader節點並返回。
2.6、集羣心跳機制--GlobalExecutor.registerHeartbeat(new HeartBeat())
RaftCore.init()方法除了上面的選舉操作之外,緊跟着進行了集羣心跳機制的邏輯;同樣調用了一個定時任務,每個5s執行一個發送心跳的操作---new HeartBeat():
public class HeartBeat implements Runnable {
@Override
public void run() {
try {
if (!peers.isReady()) {
return;
}
RaftPeer local = peers.local();
local.heartbeatDueMs -= GlobalExecutor.TICK_PERIOD_MS;
if (local.heartbeatDueMs > 0) {
return;
}
local.resetHeartbeatDue();
sendBeat();
} catch (Exception e) {
Loggers.RAFT.warn("[RAFT] error while sending beat {}", e);
}
}
該方法中,首先會獲取本機節點的RaftPeer信息,並重置心跳信息;同時調用sendBeat()方法發送心跳:
public void sendBeat() throws IOException, InterruptedException {
RaftPeer local = peers.local();
if (local.state != RaftPeer.State.LEADER && !STANDALONE_MODE) {
return;
}
if (Loggers.RAFT.isDebugEnabled()) {
Loggers.RAFT.debug("[RAFT] send beat with {} keys.", datums.size());
}
local.resetLeaderDue();
// build data
JSONObject packet = new JSONObject();
packet.put("peer", local);
JSONArray array = new JSONArray();
if (switchDomain.isSendBeatOnly()) {
Loggers.RAFT.info("[SEND-BEAT-ONLY] {}", String.valueOf(switchDomain.isSendBeatOnly()));
}
if (!switchDomain.isSendBeatOnly()) {
for (Datum datum : datums.values()) {
JSONObject element = new JSONObject();
if (KeyBuilder.matchServiceMetaKey(datum.key)) {
element.put("key", KeyBuilder.briefServiceMetaKey(datum.key));
} else if (KeyBuilder.matchInstanceListKey(datum.key)) {
element.put("key", KeyBuilder.briefInstanceListkey(datum.key));
}
element.put("timestamp", datum.timestamp);
array.add(element);
}
}
packet.put("datums", array);
// broadcast
Map<String, String> params = new HashMap<String, String>(1);
params.put("beat", JSON.toJSONString(packet));
String content = JSON.toJSONString(params);
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(content.getBytes(StandardCharsets.UTF_8));
gzip.close();
byte[] compressedBytes = out.toByteArray();
String compressedContent = new String(compressedBytes, StandardCharsets.UTF_8);
if (Loggers.RAFT.isDebugEnabled()) {
Loggers.RAFT.debug("raw beat data size: {}, size of compressed data: {}",
content.length(), compressedContent.length());
}
for (final String server : peers.allServersWithoutMySelf()) {
try {
final String url = buildURL(server, API_BEAT);
if (Loggers.RAFT.isDebugEnabled()) {
Loggers.RAFT.debug("send beat to server " + server);
}
HttpClient.asyncHttpPostLarge(url, null, compressedBytes, new AsyncCompletionHandler<Integer>() {
@Override
public Integer onCompleted(Response response) throws Exception {
if (response.getStatusCode() != HttpURLConnection.HTTP_OK) {
Loggers.RAFT.error("NACOS-RAFT beat failed: {}, peer: {}",
response.getResponseBody(), server);
MetricsMonitor.getLeaderSendBeatFailedException().increment();
return 1;
}
peers.update(JSON.parseObject(response.getResponseBody(), RaftPeer.class));
if (Loggers.RAFT.isDebugEnabled()) {
Loggers.RAFT.debug("receive beat response from: {}", url);
}
return 0;
}
@Override
public void onThrowable(Throwable t) {
Loggers.RAFT.error("NACOS-RAFT error while sending heart-beat to peer: {} {}", server, t);
MetricsMonitor.getLeaderSendBeatFailedException().increment();
}
});
} catch (Exception e) {
Loggers.RAFT.error("error while sending heart-beat to peer: {} {}", server, e);
MetricsMonitor.getLeaderSendBeatFailedException().increment();
}
}
}
}
該方法中大致的過程是:
首先判斷本機節點是否是Leader節點,如果不是則直接返回,如果是Leader節點,則將RaftPeer和時間戳等信息封裝並通過httpClient遠程發送到其他nacos集羣follower節點中;請求會發送到RaftController.beat()方法;beat方法中調用了RaftCore.receivedBeat()方法;並將遠程nacos節點RaftPeer返回到本機節點中;然後更新RaftPeerSet集合信息,保持nacos集羣數據節點的一致性。
2.7、RaftCore.receivedBeat()
核心代碼:
if (local.state != RaftPeer.State.FOLLOWER) {
Loggers.RAFT.info("[RAFT] make remote as leader, remote peer: {}", JSON.toJSONString(remote));
// mk follower
local.state = RaftPeer.State.FOLLOWER;
local.voteFor = remote.ip;
}
final JSONArray beatDatums = beat.getJSONArray("datums");
local.resetLeaderDue();
local.resetHeartbeatDue();
peers.makeLeader(remote);
在該方法中會判斷該遠程節點是否爲follower,如果不是則修改爲follower狀態,同時調用makeLeader()方法將其他非follower的節點狀態改爲follower;最後返回遠程節點RaftPeer。