參考資料
<<從PAXOS到ZOOKEEPER分佈式一致性原理與實踐>>
zookeeper-3.0.0
Leader角色初始化
在上文的選舉完成之後,每個zk實例都會根據選舉結果進入對應的角色,本文主要就是講述Leader的初始化相關內容。
Leader初始化流程
case LEADING:
LOG.info("LEADING");
try {
setLeader(makeLeader(logFactory)); // 設置成主狀態
leader.lead(); // 接聽所有事件請求
setLeader(null); // 如果失去當前主 則將主設置爲空
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
if (leader != null) { // 設置爲空並重置狀態
leader.shutdown("Forcing shutdown");
setLeader(null);
}
setPeerState(ServerState.LOOKING);
}
break;
在角色進入到LEADING時, 此時就會進入生產一個leader實例並調用該leader實例的lead方法進入主角色開始執行。首先查看makeLeader方法。
protected Leader makeLeader(FileTxnSnapLog logFactory) throws IOException {
return new Leader(this, new LeaderZooKeeperServer(logFactory,
this,new ZooKeeperServer.BasicDataTreeBuilder()));
}
此時就是初始化了一個新的Leader類並傳入QuormPeer實例並初始化了一個LeaderZooKeeperServer實例。
該類的初始化方法如下;
Leader(QuorumPeer self,LeaderZooKeeperServer zk) throws IOException {
this.self = self;
try {
ss = new ServerSocket(self.getQuorumAddress().getPort()); // 監聽一下本地服務端口
} catch (BindException e) {
LOG.error("Couldn't bind to port "
+ self.getQuorumAddress().getPort());
throw e;
}
this.zk=zk;
}
主要就是保存對應的實例並監聽本地的端口。接着就執行了該類的lead方法。
leader.lead方法
void lead() throws IOException, InterruptedException {
self.tick = 0; // 計數置零
zk.loadData(); // zk加載數據 主要就是將會話刪除舊的恢復可用的
zk.startup(); // zk創建會話 並註冊調用鏈處理函數
long epoch = self.getLastLoggedZxid() >> 32L; // 獲取epoch 值 並加1
epoch++;
zk.setZxid(epoch << 32L); // 設置zxid值
zk.dataTree.lastProcessedZxid = zk.getZxid(); // 獲取最後一次提交事物id
lastProposed = zk.getZxid();
newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(),
null, null); // 生成一個新leader的包
if ((newLeaderProposal.packet.getZxid() & 0xffffffffL) != 0) { // 判斷值是否爲0
LOG.warn("NEWLEADER proposal has Zxid of "
+ newLeaderProposal.packet.getZxid());
}
outstandingProposals.add(newLeaderProposal); // 添加事務
// Start thread that waits for connection requests from
// new followers.
cnxAcceptor = new FollowerCnxAcceptor(); // 開啓線程接受follower的信息
cnxAcceptor.start();
// We have to get at least a majority of servers in sync with
// us. We do this by waiting for the NEWLEADER packet to get
// acknowledged
newLeaderProposal.ackCount++; // ack統計 默認包括自己 所有先加1
while (newLeaderProposal.ackCount <= self.quorumPeers.size() / 2) { // 檢查回覆是否大於集羣總數的一半
if (self.tick > self.initLimit) { // 檢查tick是否超過限制次數
// Followers aren't syncing fast enough,
// renounce leadership!
shutdown("Waiting for " + (self.quorumPeers.size() / 2) // 超過限制次數 則停止並返回 並繼續進行選舉
+ " followers, only synced with "
+ newLeaderProposal.ackCount);
if (followers.size() >= self.quorumPeers.size() / 2) {
LOG.warn("Enough followers present. "+
"Perhaps the initTicks need to be increased.");
}
return;
}
Thread.sleep(self.tickTime); // 休眠
self.tick++; // 增加tick 值
}
if (!System.getProperty("zookeeper.leaderServes", "yes").equals("no")) { // 如果獲取類熟屬性 如果不是leaderServes則設置zk
self.cnxnFactory.setZooKeeperServer(zk);
}
// Everything is a go, simply start counting the ticks
// WARNING: I couldn't find any wait statement on a synchronized
// block that would be notified by this notifyAll() call, so
// I commented it out
//synchronized (this) {
// notifyAll();
//}
// We ping twice a tick, so we only update the tick every other
// iteration
boolean tickSkip = true;
while (true) {
Thread.sleep(self.tickTime / 2); // 休眠一半的tickTime時間
if (!tickSkip) {
self.tick++;
}
int syncedCount = 0;
// lock on the followers when we use it.
synchronized (followers) { // 獲取所有的followers併發送synced請求
for (FollowerHandler f : followers) {
if (f.synced()) {
syncedCount++;
}
f.ping(); // 發送ping請求
}
}
if (!tickSkip && syncedCount < self.quorumPeers.size() / 2) { // 檢查是否獲得半數以上的回覆 如果沒有則停止並重新進入選舉流程
// Lost quorum, shutdown
shutdown("Only " + syncedCount + " followers, need "
+ (self.quorumPeers.size() / 2));
// make sure the order is the same!
// the leader goes to looking
return;
}
tickSkip = !tickSkip;
}
}
lead方法,主要就是先加載會話相關的數據,然後再註冊請求過來的調用鏈處理函數;在完成之後就進入等待,等待followers發來的確認消息,當獲得的響應數超過一半時,就跳出等待;然後就定時檢查followers的週期是否超時,並且是否存活,定時給followers發送ping消息。
FollowerCnxAcceptor獲取followers的響應
class FollowerCnxAcceptor extends Thread{
private volatile boolean stop = false;
@Override
public void run() {
try {
while (!stop) { // 檢查是否在運行
try{
Socket s = ss.accept(); // 接受follower的連接請求
s.setSoTimeout(self.tickTime * self.syncLimit); // 設置該連接的過期時間
s.setTcpNoDelay(true); // 是否開啓TCP_NODELAY
new FollowerHandler(s, Leader.this); // 新註冊一個FollowerHandler
} catch (SocketException e) {
if (stop) {
LOG.info("exception while shutting down acceptor: "
+ e);
// When Leader.shutdown() calls ss.close(),
// the call to accept throws an exception.
// We catch and set stop to true.
stop = true;
} else {
throw e;
}
}
}
} catch (Exception e) {
LOG.warn("Exception while accepting follower", e);
}
}
public void halt() {
stop = true;
}
}
通過一個線程來完成接受followers的連接,每接受一個連接就初始化一個FollowerHandler,並設置連接的超時時間等條件,並且設置最多網絡只有一個未被確認的網絡包,依次提高傳輸效率降低分組的報文個數。
FollowerHandler的處理流程
FollowerHandler類就是處理有關消息的發送的相關具體操作類。
FollowerHandler(Socket sock, Leader leader) throws IOException {
super("FollowerHandler-" + sock.getRemoteSocketAddress());
this.sock = sock;
this.leader = leader;
leader.addFollowerHandler(this); // 添加到leader的followers列表中
start(); // 開啓run方法運行
}
由於該類繼承自線程類,調用start方法就是執行了run函數;
@Override
public void run() {
try {
ia = BinaryInputArchive.getArchive(new BufferedInputStream(sock
.getInputStream())); // 初始化接入流
bufferedOutput = new BufferedOutputStream(sock.getOutputStream()); // 初始化輸入流
oa = BinaryOutputArchive.getArchive(bufferedOutput);
QuorumPacket qp = new QuorumPacket(); // 生成一個包
ia.readRecord(qp, "packet"); // 讀取輸入數據
if (qp.getType() != Leader.LASTZXID) { // 檢查類型
LOG.error("First packet " + qp.toString()
+ " is not LASTZXID!"); // 如果不等於最後的事務ID則報錯返回
return;
}
long peerLastZxid = qp.getZxid(); // 獲取事務ID
int packetToSend = Leader.SNAP;
boolean logTxns = true;
long zxidToSend = 0;
// we are sending the diff
synchronized(leader.zk.committedLog) { // 如果提交日誌的大小不等於0
if (leader.zk.committedLog.size() != 0) {
if ((leader.zk.maxCommittedLog >= peerLastZxid) // 如果當前的最大日誌大於接受事務ID
&& (leader.zk.minCommittedLog <= peerLastZxid)) { // 並且當前的最小日誌小於接受事務ID
packetToSend = Leader.DIFF;
zxidToSend = leader.zk.maxCommittedLog; // 發送日誌設置成最大日誌
for (Proposal propose: leader.zk.committedLog) { // 遍歷獲取事務日誌
if (propose.packet.getZxid() > peerLastZxid) { // 如果獲取的日誌大於當前接受的事務ID
queuePacket(propose.packet); // 將數據發送給followers同步數據
QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),
null, null);
queuePacket(qcommit); // 添加到發送隊列中
}
}
}
}
else {
logTxns = false;
} }
long leaderLastZxid = leader.startForwarding(this, peerLastZxid); // 加入到要處理的列表中
QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
leaderLastZxid, null, null); // 生成一個新的包
oa.writeRecord(newLeaderQP, "packet"); // 發送該包
bufferedOutput.flush();
// a special case when both the ids are the same
if (peerLastZxid == leaderLastZxid) { // 檢查事務ID與當前最後的事務ID是否相同
packetToSend = Leader.DIFF; // 檢查日誌是否有不一樣的
zxidToSend = leaderLastZxid;
}
//check if we decided to send a diff or we need to send a truncate
// we avoid using epochs for truncating because epochs make things
// complicated. Two epochs might have the last 32 bits as same.
// only if we know that there is a committed zxid in the queue that
// is less than the one the peer has we send a trunc else to make
// things simple we just send sanpshot.
if (logTxns && (peerLastZxid > leader.zk.maxCommittedLog)) {
// this is the only case that we are sure that
// we can ask the follower to truncate the log
packetToSend = Leader.TRUNC; // 截斷日誌
zxidToSend = leader.zk.maxCommittedLog;
}
oa.writeRecord(new QuorumPacket(packetToSend, zxidToSend, null, null), "packet"); // 寫入新的包確定了類型與ID值
bufferedOutput.flush();
// only if we are not truncating or fast sycning
if (packetToSend == Leader.SNAP) { // 如果數據沒有改變
LOG.warn("Sending snapshot last zxid of peer is 0x"
+ Long.toHexString(peerLastZxid) + " "
+ " zxid of leader is 0x"
+ Long.toHexString(leaderLastZxid));
// Dump data to follower
leader.zk.serializeSnapshot(oa); // 將序列化快照發送給follower
oa.writeString("BenWasHere", "signature");
}
bufferedOutput.flush();
//
// Mutation packets will be queued during the serialize,
// so we need to mark when the follower can actually start
// using the data
//
queuedPackets
.add(new QuorumPacket(Leader.UPTODATE, -1, null, null)); // 添加到隊列中
// Start sending packets
new Thread() {
public void run() {
Thread.currentThread().setName(
"Sender-" + sock.getRemoteSocketAddress());
try {
sendPackets(); // 啓動線程發送數據
} catch (InterruptedException e) {
LOG.warn("Interrupted",e);
}
}
}.start();
while (true) {
qp = new QuorumPacket(); // 生成一個包
ia.readRecord(qp, "packet"); // 讀包的數據
long traceMask = ZooTrace.SERVER_PACKET_TRACE_MASK;
if (qp.getType() == Leader.PING) {
traceMask = ZooTrace.SERVER_PING_TRACE_MASK;
}
ZooTrace.logQuorumPacket(LOG, traceMask, 'i', qp);
tickOfLastAck = leader.self.tick;
ByteBuffer bb;
long sessionId;
int cxid;
int type;
switch (qp.getType()) { // 獲取讀入包的類型
case Leader.ACK:
leader.processAck(qp.getZxid(), sock.getLocalSocketAddress()); // 確認獲取了ACK信息
break;
case Leader.PING:
// Process the touches
ByteArrayInputStream bis = new ByteArrayInputStream(qp
.getData());
DataInputStream dis = new DataInputStream(bis); // 處理ping類型消息
while (dis.available() > 0) {
long sess = dis.readLong();
int to = dis.readInt();
leader.zk.touch(sess, to); // 獲取sess值更新seesion
}
break;
case Leader.REVALIDATE:
bis = new ByteArrayInputStream(qp.getData()); // 驗證session是否存活
dis = new DataInputStream(bis);
long id = dis.readLong();
int to = dis.readInt();
ByteArrayOutputStream bos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(bos);
dos.writeLong(id);
boolean valid = leader.zk.touch(id, to);
ZooTrace.logTraceMessage(LOG,
ZooTrace.SESSION_TRACE_MASK,
"Session 0x" + Long.toHexString(id)
+ " is valid: "+ valid);
dos.writeBoolean(valid);
qp.setData(bos.toByteArray());
queuedPackets.add(qp);
break;
case Leader.REQUEST:
bb = ByteBuffer.wrap(qp.getData()); // 處理請求
sessionId = bb.getLong();
cxid = bb.getInt();
type = bb.getInt();
bb = bb.slice();
if(type == OpCode.sync){
leader.zk.submitRequest(new FollowerSyncRequest(this, sessionId, cxid, type, bb,
qp.getAuthinfo())); // 如果是同步則提交請求到同步請求
} else {
leader.zk.submitRequest(null, sessionId, type, cxid, bb,
qp.getAuthinfo()); // 否則直接提交數據去處理
}
break;
default:
}
}
} catch (IOException e) {
if (sock != null && !sock.isClosed()) {
LOG.error("FIXMSG",e);
}
} catch (InterruptedException e) {
LOG.error("FIXMSG",e);
} finally {
LOG.warn("******* GOODBYE "
+ (sock != null ? sock.getRemoteSocketAddress() : "<null>") // 打印信息
+ " ********");
// Send the packet of death
try {
queuedPackets.put(proposalOfDeath); // 關閉發送的線程
} catch (InterruptedException e) {
LOG.error("FIXMSG",e);
}
shutdown(); // 重置並移除在leader中的該handler
}
}
public void shutdown() {
try {
if (sock != null && !sock.isClosed()) { // 檢查sock是否關閉 如果沒關則關閉
sock.close();
}
} catch (IOException e) {
LOG.error("FIXMSG",e);
}
leader.removeFollowerHandler(this); // 移除該handler
}
run函數主要就是先同步數據,檢查獲取從的包的事務ID如果ID不同則將當前主的數據同步發送給從,主要完成了數據同步的工作,在檢查完成之後,就會啓動一個單獨的線程去發送數據給從,並且主會監聽從發送過來的請求並將該請求處理。從這段執行流程也可知followe會轉發客戶端的請求到主上面來,全局只有主來處理客戶端的數據請求。
private void sendPackets() throws InterruptedException {
long traceMask = ZooTrace.SERVER_PACKET_TRACE_MASK;
while (true) {
QuorumPacket p;
p = queuedPackets.take(); // 獲取隊列中的數據
if (p == proposalOfDeath) { // 如果要停止則停止循環
// Packet of death!
break;
}
if (p.getType() == Leader.PING) { // 獲取待發送消息類型
traceMask = ZooTrace.SERVER_PING_TRACE_MASK;
}
ZooTrace.logQuorumPacket(LOG, traceMask, 'o', p);
try {
oa.writeRecord(p, "packet"); // 發送該消息
bufferedOutput.flush();
} catch (IOException e) {
if (!sock.isClosed()) {
LOG.warn("Unexpected exception",e);
}
break;
}
}
}
啓動之後就又一個單獨的線程專門監聽發送隊列並從該隊列中取數據發送給從。至此,Leader角色的主要的流程基本執行完成。
總結
本文主要是分析了Leader角色的啓動流程,主要就是先恢復重建本地的日誌和事物數據,然後接受從的請求,並比較從的數據是否和主數據一致,如果不一致則從主中發送數據給從達到數據同步。然後再監聽從的響應請求並處理,其中包括如果從接受的客戶端的請求會轉發給主處理,基本的處理流程就是這樣。由於本人才疏學淺,如有錯誤請批評指正。