zk-leader選舉

選舉環境

QuorumCnxManager

QuorumCnxManager 
   QuorumCnxManager.Listener 
 QuorumCnxManager.SendWorker
        final ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>> queueSendMap;
 QuorumCnxManager.RecvWorker
        public final ArrayBlockingQueue<Message> recvQueue;

QuorumCnxManager Listener SendWorker RecvWorker 的分工很明確 準確的說 QuorumCnxManager這個類的職責也很明確
就是負責監聽端口 發消息 讀消息 其中

  • Listener 別人主動連我的信息 同時還有一個條件 (sid < this.mySid) 這個條件我體會了半天才意識到爲何這麼做)
    這裏在選舉的時候 有一個簡單的策略 會主動斷開與比自己myid小的節點建立的連接
  • SendWorker 負責根據Listener保存的連接信息 向對應的server發送(投票)信息
  • RecvWorker 獲取其他server的(投票)信息 並存入隊列

在QuorumCnxManager的內部類中只有一個Message的內部類
這裏只負責與其他server的信息交換 但不負責信息的生成與處理 數據的處理就要交給對應的選舉算法進行處理了
zk提拱多種選舉算法 不過之前版本的都廢棄掉了
一般默認使用FastLeaderElection 也就是在配置文件中設置 electorArg=3
具體的信息處理 都在選舉的算法裏 zk 的狀態也在這個類中進行改變

QuorumPeer.createElectionAlgorithm
 protected Election createElectionAlgorithm(int electionAlgorithm){
        Election le=null;          
        //TODO: use a factory rather than a switch
        switch (electionAlgorithm) {
        case 0:
            le = new LeaderElection(this);
            break;
        case 1:
            le = new AuthFastLeaderElection(this);
            break;
        case 2:
            le = new AuthFastLeaderElection(this, true);
            break;
        case 3:
            qcm = createCnxnManager();
            QuorumCnxManager.Listener listener = qcm.listener;
            if(listener != null){
                listener.start();
                le = new FastLeaderElection(this, qcm);
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
        }
        return le;
    }

FastLeaderElection

消息體的定義
static public class ToSend {
    static enum mType {crequest, challenge, notification, ack}

     ToSend(mType type,  消息類型 如上面的枚舉類型
            long leader, 候選leader  QuorumPeer獲取
            long zxid,   候選事務id  QuorumPeer獲取
            long electionEpoch,  邏輯時鐘
            ServerState state,   服務狀態
            long sid,            本身的myid
            long peerEpoch)      選舉的紀元 
				 peerEpoch初始值
				      public long getCurrentEpoch() throws IOException {
						   if (currentEpoch == -1) {
							currentEpoch = readLongFromFile(CURRENT_EPOCH_FILENAME);
						   }
						  return currentEpoch;
					   }
消息組裝
消息組裝 一共40字節
static ByteBuffer buildMsg(int state,
            long leader,
            long zxid,
            long electionEpoch,
            long epoch) {
        byte requestBytes[] = new byte[40];
        ByteBuffer requestBuffer = ByteBuffer.wrap(requestBytes);

        /*
         * Building notification packet to send 
         */

        requestBuffer.clear();
        requestBuffer.putInt(state);
        requestBuffer.putLong(leader);
        requestBuffer.putLong(zxid);
        requestBuffer.putLong(electionEpoch);
        requestBuffer.putLong(epoch);
        requestBuffer.putInt(Notification.CURRENTVERSION);
        
        return requestBuffer;
    }

兩個線程
  • WorkerSender 負責將sendqueue中的 消息交給QuorumCnxManager放到queueSendMap中sid對應的 隊列裏進行消息的發送
  • WorkerReceiver 負責將收到的消息進行簡單處理 以及將消息進行判斷 然後給對應的server發送自己更新的後的消息
    在這個版本中 消息一共40字節

上述倆個線程負責消息的發送和收集 同時 使用到了QuorumCnxManager這個類 發送的消息交給它queueSendMap 獲取的消息從他的recvQueue裏面拿

主要邏輯梳理

1. response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS); ==>QuorumCnxManager.recvQueue.poll(timeout, unit);  
   這裏是從QuorumCnxManager的接收消息對列中獲取消息
2. 檢查收到的數據的myid是否有在配置文件中配置 如果沒有 則要向該服務發送消息
 .............................   
QuorumPeer
    public Map<Long,QuorumPeer.QuorumServer> getView() {
        return Collections.unmodifiableMap(this.quorumPeers);
    }

    /**
     * Observers are not contained in this view, only nodes with 
     * PeerType=PARTICIPANT.
     */
    public Map<Long,QuorumPeer.QuorumServer> getVotingView() {
        return QuorumPeer.viewToVotingView(getView());
    }

QuorumPeerMain
    quorumPeer.setQuorumPeers(config.getServers());   
.............................
    if(!self.getVotingView().containsKey(response.sid)){
          Vote current = self.getCurrentVote();
           ToSend notmsg = new ToSend(ToSend.mType.notification,
          current.getId(),
          current.getZxid(),
          logicalclock.get(),
          self.getPeerState(),
          response.sid,
          current.getPeerEpoch());
          sendqueue.offer(notmsg); 
 3. 如果是存在 那麼進行後續的邏輯
    1. 檢查數據的合法性之前的版本 數據大小爲28字節 小於28字節則捨去該消息 否則消息初始化buffer.clear()=>position=0
   /*
    * We check for 28 bytes for backward compatibility
   */
    if (response.buffer.capacity() < 28) {
          LOG.error("Got a short response: "
            + response.buffer.capacity());
               continue;
          }
        boolean backCompatibility = (response.buffer.capacity() == 28);
       response.buffer.clear();
    2. 將buffer中的消息讀取出來
        Notification n = new Notification();
                            // State of peer that sent this message
         QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
        switch (response.buffer.getInt()) {
        case 0:
          ackstate = QuorumPeer.ServerState.LOOKING;
            break;
         case 1:
          ackstate = QuorumPeer.ServerState.FOLLOWING;
           break;
         case 2:
          ackstate = QuorumPeer.ServerState.LEADING;
             break;
         case 3:
          ackstate = QuorumPeer.ServerState.OBSERVING;
              break;
          default:
                 continue;
            }
                            
           n.leader = response.buffer.getLong();
           n.zxid = response.buffer.getLong();
           n.electionEpoch = response.buffer.getLong();
           n.state = ackstate;
           n.sid = response.sid;
           if(!backCompatibility){
                 n.peerEpoch = response.buffer.getLong();
                } else {
                if(LOG.isInfoEnabled()){
                     LOG.info("Backward compatibility mode, server id=" + n.sid);
                    }
                n.peerEpoch = ZxidUtils.getEpochFromZxid(n.zxid);
              }

              /*
                * Version added in 3.4.6
                */
           n.version = (response.buffer.remaining() >= 4) ? response.buffer.getInt() : 0x0;

     3. 根據消息的狀態處理消息
          如果自己的狀態是如果也爲looking

                          if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){
                                recvqueue.offer(n);

                                /*
                                 * Send a notification back if the peer that sent this
                                 * message is also looking and its logical clock is
                                 * lagging behind.
                                 */
                                 判斷該消息狀態 如果也爲lookig 同時邏輯時鐘小於自己的 則向該服務發送一條消息 leader爲自己選舉的leader(不一定是自己)
                                if((ackstate == QuorumPeer.ServerState.LOOKING)
                                        && (n.electionEpoch < logicalclock.get())){
                                    Vote v = getVote();
                                    ToSend notmsg = new ToSend(ToSend.mType.notification,
                                            v.getId(),
                                            v.getZxid(),
                                            logicalclock.get(),
                                            self.getPeerState(),
                                            response.sid,
                                            v.getPeerEpoch());
                                    sendqueue.offer(notmsg);
                                }
                            }
        如果自己的狀態不是looking狀態
                   
                                /*
                                 * If this server is not looking, but the one that sent the ack
                                 * is looking, then send back what it believes to be the leader.
                                 */
                                Vote current = self.getCurrentVote();
                                如果請求的服務的狀態是 looking 向該服務發送自己當前的投票信息
                                if(ackstate == QuorumPeer.ServerState.LOOKING){
                                    if(LOG.isDebugEnabled()){
                                        LOG.debug("Sending new notification. My id =  " +
                                                self.getId() + " recipient=" +
                                                response.sid + " zxid=0x" +
                                                Long.toHexString(current.getZxid()) +
                                                " leader=" + current.getId());
                                    }
                                    
                                    ToSend notmsg;
                                    if(n.version > 0x0) {
                                        notmsg = new ToSend(
                                                ToSend.mType.notification,
                                                current.getId(),
                                                current.getZxid(),
                                                current.getElectionEpoch(),
                                                self.getPeerState(),
                                                response.sid,
                                                current.getPeerEpoch());
                                        
                                    } 
                                    else {
                                        Vote bcVote = self.getBCVote();
                                        notmsg = new ToSend(
                                                ToSend.mType.notification,
                                                bcVote.getId(),
                                                bcVote.getZxid(),
                                                bcVote.getElectionEpoch(),
                                                self.getPeerState(),
                                                response.sid,
                                                bcVote.getPeerEpoch());
                                    }
                                    sendqueue.offer(notmsg);
                                }
                              
       

選舉流程

QuorumPeer.run()
{
  */
           while (running) {
               switch (getPeerState()) {
               case LOOKING:
                   LOG.info("LOOKING");
                  ...
                  else {
                       try {
                           setBCVote(null);
                           setCurrentVote(makeLEStrategy().lookForLeader());
                       } catch (Exception e) {
                           LOG.warn("Unexpected exception", e);
                           setPeerState(ServerState.LOOKING);
                       }
                   }
                   break;
               case OBSERVING:
                   try {
                       LOG.info("OBSERVING");
                       setObserver(makeObserver(logFactory));
                       observer.observeLeader();
                   } catch (Exception e) {
                       LOG.warn("Unexpected exception",e );                        
                   } finally {
                       observer.shutdown();
                       setObserver(null);
                       setPeerState(ServerState.LOOKING);
                   }
                   break;
               case FOLLOWING:
                   try {
                       LOG.info("FOLLOWING");
                       setFollower(makeFollower(logFactory));
                       follower.followLeader();
                   } catch (Exception e) {
                       LOG.warn("Unexpected exception",e);
                   } finally {
                       follower.shutdown();
                       setFollower(null);
                       setPeerState(ServerState.LOOKING);
                   }
                   break;
               case LEADING:
                   LOG.info("LEADING");
                   try {
                       setLeader(makeLeader(logFactory));
                       leader.lead();
                       setLeader(null);
                   } catch (Exception e) {
                       LOG.warn("Unexpected exception",e);
                   } finally {
                       if (leader != null) {
                           leader.shutdown("Forcing shutdown");
                           setLeader(null);
                       }
                       setPeerState(ServerState.LOOKING);
                   }
                   break;
}

makeLEStrategy().lookForLeader() 正式開始選舉

主要邏輯梳理

 1. 初始化一些配置
    HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();  存放收到的投票
    HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
    int notTimeout = finalizeWait;  等待時間 默認200
    synchronized(this){
           logicalclock.incrementAndGet(); //邏輯時鐘更新
           updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch()); //更新當前投票信息
        }

2. 發送自己的投票信息(發送給自己)
    sendNotifications();  --這個時候數據 包括myid全是自己的  
       sendqueue.offer(notmsg);
  WorkerSender.run 
        public void run() {
                while (!stop) {
                    try {
                        ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
                        if(m == null) continue;
                        process(m);
=============================================================
                      manager.toSend(m.sid, requestBuffer);   
      public void toSend(Long sid, ByteBuffer b) {
        /*
         * If sending message to myself, then simply enqueue it (loopback).
         */
        if (this.mySid == sid) {  如果myid和自己的一樣 直接放到接收隊列
             b.position(0);
             addToRecvQueue(new Message(b.duplicate(), sid));
            /*
             * Otherwise send to the corresponding thread to send.
             */
        } else {
             /*
              * Start a new connection if doesn't have one already.
              */
             ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY);
             ArrayBlockingQueue<ByteBuffer> bqExisting = queueSendMap.putIfAbsent(sid, bq);
             if (bqExisting != null) {
                 addToSendQueue(bqExisting, b);
             } else {
                 addToSendQueue(bq, b);  ====>  queue.add(buffer);  queueSendMap這個存放sid 和對應的發送消息對列
             }
             connectOne(sid);
                
        }
    }
        
          ...
 }    

 3. 這一步主要是從QuorumCnxManager的recvQueue裏面拿消息 同時在必要的時候 
     請求QuorumCnxManager向對應的服務發送請求
  while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if(n == null){
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                }

 4. 根據其他 集羣內的server返回的消息進行處理 如果不是集羣內配置的 服務直接跳過 打印警告日誌
                     
                    if(self.getVotingView().containsKey(n.sid)) {
                    /*
                     * Only proceed if the vote comes from a replica in the
                     * voting view.
                     */
                    switch (n.state) { case LOOKING:
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }

                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {

                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock.get(),
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break; 

                         case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                           
                            if(ooePredicate(recvset, outofelection, n)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify
                         * a majority is following the same leader.
                         */
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
           
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                n.state, n.sid);
                        break;
                    }

    4.1   先與獲取的信息進行比較 
                       如果自身的邏輯時鐘較小 則刪隊列中已經獲取到的消息 更新選票的信息 然後發送notify消息
                       如果自身的邏輯時鐘較大 則直接忽略該消息
                       如果邏輯時鐘一樣 比較信息 然後發送notify消息
                           if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }
   4.2 將獲取到的消息存recvset的Map中 sid->vote
        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

   5.  這裏判斷自己收都到的投票是否足夠結束一輪投票 這裏兩種策略 不過我們一般都是使用票數過半作爲條件
         termPredicate =>self.getQuorumVerifier().containsQuorum(set);
         ==> 
          public boolean containsQuorum(HashSet<Long> set){
            return (set.size() > half);
         }

        if (termPredicate(recvset,
                 new Vote(proposedLeader, proposedZxid,
                     logicalclock.get(), proposedEpoch))) {
                            
                            如果票數過半 最後等待一段時間 看投票信息是否有變化
                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */

                            這裏開始修改當前服務的狀態
                             在獲取超過一般的服務器的數據後 一般這個時候是可以 確定自己可以作爲什麼角色
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock.get(),
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;  這裏返回最後的 投票信息
                            }
                        }
                        break;   


    6. 
     這裏 FOLLOWING  LEADING
       是放在一個邏輯裏處理的
          如果自己的leader 就做判斷
          如果自己不是leader 或者只是新加入集羣的一員 就將消息放入
          outofelection進行驗證 同時返回自己最後的投票信息 並更新自己的狀態
                     case FOLLOWING:
                     case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                           
                            if(ooePredicate(recvset, outofelection, n)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify
                         * a majority is following the same leader.
                         */
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
           
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;                      

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章