Zookeeper源码分析：Leader角色初始化

参考资料

<<从PAXOS到ZOOKEEPER分布式一致性原理与实践>>
zookeeper-3.0.0

Leader角色初始化

在上文的选举完成之后，每个zk实例都会根据选举结果进入对应的角色，本文主要就是讲述Leader的初始化相关内容。

Leader初始化流程

case LEADING:
                LOG.info("LEADING");
                try {
                    setLeader(makeLeader(logFactory));                          // 设置成主状态
                    leader.lead();                                              // 接听所有事件请求
                    setLeader(null);                                            // 如果失去当前主  则将主设置为空
                } catch (Exception e) {
                    LOG.warn("Unexpected exception",e);
                } finally {
                    if (leader != null) {                                       // 设置为空并重置状态
                        leader.shutdown("Forcing shutdown");
                        setLeader(null);
                    }
                    setPeerState(ServerState.LOOKING);
                }
                break;

在角色进入到LEADING时，此时就会进入生产一个leader实例并调用该leader实例的lead方法进入主角色开始执行。首先查看makeLeader方法。

protected Leader makeLeader(FileTxnSnapLog logFactory) throws IOException {
        return new Leader(this, new LeaderZooKeeperServer(logFactory,
                this,new ZooKeeperServer.BasicDataTreeBuilder()));
    }

此时就是初始化了一个新的Leader类并传入QuormPeer实例并初始化了一个LeaderZooKeeperServer实例。

该类的初始化方法如下；

    Leader(QuorumPeer self,LeaderZooKeeperServer zk) throws IOException {
        this.self = self;
        try {
            ss = new ServerSocket(self.getQuorumAddress().getPort());           // 监听一下本地服务端口
        } catch (BindException e) {
            LOG.error("Couldn't bind to port "
                    + self.getQuorumAddress().getPort());
            throw e;
        }
        this.zk=zk;
    }

主要就是保存对应的实例并监听本地的端口。接着就执行了该类的lead方法。

leader.lead方法

    void lead() throws IOException, InterruptedException {
        self.tick = 0;                                                              // 计数置零
        zk.loadData();                                                              // zk加载数据 主要就是将会话删除旧的恢复可用的
        zk.startup();                                                               // zk创建会话 并注册调用链处理函数
        long epoch = self.getLastLoggedZxid() >> 32L;                               // 获取epoch 值 并加1
        epoch++;
        zk.setZxid(epoch << 32L);                                                   // 设置zxid值
        zk.dataTree.lastProcessedZxid = zk.getZxid();                               // 获取最后一次提交事物id
        lastProposed = zk.getZxid();
        newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(),
                null, null);                                                        // 生成一个新leader的包
        if ((newLeaderProposal.packet.getZxid() & 0xffffffffL) != 0) {              // 判断值是否为0
            LOG.warn("NEWLEADER proposal has Zxid of "
                    + newLeaderProposal.packet.getZxid());
        }
        outstandingProposals.add(newLeaderProposal);                                // 添加事务
        
        // Start thread that waits for connection requests from 
        // new followers.
        cnxAcceptor = new FollowerCnxAcceptor();                                    // 开启线程接受follower的信息
        cnxAcceptor.start();
        
        // We have to get at least a majority of servers in sync with
        // us. We do this by waiting for the NEWLEADER packet to get
        // acknowledged
        newLeaderProposal.ackCount++;                                               // ack统计 默认包括自己 所有先加1
        while (newLeaderProposal.ackCount <= self.quorumPeers.size() / 2) {         // 检查回复是否大于集群总数的一半
            if (self.tick > self.initLimit) {                                       // 检查tick是否超过限制次数
                // Followers aren't syncing fast enough,
                // renounce leadership!
                shutdown("Waiting for " + (self.quorumPeers.size() / 2)             // 超过限制次数 则停止并返回  并继续进行选举
                        + " followers, only synced with "
                        + newLeaderProposal.ackCount);
                if (followers.size() >= self.quorumPeers.size() / 2) {              
                    LOG.warn("Enough followers present. "+
                            "Perhaps the initTicks need to be increased.");
                }
                return;
            }
            Thread.sleep(self.tickTime);                                            // 休眠
            self.tick++;                                                            // 增加tick 值
        }
        if (!System.getProperty("zookeeper.leaderServes", "yes").equals("no")) {    // 如果获取类熟属性 如果不是leaderServes则设置zk
            self.cnxnFactory.setZooKeeperServer(zk);
        }
        // Everything is a go, simply start counting the ticks
        // WARNING: I couldn't find any wait statement on a synchronized
        // block that would be notified by this notifyAll() call, so
        // I commented it out
        //synchronized (this) {
        //    notifyAll();
        //}
        // We ping twice a tick, so we only update the tick every other
        // iteration
        boolean tickSkip = true;

        while (true) {
            Thread.sleep(self.tickTime / 2);                                        // 休眠一半的tickTime时间
            if (!tickSkip) { 
                self.tick++;
            }
            int syncedCount = 0;
            // lock on the followers when we use it.
            synchronized (followers) {                                              // 获取所有的followers并发送synced请求
                for (FollowerHandler f : followers) {
                    if (f.synced()) {
                        syncedCount++;
                    }
                    f.ping();                                                       // 发送ping请求
                }
            }
            if (!tickSkip && syncedCount < self.quorumPeers.size() / 2) {           // 检查是否获得半数以上的回复 如果没有则停止并重新进入选举流程
                // Lost quorum, shutdown
                shutdown("Only " + syncedCount + " followers, need "
                        + (self.quorumPeers.size() / 2));
                // make sure the order is the same!
                // the leader goes to looking
                return;
            }
            tickSkip = !tickSkip;
        }
    }

lead方法，主要就是先加载会话相关的数据，然后再注册请求过来的调用链处理函数；在完成之后就进入等待，等待followers发来的确认消息，当获得的响应数超过一半时，就跳出等待；然后就定时检查followers的周期是否超时，并且是否存活，定时给followers发送ping消息。

FollowerCnxAcceptor获取followers的响应

class FollowerCnxAcceptor extends Thread{
        private volatile boolean stop = false;
        
        @Override
        public void run() {
            try {
                while (!stop) {                                                         // 检查是否在运行
                    try{
                        Socket s = ss.accept();                                         // 接受follower的连接请求
                        s.setSoTimeout(self.tickTime * self.syncLimit);                 // 设置该连接的过期时间
                        s.setTcpNoDelay(true);                                          // 是否开启TCP_NODELAY
                        new FollowerHandler(s, Leader.this);                            // 新注册一个FollowerHandler
                    } catch (SocketException e) {
                        if (stop) {
                            LOG.info("exception while shutting down acceptor: "
                                    + e);

                            // When Leader.shutdown() calls ss.close(),
                            // the call to accept throws an exception.
                            // We catch and set stop to true.
                            stop = true;
                        } else {
                            throw e;
                        }
                    }
                }
            } catch (Exception e) {
                LOG.warn("Exception while accepting follower", e);
            }
        }
        
        public void halt() {
            stop = true;
        }
    }

通过一个线程来完成接受followers的连接，每接受一个连接就初始化一个FollowerHandler，并设置连接的超时时间等条件，并且设置最多网络只有一个未被确认的网络包，依次提高传输效率降低分组的报文个数。

FollowerHandler的处理流程

FollowerHandler类就是处理有关消息的发送的相关具体操作类。

    FollowerHandler(Socket sock, Leader leader) throws IOException {
        super("FollowerHandler-" + sock.getRemoteSocketAddress()); 
        this.sock = sock;
        this.leader = leader;
        leader.addFollowerHandler(this);                                    // 添加到leader的followers列表中
        start();                                                            // 开启run方法运行
    }

由于该类继承自线程类，调用start方法就是执行了run函数；

    @Override
    public void run() {
        try {

            ia = BinaryInputArchive.getArchive(new BufferedInputStream(sock
                    .getInputStream()));                                                // 初始化接入流
            bufferedOutput = new BufferedOutputStream(sock.getOutputStream());          // 初始化输入流
            oa = BinaryOutputArchive.getArchive(bufferedOutput);

            QuorumPacket qp = new QuorumPacket();                                       // 生成一个包
            ia.readRecord(qp, "packet");                                                // 读取输入数据
            if (qp.getType() != Leader.LASTZXID) {                                      // 检查类型
                LOG.error("First packet " + qp.toString()
                        + " is not LASTZXID!");                                         // 如果不等于最后的事务ID则报错返回
                return;
            }
            long peerLastZxid = qp.getZxid();                                           // 获取事务ID
            int packetToSend = Leader.SNAP;
            boolean logTxns = true;

            long zxidToSend = 0;
            // we are sending the diff
            synchronized(leader.zk.committedLog) {                                      // 如果提交日志的大小不等于0
                if (leader.zk.committedLog.size() != 0) {
                    if ((leader.zk.maxCommittedLog >= peerLastZxid)                     // 如果当前的最大日志大于接受事务ID
                            && (leader.zk.minCommittedLog <= peerLastZxid)) {           // 并且当前的最小日志小于接受事务ID
                        packetToSend = Leader.DIFF;
                        zxidToSend = leader.zk.maxCommittedLog;                         // 发送日志设置成最大日志
                        for (Proposal propose: leader.zk.committedLog) {                // 遍历获取事务日志
                            if (propose.packet.getZxid() > peerLastZxid) {              // 如果获取的日志大于当前接受的事务ID
                                queuePacket(propose.packet);                            // 将数据发送给followers同步数据
                                QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),
                                        null, null);
                                queuePacket(qcommit);                                   // 添加到发送队列中

                            }
                        }
                    }
                }
                else {
                    logTxns = false;
                }            }
            long leaderLastZxid = leader.startForwarding(this, peerLastZxid);           // 加入到要处理的列表中
            QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
                    leaderLastZxid, null, null);                                        // 生成一个新的包
            oa.writeRecord(newLeaderQP, "packet");                                      // 发送该包
            bufferedOutput.flush();
            // a special case when both the ids are the same
            if (peerLastZxid == leaderLastZxid) {                                       // 检查事务ID与当前最后的事务ID是否相同
                packetToSend = Leader.DIFF;                                             // 检查日志是否有不一样的
                zxidToSend = leaderLastZxid;
            }
            //check if we decided to send a diff or we need to send a truncate
            // we avoid using epochs for truncating because epochs make things
            // complicated. Two epochs might have the last 32 bits as same.
            // only if we know that there is a committed zxid in the queue that
            // is less than the one the peer has we send a trunc else to make
            // things simple we just send sanpshot.
            if (logTxns && (peerLastZxid > leader.zk.maxCommittedLog)) {
                // this is the only case that we are sure that
                // we can ask the follower to truncate the log
                packetToSend = Leader.TRUNC;                                           // 截断日志
                zxidToSend = leader.zk.maxCommittedLog;

            }
            oa.writeRecord(new QuorumPacket(packetToSend, zxidToSend, null, null), "packet");       // 写入新的包确定了类型与ID值
            bufferedOutput.flush();
            // only if we are not truncating or fast sycning
            if (packetToSend == Leader.SNAP) {                                          // 如果数据没有改变
                LOG.warn("Sending snapshot last zxid of peer is 0x"
                        + Long.toHexString(peerLastZxid) + " " 
                        + " zxid of leader is 0x"
                        + Long.toHexString(leaderLastZxid));
                // Dump data to follower
                leader.zk.serializeSnapshot(oa);                                        // 将序列化快照发送给follower
                oa.writeString("BenWasHere", "signature"); 
            }
            bufferedOutput.flush();
            //
            // Mutation packets will be queued during the serialize,
            // so we need to mark when the follower can actually start
            // using the data
            //
            queuedPackets
                    .add(new QuorumPacket(Leader.UPTODATE, -1, null, null));            // 添加到队列中

            // Start sending packets
            new Thread() {
                public void run() {
                    Thread.currentThread().setName(
                            "Sender-" + sock.getRemoteSocketAddress());
                    try {
                        sendPackets();                                                  // 启动线程发送数据
                    } catch (InterruptedException e) {
                        LOG.warn("Interrupted",e);
                    }
                }
            }.start();

            while (true) {
                qp = new QuorumPacket();                                                // 生成一个包
                ia.readRecord(qp, "packet");                                            // 读包的数据

                long traceMask = ZooTrace.SERVER_PACKET_TRACE_MASK;
                if (qp.getType() == Leader.PING) {
                    traceMask = ZooTrace.SERVER_PING_TRACE_MASK;
                }
                ZooTrace.logQuorumPacket(LOG, traceMask, 'i', qp);
                tickOfLastAck = leader.self.tick;


                ByteBuffer bb;
                long sessionId;
                int cxid;
                int type;

                switch (qp.getType()) {                                                 // 获取读入包的类型
                case Leader.ACK:
                    leader.processAck(qp.getZxid(), sock.getLocalSocketAddress());      // 确认获取了ACK信息
                    break;
                case Leader.PING:
                    // Process the touches
                    ByteArrayInputStream bis = new ByteArrayInputStream(qp
                            .getData());
                    DataInputStream dis = new DataInputStream(bis);                     // 处理ping类型消息
                    while (dis.available() > 0) {
                        long sess = dis.readLong();
                        int to = dis.readInt();
                        leader.zk.touch(sess, to);                                      // 获取sess值更新seesion
                    }
                    break;
                case Leader.REVALIDATE:
                    bis = new ByteArrayInputStream(qp.getData());                       // 验证session是否存活
                    dis = new DataInputStream(bis);
                    long id = dis.readLong();
                    int to = dis.readInt();
                    ByteArrayOutputStream bos = new ByteArrayOutputStream();
                    DataOutputStream dos = new DataOutputStream(bos);
                    dos.writeLong(id);
                    boolean valid = leader.zk.touch(id, to);
                    ZooTrace.logTraceMessage(LOG,
                                             ZooTrace.SESSION_TRACE_MASK,
                                             "Session 0x" + Long.toHexString(id)
                                             + " is valid: "+ valid);
                    dos.writeBoolean(valid);
                    qp.setData(bos.toByteArray());
                    queuedPackets.add(qp);
                    break;
                case Leader.REQUEST:
                    bb = ByteBuffer.wrap(qp.getData());                                 // 处理请求
                    sessionId = bb.getLong();
                    cxid = bb.getInt();
                    type = bb.getInt();
                    bb = bb.slice();
                    if(type == OpCode.sync){
                     	leader.zk.submitRequest(new FollowerSyncRequest(this, sessionId, cxid, type, bb,
                                qp.getAuthinfo()));                                     // 如果是同步则提交请求到同步请求
                    } else {
                    leader.zk.submitRequest(null, sessionId, type, cxid, bb,
                            qp.getAuthinfo());                                          // 否则直接提交数据去处理
                    }
                    break;
                default:
                }
            }
        } catch (IOException e) {
            if (sock != null && !sock.isClosed()) {
                LOG.error("FIXMSG",e);
            }
        } catch (InterruptedException e) {
            LOG.error("FIXMSG",e);
        } finally {
            LOG.warn("******* GOODBYE " 
                    + (sock != null ? sock.getRemoteSocketAddress() : "<null>")         // 打印信息
                    + " ********");
            // Send the packet of death
            try {
                queuedPackets.put(proposalOfDeath);                                     // 关闭发送的线程
            } catch (InterruptedException e) {
                LOG.error("FIXMSG",e);
            }
            shutdown();                                                                 // 重置并移除在leader中的该handler
        }
    }

    public void shutdown() {
        try {
            if (sock != null && !sock.isClosed()) {                                     // 检查sock是否关闭 如果没关则关闭
                sock.close();
            }
        } catch (IOException e) {
            LOG.error("FIXMSG",e);
        }
        leader.removeFollowerHandler(this);                                             // 移除该handler           
    }

run函数主要就是先同步数据，检查获取从的包的事务ID如果ID不同则将当前主的数据同步发送给从，主要完成了数据同步的工作，在检查完成之后，就会启动一个单独的线程去发送数据给从，并且主会监听从发送过来的请求并将该请求处理。从这段执行流程也可知followe会转发客户端的请求到主上面来，全局只有主来处理客户端的数据请求。

    private void sendPackets() throws InterruptedException {
        long traceMask = ZooTrace.SERVER_PACKET_TRACE_MASK;
        while (true) {
            QuorumPacket p;
            p = queuedPackets.take();                                   // 获取队列中的数据

            if (p == proposalOfDeath) {                                 // 如果要停止则停止循环
                // Packet of death!
                break;
            }
            if (p.getType() == Leader.PING) {                           // 获取待发送消息类型
                traceMask = ZooTrace.SERVER_PING_TRACE_MASK;
            }
            ZooTrace.logQuorumPacket(LOG, traceMask, 'o', p);
            try {
                oa.writeRecord(p, "packet");                           // 发送该消息
                bufferedOutput.flush();
            } catch (IOException e) {
                if (!sock.isClosed()) {
                    LOG.warn("Unexpected exception",e);
                }
                break;
            }
        }
    }

启动之后就又一个单独的线程专门监听发送队列并从该队列中取数据发送给从。至此，Leader角色的主要的流程基本执行完成。

总结

本文主要是分析了Leader角色的启动流程，主要就是先恢复重建本地的日志和事物数据，然后接受从的请求，并比较从的数据是否和主数据一致，如果不一致则从主中发送数据给从达到数据同步。然后再监听从的响应请求并处理，其中包括如果从接受的客户端的请求会转发给主处理，基本的处理流程就是这样。由于本人才疏学浅，如有错误请批评指正。

Zookeeper源码分析：Leader角色初始化

参考资料

Leader角色初始化

Leader初始化流程

leader.lead方法

FollowerCnxAcceptor获取followers的响应

FollowerHandler的处理流程

总结

Redis的rdb格式學習

遍歷百萬級Redis的鍵值的大結局

租約-代碼實踐

golang源碼分析：調度器chan調度

兩階段提交實際項目V1

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結