Zookeeper 源碼解析——服務端與客戶端網絡通信

文章目錄

一、概述

在前一篇博文中我們已經分析了 Zookeeper 中客戶端網絡通信的源碼，在本篇博文中我們將會分析 Zookeeper 中服務端的通信，但 Zookeeper 中服務端的通信又可分爲服務端與服務端之間的通信和服務端和客戶端之間的通信，爲了保證梳理的清晰度，所以本篇博文將僅分析 服務端中服務端和客戶端通信 的源碼邏輯，且着重於使用 註釋源碼 的方式進行解析。

在閱讀源碼的過程中也遇到一些比較有趣的註釋，比如下面這個：

博客內所有文章均爲原創，所有示意圖均爲原創，若轉載請附原文鏈接。

二、涉及的核心類

2.1 核心類簡介

ZooKeeperServerMain ：ZkServer 核心啓動類；
ServerCnxnFactory ：服務端連接管理器工廠（工廠模式）；
NettyServerCnxnFactory ：服務端連接管理器工廠 Netty 實現（ServerCnxnFactory 實現類）；
NettyServerCnxn ：單條連接的服務端連接管理器 Netty 實現；
RequestProcessor ：請求處理器接口，實現該接口類可用作 Request Processor Pipeline 中的節點；

三、核心源碼解析

3.1 Standalone 模式下建立 Netty 網絡連接

// ZooKeeperServerMain.java
public static void main(String[] args) {
	ZooKeeperServerMain main = new ZooKeeperServerMain();
    try {
    	// 根據命令行參數初始化並運行 Zookeeper 服務端
    	main.initializeAndRun(args);
    } 
    System.exit(0);
}

// ZooKeeperServerMain.java
protected void initializeAndRun(String[] args) throws ConfigException, IOException, AdminServerException {
	try {
    	ManagedUtil.registerLog4jMBeans();
    }

	// 從命令行解析參數至 ServerConfig 實例中
	ServerConfig config = new ServerConfig();
    if (args.length == 1) {
    	config.parse(args[0]);
    } else {
    	config.parse(args);
    }

	// 調用該方法創建並啓動 Zookeeper 服務端
    runFromConfig(config);
}

// ZooKeeperServerMain.java
public void runFromConfig(ServerConfig config) throws IOException, AdminServerException {
	FileTxnSnapLog txnLog = null;
	try {
    	// 創建本地文件事務存儲
		txnLog = new FileTxnSnapLog(config.dataLogDir, config.dataDir);
		// 創建 Zookeeper 服務端實例
		final ZooKeeperServer zkServer = new ZooKeeperServer(txnLog, config.tickTime, config.minSessionTimeout, config.maxSessionTimeout, null);
		// 將 Zookeeper 服務端實例與本地事務文件存儲進行綁定
		txnLog.setServerStats(zkServer.serverStats());

		boolean needStartZKServer = true;
		if (config.getClientPortAddress() != null) {
	
			// 通過靜態方法 createFactory 創建 ServerCnxnFactory 實例 
			cnxnFactory = ServerCnxnFactory.createFactory();
			// 根據配置文件中的 ClientPostAddress 配置其客戶端端口
			cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns(), false);
			// 使用 ServerCnxnFactory 啓動 zookeeperServer
			cnxnFactory.startup(zkServer);
			// 因爲在此處 zkServer 已經啓動，所以我們不需要在 secureCnxnFactory 中再次啓動它
			needStartZKServer = false;
		}
		
		// 省略其它包括 secureCnxnFactory 在內的組件初始化和配置代碼...
	} finally {
        if (txnLog != null) {
        	txnLog.close();
    	}
	}
}

// ServerCnxnFactory.java
static public ServerCnxnFactory createFactory() throws IOException {
	// 從配置文件中獲取將要創建的 ServerCnxnFactory 類型
	String serverCnxnFactoryName = System.getProperty(ZOOKEEPER_SERVER_CNXN_FACTORY);
	
	if (serverCnxnFactoryName == null) {
		// 如果系統配置文件中未設置該屬性則默認使用 JDK 的 NIO 實現版本 NIOServerCnxnFactory
		serverCnxnFactoryName = NIOServerCnxnFactory.class.getName();
	}
	try {
		// 通過反射調用構造方法實例化 ServerCnxnFactory 對象
		ServerCnxnFactory serverCnxnFactory = (ServerCnxnFactory) Class.forName(serverCnxnFactoryName).getDeclaredConstructor().newInstance();
		return serverCnxnFactory;
	}
}

// ServerCnxnFactory.java
public void startup(ZooKeeperServer zkServer) throws IOException, InterruptedException {
	// 啓動 zkServer
	startup(zkServer, true);
}

// NettyServerCnxnFactory.java
public void startup(ZooKeeperServer zks, boolean startServer) throws IOException, InterruptedException {
	// 綁定 Netty 監聽端口
	start();
	// 完成 zkServer 和 ServerCnxnFactory 的雙向綁定
    setZooKeeperServer(zks);
    if (startServer) {
    	// 啓動 zkServer
        zks.startdata();
        zks.startup();
    }
}

// NettyServerCnxnFactory.java
public void start() {
	// 綁定監聽端口
    parentChannel = bootstrap.bind(localAddress).syncUninterruptibly().channel();
    
    // 如果原始端口爲 0 則在調用 bind 方法後該端口會發生改變
    // 因此更新 localAddress 以獲得真正的端口
    localAddress = (InetSocketAddress) parentChannel.localAddress();
}

// NettyServerCnxnFactory.java
final public void setZooKeeperServer(ZooKeeperServer zks) {
	// 實現 zkServer 和 ServerCnxnFactory 的雙向綁定
	this.zkServer = zks;
    if (zks != null) {
    	if (secure) {
        	zks.setSecureServerCnxnFactory(this);
        } else {
            zks.setServerCnxnFactory(this);
        }
    }
}

3.2 配置 Netty

// NettyServerCnxnFactory.java
NettyServerCnxnFactory() {
	x509Util = new ClientX509Util();

	// 創建與客戶端端口數目相同的線程，使得每一個線程監聽一個端口
	// 且在創建 bossGroup 時優先選擇使用更高性能的 EpollEventLoopGroup
	EventLoopGroup bossGroup = NettyUtils.newNioOrEpollEventLoopGroup(NettyUtils.getClientReachableLocalInetAddressCount());
	// 創建 workerGroup 且優先選擇使用更高性能的 EpollEventLoopGroup
	EventLoopGroup workerGroup = NettyUtils.newNioOrEpollEventLoopGroup();
	ServerBootstrap bootstrap = new ServerBootstrap()
			.group(bossGroup, workerGroup)
            .channel(NettyUtils.nioOrEpollServerSocketChannel())
            // 父 Channel 配置
            .option(ChannelOption.SO_REUSEADDR, true)
            // 子 Channel 配置
            .childOption(ChannelOption.TCP_NODELAY, true)
            .childOption(ChannelOption.SO_LINGER, -1)
            .childHandler(new ChannelInitializer<SocketChannel>() {
                @Override
                protected void initChannel(SocketChannel ch) throws Exception {
                	ChannelPipeline pipeline = ch.pipeline();
                    if (secure) {
                    	initSSL(pipeline);
                    }
                    // 向 pipeline 添加 channelHandler 處理器
                    pipeline.addLast("servercnxnfactory", channelHandler);
                }
            });
    this.bootstrap = configureBootstrapAllocator(bootstrap);
    this.bootstrap.validate();
}

// NettyServerCnxnFactory.CnxnChannelHandler.java
public void channelActive(ChannelHandlerContext ctx) throws Exception {
	// Netty Channel 初始化完成後會調用該方法
	
	// 創建一個 NettyServerCnxn 並綁定當前的 Channel 和 zkServer
	NettyServerCnxn cnxn = new NettyServerCnxn(ctx.channel(), zkServer, NettyServerCnxnFactory.this);
	// 將 NettyServerCnxn 保存至 Channel 屬性中（接收請求時會用到）
	ctx.channel().attr(CONNECTION_ATTRIBUTE).set(cnxn);

	if (secure) {
		SslHandler sslHandler = ctx.pipeline().get(SslHandler.class);
		Future<Channel> handshakeFuture = sslHandler.handshakeFuture();
		handshakeFuture.addListener(new CertificateVerifier(sslHandler, cnxn));
	} else {
		// 將 Channel 和 NettyServerCnxn 分別添加到集合中保存
		allChannels.add(ctx.channel());
        addCnxn(cnxn);
	}
}

// NettyServerCnxnFactory.CnxnChannelHandler.java
public void channelInactive(ChannelHandlerContext ctx) throws Exception {
	// Netty Channel 結束前會調用該方法

	// 將 Channel 從 allChannels 集合中移除
	allChannels.remove(ctx.channel());
	// 解除 Channel 和 NettyServerCnxn 之間的綁定關係
    NettyServerCnxn cnxn = ctx.channel().attr(CONNECTION_ATTRIBUTE).getAndSet(null);
	if (cnxn != null) {
		// 關閉 NettyServerCnxn
		cnxn.close();
	}
}

3.3 接收並處理請求

// NettyServerCnxnFactory.CnxnChannelHandler.java
public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
	try {
		try {
			// 在初始化時的 channelActive 方法中我們將 NettyServerCnxn 註冊至 Channel 屬性中
			NettyServerCnxn cnxn = ctx.channel().attr(CONNECTION_ATTRIBUTE).get();
			if (cnxn == null) {
				LOG.error("channelRead() on a closed or closing NettyServerCnxn");
			} else {
				// 如果 NettyServerCnxn 未被關閉或未被正在關閉則調用 processMessage 處理請求
				cnxn.processMessage((ByteBuf) msg);
			}
		}
	} finally {
		// 釋放 Buffer
		ReferenceCountUtil.release(msg);
	}
}

// NettyServerCnxn.java
void processMessage(ByteBuf buf) {
	checkIsInEventLoop("processMessage");
	if (throttled.get()) {
    	// 如果當前爲限流狀態則直接進行排隊
    	if (queuedBuffer == null) {
    		queuedBuffer = channel.alloc().compositeBuffer();
    	}
    	// 添加至 queuedBuffer 中排隊
    	appendToQueuedBuffer(buf.retainedDuplicate());
    } else {
    
		if (queuedBuffer != null) {
			// 如果存在 queueBuffer 則仍然讓響應排隊
			appendToQueuedBuffer(buf.retainedDuplicate());
			// 該方法中包含對於 Channel 正在關閉時的處理邏輯，但對於響應的處理實質還是調用 receiveMessage 方法
			processQueuedBuffer();
        } else {
        	// 調用 receiveMessage 處理響應
            receiveMessage(buf);
            // 必須再次檢查通道是否正在關閉，因爲在 receiveMessage 方法中可能出現錯誤而導致 close() 被調用          
            if (!closingChannel && buf.isReadable()) {
            	if (queuedBuffer == null) {
                	queuedBuffer = channel.alloc().compositeBuffer();
                }
				appendToQueuedBuffer(buf.retainedSlice(buf.readerIndex(), buf.readableBytes()));
        	}
    	}
    	
	}
}

關於 Netty.CompositeByteBuf ：

CompositeByteBuf 在聚合時使用，多個 buffer 合併時，不需要 copy，通過 CompositeByteBuf 可以把需要合併的 bytebuf 組合起來；
對外提供統一的 readindex 和 writerindex ，CompositeByteBuf 裏面有個ComponentList，繼承自 ArrayList，聚合的 bytebuf 都放在 ComponentList 裏面，其最小容量爲16 ；

// NettyServerCnxn.java
private void receiveMessage(ByteBuf message) {
	checkIsInEventLoop("receiveMessage");
    try {
		while(message.isReadable() && !throttled.get()) {
			if (bb != null) {
				if (bb.remaining() > message.readableBytes()) {
					int newLimit = bb.position() + message.readableBytes();
					bb.limit(newLimit);
				}
				// 從數據包中讀取長度爲 bb 的 ByteBuffer
				message.readBytes(bb);
				bb.limit(bb.capacity());
            
            	if (bb.remaining() == 0) {
            		packetReceived();
                	bb.flip();

                	ZooKeeperServer zks = this.zkServer;
					if (zks == null || !zks.isRunning()) {
						throw new IOException("ZK down");
					}
 					if (initialized) {
                        // 源碼注：如果將 zks.processPacket() 改爲使用 ByteBuffer[] 則可以實現零拷貝隊列
                        // 調用 processPacket 方法處理數據包中的實際數據
						zks.processPacket(this, bb);

						if (zks.shouldThrottle(outstandingCount.incrementAndGet())) {
							disableRecvNoWait();
						}
					} else {
						zks.processConnectRequest(this, bb);
                        initialized = true;
                    }
                    bb = null;
                }
            } else {
				if (message.readableBytes() < bbLen.remaining()) {
					bbLen.limit(bbLen.position() + message.readableBytes());
				}
				// 4 byte 的 ByteBuffer 用於讀取數據包中前 4 byte 所記錄的數據包中實際數據長度 
                message.readBytes(bbLen);
                bbLen.limit(bbLen.capacity());
                if (bbLen.remaining() == 0) {
                    bbLen.flip();
                    // 讀取前 4 byte 所代表的的 Int 數值
                    int len = bbLen.getInt();
                    bbLen.clear();
                    if (!initialized) {
                    	if (checkFourLetterWord(channel, message, len)) {
                        	return;
                        }
                    }
                    if (len < 0 || len > BinaryInputArchive.maxBuffer) {
                    	throw new IOException("Len error " + len);
                    }
                    // 將 bb 賦值爲數據包中前 4 byte Int 值長度的 ByteBuffer
                    bb = ByteBuffer.allocate(len);
                }
            }
        }
    } 
}

上面這個 receiveMessage 方法中的邏輯比較複雜一點，但是處理的流程是跟客戶端類似的，在這裏簡單總結一下代碼主流程：

當數據包首次進入該方法時 bb 爲空，所以直接進入第二個語句塊；
在第二個語句塊中會從數據包中讀入長度爲 4 byte 的 ByteBuffer（ bblen = ByteBuffer.allocate(4) ），然後將其轉換爲一個 Int 整型值 len ；
根據整型值 len 申請長度爲 len 的 ByteBuffer 賦值給 bb ，然後結束此輪循環；
進入第二輪循環時 bb 已經是長度爲 len 的 ByteBuffer（ len 爲數據包中有效數據的長度），所以進入第一個語句塊；
在第一個語句塊中會直接從傳入的數據包中讀長度爲 len 的數據並寫入到 bb 中（一次性完整的將全部有效數據讀入）；
最後將獲取到的有效數據傳入 processPacket 方法中進行處理；

// ZooKeeperServer.java
public void processPacket(ServerCnxn cnxn, ByteBuffer incomingBuffer) throws IOException {
	// We have the request, now process and setup for next
    InputStream bais = new ByteBufferInputStream(incomingBuffer);
    BinaryInputArchive bia = BinaryInputArchive.getArchive(bais);
    // 解析請求頭至臨時變量 h
    RequestHeader h = new RequestHeader();
    h.deserialize(bia, "header");
    // 從原緩衝區的當前位置開始創建一個新的字節緩衝區
	incomingBuffer = incomingBuffer.slice();
	if (h.getType() == OpCode.auth) {
		AuthPacket authPacket = new AuthPacket();
        ByteBufferInputStream.byteBuffer2Record(incomingBuffer, authPacket);
            
        // 省略認證等代碼...
        else {
        	// 將數據包中的有效數據組裝爲 Request 請求
            Request si = new Request(cnxn, cnxn.getSessionId(), h.getXid(), h.getType(), incomingBuffer, cnxn.getAuthInfo());
            si.setOwner(ServerCnxn.me);
            // 始終將來自客戶端的請求視爲可能的本地請求
            setLocalSessionFlag(si);
            // 將組裝好的 Request 請求通過 submitRequest 方法發送給上層邏輯處理
            submitRequest(si);
        }
    }
    cnxn.incrOutstandingRequests(h);
}

關於 ByteBuffer.slice ：

創建一個新的字節緩衝區，其內容是原緩衝區內容的共享子序列，新緩衝區的內容將從原緩衝區的當前位置開始；
原緩衝區內容的更改將在新緩衝區中可見，反之亦然，但兩個緩衝區的 position 、limit 和 mark 是相互獨立的；
新緩衝區的 position 爲零，它的 capacity 和 limit 爲原緩衝區中剩餘（remaining = limit - position）的字節數，且它的 mark 將是未定義的。當且僅當原緩衝區是 direct 和 read-only 時，新的緩衝區纔將是 direct 的；

// ZooKeeperServer.java
public void submitRequest(Request si) {
	if (firstProcessor == null) {
		synchronized (this) {
			try {
			// 因爲所有的請求都被傳遞給請求處理器，所以應該等待請求處理器鏈建立完成
			// 且當請求處理器鏈建立完成後，狀態將更新爲 RUNNING
			while (state == State.INITIAL) {
				wait(1000);
			}
			if (firstProcessor == null || state != State.RUNNING) {
				throw new RuntimeException("Not started");
			}
		}
	}
	try {
		// 驗證 sessionId
		touch(si.cnxn);
		// 驗證 Request 是否有效
		boolean validpacket = Request.isValid(si.type);
		if (validpacket) {
			// 如果 Request 有效則將其傳遞給請求處理鏈（Request Processor Pipeline）的第一個請求處理器
			firstProcessor.processRequest(si);
			if (si.cnxn != null) {
				incInProcess();
			}
		} else {
			// 該請求來自未知類型的客戶端
			new UnimplementedRequestProcessor().processRequest(si);
		}
    }
}

// RequestProcessor.java
public interface RequestProcessor {
    @SuppressWarnings("serial")
    public static class RequestProcessorException extends Exception {
        public RequestProcessorException(String msg, Throwable t) {
            super(msg, t);
        }
    }

    void processRequest(Request request) throws RequestProcessorException;

    void shutdown();
}

3.4 發送響應

因爲 Zookeeper 中對於請求的處理是採用 Request Processor Pipeline 來完成的，所以對於處理請求後組裝併發送響應的工作就是由最後一個 FinalRequestProcessor 來完成的，因此我們下面的源碼分析就從 FinalRequestProcessor 的 processRequest 方法開始（這裏的分析不會過多的涉及對於具體業務流程中響應的生成邏輯，更多的是偏向響應發送的整體流程邏輯），該方法的入參爲上一個 Request Processor 處理後的 Request 請求。

// FinalRequestProcessor.java
public void processRequest(Request request) {
	// 因爲重點分析發送響應流程，所以省略居多分類別處理請求並生成 hdr 響應頭 和 rsp 響應體代碼...
	try {
		// 在上面處理過 Request 請求後將生成的響應頭和響應體作爲入參調用 sendResponse 方法發送響應
		cnxn.sendResponse(hdr, rsp, "response");
		// 如果 Request 的類型爲 closeSession 則進入關閉邏輯
		if (request.type == OpCode.closeSession) {
			cnxn.sendCloseSession();
		}
	}
}

// ServerCnxn.java
public void sendResponse(ReplyHeader h, Record r, String tag) throws IOException {
	ByteArrayOutputStream baos = new ByteArrayOutputStream();
	BinaryOutputArchive bos = BinaryOutputArchive.getArchive(baos);
	try {
		// 預留首部 4 byte 記錄數據包中有效數據的長度
		baos.write(fourBytes);
		// 寫入響應頭
 		bos.writeRecord(h, "header");
		if (r != null) {
			// 寫入響應體
			bos.writeRecord(r, tag);
        }
        // 關閉輸出流
        baos.close();
    } 
    
    // 將輸出流轉換爲字節數組
	byte b[] = baos.toByteArray();
	// 重定位指針便於確定有效數據的長度
    serverStats().updateClientResponseSize(b.length - 4);
    // 將字節數組包裝到 ByteBuffer 緩衝區中
    ByteBuffer bb = ByteBuffer.wrap(b);
	// 向首部 4 byte 寫入數據包中有效數據長度的 Int 整型值
    bb.putInt(b.length - 4).rewind();
    // 發送組裝好的 ByteBuffer
    sendBuffer(bb);
}

// NettyServerCnxn.java
public void sendBuffer(ByteBuffer sendBuffer) {
	// 如果 ByteBuffer 爲 closeConn 則調用 close() 進入關閉邏輯
	if (sendBuffer == ServerCnxnFactory.closeConn) {
		close();
		return;
	}
	// 否則將 ByteBuffer 中的數據寫入 Channel 並通過 flush 將其發送    
    channel.writeAndFlush(Unpooled.wrappedBuffer(sendBuffer)).addListener(onSendBufferDoneListener);
}

四、源碼總結

4.1 接收請求

服務端從 Netty Channel 的 channelRead 方法接收請求，並通過 NettyServerCnxn 的 processMessage 方法將其轉發給當前 Channel 所綁定的 NettyServerCnxn ；
在 NettyServerCnxn 的 processMessage 方法中會進行限流（throttle）處理將請求 Buffer 拷貝到 queuedBuffer 中，然後調用 receiveMessage 方法對請求做進一步處理；
receiveMessage 方法的主要工作就是從傳入的 ByteBuf 中讀取有效的數據（數據包前 4 byte 記錄有效數據的長度），並將其轉化爲 ByteBuffer 傳給 ZooKeeperServer 的 processPacket 進行處理；
在 processPacket 方法中會從 ByteBuffer 中解析出請求頭和請求體，並將其封裝爲 Request 後傳給上層的 submitRequest 方法進行處理；
submitRequest 會等待第一個 Request Processor 初始化完成後進行請求的驗證工作，然後將驗證成功的請求傳遞給 Request Processor Pipeline 中的第一個 Request Processor 進行處理；

4.2 發送響應

響應的發送工作是由 Request Processor Pipeline 的最後一個 Request Processor 來完成的，在 FinalRequestProcessor 的 processRequest 方法中會根據請求的類型對傳入的請求進行處理，並生成響應頭和響應體傳給 ServerCnxn 的 sendResponse 方法；
在 sendResponse 方法中會將入參的響應頭和響應體組裝爲一個完整的響應，並將其轉換爲 ByteBuffer 通過 NettyServerCnxn 的 sendBuffer 方法傳給 NettyServerCnxn ；
在 NettyServerCnxn 的 sendBuffer 方法中會進行 ByteBuffer 類型的判斷，如果類型爲 closeConn 則進入關閉邏輯，否則通過 Channel 的 writeAndFlush 方法將響應發送；

五、內容總結

5.1 ByteBuffer.slice()

創建一個新的字節緩衝區，其內容是原緩衝區內容的共享子序列，新緩衝區的內容將從原緩衝區的當前位置開始；
原緩衝區內容的更改將在新緩衝區中可見，反之亦然，但兩個緩衝區的 position 、limit 和 mark 是相互獨立的；
新緩衝區的 position 爲零，它的 capacity 和 limit 爲原緩衝區中剩餘（remaining = limit - position）的字節數，且它的 mark 將是未定義的。當且僅當原緩衝區是 direct 和 read-only 時，新的緩衝區纔將是 direct 的；

5.2 Netty.CompositeByteBuf

CompositeByteBuf 在聚合時使用，多個 buffer 合併時，不需要 copy，通過 CompositeByteBuf 可以把需要合併的 bytebuf 組合起來，對外提供統一的 readindex 和 writerindex ；
CompositeByteBuf 裏面有個ComponentList，繼承自 ArrayList，聚合的 bytebuf 都放在 ComponentList 裏面，其最小容量爲16 ；

5.3 零拷貝隊列

Zookeeper 網絡通信的源碼中很多地方使用到了零拷貝隊列，並且有寫地方也直接註釋了優化建議即使用零拷貝隊列，但是因爲這裏涉及的技術點比較多，所以打算在分析 Netty 源碼的時候在單獨寫文章進行整體，這裏僅做知識點備忘。

// NettyServerCnxn.receiveMessage()

// TODO: if zks.processPacket() is changed to take a ByteBuffer[],
// we could implement zero-copy queueing.
zks.processPacket(this, bb);

六、思考

5.1 爲什麼 Zookeeper 選擇使用 ByteBuffer 而不是 ByteBuf

通過對源碼的閱讀我們可以發現 Zookeeper 對於 JDK NIO 的 ByteBuffer 和 Netty 的 ByteBuf 基本是穿插使用的，我在閱讀的過程中就在疑問爲什麼會使用這樣的方式。但其實我們可以發現 Zookeeper 服務端對於 Netty 實現的部分其實底層爲了效率使用的仍然是 ByteBuf ，但進入到上層 ZooKeeperServer 的 processPacket 方法後就轉換爲了 ByteBuffer （轉換是在 receiveMessage 中完成的），我覺得這樣轉換的意義可能更多的是爲了兼容原本的 JDK NIO 實現吧。