微服務系列（二）(3) ZooKeeper源碼分析-part-2

前文跟蹤源碼分析了ZooKeeper Server的初始化過程，通訊原理及選舉機制，本文將繼續進入源碼，探究ZooKeeper的存儲機制。

通過前文的鏈路追蹤，可以知道ZooKeeper的存儲核心類是org.apache.zookeeper.server.ZKDatabase

下面就開始分析解讀它的實現，它在內存中保存了怎樣的數據結構，又是以哪種策略來寫入文件？

首先回憶一下它是在哪裏進行初始化的

...
quorumPeer.setTxnFactory(new FileTxnSnapLog(
                      config.getDataLogDir(),
                      config.getDataDir()));
...
quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
...

可以看到通過調用org.apache.zookeeper.server.ZKDatabase#ZKDatabase構造器來初始化，並且傳遞了一個FileTxnSnapLog對象

public ZKDatabase(FileTxnSnapLog snapLog) {
    dataTree = createDataTree();
    sessionsWithTimeouts = new ConcurrentHashMap<Long, Integer>();
    this.snapLog = snapLog;

    try {
        snapshotSizeFactor = Double.parseDouble(
            System.getProperty(SNAPSHOT_SIZE_FACTOR,
                    Double.toString(DEFAULT_SNAPSHOT_SIZE_FACTOR)));
        if (snapshotSizeFactor > 1) {
            snapshotSizeFactor = DEFAULT_SNAPSHOT_SIZE_FACTOR;
            LOG.warn("The configured {} is invalid, going to use " +
                    "the default {}", SNAPSHOT_SIZE_FACTOR,
                    DEFAULT_SNAPSHOT_SIZE_FACTOR);
        }
    } catch (NumberFormatException e) {
        LOG.error("Error parsing {}, using default value {}",
                SNAPSHOT_SIZE_FACTOR, DEFAULT_SNAPSHOT_SIZE_FACTOR);
        snapshotSizeFactor = DEFAULT_SNAPSHOT_SIZE_FACTOR;
    }
    LOG.info("{} = {}", SNAPSHOT_SIZE_FACTOR, snapshotSizeFactor);
}

這裏發現了一個系統參數zookeeper.snapshotSizeFactor，默認爲0.33，打個tag，總覺得是個比較重要的參數。

繼續進入createDataTree()

public DataTree createDataTree() {
    return new DataTree();
}

public DataTree() {
    /* Rather than fight it, let root have an alias */
    nodes.put("", root);
    nodes.put(rootZookeeper, root);

    /** add the proc node and quota node */
    root.addChild(procChildZookeeper);
    nodes.put(procZookeeper, procDataNode);

    procDataNode.addChild(quotaChildZookeeper);
    nodes.put(quotaZookeeper, quotaDataNode);

    addConfigNode();

    nodeDataSize.set(approximateDataSize());
    try {
        dataWatches = WatchManagerFactory.createWatchManager();
        childWatches = WatchManagerFactory.createWatchManager();
    } catch (Exception e) {
        LOG.error("Unexpected exception when creating WatchManager, " +
                "exiting abnormally", e);
        System.exit(ExitCode.UNEXPECTED_ERROR.getValue());
    }
}

又出來了一個類DataTree，先看看官方註釋：

/**
 * This class maintains the tree data structure. It doesn't have any networking
 * or client connection code in it so that it can be tested in a stand alone
 * way.
 * <p>
 * The tree maintains two parallel data structures: a hashtable that maps from
 * full paths to DataNodes and a tree of DataNodes. All accesses to a path is
 * through the hashtable. The tree is traversed only when serializing to disk.
 * 
 * 翻譯：
 * 該類維護樹數據結構。 它沒有任何網絡或客戶端連接代碼，因此可以獨立測試。
 * <p>
 * 樹維護着兩個並行的數據結構：一個從完整路徑映射到DataNodes的哈希表和一個DataNode樹。 
 * 對路徑的所有訪問都是通過哈希表進行的。 僅在序列化到磁盤時遍歷樹。
 */

大致可以瞭解到，其內部維護了一個樹的數據結構，並且還維護了路徑到節點的hash表（在java中其實就是一個Map，K是完整路徑，V是節點）。

可以看到，初始化過程會初始化幾個默認節點：

/zookeeper
/zookeeper/quota
/zookeeper/config

並設置所存儲節點的當前近似數據大小

以及

dataWatches = WatchManagerFactory.createWatchManager();
childWatches = WatchManagerFactory.createWatchManager();

很容易聯想到zk中的監聽器功能，後續在深入瞭解其原理。

到這裏，可以發現，原來ZooKeeper中的節點org.apache.zookeeper.server.DataNode對象保存children信息是通過保存其路徑來標誌父子關係，這也是爲什麼同一路徑下不允許出現同名的節點的原因。

Api中有一個核心的方法org.apache.zookeeper.server.DataTree#processTxn(TxnHeader, org.apache.jute.Record, boolean)

這個方法則是進行事務消息的寫入和存儲（內存中的存儲）

將請求或本地封裝一個TxnHeader對象保存請求信息，Record作爲內容對象，進行方法的調用。

從TxnHeader對象解析出clientId、cxid、zxid、type等信息，並根據type來做不同的處理，如：

create：創建節點（不攜帶stat信息）

create2：創建節點（攜帶stat信息）

setData：更新節點信息

…等

那麼到這裏，就會有一個疑問，這個zkDatabase作爲內存中用於與其他組件交互的“數據庫”，爲什麼沒有提供持久化的操作呢？

原來是有的，只是我剛開始沒有發現，這個方法命名有些不太好定位。

org.apache.zookeeper.server.ZKDatabase#append

我憑什麼推測是這個方法進行持久化操作，有以下原因：

從前一篇文章中有提到proccessPacket()方法，該方法存在於Learner、Follower、Observer中，並分別做了不同的實現，這裏以Follower舉例，可以在處理PROPOSAL消息邏輯中找到fzk.logRequest(hdr, txn);，而它最終調用了org.apache.zookeeper.server.SyncRequestProcessor#processRequest，最終定位到邏輯zks.getZKDatabase().append(si)，猜想證實（zk的請求處理器會在下一篇文章詳細講到）
從官方註釋來看append to the underlying transaction log
從方法底層實現來看，其最終調用org.apache.zookeeper.server.persistence.FileTxnLog#append，而在這個方法中，會找到文件流操作的邏輯。

暫且不管zk的請求處理器是如何轉發請求的，直接進入執行sync請求的邏輯，zk除了存儲Node信息外，還額外做了哪些事情。

由一個線程來循環處理這樣的邏輯：

while (true) {
    Request si = null;
    if (toFlush.isEmpty()) {
        si = queuedRequests.take();
    } else {
        si = queuedRequests.poll();
        if (si == null) {
            flush(toFlush);
            continue;
        }
    }
    if (si == requestOfDeath) {
        break;
    }
    if (si != null) {
        // track the number of records written to the log
        if (zks.getZKDatabase().append(si)) {
            logCount++;
            if (logCount > (snapCount / 2 + randRoll)) {
                randRoll = r.nextInt(snapCount/2);
                // roll the log
                zks.getZKDatabase().rollLog();
                // take a snapshot
                if (snapInProcess != null && snapInProcess.isAlive()) {
                    LOG.warn("Too busy to snap, skipping");
                } else {
                    snapInProcess = new ZooKeeperThread("Snapshot Thread") {
                            public void run() {
                                try {
                                    zks.takeSnapshot();
                                } catch(Exception e) {
                                    LOG.warn("Unexpected exception", e);
                                }
                            }
                        };
                    snapInProcess.start();
                }
                logCount = 0;
            }
        } else if (toFlush.isEmpty()) {
            // optimization for read heavy workloads
            // iff this is a read, and there are no pending
            // flushes (writes), then just pass this to the next
            // processor
            if (nextProcessor != null) {
                nextProcessor.processRequest(si);
                if (nextProcessor instanceof Flushable) {
                    ((Flushable)nextProcessor).flush();
                }
            }
            continue;
        }
        toFlush.add(si);
        if (toFlush.size() > 1000) {
            flush(toFlush);
        }
    }
}

要理解這段邏輯，則需要理解這幾個對象：

toFlush：內存中緩存的請求，用於緩衝，當請求進入時，會先進入Flush鏈表，而當達到flush條件時，則會觸發flush進行批量的處理Flush鏈表中的請求。（flush條件1.請求隊列爲空且Flush鏈表不爲空（此時請求pending在zkServer）2.請求隊列不爲空且Flush鏈表爲空（此時請求不斷進入，將請求pending後批量處理））

randRoll：作爲一個隨機值，來決定執行快照備份的時機。

那麼，這段邏輯做了以下幾件事：

如果是寫操作，追加到事務日誌，如果是讀操作，直接轉發給下一個處理器
當執行寫操作時，令logCount++，檢查logCount是否到達隨機閾值（0.5*snapCount~snapCount），如果達到閾值，則進行快照備份操作，即每經過一定次數的寫事務日誌操作，則會進行一次快照備份，且是通過守護線程來異步執行快照。（這裏需要注意的是，在執行快照的邏輯前還執行了rollLog()，可以理解爲日誌的分段，在定期備份的同時，形成新的事務文件，日誌名以log.${zxid}命名）

接着深入它快照備份了哪些信息

最終定位到org.apache.zookeeper.server.persistence.FileTxnSnapLog#save

public void save(DataTree dataTree,
                 ConcurrentHashMap<Long, Integer> sessionsWithTimeouts,
                 boolean syncSnap)
    throws IOException {
    long lastZxid = dataTree.lastProcessedZxid;
    File snapshotFile = new File(snapDir, Util.makeSnapshotName(lastZxid));
    LOG.info("Snapshotting: 0x{} to {}", Long.toHexString(lastZxid),
            snapshotFile);
    try {
        snapLog.serialize(dataTree, sessionsWithTimeouts, snapshotFile, syncSnap);
    } catch (IOException e) {
        if (snapshotFile.length() == 0) {
            /* This may be caused by a full disk. In such a case, the server
             * will get stuck in a loop where it tries to write a snapshot
             * out to disk, and ends up creating an empty file instead.
             * Doing so will eventually result in valid snapshots being
             * removed during cleanup. */
            if (snapshotFile.delete()) {
                LOG.info("Deleted empty snapshot file: " +
                         snapshotFile.getAbsolutePath());
            } else {
                LOG.warn("Could not delete empty snapshot file: " +
                         snapshotFile.getAbsolutePath());
            }
        } else {
            /* Something else went wrong when writing the snapshot out to
             * disk. If this snapshot file is invalid, when restarting,
             * ZooKeeper will skip it, and find the last known good snapshot
             * instead. */
        }
        throw e;
    }
}

可以看到，其最終將內存中的DataTree信息以及session會話信息序列化後寫入到快照文件

另外，在上文源碼追蹤過程中瞭解到，在ZkServer啓動時，會啓動一個用於日誌清理的線程，防止日誌的無限制堆積而影響處理速度。

通過以上的源碼定位和分析，可以瞭解到Zk的存儲機制：

ZooKeeper client在使用時，對於節點的增刪改查操作對應於ZooKeeper server中的ZkDatabase的DataTree數據結構
ZooKeeper的節點信息保存在內存中
ZooKeeper在執行事務請求時會寫入到日誌文件，並且事務日誌會根據事務消息的數量不定期的分段，日誌名以log.${zxid}命名
ZooKeeper會不定期的進行快照備份，備份原理是將內存中的DataTree信息以及session會話信息序列化後寫入到快照文件
ZooKeeper寫入文件的原理，使用jdk提供的FileOutputStream(BufferedOutputStream(DataOutputStream()))IO流的API實現

至此，通過源碼分析ZooKeeper的存儲機制暫時告一段落了。

微服務系列（二）(3) ZooKeeper源碼分析-part-2

微服務系列（二）(3) ZooKeeper源碼分析-part-2

關於遊戲付費的一點想法

我通過CKA和CKS啦！

微服務系列（二）(1) Eureka源碼分析

微服務系列（二）(3) ZooKeeper源碼分析-part-2

微服務系列（三）如何選擇配置中心

微服務系列（二）探究不同註冊中心的底層原理

微服務系列（六）探究Spring Cloud服務調用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結