Cassandra 讀/插入/刪除操作的實現

Cassandra 數據模型概述

關於數據模型的基本概念先參考 http://wiki.apache.org/cassandra/DataModel 。注意：這篇文章中關於 Column Families 和 Rows 的描述是不準確的，而且缺少 column family store (CFS) 這一重要概念。 下面對 Column Families 做進一步的說明。

Column-family 數據模型

爲了簡化，我們先忽略 SuperColumn。

在 Cassandra 中一個 Keyspace 可以看作一個二維索引結構：

第一層索引所用的 key 爲 (row-key, cf-name)，即用一個 row-key 和 column-family-name 可以定位一個 column family。column family 是 column 的集合。
第一層索引所用的 key 爲 column-name，即通過一個 column-name 可以在一個 column family 中定位一個 column。

Column 是這個數據模型裏面最基本的數據單元，它是一個三元組 (name, value, timestamp)。一個 column family 裏面，所有的 column 是按照 column-name 排序的。所以可以根據 column-name 快速找到 column。

數據定位

Cassandra 通過 (row-key, cf-name) 對來定位一個 column-family，具體過程如下：

首先根據 row-key 生成一個哈希值，根據哈希值確定在網絡中哪幾個節點上放置該 column family。
在每一個放置了該 column family 的節點上，具有相同的 cf-name 的 column-family 會被保存在一起，稱爲一個 column family store (CFS)。在一個 CFS 中，各個 column family 是按照 row-key 排序的。所以，在該節點上，首先通過 cf-name 來找到對應的 CFS，然後用 row-key 在這個 CFS 中查找這個 column family。

需要注意的地方是，上面的定位過程中 row-key 被使用了兩次。總結來說，在每個節點本地看來，一個 CFS 相當於數據庫的一個表， Column Family 相當於表中的一行， Column 相當於一行中的一個域。

數據的本地存儲

在閱讀關於本地存儲實現的代碼分析之前，建議先閱讀關於 Google 的 Big Table 系統的論文："Bigtable: A Distributed Storage System for Structured Data"。爲了簡化討論，以下均不考慮 super column。

數據結構

數據模型相關的數據結構可以分成兩部分：內存中的數據結構和磁盤上的數據結構。

每個 Cassandra 節點在內存中維護一系列層次性的數據結構

在最頂層的是 Table 類，一個 Table 對應於一個 keyspace。它是 ColumnFamilyStore 對象的集合，可以通過 column family 的名字查找到對應的 CFS；
ColumnFamilyStore 類包含兩個主要的域：
- 一個 Memtable 對象。該對象保存了最近的對該 CFS 中的 Column family 的修改，這些修改還沒有被 flush 到磁盤上。
- 一個 SStableTracker 對象。這是一個 SStableReader 對象的集合，每個 SSTableReader 對象維護了關於磁盤上的 CF 數據的必要信息。

Memtable 對象是一個修改的集合，修改包括兩種：插入和刪除。在 Cassandra 中，刪除並不直接刪除數據，而只是添加一條特殊的修改記錄；修改某一個 column 也並不直接修改原有的數據，而是插入一條新的記錄。一條修改記錄對應一個 ColumnFamily 對象，需要注意的是，在 Cassandra 的代碼裏面，ColumnFamily 對象表示對某個 Column family 的修改，而不是表示這個 CF 的全部內容。Memtable 類包含一個 NonBlockingHashMap<DecoratedKey, ColumnFamily> 域，這個哈希表用於從 row-key 查找對應的 ColumnFamily 對象。

ColumnFamily 對象是一個 Column 對象的集合，同樣的，每一個 Column 對象表示對某一列的修改。

在一定的條件下，Memtable 裏面的修改記錄會被 flush 到磁盤上，flush 之後會創建一個新的 Memtable。數據在磁盤上按照 SStable (Sorted String Table) 的格式保存，一個 SSTable 包括三個文件

cfname-seqno-Data.db：保存修改記錄的主數據文件；
cfname-seqno-Index.db：從 row-key 到相應的 ColumnFamily 在 data.db 文件中的偏移量的索引；
cfname-seqno-Filter.db：Bloom filter，是 row-key 的集合，用於快速檢查一個 row-key 對應的 CF 修改是否在 cfname-seqno-Data.db 文件中存在。

其中，cfname 爲 column family 的名字，seqno 用於區分同一個 CFS 的多個 SSTable，seqno 依次遞增。

Data.db 文件按照 row-key 的順序保存了 Memtable 中所有的 CF 修改信息。Memtable 中每個 ColumnFamily 在 Data.db 文件中按照如下格式保存

bloom filter length
bloom filter：關於 column name 的 Bloom filter，用於快速查找某個列是否被修改；
index size
index info 表。由於 column 按照 name 的順序保存，所以可以通過這個索引表來快速查找。每個 index info 包括
1. first column name：起始 column
2. last column name：結束 column
3. offset：起始 column 在 column 表中的偏移量
4. width：起始 column 到結束 column 之間所有 column 的大小之和
column 表：column 的修改信息。

邏輯上看，在 SSTable 中所有的 CF 按照 row-key 排序，每個 CF 的 column 按照 column name 排序，所以利用上述的多級索引結構可以很快查找到一個 column。

本地讀/插入/刪除操作的實現

Cassandra 支持三個主要的操作：read, insert, remove。操作的基本數據單元是一個 column。

在傳統的文件系統或者數據庫中，insert 操作需要對系統的元信息和數據進行更新，而 remove 操作則需要對元信息進行更新。Cassandra 的設計思路與這些系統不同，無論是 insert 還是 remove 操作，都是在已有的數據後面進行追加，而不修改已有的數據。這種設計稱爲 Log structured 存儲，顧名思義就是系統中的數據是以日誌的形式存在的，所以只會將新的數據追加到已有數據的後面。Log structured 存儲系統有兩個主要優點：

數據的寫和刪除效率極高。傳統的存儲系統需要更新元信息和數據，因此磁盤的磁頭需要反覆移動，這是一個比較耗時的操作，而 Log structured 的系統則是順序寫，可以充分利用文件系統的 cache，所以效率很高；
錯誤恢復簡單。由於數據本身就是以日誌形式保存，老的數據不會被覆蓋，所以在設計 journal 的時候不需要考慮 undo，簡化了錯誤恢復。

但是，Log structured 的存儲系統也引入了一個重要的問題：讀的複雜度和性能。理論上說，讀操作需要從後往前掃描數據，以找到某個記錄的最新版本。相比傳統的存儲系統，這是比較耗時的。

插入操作

Cassandra 實現單次插入操作的函數爲 CassandraServer.insert，該函數的原型如下：

public void insert(String table, String key, ColumnPath column_path, byte[] value, long timestamp, int consistency_level)

其中，ColumnPath 類描述了需要插入的數據在 Cassandra 的索引層次結構中的路徑，包括 column_family, super_column, column 三個主要的域。由這個函數的參數就可以唯一地確定數據需要插入的位置。Cassandra 首先通過 table, key, column_family 幾個參數確定存儲該數據的節點，然後將 table, key, column_family, super_column, column 這幾項信息封裝在一個 RowMutation 數據結構裏面，分配到選定的節點上執行插入操作。

我們先來看一下 RowMutation 這個數據結構，它包括三個主要的域：

table：數據所在的 table；
key：column family 對應的 row-key；
modifications：類型爲 Map<String, ColumnFamily>，是一個 CF 的集合，每個 CF 是對一個 column family 的修改。

RowMutation 對象被髮送到每個節點之後，節點調用 Table.apply 函數來將修改插入到本地存儲中。本地執行插入操作的流程如下：

將修改記錄到 commit log 裏面，需要注意的是，每次添加修改到 commit log 之後並不立即將 commit log flush 到磁盤上，所以該操作不會阻塞；
對 RowMutation 中的每個 CF，找到對應的 CFS，然後調用 ColumnFamilyStore.apply 函數。該函數將 CF 加入到 memtable 中，如果 memtable 的大小或者包含的對象個數超過閾值，則創建一個新的 memtable，並返回當前的 memtable 以便將它的內容 flush 到 SSTable 中，否則返回 null；
對於每個 CFS 的 apply 函數返回的非空的 memtable，Table.apply 函數都需要將它 flush 到 SSTable 中持久保存。flush 的操作由 ColumnFamilyStore.maybeSwitchMemtable 函數來執行，大致流程如下：
1. 調用 submitFlush 函數將 memtable flush 到 SSTable。flush 分爲兩個階段：(1)對 memtable 中的 CF 按照 row-key 進行排序；(2)將排序好的 CF flush 到 SSTable 中。前一個操作是計算密集型的，而後一個操作是 I/O 密集型的，所以將這兩個操作分配到兩個不同的線程池中執行，提高並行性。
2. 當 memtable 的內容確信已經寫到磁盤上之後，需要刪除 commit log 中相應的修改記錄。submitFlush 函數返回一個 condition，當負責 flush 的線程完成了寫磁盤操作，就會 signal 這個 condition。所以，switchMemtable 函數在 submitFlush 函數返回之後，會啓動一個新的線程，等待該 condition 被 signal，之後就可以安全地執行刪除 commit log 的操作了。

刪除操作

刪除一個 column 其實只是插入一個關於這個 column 的墓碑（tombstone），並不直接刪除原有的 column。該墓碑被作爲對該 CF 的一次修改記錄在 Memtable 和 SSTable 中。墓碑的內容是刪除請求被執行的時間，該時間是接受客戶端請求的存儲節點在執行該請求時的本地時間（local delete time），稱爲本地刪除時間。需要注意區分本地刪除時間和時間戳，每個 CF 修改記錄都有一個時間戳，這個時間戳可以理解爲該 column 的修改時間，是由客戶端給定的。

由於被刪除的 column 並不會立即被從磁盤中刪除，所以系統佔用的磁盤空間會越來越大，這就需要有一種垃圾回收的機制，定期刪除被標記了墓碑的 column。垃圾回收是在 compaction 的過程中完成的。Compaction 過程在後面介紹。

讀操作

Cassandra 中讀操作的接口有以下五個：

public ColumnOrSuperColumn get(String table, String key, ColumnPath column_path, ConsistencyLevel consistency_level)
public List<ColumnOrSuperColumn> get_slice(String keyspace, String key, ColumnParent column_parent, SlicePredicate predicate,
 ConsistencyLevel consistency_level)
public Map<String, ColumnOrSuperColumn> multiget(String table, List<String> keys, ColumnPath column_path, ConsistencyLevel consistency_level)
public Map<String, List<ColumnOrSuperColumn>> multiget_slice(String keyspace, List<String> keys, ColumnParent column_parent,
 SlicePredicate predicate, ConsistencyLevel consistency_level)
public List<KeySlice> get_range_slices(String keyspace, ColumnParent column_parent, SlicePredicate predicate, KeyRange range,
 ConsistencyLevel consistency_level)

這些接口形式上雖然不同，但是它們的底層操作都是一樣的：對客戶端指定的key集合 keys = {k_1, k_2, ...} 中的每個 k_i，利用 (k_i, cf-name) 確定一個 Column Family，然後根據一定的標準從中選擇出一個 column 集合並返回。

存儲節點首先根據 cf-name 在本地找到相應的 CFS，這個 CFS 包括了一個或者多個 Memtable以及多個 SSTable。每個 Memtable 或者 SSTable 裏面都可能包含 keys 集合中的一部分 key。本地的讀操作主要有兩個步驟：

對每一個 Memtable 或者 SSTable，讀取一個結果集合 r = {(k, cf)}，其中每個 k 都屬於集合 keys，且 cf 中的 column 必須滿足客戶端指定的條件。每個結果集合都已經根據 k 來排好序。
對上一步得到的多個結果集合進行歸併（merge）爲一個結果集合：將 k 相等的 cf 合併爲一個 cf，最後得到最終的結果集合 R = {(k_1, cf_1), ....}。

由於每個結果集合 r 都已經根據 k 來排序，所以可以使用多路歸併算法來找到這些集合中所有 k 相等的 cf。cassandra 使用 Apache 的 IteratorUtils.collatedIterator 類來實現多路歸併，這個類實現了一個最小堆，可以從給定的一個迭代器集合裏面找到下一個值（next）最小的迭代器並返回它的值。cassandra 得到每個集合 r 的迭代器，並且將他們作爲參數傳給 collatedIterator。由於 collatedIterator 每次調用 next() 都會返回給定的迭代器中值最小的值，所以所有 k 相等的 cf 會被連續返回。接下來就是對這些 k 相等的 cf 進行合併了。

cf 中的每一個 column 都是關於這個 column 的一次修改，而我們要得到最新的修改，所以合併 cf 的過程就是保留同名的 column 中時間戳最新的那個 column。

可見，爲了提高寫的性能，cassandra 增加了讀操作的複雜度，因此讀操作的性能會相對較低。不過，可以通過在每個存儲節點上增加 cache 來提高讀的性能，在 Cassandra 0.6.1 版本里面已經加入了這個優化。這種設計體現了 Log structured 文件系統最初提出者的思想：寫操作是主要的性能瓶頸，而讀操作可以通過 cache 來提高性能，因此需要讓文件系統儘量優化寫。

Compaction

簡單地說，compaction 就是將一個 CFS 中的多個 SSTable 合併爲一個。Cassandra 裏面 compaction 機制主要有三個功能：

垃圾回收：上面提到cassandra並不直接刪除數據，因此磁盤空間會消耗得越來越多，compaction 會把標記爲刪除的數據真正刪除；
提高讀效率：compaction 將多個 SSTable 合併爲一個，因此能提高讀操作的效率；
生成 MerkleTree：在合併的過程中會生成關於這個 CFS 中數據的 MerkleTree，用於與其他存儲節點對比以及修復數據。

Cassandra 中的 compaction 分爲 minor, major, cleanup, readonly, anti-entropy 幾種，它們各實現了上面幾個功能中的一個或者幾個，其中 major compaction 實現了上述所有三個功能，因此我們在這裏只對它進行介紹。MerkleTree 以及數據修復在Cassandra 中的數據一致性中介紹。

cassandra 每增加一個新的 SSTable，就會調用 CompactionManager.submitMinorIfNeeded 函數，這個函數當該 CFS 中的 SSTable 數量達到一定閾值就會觸發 compaction 操作，具體的 compaction 操作在 doCompaction 函數中完成。compaction 操作其實就是將一個 CFS 的所有 SSTable 合併成一個 SSTable，然後將原有的 SSTable 刪除。如果是 major compaction 的話，在合併過程中不會保留標記爲刪除的 column 或者 CF。

Commit Log

在數據庫領域，commit log 可以分爲 undo-log, redo-log 以及 undo-redo-log 三類，由於 cassandra 不會覆蓋已有的數據，所以無須使用 undo 操作，因此它的 commit log 使用的是 redo log。commit log 的文件格式以及相關的操作在 CommitLog.java 文件開始的註釋裏面已經有比較詳細的說明：

/* * Commit Log tracks every write operation into the system. The aim * of the commit log is to be able to successfully recover data that was * not stored to disk via the Memtable. Every Commit Log maintains a * header represented by the abstraction CommitLogHeader. The header * contains a bit array and an array of longs and both the arrays are * of size, #column families for the Table, the Commit Log represents. * * Whenever a ColumnFamily is written to, for the first time its bit flag * is set to one in the CommitLogHeader. When it is flushed to disk by the * Memtable its corresponding bit in the header is set to zero. This helps * track which CommitLogs can be thrown away as a result of Memtable flushes. * Additionally, when a ColumnFamily is flushed and written to disk, its * entry in the array of longs is updated with the offset in the Commit Log * file where it was written. This helps speed up recovery since we can seek * to these offsets and start processing the commit log. * * Every Commit Log is rolled over everytime it reaches its threshold in size; * the new log inherits the "dirty" bits from the old. * * Over time there could be a number of commit logs that would be generated. * To allow cleaning up non-active commit logs, whenever we flush a column family and update its bit flag in * the active CL, we take the dirty bit array and bitwise & it with the headers of the older logs. * If the result is 0, then it is safe to remove the older file. (Since the new CL * inherited the old's dirty bitflags, getting a zero for any given bit in the anding * means that either the CF was clean in the old CL or it has been flushed since the * switch in the new.) */

我們在這裏只是對那裏沒有說明的幾個要點進行補充:

Cassandra 中所有 Table 共用一個 commit log，那麼如何區分不同 CF 的 commit log 記錄呢？cassandra 在啓動加載配置文件的時候會給所有的 CF 分配一個全局的 ID，這個 ID 在 commit log 中被用於區分不同 CF 的記錄，這個 ID 也被用作 CommitLogHeader 中兩個數組的下標。
Cassandra 每次重啓的時候會從 commit log 中恢復數據，最後刪除 commit log 文件。
如果對 cassandra 的配置文件進行了修改，增加或者刪除了一個 CF，那麼原有的 commit log 就不能工作了，有可能引起數據的丟失。因此，安全的做法是：
- 在修改配置之前，先重啓一次 cassandra，並且保證沒有任何寫操作，經過這次重啓，上次的 commit log 已經被刪除；
- 修改配置，並再次重啓 cassandra，這樣就能保證 commit log 頭部的格式是正確的。

分佈式讀/插入/刪除操作

Cassandra 的一個特性是可以讓用戶指定每個讀/插入/刪除操作的一致性級別（consistency level）。Casssandra API 目前支持以下三種一致性級別：

ZERO：只對插入或者刪除操作有意義。負責執行操作的節點把該修改發送給所有的備份節點，但是不會等待任何一個節點回復確認，因此不能保證任何的一致性。
ONE：對於插入或者刪除操作，執行節點保證該修改已經寫到一個存儲節點的 commit log 和 Memtable 中；對於讀操作，執行節點在獲得一個存儲節點上的數據之後立即返回結果。
QUORUM：假設該數據對象的備份節點數目爲 n。對於插入或者刪除操作，保證至少寫到 n/2+1 個存儲節點上；對於讀操作，向 n/2+1 個存儲節點查詢，並返回時間戳最新的數據。

如果用戶在讀和寫操作的時候都選擇 QUORUM 級別，那麼就能保證每次讀操作都能得到最新的更改。另外，Cassandra 0.6 以上的版本對插入和刪除操作支持 ANY 級別，表示保證數據寫到一個存儲節點上。與 ONE 級別不同的地方是，ANY 把寫到 hinted handoff 節點上也看作成功，而 ONE 要求必須寫到最終的目的節點上。

最終一致性的維護

簡單地說，維護系統中數據的最終一致性的方法就是定期地檢查數據備份是否一致，如果不一致則及時採取同步措施。在 Cassandra 裏面，通過三個機制保證系統的最終一致性：

anti-entropy
read repair
hinted handoff

anti-entropy

Anti-Entropy 是 Cassandra 維護最終一致性的主要機制，entropy 就是熵的意思，在物理學中代表混亂、不一致的程度，anti-entropy 則是維護一致性的意思。

Cassandra 使用分佈式哈希表（DHT）來確定存儲某一個數據對象的節點。在 DHT 裏面，負責存儲的節點以及數據對象都被分配一個 token。token 只能在一定的範圍內取值，比如說如果用 MD5 作爲 token 的話，那麼取值範圍就是 [0, 2^128-1]。存儲節點以及對象根據 token 的大小排列成一個環，即最大的 token 後面緊跟着最小的 token，比如對 MD5 而言，token 2^128-1 的下一個 token 就是 0。Cassandra 使用以下算法來分佈數據：

首先，每個存儲節點被分配一個隨機的 token，該 token 代表它在 DHT 環上的位置；
然後，用戶爲數據對象指定一個 key（即 row-key），Cassandra 根據這個 key 計算一個哈希值作爲 token，再根據 token 確定對象在 DHT 環上的位置；
最後，該數據對象由環上擁有比該對象的 token 大的最小的 token 的節點來負責存儲；
根據用戶在配置時指定的備份策略，將該數據對象備份到另外的 N-1 個節點上。網絡中總共存在該對象的 N 個副本。

因此，每個存儲節點最起碼需要負責存儲在環上位於它與它的前一個存儲節點之間的那些數據對象，而且這些對象都會被備份到相同的節點上。我們把 DHT 環上任何兩點之間的區域稱爲一個 range，那麼每個存儲節點需要存儲它與前一個存儲節點之間的 range。

因爲 Cassandra 以 range 爲單位進行備份，所以每個節點需要定期檢查與它保存了相同的 range 的節點，看是否有不一致的情況。一種最簡單的方法如下：

爲 range 中每個數據對象進行一次消息摘要，然後將 range 內的所有對象的token和消息摘對的集合 {(token, digest)} 要發送到需要檢查的鄰居；
鄰居也生成自己維護的該 range 中對象的消息摘要，將兩個列表按照 token 排序之後通過歸併算法就能找出所有不一致的和遺漏的數據對象；
鄰居將這些不一致和遺漏的對象發送給發起修復的節點。

這個算法的優點是精確，可以精確到每個對象，但是缺點是網絡開銷太大，因爲一個range內的對象數量可能很多。由於不一致的對象比例往往是比較低的，所以傳遞所有對象的摘要值開銷一般會比接下來傳遞需要修復的對象本身開銷大。

爲了降低開銷，Cassandra 的 anti-entropy 算法降低了比較的精確程度：將 range 劃分爲 sub-range，取 sub-range 內的所有對象的摘要值的異或（XOR）作爲該 sub-range 的摘要值，最後只是比較 sub-range 以及它的摘要值，如果發現某個 sub-range 不一致，則將整個 sub-range 內的對象都修復。那麼，如何劃分以及比較 sub-range 呢？Cassandra 使用了 MerkleTree。

Cassandra 中的 MerkleTree 是一個二叉樹，整個樹代表一個完整的 range，而樹的每個葉子節點代表一個 sub-range，記錄了該 sub-range 的摘要值，每個內部節點的摘要值是它兩個子節點的摘要值的異或。因爲只有代表同一個 range 的兩個 MerkleTree 才能進行有意義的比較，而在 Cassandra 裏面，由於存在不一致的情況，所以同一個 range 的兩個備份的起始或者結束 token 可能會不同。比如，節點 A 的 range 是 {3, 4, 5, 6, 8, 10}，而節點 B 上雖然也備份了這個 range 上的數據，但是由於存在不一致，這個 range 的 token 集合可能是 {4, 5, 6, 8, 10}。所以，如果直接從這兩個 range 生成兩棵樹的話，它們之間的比較是不準確的。爲了解決這個問題，Cassandra 中所有的 MerkleTree 所代表的 range 都是整個 DHT 環，比如以 MD5 爲 token 的話，range 就是 [0, 2^128)（等價於[0, 0]）。

從一個 range 構造 MerkleTree 分爲兩個步驟：

劃分 sub-range 並且生成樹結構
計算葉子節點的摘要值。

劃分 sub-range 在 AntiEntropyService.Validator.prepare() 函數中進行。算法如下：

int numkeys = keys.size(); Random random = new Random(); // sample the column family using random keys from the index while (true) { DecoratedKey dk = keys.get(random.nextInt(numkeys)); if (!tree.split(dk.token)) break; }

keys 是 range 中的數據對象的 token 列表。算法每次選出一個 token, 算法通過隨機抽樣來判斷 range 中 token 的分佈，再根據分佈來劃分 range：token 密集的部分劃分得比較細，token 稀疏的部分劃分得粗，以保證每個 sub-range 包含的 token 數量大致相等。具體的劃分操作由 MerkleTree.split() 函數執行：

    /**
     * Splits the range containing the given token, if no tree limits would be
     * violated. If the range would be split to a depth below hashdepth, or if
     * the tree already contains maxsize subranges, this operation will fail.
     *
     * @return True if the range was successfully split.
     */
    public boolean split(Token t)
    {
        if (!(size < maxsize))
            return false;
 
        Token mintoken = partitioner.getMinimumToken();
        try
        {
            root = splitHelper(root, mintoken, mintoken, (byte)0, t);
        }
        catch (StopRecursion.TooDeep e)
        {
            return false;
        }
        return true;
    }
 
    private Hashable splitHelper(Hashable hashable, Token pleft, Token pright, byte depth, Token t) throws StopRecursion.TooDeep
    {
        if (depth >= hashdepth)
            throw new StopRecursion.TooDeep();
 
        if (hashable instanceof Leaf)
        {
            // split
            size++;
            Token midpoint = partitioner.midpoint(pleft, pright);
            return new Inner(midpoint, new Leaf(), new Leaf());
        }
        // else: node.
 
        // recurse on the matching child
        Inner node = (Inner)hashable;
        if (Range.contains(pleft, node.token, t))
            // left child contains token
            node.lchild(splitHelper(node.lchild, pleft, node.token, inc(depth), t));
        else
            // else: right child contains token
            node.rchild(splitHelper(node.rchild, node.token, pright, inc(depth), t));
        return node;
    }

該函數調用 splitHelper 函數進行遞歸劃分，如果一個 token 落在一個葉子節點上，則將該葉子節點對應的 sub-range 從中點劃分爲兩個 sub-range，這樣就能保證 token 越密集，劃分的 sub-range 越小。需要特別注意的是，每個 sub-range 都是從中點劃分的，而不是在 token 的位置劃分，這爲後面比較兩個樹的算法帶來了便利。因此，如果 DHT 環的總長度是2^n的話，每個 sub-range 的長度是2^k (1<=k<=n)，各個 sub-range 的 k 可以不同。

建立了樹結構之後，通過 AntiEntropyService.Validator.add() 和 AntiEntropyService.Validator.complete() 函數計算葉子節點的摘要值。對於有 token 的 sub-range，首先生成每一個 token 對應對象的 SHA-256 摘要值，然後將它們進行異或得到 sub-range 的摘要值；對於沒有 token 的 sub-range，摘要值設置爲0。

給定兩棵 MerkleTree，可以使用 MerkleTree.difference() 函數來對它們進行比較，該函數返回兩棵樹中最大連續不一致的 sub-range 列表。這個函數所接受的樹必須是使用上述的 prepare 函數生成的，即每個 sub-range 在劃分時都在中點劃分。算法首先比較兩個樹的根節點的摘要值，如果不相等，則分別比較兩個左子樹的摘要值以及兩個右子樹的摘要值，如果其中一對子樹的摘要值不相等，則遞歸比較這一對子樹。在遞歸的過程中，算法收集所有最大的連續不一致的 sub-range。

上面已經把 MerkleTree 的原理講完了，結合 Cassandra wiki 上關於 Anti-Entropy 的文章以及源代碼應該就可以比較好地理解 Cassandra 中 AE 的工作原理了。

最後就是找到兩個 range 不一致的 sub-range 之後，如果進行修復的問題了。節點會對不一致的 sub-range 進行一次 AntiCompaction，得到一個臨時的 SSTable，最後通過 Streaming API 將這些文件發送到需要修復的備份節點。備份節點只需要將收到的 SSTable 作爲一個新的 SSTable 保存下來，後續的讀和 compaction 就會得到新的數據。

read repair

read repair 是指在客戶端讀取某一個 column 的時候，執行客戶請求的存儲節點會負責檢查該 column 的各個備份是否一致，如果不一致則修復。修復的時機由用戶指定的 consistency level 來決定。如果是 ONE，那麼節點在獲得存儲節點返回的第一個結果後立即返回給客戶，並且修復會在後臺進行；如果是 QUORUM，那麼節點執行以下的協議：

       // 1. Get the N nodes from storage service where the data needs to be
       // replicated
       // 2. Construct a message for read/write
        * 3. Set one of the messages to get the data and the rest to get the digest
       // 4. SendRR ( to all the nodes above )
       // 5. Wait for a response from at least X nodes where X <= N and the data node
        * 6. If the digest matches return the data.
        * 7. else carry out read repair by getting data from all the nodes.
       // 5. return success

執行節點向所有的 N 個備份節點發送讀數據請求，但是隻有其中一個要求返回數據，而其他的只需要返回對該數據的摘要，執行節點檢查數據與摘要是否匹配，如果不匹配則發起 read repair 操作。（Cassandra 0.6.1 版本中，上述協議的實現似乎與描述的不一致，執行節點只會選擇其中一個存儲節點返回的摘要與數據的摘要進行比較，而不是將數據與 X 個節點返回的摘要都比較一遍。這樣的實現可能會遺漏部分不一致的情況。）

read repair 的具體操作是：從所有備份節點（N 個）讀取數據，然後選出最新的修改版本，再將該版本寫到這些備份節點上，從而使得它們的版本一致。修改版本的新舊是根據時間戳來決定的。

hinted handoff

首先可以看看 Cassandra wiki 上關於 hinted handoff 的介紹文章。

負責中轉的節點將中轉的數據與自己的數據存儲在一起，然後在 System Table 裏面增加一些引用。具體可以看 HintedHandoffManager.java 文件開始的註釋：

/**
 * For each table (keyspace), there is a row in the system hints CF.
 * SuperColumns in that row are keys for which we have hinted data.
 * Subcolumns names within that supercolumn are host IPs. Subcolumn values are always empty.
 * Instead, we store the row data "normally" in the application table it belongs in.
 *
 * So when we deliver hints we look up endpoints that need data delivered
 * on a per-key basis, then read that entire row out and send it over.
 * (TODO handle rows that have incrementally grown too large for a single message.)
 *
 * HHM never deletes the row from Application tables; there is no way to distinguish that
 * from hinted tombstones!  instead, rely on cleanup compactions to remove data
 * that doesn't belong on this node.  (Cleanup compactions may be started manually
 * -- on a per node basis -- with "nodeprobe cleanup.")
 *
 * TODO this avoids our hint rows from growing excessively large by offloading the
 * message data into application tables.  But, this means that cleanup compactions
 * will nuke HH data.  Probably better would be to store the RowMutation messages
 * in a HHData (non-super) CF, modifying the above to store a UUID value in the
 * HH subcolumn value, which we use as a key to a [standard] HHData system CF
 * that would contain the message bytes.
 *
 * There are two ways hinted data gets delivered to the intended nodes.
 *
 * runHints() runs periodically and pushes the hinted data on this node to
 * every intended node.
 *
 * runDelieverHints() is called when some other node starts up (potentially
 * from a failure) and delivers the hinted data just to that node.
 */

分佈式刪除問題

在一個完全分佈式的系統裏面，刪除數據是一個很難的問題，Cassandra 中刪除數據的策略是比較有啓發性的。

Java2King

發佈了77 篇原創文章 · 獲贊 3 · 訪問量 62萬+

私信關注

Cassandra 讀/插入/刪除操作的實現

Column-family 數據模型

數據定位

數據的本地存儲

數據結構

本地讀/插入/刪除操作的實現

插入操作

刪除操作

讀操作

Compaction

Commit Log

分佈式讀/插入/刪除操作

最終一致性的維護

anti-entropy

read repair

hinted handoff

分佈式刪除問題

python gdal 安裝使用（Windows， python 3.6.8）

[譯文]Cassandra實例

分佈式 Key-Value 存儲系統：Cassandra 入門

Linux文件處理命令教程

Cassandra 讀/插入/刪除操作的實現

Redis配置文件redis.conf參數解讀

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結