在前幾章中,我們已經熟悉了LevelDB中的創建、讀數據、寫數據等基本操作,現在應該仔細來看看存儲數據的結構體了,一開始我們已經看了skiplist的實現,其實MemTable中基本上就是依靠skiplist來實現的。MemTable是在內存中的數據存儲結構,一些基本的讀取操作都是會先對其做操作,而sstable則是磁盤上的存儲結構。這一節的內容是也是LevelDB的精華所在。
MemTable
MemTable的結構較爲簡單。對其的get的put操作都轉換爲跳錶上的操作即可,還要注意一點,就是MemTable中有一個內置的引用計數,作用和智能指針相似,只有ref=0的時候才能析構 ,不同的地方時之類的引用計數需要手動增加,每申請一個MemTable對象,都有手動調用Ref(),要析構前調用Unref()。
class MemTable {
public:
// MemTables are reference counted. The initial reference count
// is zero and the caller must call Ref() at least once.
explicit MemTable(const InternalKeyComparator& comparator);
MemTable(const MemTable&) = delete;
MemTable& operator=(const MemTable&) = delete;
// Increase reference count.
void Ref() { ++refs_; }
// Drop reference count. Delete if no more references exist.
void Unref() {
--refs_;
assert(refs_ >= 0);
if (refs_ <= 0) {
delete this;
}
}
// Add an entry into memtable that maps key to value at the
// specified sequence number and with the specified type.
// Typically value will be empty if type==kTypeDeletion.
void Add(SequenceNumber seq, ValueType type, const Slice& key,
const Slice& value);
// If memtable contains a value for key, store it in *value and return true.
// If memtable contains a deletion for key, store a NotFound() error
// in *status and return true.
// Else, return false.
bool Get(const LookupKey& key, std::string* value, Status* s);
...
private:
typedef SkipList<const char*, KeyComparator> Table;
~MemTable(); // Private since only Unref() should be used to delete it
KeyComparator comparator_;
int refs_;
Arena arena_;
//這裏可以看到MemTable中就是依靠一個快表進行存儲的,
//注意這裏沒有放指針,直接放了對象,構造MemTable的時候直接調用SkipList的構造函數即可
Table table_;
...
};
以上的接口都解釋的很清楚了,接下來看一下Add函數的具體實現。大致流程:首先計算出每一個key-value對的格式長度,格式如下圖。然後申請內存,在內存空間中將數據按照格式填充進去,最後插入快表中。
void MemTable::Add(SequenceNumber s, ValueType type, const Slice& key,
const Slice& value) {
// Format of an entry is concatenation of:
// key_size : varint32 of internal_key.size()
// key bytes : char[internal_key.size()]
// value_size : varint32 of value.size()
// value bytes : char[value.size()]
size_t key_size = key.size();
size_t val_size = value.size();
size_t internal_key_size = key_size + 8;
//整體大小爲encoded_len,格式爲如上所示
const size_t encoded_len = VarintLength(internal_key_size) +
internal_key_size + VarintLength(val_size) +
val_size;
//這裏是申請encoded_len大小的空間
char* buf = arena_.Allocate(encoded_len);
//把key長度賦值到內存中
char* p = EncodeVarint32(buf, internal_key_size);
//把key放到內存中,這裏先放key的值,然後再放type(8字節的type)
memcpy(p, key.data(), key_size);
p += key_size;
EncodeFixed64(p, (s << 8) | type);
p += 8;
//放value的size
p = EncodeVarint32(p, val_size);
memcpy(p, value.data(), val_size);
assert(p + val_size == buf + encoded_len);
//最後把這個內存塊插入快表中
table_.Insert(buf);
}
接下來再看看查找的過程。首先根據key初始化一個Iterator對象,用於查找,當沒有找到就直接返回false,如果找到則按照如上的格式解析出來,這裏還有一點,在解析出type的時候判斷其是否已經被刪除(tag == kTypeDeletion),如果沒有則賦值參數value並返回true。
bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
Slice memkey = key.memtable_key();
//這裏利用一個iterator來查找key,如果找到則返回true,找不到返回false
Table::Iterator iter(&table_);
iter.Seek(memkey.data());
if (iter.Valid()) {
// entry format is:
// klength varint32
// userkey char[klength]
// tag uint64
// vlength varint32
// value char[vlength]
// Check that it belongs to same user key. We do not check the
// sequence number since the Seek() call above should have skipped
// all entries with overly large sequence numbers.
const char* entry = iter.key();
uint32_t key_length;
const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);
if (comparator_.comparator.user_comparator()->Compare(
Slice(key_ptr, key_length - 8), key.user_key()) == 0) {
// Correct user key
const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
switch (static_cast<ValueType>(tag & 0xff)) {
// kTypeValue表示put進去的
case kTypeValue: {
Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
value->assign(v.data(), v.size());
return true;
}
// kTypeDeletion表示已經刪除的,deleted
case kTypeDeletion:
*s = Status::NotFound(Slice());
return true;
}
}
}
return false;
}
sstable
sstable是磁盤上的文件,用於可持久化的存儲Immutable MemTable的數據,TableBuilder類就是對其的操作。其中有一個關鍵的結構體Rep。
sstable中存放數據根據用途可以分成以下幾類:DataBlock、FilterBlock、MetaIndexBlock、IndexBlock、Footer。DataBlock存儲的是數據key-value對,按照key值的增序列排序,格式三個字段分別是data、type、crc。FilterBlock這裏先不考慮,MetaIndexBlock存儲的是Filter的索引,IndexBlock存放的是數據的索引,Footer存儲的是索引的索引,裏面有MetaIndexBlock的索引和indexBlock的索引。
struct TableBuilder::Rep {
Rep(const Options& opt, WritableFile* f)
: options(opt),
index_block_options(opt),
file(f),
offset(0),
data_block(&options),
index_block(&index_block_options),
num_entries(0),
closed(false),
filter_block(opt.filter_policy == nullptr
? nullptr
: new FilterBlockBuilder(opt.filter_policy)),
pending_index_entry(false) {
index_block_options.block_restart_interval = 1;
}
Options options;
Options index_block_options;
WritableFile* file; //寫入的.sst文件
uint64_t offset;
Status status;
BlockBuilder data_block; //數據block
BlockBuilder index_block; //索引block
std::string last_key; //上一次插入的key,確保sstable中key是有序的
int64_t num_entries;
bool closed; // 是否關閉
FilterBlockBuilder* filter_block;
bool pending_index_entry; //datablock中是否有數據,datablock爲空的時候爲true
BlockHandle pending_handle; //記錄datablock在文件中的偏移量和大小
std::string compressed_output;//是否datablock壓縮了
};
將數據寫入的sstable的過程是:調用Add() -> Flush() -> WriteBlock() -> WriteRawBlock(),接下來對每個函數做的工作總結。
- Add()函數首先判斷新插入的key-value鍵值對的key是否大於lastkey(這樣做的話就是保證sstable中的數據是key有序遞增的),如果datablock中沒有數據則還要在indexblock中插入,最後在datablock中插入數據並更新lastkey值和number值,如果datablock的size沒有超過閾值(默認4kb)則結束,如果超過閾值則Flush到磁盤中。
- Flush函數中調用WriteBlock()將datablock中的數據寫入低層磁盤文件中。
- WriteBlock()函數首先將datablock中的數據取出來,然後規範成如下的格式,然後再調用WriteRawBlock()
block_data: uint8[n] (數據)
type: uint8 (是否壓縮)
crc: uint32 (校驗)
- WriteRawBlock()函數就是真正將數據append到文件中。
void TableBuilder::Add(const Slice& key, const Slice& value) {
Rep* r = rep_;
assert(!r->closed);
if (!ok()) return;
if (r->num_entries > 0) {
//當datablock中已經有數據了,這時候插入的key值要比lastkey大才行,保證datablock中的key值是有序的
assert(r->options.comparator->Compare(key, Slice(r->last_key)) > 0);
}
if (r->pending_index_entry) {
//如果之前datablock中沒有數據,則首次插入數據要先在pending_handle中記錄datablock的偏移量和大小
//同時indexblock中也要插入
assert(r->data_block.empty());
r->options.comparator->FindShortestSeparator(&r->last_key, key);
std::string handle_encoding;
r->pending_handle.EncodeTo(&handle_encoding);
r->index_block.Add(r->last_key, Slice(handle_encoding));
r->pending_index_entry = false;
}
if (r->filter_block != nullptr) {
r->filter_block->AddKey(key);
}
//datablock中插入,並更新num值和lastkey值
r->last_key.assign(key.data(), key.size());
r->num_entries++;
r->data_block.Add(key, value);
//當datablock的size超過閾值(默認4kb)則flash到磁盤中
const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
if (estimated_block_size >= r->options.block_size) {
Flush();
}
}
void TableBuilder::Flush() {
Rep* r = rep_;
assert(!r->closed);
if (!ok()) return;
if (r->data_block.empty()) return;
assert(!r->pending_index_entry);
WriteBlock(&r->data_block, &r->pending_handle);
if (ok()) {
r->pending_index_entry = true;
r->status = r->file->Flush();
}
if (r->filter_block != nullptr) {
r->filter_block->StartBlock(r->offset);
}
}
void TableBuilder::WriteBlock(BlockBuilder* block, BlockHandle* handle) {
// File format contains a sequence of blocks where each block has:
// block_data: uint8[n]
// type: uint8
// crc: uint32
assert(ok());
Rep* r = rep_;
Slice raw = block->Finish(); //這裏是block.Finish()
Slice block_contents;
CompressionType type = r->options.compression;
// TODO(postrelease): Support more compression options: zlib?
switch (type) {
case kNoCompression:
block_contents = raw;
break;
case kSnappyCompression: {
std::string* compressed = &r->compressed_output;
if (port::Snappy_Compress(raw.data(), raw.size(), compressed) &&
compressed->size() < raw.size() - (raw.size() / 8u)) {
block_contents = *compressed;
} else {
// Snappy not supported, or compressed less than 12.5%, so just
// store uncompressed form
block_contents = raw;
type = kNoCompression;
}
break;
}
}
WriteRawBlock(block_contents, type, handle);
//寫入磁盤完成後,datablock清空
r->compressed_output.clear();
block->Reset();
}
void TableBuilder::WriteRawBlock(const Slice& block_contents,
CompressionType type, BlockHandle* handle) {
Rep* r = rep_;
handle->set_offset(r->offset);
handle->set_size(block_contents.size());
//把數據插入底層file文件,首先插入blockcontent,然後插入type
r->status = r->file->Append(block_contents);
if (r->status.ok()) {
char trailer[kBlockTrailerSize];
trailer[0] = type; //第一個字節爲type,後面爲crc
uint32_t crc = crc32c::Value(block_contents.data(), block_contents.size());
crc = crc32c::Extend(crc, trailer, 1); // Extend crc to cover block type
EncodeFixed32(trailer + 1, crc32c::Mask(crc));
r->status = r->file->Append(Slice(trailer, kBlockTrailerSize)); //插入type和crc
if (r->status.ok()) {
r->offset += block_contents.size() + kBlockTrailerSize;
}
}
}
最後還有一個Finish()函數,在TableBuilder結束的時候,用於將datablock、filterblock、metaindexblock、indexblock、footer等數據都寫入磁盤文件。
參考博客:
- https://www.cnblogs.com/ym65536/p/7751229.html