LevelDB源碼解讀——MemTable和sstable

在前幾章中,我們已經熟悉了LevelDB中的創建、讀數據、寫數據等基本操作,現在應該仔細來看看存儲數據的結構體了,一開始我們已經看了skiplist的實現,其實MemTable中基本上就是依靠skiplist來實現的。MemTable是在內存中的數據存儲結構,一些基本的讀取操作都是會先對其做操作,而sstable則是磁盤上的存儲結構。這一節的內容是也是LevelDB的精華所在。

MemTable

MemTable的結構較爲簡單。對其的get的put操作都轉換爲跳錶上的操作即可,還要注意一點,就是MemTable中有一個內置的引用計數,作用和智能指針相似,只有ref=0的時候才能析構 ,不同的地方時之類的引用計數需要手動增加,每申請一個MemTable對象,都有手動調用Ref(),要析構前調用Unref()。

class MemTable {
 public:
  // MemTables are reference counted.  The initial reference count
  // is zero and the caller must call Ref() at least once.
  explicit MemTable(const InternalKeyComparator& comparator);
  MemTable(const MemTable&) = delete;
  MemTable& operator=(const MemTable&) = delete;
  // Increase reference count.
  void Ref() { ++refs_; }
  // Drop reference count.  Delete if no more references exist.
  void Unref() {
    --refs_;
    assert(refs_ >= 0);
    if (refs_ <= 0) {
      delete this;
    }
  }
  // Add an entry into memtable that maps key to value at the
  // specified sequence number and with the specified type.
  // Typically value will be empty if type==kTypeDeletion.
  void Add(SequenceNumber seq, ValueType type, const Slice& key,
           const Slice& value);
  // If memtable contains a value for key, store it in *value and return true.
  // If memtable contains a deletion for key, store a NotFound() error
  // in *status and return true.
  // Else, return false.
  bool Get(const LookupKey& key, std::string* value, Status* s);
  ...
  
 private:
  typedef SkipList<const char*, KeyComparator> Table;
  ~MemTable();  // Private since only Unref() should be used to delete it
  KeyComparator comparator_;
  int refs_;
  Arena arena_;
  //這裏可以看到MemTable中就是依靠一個快表進行存儲的,
  //注意這裏沒有放指針,直接放了對象,構造MemTable的時候直接調用SkipList的構造函數即可
  Table table_; 
  ...
};

以上的接口都解釋的很清楚了,接下來看一下Add函數的具體實現。大致流程:首先計算出每一個key-value對的格式長度,格式如下圖。然後申請內存,在內存空間中將數據按照格式填充進去,最後插入快表中。
MemTable中的結構

void MemTable::Add(SequenceNumber s, ValueType type, const Slice& key,
                   const Slice& value) {
  // Format of an entry is concatenation of:
  //  key_size     : varint32 of internal_key.size()
  //  key bytes    : char[internal_key.size()]
  //  value_size   : varint32 of value.size()
  //  value bytes  : char[value.size()]
  size_t key_size = key.size();
  size_t val_size = value.size();
  size_t internal_key_size = key_size + 8;
  //整體大小爲encoded_len,格式爲如上所示
  const size_t encoded_len = VarintLength(internal_key_size) +
                             internal_key_size + VarintLength(val_size) +
                             val_size;
  //這裏是申請encoded_len大小的空間
  char* buf = arena_.Allocate(encoded_len);
  //把key長度賦值到內存中
  char* p = EncodeVarint32(buf, internal_key_size);
  //把key放到內存中,這裏先放key的值,然後再放type(8字節的type)
  memcpy(p, key.data(), key_size);
  p += key_size;
  EncodeFixed64(p, (s << 8) | type);
  p += 8;
  //放value的size
  p = EncodeVarint32(p, val_size);
  memcpy(p, value.data(), val_size);
  assert(p + val_size == buf + encoded_len);
  //最後把這個內存塊插入快表中
  table_.Insert(buf);
}

接下來再看看查找的過程。首先根據key初始化一個Iterator對象,用於查找,當沒有找到就直接返回false,如果找到則按照如上的格式解析出來,這裏還有一點,在解析出type的時候判斷其是否已經被刪除(tag == kTypeDeletion),如果沒有則賦值參數value並返回true。

bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
  Slice memkey = key.memtable_key();
  //這裏利用一個iterator來查找key,如果找到則返回true,找不到返回false
  Table::Iterator iter(&table_);
  iter.Seek(memkey.data());
  if (iter.Valid()) {
    // entry format is:
    //    klength  varint32
    //    userkey  char[klength]
    //    tag      uint64
    //    vlength  varint32
    //    value    char[vlength]
    // Check that it belongs to same user key.  We do not check the
    // sequence number since the Seek() call above should have skipped
    // all entries with overly large sequence numbers.
    const char* entry = iter.key();
    uint32_t key_length;
    const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);
    if (comparator_.comparator.user_comparator()->Compare(
            Slice(key_ptr, key_length - 8), key.user_key()) == 0) {
      // Correct user key
      const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
      switch (static_cast<ValueType>(tag & 0xff)) {
        // kTypeValue表示put進去的
        case kTypeValue: {
          Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
          value->assign(v.data(), v.size());
          return true;
        }
        // kTypeDeletion表示已經刪除的,deleted
        case kTypeDeletion:
          *s = Status::NotFound(Slice());
          return true;
      }
    }
  }
  return false;
}

sstable

sstable是磁盤上的文件,用於可持久化的存儲Immutable MemTable的數據,TableBuilder類就是對其的操作。其中有一個關鍵的結構體Rep。
sstable中存放數據根據用途可以分成以下幾類:DataBlock、FilterBlock、MetaIndexBlock、IndexBlock、Footer。DataBlock存儲的是數據key-value對,按照key值的增序列排序,格式三個字段分別是data、type、crc。FilterBlock這裏先不考慮,MetaIndexBlock存儲的是Filter的索引,IndexBlock存放的是數據的索引,Footer存儲的是索引的索引,裏面有MetaIndexBlock的索引和indexBlock的索引。

struct TableBuilder::Rep {
  Rep(const Options& opt, WritableFile* f)
      : options(opt),
        index_block_options(opt),
        file(f),
        offset(0),
        data_block(&options),
        index_block(&index_block_options),
        num_entries(0),
        closed(false),
        filter_block(opt.filter_policy == nullptr
                         ? nullptr
                         : new FilterBlockBuilder(opt.filter_policy)),
        pending_index_entry(false) {
    index_block_options.block_restart_interval = 1;
  }
  Options options;
  Options index_block_options;
  WritableFile* file;           //寫入的.sst文件
  uint64_t offset;
  Status status;
  BlockBuilder data_block;      //數據block
  BlockBuilder index_block;     //索引block
  std::string last_key;         //上一次插入的key,確保sstable中key是有序的
  int64_t num_entries;
  bool closed;                  // 是否關閉
  FilterBlockBuilder* filter_block;
  bool pending_index_entry;     //datablock中是否有數據,datablock爲空的時候爲true
  BlockHandle pending_handle;   //記錄datablock在文件中的偏移量和大小
  std::string compressed_output;//是否datablock壓縮了
};

將數據寫入的sstable的過程是:調用Add() -> Flush() -> WriteBlock() -> WriteRawBlock(),接下來對每個函數做的工作總結。

  1. Add()函數首先判斷新插入的key-value鍵值對的key是否大於lastkey(這樣做的話就是保證sstable中的數據是key有序遞增的),如果datablock中沒有數據則還要在indexblock中插入,最後在datablock中插入數據並更新lastkey值和number值,如果datablock的size沒有超過閾值(默認4kb)則結束,如果超過閾值則Flush到磁盤中。
  2. Flush函數中調用WriteBlock()將datablock中的數據寫入低層磁盤文件中。
  3. WriteBlock()函數首先將datablock中的數據取出來,然後規範成如下的格式,然後再調用WriteRawBlock()

block_data: uint8[n] (數據)
type: uint8 (是否壓縮)
crc: uint32 (校驗)

  1. WriteRawBlock()函數就是真正將數據append到文件中。
void TableBuilder::Add(const Slice& key, const Slice& value) {
  Rep* r = rep_;
  assert(!r->closed);
  if (!ok()) return;
  if (r->num_entries > 0) {
    //當datablock中已經有數據了,這時候插入的key值要比lastkey大才行,保證datablock中的key值是有序的
    assert(r->options.comparator->Compare(key, Slice(r->last_key)) > 0);
  }

  if (r->pending_index_entry) {
    //如果之前datablock中沒有數據,則首次插入數據要先在pending_handle中記錄datablock的偏移量和大小
    //同時indexblock中也要插入
    assert(r->data_block.empty());
    r->options.comparator->FindShortestSeparator(&r->last_key, key);
    std::string handle_encoding;
    r->pending_handle.EncodeTo(&handle_encoding);
    r->index_block.Add(r->last_key, Slice(handle_encoding));
    r->pending_index_entry = false;
  }

  if (r->filter_block != nullptr) {
    r->filter_block->AddKey(key);
  }
  //datablock中插入,並更新num值和lastkey值
  r->last_key.assign(key.data(), key.size());
  r->num_entries++;
  r->data_block.Add(key, value);

  //當datablock的size超過閾值(默認4kb)則flash到磁盤中
  const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
  if (estimated_block_size >= r->options.block_size) {
    Flush();
  }
}

void TableBuilder::Flush() {
  Rep* r = rep_;
  assert(!r->closed);
  if (!ok()) return;
  if (r->data_block.empty()) return;
  assert(!r->pending_index_entry);
  WriteBlock(&r->data_block, &r->pending_handle);
  if (ok()) {
    r->pending_index_entry = true;
    r->status = r->file->Flush();
  }
  if (r->filter_block != nullptr) {
    r->filter_block->StartBlock(r->offset);
  }
}

void TableBuilder::WriteBlock(BlockBuilder* block, BlockHandle* handle) {
  // File format contains a sequence of blocks where each block has:
  //    block_data: uint8[n]
  //    type: uint8
  //    crc: uint32
  assert(ok());
  Rep* r = rep_;
  Slice raw = block->Finish();  //這裏是block.Finish()

  Slice block_contents;
  CompressionType type = r->options.compression;
  // TODO(postrelease): Support more compression options: zlib?
  switch (type) {
    case kNoCompression:
      block_contents = raw;
      break;

    case kSnappyCompression: {
      std::string* compressed = &r->compressed_output;
      if (port::Snappy_Compress(raw.data(), raw.size(), compressed) &&
          compressed->size() < raw.size() - (raw.size() / 8u)) {
        block_contents = *compressed;
      } else {
        // Snappy not supported, or compressed less than 12.5%, so just
        // store uncompressed form
        block_contents = raw;
        type = kNoCompression;
      }
      break;
    }
  }
  WriteRawBlock(block_contents, type, handle);
  //寫入磁盤完成後,datablock清空
  r->compressed_output.clear();
  block->Reset();
}

void TableBuilder::WriteRawBlock(const Slice& block_contents,
                                 CompressionType type, BlockHandle* handle) {
  Rep* r = rep_;
  handle->set_offset(r->offset);
  handle->set_size(block_contents.size());
  //把數據插入底層file文件,首先插入blockcontent,然後插入type
  r->status = r->file->Append(block_contents);
  if (r->status.ok()) {
    char trailer[kBlockTrailerSize];
    trailer[0] = type;    //第一個字節爲type,後面爲crc
    uint32_t crc = crc32c::Value(block_contents.data(), block_contents.size());
    crc = crc32c::Extend(crc, trailer, 1);  // Extend crc to cover block type
    EncodeFixed32(trailer + 1, crc32c::Mask(crc));
    r->status = r->file->Append(Slice(trailer, kBlockTrailerSize)); //插入type和crc
    if (r->status.ok()) {
      r->offset += block_contents.size() + kBlockTrailerSize;
    }
  }
}

最後還有一個Finish()函數,在TableBuilder結束的時候,用於將datablock、filterblock、metaindexblock、indexblock、footer等數據都寫入磁盤文件。

參考博客:

  1. https://www.cnblogs.com/ym65536/p/7751229.html
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章