RocksDB 的Put操作和WriteBatch操作

直接從GitHub上clone下來RocksDB的源碼。

RocksDB的put操作的聲明代碼在頭文件include/rocksdb/db.h中:

  // Set the database entry for "key" to "value".
  // If "key" already exists, it will be overwritten.
  // Returns OK on success, and a non-OK status on error.
  // Note: consider setting options.sync = true.
  virtual Status Put(const WriteOptions& options,
                     ColumnFamilyHandle* column_family, const Slice& key,
                     const Slice& value) = 0;
  virtual Status Put(const WriteOptions& options, const Slice& key,
                     const Slice& value) {
    return Put(options, DefaultColumnFamily(), key, value);
  }

註釋說明如果put已經存在的key,那麼會對value進行覆蓋。

如果不指定column_family,那麼會用默認的column_family。

put函數的實現代碼在db/db_impl_write.cc文件中:

// Default implementations of convenience methods that subclasses of DB
// can call if they wish
Status DB::Put(const WriteOptions& opt, ColumnFamilyHandle* column_family,
               const Slice& key, const Slice& value) {
  // Pre-allocate size of write batch conservatively.
  // 8 bytes are taken by header, 4 bytes for count, 1 byte for type,
  // and we allocate 11 extra bytes for key length, as well as value length.
  WriteBatch batch(key.size() + value.size() + 24);
  Status s = batch.Put(column_family, key, value);
  if (!s.ok()) {
    return s;
  }
  return Write(opt, &batch);
}

這裏我們看到了非常有趣的一個操作:put函數調用WriteBatch來對put操作進行處理,也就是說

下面我們來看一下WriteBatch的put操作是怎樣進行的,在代碼db/write_batch.cc中:

Status WriteBatch::Put(ColumnFamilyHandle* column_family, const Slice& key,
                       const Slice& value) {
  return WriteBatchInternal::Put(this, GetColumnFamilyID(column_family), key,
                                 value);
}

調用了內部的put:

Status WriteBatchInternal::Put(WriteBatch* b, uint32_t column_family_id,
                               const Slice& key, const Slice& value) {
  if (key.size() > size_t{port::kMaxUint32}) {
    return Status::InvalidArgument("key is too large");
  }
  if (value.size() > size_t{port::kMaxUint32}) {
    return Status::InvalidArgument("value is too large");
  }

  LocalSavePoint save(b);
  WriteBatchInternal::SetCount(b, WriteBatchInternal::Count(b) + 1);
  if (column_family_id == 0) {
    b->rep_.push_back(static_cast<char>(kTypeValue));
  } else {
    b->rep_.push_back(static_cast<char>(kTypeColumnFamilyValue));
    PutVarint32(&b->rep_, column_family_id);
  }
  PutLengthPrefixedSlice(&b->rep_, key);
  PutLengthPrefixedSlice(&b->rep_, value);
  b->content_flags_.store(
      b->content_flags_.load(std::memory_order_relaxed) | ContentFlags::HAS_PUT,
      std::memory_order_relaxed);
  return save.commit();
}

從這裏可以看出rep_的格式如下:

開頭是 sequence number(fixed64)和 record count(fixed32),然後就是一個一個record,record格式如下:

sequence number爲WriteBatch的序列號(也就是每個操作對應的全局序列號),後四字節爲當前Batch中的記錄數。

圖。。。。。。。。。

數據放進WriteBatch後進行寫入操作,調用函數Write:

Status DBImpl::Write(const WriteOptions& write_options, WriteBatch* my_batch) {
  return WriteImpl(write_options, my_batch, nullptr, nullptr);
}

調用WriteImpl

// The main write queue. This is the only write queue that updates LastSequence.
// When using one write queue, the same sequence also indicates the last
// published sequence.
Status DBImpl::WriteImpl(const WriteOptions& write_options,
                         WriteBatch* my_batch, WriteCallback* callback,
                         uint64_t* log_used, uint64_t log_ref,
                         bool disable_memtable, uint64_t* seq_used,
                         size_t batch_cnt,
                         PreReleaseCallback* pre_release_callback) {
    ......................
}

這個函數有300行,是將內容寫進數據庫的核心函數。

具體做的事情我簡單介紹一下:

內部會維護write group,把多個batch合併一起寫入。
新建一個WriteThread,將writer加入一個write group中。

爲了防止數據丟失,Rocksdb爲了防止數據丟失先寫入WAL文件,再寫入Memtable

檢查memtable的寫入任務是否全部已經完成,假如完成,最後一個writer以leader的角色退出,
並會選出下一個batch的leader

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章