直接從GitHub上clone下來RocksDB的源碼。
RocksDB的put
操作的聲明代碼在頭文件include/rocksdb/db.h
中:
// Set the database entry for "key" to "value".
// If "key" already exists, it will be overwritten.
// Returns OK on success, and a non-OK status on error.
// Note: consider setting options.sync = true.
virtual Status Put(const WriteOptions& options,
ColumnFamilyHandle* column_family, const Slice& key,
const Slice& value) = 0;
virtual Status Put(const WriteOptions& options, const Slice& key,
const Slice& value) {
return Put(options, DefaultColumnFamily(), key, value);
}
註釋說明如果put已經存在的key,那麼會對value進行覆蓋。
如果不指定column_family,那麼會用默認的column_family。
put
函數的實現代碼在db/db_impl_write.cc
文件中:
// Default implementations of convenience methods that subclasses of DB
// can call if they wish
Status DB::Put(const WriteOptions& opt, ColumnFamilyHandle* column_family,
const Slice& key, const Slice& value) {
// Pre-allocate size of write batch conservatively.
// 8 bytes are taken by header, 4 bytes for count, 1 byte for type,
// and we allocate 11 extra bytes for key length, as well as value length.
WriteBatch batch(key.size() + value.size() + 24);
Status s = batch.Put(column_family, key, value);
if (!s.ok()) {
return s;
}
return Write(opt, &batch);
}
這裏我們看到了非常有趣的一個操作:put函數調用WriteBatch
來對put操作進行處理,也就是說
下面我們來看一下WriteBatch
的put操作是怎樣進行的,在代碼db/write_batch.cc
中:
Status WriteBatch::Put(ColumnFamilyHandle* column_family, const Slice& key,
const Slice& value) {
return WriteBatchInternal::Put(this, GetColumnFamilyID(column_family), key,
value);
}
調用了內部的put:
Status WriteBatchInternal::Put(WriteBatch* b, uint32_t column_family_id,
const Slice& key, const Slice& value) {
if (key.size() > size_t{port::kMaxUint32}) {
return Status::InvalidArgument("key is too large");
}
if (value.size() > size_t{port::kMaxUint32}) {
return Status::InvalidArgument("value is too large");
}
LocalSavePoint save(b);
WriteBatchInternal::SetCount(b, WriteBatchInternal::Count(b) + 1);
if (column_family_id == 0) {
b->rep_.push_back(static_cast<char>(kTypeValue));
} else {
b->rep_.push_back(static_cast<char>(kTypeColumnFamilyValue));
PutVarint32(&b->rep_, column_family_id);
}
PutLengthPrefixedSlice(&b->rep_, key);
PutLengthPrefixedSlice(&b->rep_, value);
b->content_flags_.store(
b->content_flags_.load(std::memory_order_relaxed) | ContentFlags::HAS_PUT,
std::memory_order_relaxed);
return save.commit();
}
從這裏可以看出rep_的格式如下:
開頭是 sequence number(fixed64)和 record count(fixed32),然後就是一個一個record,record格式如下:
sequence number爲WriteBatch的序列號(也就是每個操作對應的全局序列號),後四字節爲當前Batch中的記錄數。
圖。。。。。。。。。
數據放進WriteBatch後進行寫入操作,調用函數Write
:
Status DBImpl::Write(const WriteOptions& write_options, WriteBatch* my_batch) {
return WriteImpl(write_options, my_batch, nullptr, nullptr);
}
調用WriteImpl
:
// The main write queue. This is the only write queue that updates LastSequence.
// When using one write queue, the same sequence also indicates the last
// published sequence.
Status DBImpl::WriteImpl(const WriteOptions& write_options,
WriteBatch* my_batch, WriteCallback* callback,
uint64_t* log_used, uint64_t log_ref,
bool disable_memtable, uint64_t* seq_used,
size_t batch_cnt,
PreReleaseCallback* pre_release_callback) {
......................
}
這個函數有300行,是將內容寫進數據庫的核心函數。
具體做的事情我簡單介紹一下:
內部會維護write group,把多個batch合併一起寫入。
新建一個WriteThread,將writer加入一個write group中。
爲了防止數據丟失,Rocksdb爲了防止數據丟失先寫入WAL文件,再寫入Memtable
檢查memtable的寫入任務是否全部已經完成,假如完成,最後一個writer以leader的角色退出,
並會選出下一個batch的leader