MongoDB中併發控制（MVCC）

在支持行級併發或者文檔級併發的數據庫中，爲了進一步提升系統的併發性，通常都會使用MVCC的機制。
MVCC是一種非鎖機制進行讀操作來提升性能，我們知道鎖是一種有限的系統資源，鎖定和解鎖都是需要一定的時間的，並且如果請求的鎖無法得到的時候，還需要等待。如果採用的是互斥鎖，當該鎖可用的時候，系統需要通過發送信號來在線程間通信；另外一種方式是通過自旋鎖的方式，一直去詢問所需的資源是否準備好，很浪費CPU資源。
MVCC可以提供基於某個時間點（POV， point-of-view）的快照（snapshot），使得對於事務看來，總是可以提供與事務開始時刻相一致的數據，而不管這個事務執行的時間有多長。所以在不同的事務看來，同一時刻看到的相同行的數據可能是不一樣的，即一個行可能有多個版本。

在WiredTiger裏面，事務的一致性主要是通過timestamp，每一個timestamp對應一個相應的snapshot，對於每一個事務操作在事務開始的時候，得到事務的timestamp，在存儲引擎裏面會有一個snapshot與之對應。該snapshot記錄了該事務在運行過程中的狀態，並且記錄了事務還沒有提交的事務的[min_snap_id, max_snap_id]，它是從創建snapshot的時候從global_snapshot裏面copy出來的，記錄了在創建的時候，還沒有commit的事務列表。在min_snap_id之前的事務是已經提交的事務，對應的數據是可以讀的，大於min_snap_id且小於等於max_snap_id是正在運行事務。

隱式事務的框架

在MongoDB4.0中，不管有沒有顯式地調用一個事務， CRUD操作都會默認地給這些操作創建一個事務，這是通過WriteUnitOfWork來實現的。WriteUnitOfWork類是一個wrapper類只是設定和取消一些標記，真正的實現是在WiredTigerRecoveryUnit裏面實現的, 它封裝了server端的事務的使用。
在CRUD下，都需要通過一個key記過cursor找到btree裏面相應的內存頁上面的KV值，這裏的cursor需要先通過recover unit得到相應的wiredtiger session類, 此時recover unit會創建一個新的事務並且指定該事務的read_timestamp。

WiredTigerCursor::WiredTigerCursor(const std::string& uri,
                                   uint64_t tableId,
                                   bool forRecordStore,
                                   OperationContext* opCtx) {
    _tableID = tableId;
    _ru = WiredTigerRecoveryUnit::get(opCtx);
    _session = _ru->getSession();
    _cursor = _session->getCursor(uri, tableId, forRecordStore);
}

WiredTigerSession* WiredTigerRecoveryUnit::getSession() {
    if (!_active) {
        _txnOpen();  // 打開一個新的事務
    }
    return _session.get();
}

WiredTigerRecoveryUnit::_txnOpen()主要是用來打開一個新的事務，並且指定該事務的read_timestamp，默認情況下，這個時間點是沒有設定的，該如何告訴wiredtiger層從那個snapshot裏面讀取哪？
在MongoDB裏面，可以通過指定ReadSource來實現。ReadSource可以有如下的選擇：

ReadSource	說明
kUnset	這是默認的行爲，沒有指定timestamp
kNoTimestamp	指定了沒有timestamp的讀取.
kMajorityCommitted	從WiredTigerSnapshotManager::_committedSnapshot
kLastApplied	從WiredTigerSnapshotManager::_localSnapshot讀取
kLastAppliedSnapshot	從WiredTigerSnapshotManager::_localSnapshot讀取
kAllCommittedSnapshot	從一個時間點，在此之前的所有事務都已經提交了
kProvided	明確地指定了讀取的timestamp，調用了setTimestampReadSource

void WiredTigerRecoveryUnit::_txnOpen() {
    invariant(!_active);
    _ensureSession();

    // Only start a timer for transaction's lifetime if we're going to log it.
    if (shouldLog(kSlowTransactionSeverity)) {
        _timer.reset(new Timer());
    }
    WT_SESSION* session = _session->getSession();

    switch (_timestampReadSource) {
        case ReadSource::kUnset:
        case ReadSource::kNoTimestamp: {
            WiredTigerBeginTxnBlock txnOpen(session, _ignorePrepared);

            if (_isOplogReader) {
                auto status =
                    txnOpen.setTimestamp(Timestamp(_oplogManager->getOplogReadTimestamp()),
                                         WiredTigerBeginTxnBlock::RoundToOldest::kRound);
                fassert(50771, status);
            }
            txnOpen.done();
            break;
        }
        case ReadSource::kMajorityCommitted: {
            // We reset _majorityCommittedSnapshot to the actual read timestamp used when the
            // transaction was started.
            _majorityCommittedSnapshot =
                _sessionCache->snapshotManager().beginTransactionOnCommittedSnapshot(session);
            break;
        }
        case ReadSource::kLastApplied: {
            if (_sessionCache->snapshotManager().getLocalSnapshot()) {
                _readAtTimestamp = _sessionCache->snapshotManager().beginTransactionOnLocalSnapshot(
                    session, _ignorePrepared);
            } else {
                WiredTigerBeginTxnBlock(session, _ignorePrepared).done();
            }
            break;
        }
        case ReadSource::kAllCommittedSnapshot: {
            if (_readAtTimestamp.isNull()) {
                _readAtTimestamp = _beginTransactionAtAllCommittedTimestamp(session);
                break;
            }
            // Intentionally continue to the next case to read at the _readAtTimestamp.
        }
        case ReadSource::kLastAppliedSnapshot: {
            // Only ever read the last applied timestamp once, and continue reusing it for
            // subsequent transactions.
            if (_readAtTimestamp.isNull()) {
                _readAtTimestamp = _sessionCache->snapshotManager().beginTransactionOnLocalSnapshot(
                    session, _ignorePrepared);
                break;
            }
            // Intentionally continue to the next case to read at the _readAtTimestamp.
        }
        case ReadSource::kProvided: {
            WiredTigerBeginTxnBlock txnOpen(session, _ignorePrepared);
            auto status = txnOpen.setTimestamp(_readAtTimestamp);

            if (!status.isOK() && status.code() == ErrorCodes::BadValue) {
                uasserted(ErrorCodes::SnapshotTooOld,
                          str::stream() << "Read timestamp " << _readAtTimestamp.toString()
                                        << " is older than the oldest available timestamp.");
            }
            uassertStatusOK(status);
            txnOpen.done();
            break;
        }
    }

    LOG(3) << "WT begin_transaction for snapshot id " << _mySnapshotId;
    _active = true;
}

插入修改

在InsertCmd裏面，會生成一個OperationContext，並且調用insertDocuments，通過WriteUnitOfWork對象來構造一個事務，來完成整個過程：

寫入數據

通過函數WiredTigerRecordStore::insertRecords，將指定的文檔寫入到指定的集合裏面。

   Status WiredTigerRecordStore::_insertRecords(OperationContext* opCtx,
                                           Record* records,
                                           const Timestamp* timestamps,
                                           size_t nRecords) {
  dassert(opCtx->lockState()->isWriteLocked());

  // 生成wt cursor， 在該構造函數會調用_txnOpen, 計算讀取的timestamp；
  WiredTigerCursor curwrap(_uri, _tableId, true, opCtx);
  curwrap.assertInActiveTxn();
  WT_CURSOR* c = curwrap.get();
  invariant(c);

  ...

  for (size_t i = 0; i < nRecords; i++) {
      auto& record = records[i];
      Timestamp ts = timestamps[i];
      if (!ts.isNull()) {
          LOG(4) << "inserting record with timestamp " << ts;
  		// 每一個文檔插入， 設定事務的commit_timestamp
          fassert(39001, opCtx->recoveryUnit()->setTimestamp(ts));
      }
      setKey(c, record.id);
      WiredTigerItem value(record.data.data(), record.data.size());
      c->set_value(c, value.Get());
      int ret = WT_OP_CHECK(c->insert(c));
      if (ret)
          return wtRCToStatus(ret, "WiredTigerRecordStore::insertRecord");
  }

  // 記錄Change並註冊到recover unit的事務裏面
  _changeNumRecords(opCtx, nRecords);
  _increaseDataSize(opCtx, totalLength);

  
  if (_oplogStones) {
      _oplogStones->updateCurrentStoneAfterInsertOnCommit(
          opCtx, totalLength, highestId, nRecords);
  } else {
      cappedDeleteAsNeeded(opCtx, highestId);
  }

  return Status::OK();
}

寫如索引

通過函數IndexCatalogImpl::_indexRecords，修改索引相關的文檔；索引的數據存放在另外一個集合裏面，其更新過程如下：

Status IndexCatalogImpl::_indexRecords(OperationContext* opCtx,
                                    IndexCatalogEntry* index,
                                    const std::vector<BsonRecord>& bsonRecords,
                                    int64_t* keysInsertedOut) {
 const MatchExpression* filter = index->getFilterExpression();
 if (!filter)
     return _indexFilteredRecords(opCtx, index, bsonRecords, keysInsertedOut);

 std::vector<BsonRecord> filteredBsonRecords;
 for (auto bsonRecord : bsonRecords) {
     if (filter->matchesBSON(*(bsonRecord.docPtr)))
         filteredBsonRecords.push_back(bsonRecord);
 }

 return _indexFilteredRecords(opCtx, index, filteredBsonRecords, keysInsertedOut);
}

Status IndexCatalogImpl::_indexFilteredRecords(OperationContext* opCtx,
                                            IndexCatalogEntry* index,
                                            const std::vector<BsonRecord>& bsonRecords,
                                            int64_t* keysInsertedOut) {
 InsertDeleteOptions options;
 prepareInsertDeleteOptions(opCtx, index->descriptor(), &options);

 for (auto bsonRecord : bsonRecords) {
     int64_t inserted;
     invariant(bsonRecord.id != RecordId());

     if (!bsonRecord.ts.isNull()) {
         Status status = opCtx->recoveryUnit()->setTimestamp(bsonRecord.ts);
         if (!status.isOK())
             return status;
     }

     Status status = index->accessMethod()->insert(
         opCtx, *bsonRecord.docPtr, bsonRecord.id, options, &inserted);
     if (!status.isOK())
         return status;

     if (keysInsertedOut) {
         *keysInsertedOut += inserted;
     }
 }
 return Status::OK();
}

Status IndexAccessMethod::insert(OperationContext* opCtx,
                              const BSONObj& obj,
                              const RecordId& loc,
                              const InsertDeleteOptions& options,
                              int64_t* numInserted) {
 invariant(numInserted);
 *numInserted = 0;
 BSONObjSet keys = SimpleBSONObjComparator::kInstance.makeBSONObjSet();

 Status ret = Status::OK();
 for (BSONObjSet::const_iterator i = keys.begin(); i != keys.end(); ++i) {
     Status status = _newInterface->insert(opCtx, *i, loc, options.dupsAllowed);

     ...
 }

 if (*numInserted > 1 || isMultikeyFromPaths(multikeyPaths)) {
     _btreeState->setMultikey(opCtx, multikeyPaths);
 }

 return ret;
}

Status WiredTigerIndex::insert(OperationContext* opCtx,
                            const BSONObj& key,
                            const RecordId& id,
                            bool dupsAllowed) {
 dassert(opCtx->lockState()->isWriteLocked());
 invariant(id.isNormal());
 dassert(!hasFieldNames(key));

 Status s = checkKeySize(key);
 if (!s.isOK())
     return s;

 WiredTigerCursor curwrap(_uri, _tableId, false, opCtx);
 curwrap.assertInActiveTxn();
 WT_CURSOR* c = curwrap.get();

 return _insert(opCtx, c, key, id, dupsAllowed);
}

Status WiredTigerIndexStandard::_insert(OperationContext* opCtx,
                                     WT_CURSOR* c,
                                     const BSONObj& keyBson,
                                     const RecordId& id,
                                     bool dupsAllowed) {
 invariant(dupsAllowed);

 TRACE_INDEX << " key: " << keyBson << " id: " << id;

 KeyString key(keyStringVersion(), keyBson, _ordering, id);
 WiredTigerItem keyItem(key.getBuffer(), key.getSize());

 WiredTigerItem valueItem = key.getTypeBits().isAllZeros()
     ? emptyItem
     : WiredTigerItem(key.getTypeBits().getBuffer(), key.getTypeBits().getSize());

 setKey(c, keyItem.Get());
 c->set_value(c, valueItem.Get());
 int ret = WT_OP_CHECK(c->insert(c));

 return Status::OK();
}

寫oplog

OpObserverImpl::onInserts；

WiredTiger 的WT_INSERT

每當有一個insert，就會在內存頁裏面加入一個WT_INSERT到數組WT_INSERT_HEADER，它記錄了這個key對應的不同的版本的，每一個WT_INSERT都有一個WT_UPDATE結構，以及它對應的page的offset以及size。
在每一個WT_UPDATE裏面都包含了該update的事務id，以及相關的timestamp。

MongoDB中併發控制（MVCC）

隱式事務的框架

插入修改

寫入數據

寫如索引

寫oplog

WiredTiger 的WT_INSERT

事務提交

修改操作

讀取操作

參考文檔

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

Dokcer部署Kafka集羣

nodejs學習06——小案例

MooseFS的常見問題與操作

性能問題的定位

分佈式系統的性能優化方法

MongoDB中併發控制（MVCC）

MongoDB的事務實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結