MongoDB中并发控制（MVCC）

在支持行级并发或者文档级并发的数据库中，为了进一步提升系统的并发性，通常都会使用MVCC的机制。
MVCC是一种非锁机制进行读操作来提升性能，我们知道锁是一种有限的系统资源，锁定和解锁都是需要一定的时间的，并且如果请求的锁无法得到的时候，还需要等待。如果采用的是互斥锁，当该锁可用的时候，系统需要通过发送信号来在线程间通信；另外一种方式是通过自旋锁的方式，一直去询问所需的资源是否准备好，很浪费CPU资源。
MVCC可以提供基于某个时间点（POV， point-of-view）的快照（snapshot），使得对于事务看来，总是可以提供与事务开始时刻相一致的数据，而不管这个事务执行的时间有多长。所以在不同的事务看来，同一时刻看到的相同行的数据可能是不一样的，即一个行可能有多个版本。

在WiredTiger里面，事务的一致性主要是通过timestamp，每一个timestamp对应一个相应的snapshot，对于每一个事务操作在事务开始的时候，得到事务的timestamp，在存储引擎里面会有一个snapshot与之对应。该snapshot记录了该事务在运行过程中的状态，并且记录了事务还没有提交的事务的[min_snap_id, max_snap_id]，它是从创建snapshot的时候从global_snapshot里面copy出来的，记录了在创建的时候，还没有commit的事务列表。在min_snap_id之前的事务是已经提交的事务，对应的数据是可以读的，大于min_snap_id且小于等于max_snap_id是正在运行事务。

隐式事务的框架

在MongoDB4.0中，不管有没有显式地调用一个事务， CRUD操作都会默认地给这些操作创建一个事务，这是通过WriteUnitOfWork来实现的。WriteUnitOfWork类是一个wrapper类只是设定和取消一些标记，真正的实现是在WiredTigerRecoveryUnit里面实现的, 它封装了server端的事务的使用。
在CRUD下，都需要通过一个key记过cursor找到btree里面相应的内存页上面的KV值，这里的cursor需要先通过recover unit得到相应的wiredtiger session类, 此时recover unit会创建一个新的事务并且指定该事务的read_timestamp。

WiredTigerCursor::WiredTigerCursor(const std::string& uri,
                                   uint64_t tableId,
                                   bool forRecordStore,
                                   OperationContext* opCtx) {
    _tableID = tableId;
    _ru = WiredTigerRecoveryUnit::get(opCtx);
    _session = _ru->getSession();
    _cursor = _session->getCursor(uri, tableId, forRecordStore);
}

WiredTigerSession* WiredTigerRecoveryUnit::getSession() {
    if (!_active) {
        _txnOpen();  // 打开一个新的事务
    }
    return _session.get();
}

WiredTigerRecoveryUnit::_txnOpen()主要是用来打开一个新的事务，并且指定该事务的read_timestamp，默认情况下，这个时间点是没有设定的，该如何告诉wiredtiger层从那个snapshot里面读取哪？
在MongoDB里面，可以通过指定ReadSource来实现。ReadSource可以有如下的选择：

ReadSource	说明
kUnset	这是默认的行为，没有指定timestamp
kNoTimestamp	指定了没有timestamp的读取.
kMajorityCommitted	从WiredTigerSnapshotManager::_committedSnapshot
kLastApplied	从WiredTigerSnapshotManager::_localSnapshot读取
kLastAppliedSnapshot	从WiredTigerSnapshotManager::_localSnapshot读取
kAllCommittedSnapshot	从一个时间点，在此之前的所有事务都已经提交了
kProvided	明确地指定了读取的timestamp，调用了setTimestampReadSource

void WiredTigerRecoveryUnit::_txnOpen() {
    invariant(!_active);
    _ensureSession();

    // Only start a timer for transaction's lifetime if we're going to log it.
    if (shouldLog(kSlowTransactionSeverity)) {
        _timer.reset(new Timer());
    }
    WT_SESSION* session = _session->getSession();

    switch (_timestampReadSource) {
        case ReadSource::kUnset:
        case ReadSource::kNoTimestamp: {
            WiredTigerBeginTxnBlock txnOpen(session, _ignorePrepared);

            if (_isOplogReader) {
                auto status =
                    txnOpen.setTimestamp(Timestamp(_oplogManager->getOplogReadTimestamp()),
                                         WiredTigerBeginTxnBlock::RoundToOldest::kRound);
                fassert(50771, status);
            }
            txnOpen.done();
            break;
        }
        case ReadSource::kMajorityCommitted: {
            // We reset _majorityCommittedSnapshot to the actual read timestamp used when the
            // transaction was started.
            _majorityCommittedSnapshot =
                _sessionCache->snapshotManager().beginTransactionOnCommittedSnapshot(session);
            break;
        }
        case ReadSource::kLastApplied: {
            if (_sessionCache->snapshotManager().getLocalSnapshot()) {
                _readAtTimestamp = _sessionCache->snapshotManager().beginTransactionOnLocalSnapshot(
                    session, _ignorePrepared);
            } else {
                WiredTigerBeginTxnBlock(session, _ignorePrepared).done();
            }
            break;
        }
        case ReadSource::kAllCommittedSnapshot: {
            if (_readAtTimestamp.isNull()) {
                _readAtTimestamp = _beginTransactionAtAllCommittedTimestamp(session);
                break;
            }
            // Intentionally continue to the next case to read at the _readAtTimestamp.
        }
        case ReadSource::kLastAppliedSnapshot: {
            // Only ever read the last applied timestamp once, and continue reusing it for
            // subsequent transactions.
            if (_readAtTimestamp.isNull()) {
                _readAtTimestamp = _sessionCache->snapshotManager().beginTransactionOnLocalSnapshot(
                    session, _ignorePrepared);
                break;
            }
            // Intentionally continue to the next case to read at the _readAtTimestamp.
        }
        case ReadSource::kProvided: {
            WiredTigerBeginTxnBlock txnOpen(session, _ignorePrepared);
            auto status = txnOpen.setTimestamp(_readAtTimestamp);

            if (!status.isOK() && status.code() == ErrorCodes::BadValue) {
                uasserted(ErrorCodes::SnapshotTooOld,
                          str::stream() << "Read timestamp " << _readAtTimestamp.toString()
                                        << " is older than the oldest available timestamp.");
            }
            uassertStatusOK(status);
            txnOpen.done();
            break;
        }
    }

    LOG(3) << "WT begin_transaction for snapshot id " << _mySnapshotId;
    _active = true;
}

插入修改

在InsertCmd里面，会生成一个OperationContext，并且调用insertDocuments，通过WriteUnitOfWork对象来构造一个事务，来完成整个过程：

写入数据

通过函数WiredTigerRecordStore::insertRecords，将指定的文档写入到指定的集合里面。

   Status WiredTigerRecordStore::_insertRecords(OperationContext* opCtx,
                                           Record* records,
                                           const Timestamp* timestamps,
                                           size_t nRecords) {
  dassert(opCtx->lockState()->isWriteLocked());

  // 生成wt cursor， 在该构造函数会调用_txnOpen, 计算读取的timestamp；
  WiredTigerCursor curwrap(_uri, _tableId, true, opCtx);
  curwrap.assertInActiveTxn();
  WT_CURSOR* c = curwrap.get();
  invariant(c);

  ...

  for (size_t i = 0; i < nRecords; i++) {
      auto& record = records[i];
      Timestamp ts = timestamps[i];
      if (!ts.isNull()) {
          LOG(4) << "inserting record with timestamp " << ts;
  		// 每一个文档插入， 设定事务的commit_timestamp
          fassert(39001, opCtx->recoveryUnit()->setTimestamp(ts));
      }
      setKey(c, record.id);
      WiredTigerItem value(record.data.data(), record.data.size());
      c->set_value(c, value.Get());
      int ret = WT_OP_CHECK(c->insert(c));
      if (ret)
          return wtRCToStatus(ret, "WiredTigerRecordStore::insertRecord");
  }

  // 记录Change并注册到recover unit的事务里面
  _changeNumRecords(opCtx, nRecords);
  _increaseDataSize(opCtx, totalLength);

  
  if (_oplogStones) {
      _oplogStones->updateCurrentStoneAfterInsertOnCommit(
          opCtx, totalLength, highestId, nRecords);
  } else {
      cappedDeleteAsNeeded(opCtx, highestId);
  }

  return Status::OK();
}

写如索引

通过函数IndexCatalogImpl::_indexRecords，修改索引相关的文档；索引的数据存放在另外一个集合里面，其更新过程如下：

Status IndexCatalogImpl::_indexRecords(OperationContext* opCtx,
                                    IndexCatalogEntry* index,
                                    const std::vector<BsonRecord>& bsonRecords,
                                    int64_t* keysInsertedOut) {
 const MatchExpression* filter = index->getFilterExpression();
 if (!filter)
     return _indexFilteredRecords(opCtx, index, bsonRecords, keysInsertedOut);

 std::vector<BsonRecord> filteredBsonRecords;
 for (auto bsonRecord : bsonRecords) {
     if (filter->matchesBSON(*(bsonRecord.docPtr)))
         filteredBsonRecords.push_back(bsonRecord);
 }

 return _indexFilteredRecords(opCtx, index, filteredBsonRecords, keysInsertedOut);
}

Status IndexCatalogImpl::_indexFilteredRecords(OperationContext* opCtx,
                                            IndexCatalogEntry* index,
                                            const std::vector<BsonRecord>& bsonRecords,
                                            int64_t* keysInsertedOut) {
 InsertDeleteOptions options;
 prepareInsertDeleteOptions(opCtx, index->descriptor(), &options);

 for (auto bsonRecord : bsonRecords) {
     int64_t inserted;
     invariant(bsonRecord.id != RecordId());

     if (!bsonRecord.ts.isNull()) {
         Status status = opCtx->recoveryUnit()->setTimestamp(bsonRecord.ts);
         if (!status.isOK())
             return status;
     }

     Status status = index->accessMethod()->insert(
         opCtx, *bsonRecord.docPtr, bsonRecord.id, options, &inserted);
     if (!status.isOK())
         return status;

     if (keysInsertedOut) {
         *keysInsertedOut += inserted;
     }
 }
 return Status::OK();
}

Status IndexAccessMethod::insert(OperationContext* opCtx,
                              const BSONObj& obj,
                              const RecordId& loc,
                              const InsertDeleteOptions& options,
                              int64_t* numInserted) {
 invariant(numInserted);
 *numInserted = 0;
 BSONObjSet keys = SimpleBSONObjComparator::kInstance.makeBSONObjSet();

 Status ret = Status::OK();
 for (BSONObjSet::const_iterator i = keys.begin(); i != keys.end(); ++i) {
     Status status = _newInterface->insert(opCtx, *i, loc, options.dupsAllowed);

     ...
 }

 if (*numInserted > 1 || isMultikeyFromPaths(multikeyPaths)) {
     _btreeState->setMultikey(opCtx, multikeyPaths);
 }

 return ret;
}

Status WiredTigerIndex::insert(OperationContext* opCtx,
                            const BSONObj& key,
                            const RecordId& id,
                            bool dupsAllowed) {
 dassert(opCtx->lockState()->isWriteLocked());
 invariant(id.isNormal());
 dassert(!hasFieldNames(key));

 Status s = checkKeySize(key);
 if (!s.isOK())
     return s;

 WiredTigerCursor curwrap(_uri, _tableId, false, opCtx);
 curwrap.assertInActiveTxn();
 WT_CURSOR* c = curwrap.get();

 return _insert(opCtx, c, key, id, dupsAllowed);
}

Status WiredTigerIndexStandard::_insert(OperationContext* opCtx,
                                     WT_CURSOR* c,
                                     const BSONObj& keyBson,
                                     const RecordId& id,
                                     bool dupsAllowed) {
 invariant(dupsAllowed);

 TRACE_INDEX << " key: " << keyBson << " id: " << id;

 KeyString key(keyStringVersion(), keyBson, _ordering, id);
 WiredTigerItem keyItem(key.getBuffer(), key.getSize());

 WiredTigerItem valueItem = key.getTypeBits().isAllZeros()
     ? emptyItem
     : WiredTigerItem(key.getTypeBits().getBuffer(), key.getTypeBits().getSize());

 setKey(c, keyItem.Get());
 c->set_value(c, valueItem.Get());
 int ret = WT_OP_CHECK(c->insert(c));

 return Status::OK();
}

写oplog

OpObserverImpl::onInserts；

WiredTiger 的WT_INSERT

每当有一个insert，就会在内存页里面加入一个WT_INSERT到数组WT_INSERT_HEADER，它记录了这个key对应的不同的版本的，每一个WT_INSERT都有一个WT_UPDATE结构，以及它对应的page的offset以及size。
在每一个WT_UPDATE里面都包含了该update的事务id，以及相关的timestamp。

MongoDB中并发控制（MVCC）

隐式事务的框架

插入修改

写入数据

写如索引

写oplog

WiredTiger 的WT_INSERT

事务提交

修改操作

读取操作

参考文档

MooseFS的常見問題與操作

性能問題的定位

分佈式系統的性能優化方法

MongoDB中併發控制（MVCC）

MongoDB的事務實現

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結