在之前的一篇文章 中,介紹了assembleResponse函數(位於instance.cpp第224行),它會根據op操作枚舉類型來調用相應的crud操作,枚舉類型定義如下: 1.創建索引樹(b樹) 2.將數據(主要是地址)添加到索引(樹)中 先看一下創建索引過程: 上面方法主要對要創建的索引信息進行提取,並封裝到一個BtreeBuilder中,顧名思義,該對象用於進行b樹的創建(因爲索引也是一個b樹),當信息收集排序完成後,就開始創建索引,如下: 當初始化了b樹索引及空間信息之後,下面就會將數據綁定到相應信息結點上了,也就是DataFileMgr::insert方法(pdfile.cpp文件)的如下代碼:
opReply = 1 , /* reply. responseTo is set. */
dbMsg = 1000 , /* generic msg command followed by a string */
dbUpdate = 2001 , /* update object */
dbInsert = 2002 ,
// dbGetByOID = 2003,
dbQuery = 2004 ,
dbGetMore = 2005 ,
dbDelete = 2006 ,
dbKillCursors = 2007
};
可以看到dbInsert = 2002 爲插入操作枚舉值,下面我們看一下assembleResponse在確定是插入操作時調用的方法,如下:
.....
try {
if ( op == dbInsert ) { // 添加記錄操作
receivedInsert(m, currentOp);
}
else if ( op == dbUpdate ) { // 更新記錄
receivedUpdate(m, currentOp);
}
else if ( op == dbDelete ) { // 刪除記錄
receivedDelete(m, currentOp);
}
else if ( op == dbKillCursors ) { // 刪除Cursors(遊標)對象
currentOp.ensureStarted();
logThreshold = 10 ;
ss << " killcursors " ;
receivedKillCursors(m);
}
else {
mongo::log() << " operation isn't supported: " << op << endl;
currentOp.done();
log = true ;
}
}
.....
}
}
從上面代碼可以看出,系統在確定dbInsert操作時,調用了receivedInsert()方法(位於instance.cpp文件第570行),下面是該方法的定義:
DbMessage d(m); // 初始化數據庫格式的消息
const char * ns = d.getns(); // 獲取名空間,用於接下來insert數據
assert( * ns);
uassert( 10058 , " not master " , isMasterNs( ns ) );
op.debug().str << ns;
writelock lk(ns); // 聲明寫鎖
if ( handlePossibleShardedMessage( m , 0 ) ) // 查看是不是sharding信息,如果是則處理
return ;
Client::Context ctx(ns);
int n = 0 ;
while ( d.moreJSObjs() ) { // 循環獲取當前消息體中的BSONObj數據(數據庫記錄)
BSONObj js = d.nextJsObj();
uassert( 10059 , " object to insert too large " , js.objsize() <= BSONObjMaxUserSize);
{
// 聲明BSONObj迭代器,以查看裏面元素是否有更新操作,如set inc push pull 等
BSONObjIterator i( js );
while ( i.more() ) {
BSONElement e = i.next();
uassert( 13511 , " object to insert can't have $ modifiers " , e.fieldName()[ 0 ] != ' $ ' );
}
}
// 插入記錄操作,god = false用於標識當前BSONObj對象爲有效數據
theDataFileMgr.insertWithObjMod(ns, js, false );
logOp( " i " , ns, js); // 日誌操作,包括master狀態下及sharding分片情況
if ( ++ n % 4 == 0 ) {
// 在插入一些數據後,進行持久化操作,有關持久化部分參見我的這篇文章
// http://www.cnblogs.com/daizhj/archive/2011/03/21/1990344.html
getDur().commitIfNeeded();
}
}
globalOpCounters.incInsertInWriteLock(n); // 在寫鎖環境下添加已插入記錄數(n),鎖採用InterlockedIncrement實現數的原子性
}
上面的方法中,主要是在“寫鎖”環境下執行插入數據操作,並且在插入記錄之前進行簡單的數據對象檢查,如長度和插入數據是否被修改,以確保數據的最終有效性。
最終上面代碼會調用 insertWithObjMod()方法(位於pdfile.cpp 文件第1432行),該方法定義如下:
DiskLoc loc = insert( ns, o.objdata(), o.objsize(), god );
if ( ! loc.isNull() ) // 判斷返回記錄地址是否爲空(記錄是否插入成功)
o = BSONObj( loc.rec() ); // 如有效,則用記錄地地址上的記錄(record類型指針)綁定到o上
return loc;
}
該方法只是一個對插入操作及返回結果的封裝,其中ns爲數據對象的名空間,o就是要插入的數據對象(BSONObj),god用於標識當前BSONObj 對象是否爲有效數據(false=有效),這裏之所以要傳入god這個參數,是因爲在接下來的insert方法裏同時支持添加名空間(及索引)和插入記錄 操作(都會不斷調用該方法),而在添加名空間時god=true。
下面我們看一下insert方法(pdfile.cpp 第1467行),因爲其內容較長,請詳見註釋:
bool wouldAddIndex = false ;
massert( 10093 , " cannot insert into reserved $ collection " , god || isANormalNSName( ns ) );
uassert( 10094 , str::stream() << " invalid ns: " << ns , isValidNS( ns ) );
const char * sys = strstr(ns, " system. " );
if ( sys ) { // 對插入記錄的ns進行判斷,是否要插入保留的數據庫名(system),如是則停止執行其它代碼
uassert( 10095 , " attempt to insert in reserved database name 'system' " , sys != ns);
if ( strstr(ns, " .system. " ) ) {
// later:check for dba-type permissions here if have that at some point separate
if ( strstr(ns, " .system.indexes " ) ) // 判斷是否創建索引
wouldAddIndex = true ;
else if ( legalClientSystemNS( ns , true ) )
;
else if ( ! god ) { // 表示obuf有數據,但這就意味着要向system下插入數據(把system當成數據表了)
out () << " ERROR: attempt to insert in system namespace " << ns << endl;
return DiskLoc();
}
}
else
sys = 0 ;
}
bool addIndex = wouldAddIndex && mayAddIndex; // 判斷是否需要添加索引
NamespaceDetails * d = nsdetails(ns); // 獲取ns的詳細信息
if ( d == 0 ) {
addNewNamespaceToCatalog(ns); // 向system catalog添加新的名空間,它會再次調用當前insert()方法
/* todo: shouldn't be in the namespace catalog until after the allocations here work.
also if this is an addIndex, those checks should happen before this!
*/
// 創建第一個數據庫文件.
cc().database() -> allocExtent(ns, Extent::initialSize(len), false );
d = nsdetails(ns);
if ( ! god )
ensureIdIndexForNewNs(ns);
}
d -> paddingFits();
NamespaceDetails * tableToIndex = 0 ;
string tabletoidxns;
BSONObj fixedIndexObject;
if ( addIndex ) {
assert( obuf );
BSONObj io(( const char * ) obuf);
// 做索引準備工作,這裏並不真正創建索引,只是進行參數檢查,以及索引是否已存在等
if ( ! prepareToBuildIndex(io, god, tabletoidxns, tableToIndex, fixedIndexObject ) )
return DiskLoc();
if ( ! fixedIndexObject.isEmpty() ) {
obuf = fixedIndexObject.objdata();
len = fixedIndexObject.objsize();
}
}
const BSONElement * newId = & writeId;
int addID = 0 ;
if ( ! god ) {
// 檢查對象 是否有_id字段,沒有則添加
// Note that btree buckets which we insert aren't BSONObj's, but in that case god==true.
BSONObj io(( const char * ) obuf);
BSONElement idField = io.getField( " _id " );
uassert( 10099 , " _id cannot be an array " , idField.type() != Array );
if ( idField.eoo() /* 判斷是否是結束元素 */ && ! wouldAddIndex && strstr(ns, " .local. " ) == 0 ) {
addID = len;
if ( writeId.eoo() ) {
// 初始化一個_id 隨機值(因爲_id可能是12 byte類型或其它類型)
idToInsert_.oid.init();
newId = & idToInsert; // 綁定初始化的_id值
}
len += newId -> size();
}
// 如果io對象中有時間戳元素時,並用當前時間進行更新
BSONElementManipulator::lookForTimestamps( io );
}
// 兼容舊的數據文件
DiskLoc extentLoc;
int lenWHdr = len + Record::HeaderSize;
lenWHdr = ( int ) (lenWHdr * d -> paddingFactor);
if ( lenWHdr == 0 ) {
assert( d -> paddingFactor == 0 );
* getDur().writing( & d -> paddingFactor) = 1.0 ;
lenWHdr = len + Record::HeaderSize;
}
// 在對新的對象分配空間前檢查數據是否會造成索引衝突(唯一索引)
// capped標識是否是固定大小的集合類型,這種類型下系統會自動將過於陳舊的數據remove掉
// 注:此cap與nosql中常說的cap無太大關聯
// nosql cap即:一致性,有效性,分區容忍性
// 參見這篇文章: http://blog.nosqlfan.com/html/1112.html ,
// http://blog.nosqlfan.com/html/96.html )
if ( d -> nIndexes && d -> capped && ! god ) {
checkNoIndexConflicts( d, BSONObj( reinterpret_cast < const char *> ( obuf ) ) );
}
DiskLoc loc = d -> alloc(ns, lenWHdr, extentLoc); // 爲當前記錄分配空間namespace.cpp __stdAlloc方法
if ( loc.isNull() ) { // 如果分配失效
if ( d -> capped == 0 ) { // cap大小未增加,即
log( 1 ) << " allocating new extent for " << ns << " padding: " << d -> paddingFactor << " lenWHdr: " << lenWHdr << endl;
// 嘗試從空閒空間列表中分配空間
cc().database() -> allocExtent(ns, Extent::followupSize(lenWHdr, d -> lastExtentSize), false );
// 嘗試再次爲當前記錄分配空間
loc = d -> alloc(ns, lenWHdr, extentLoc);
if ( loc.isNull() ) {
log() << " WARNING: alloc() failed after allocating new extent. lenWHdr: " << lenWHdr << " last extent size: " << d -> lastExtentSize << " ; trying again/n " ;
for ( int zzz = 0 ; zzz < 10 && lenWHdr > d -> lastExtentSize; zzz ++ ) { // 最多嘗試循環10次分配空間
log() << " try # " << zzz << endl;
cc().database() -> allocExtent(ns, Extent::followupSize(len, d -> lastExtentSize), false );
loc = d -> alloc(ns, lenWHdr, extentLoc);
if ( ! loc.isNull() )
break ;
}
}
}
if ( loc.isNull() ) { // 最終未分配空間給對象
log() << " insert: couldn't alloc space for object ns: " << ns << " capped: " << d -> capped << endl;
assert(d -> capped);
return DiskLoc();
}
}
Record * r = loc.rec();
{
assert( r -> lengthWithHeaders >= lenWHdr );
r = (Record * ) getDur().writingPtr(r, lenWHdr); // 持久化插入記錄信息
if ( addID ) {
/* a little effort was made here to avoid a double copy when we add an ID */
(( int & ) * r -> data) = * (( int * ) obuf) + newId -> size();
memcpy(r -> data + 4 , newId -> rawdata(), newId -> size()); // 拷貝_id字段到指定記錄內存空間
memcpy(r -> data + 4 + newId -> size(), (( char * )obuf) + 4 , addID - 4 ); // 拷貝數據到指定內存空間
}
else {
if ( obuf )
memcpy(r -> data, obuf, len); // 直接拷貝數據到記錄字段r
}
}
{
Extent * e = r -> myExtent(loc);
if ( e -> lastRecord.isNull() ) { // 如果未尾記錄爲空,本人理解:即之前未插入過記錄
Extent::FL * fl = getDur().writing(e -> fl());
fl -> firstRecord = fl -> lastRecord = loc;
r -> prevOfs = r -> nextOfs = DiskLoc::NullOfs;
}
else {
Record * oldlast = e -> lastRecord.rec(); // 否則將新記錄添加到最後一條記錄的後面
r -> prevOfs = e -> lastRecord.getOfs();
r -> nextOfs = DiskLoc::NullOfs;
getDur().writingInt(oldlast -> nextOfs) = loc.getOfs();
getDur().writingDiskLoc(e -> lastRecord) = loc;
}
}
/* 持久化操作並更新相應統計信息 */
{
NamespaceDetails::Stats * s = getDur().writing( & d -> stats);
s -> datasize += r -> netLength();
s -> nrecords ++ ;
}
// 在god時會清空stats信息,同時會添加一個 btree bucket(佔據存儲空間)
if ( ! god )
NamespaceDetailsTransient::get_w( ns ).notifyOfWriteOp(); // 在寫操作時清空緩存,優化查詢優化
if ( tableToIndex ) {
uassert( 13143 , " can't create index on system.indexes " , tabletoidxns.find( " .system.indexes " ) == string ::npos );
BSONObj info = loc.obj();
bool background = info[ " background " ].trueValue();
if ( background && cc().isSyncThread() ) {
/* don't do background indexing on slaves. there are nuances. this could be added later but requires more code. */
log() << " info: indexing in foreground on this replica; was a background index build on the primary " << endl;
background = false ;
}
int idxNo = tableToIndex -> nIndexes;
IndexDetails & idx = tableToIndex -> addIndex(tabletoidxns.c_str(), ! background); // 清空臨時緩存信息; 同時遞增索引數量
getDur().writingDiskLoc(idx.info) = loc;
try {
buildAnIndex(tabletoidxns, tableToIndex, idx, idxNo, background); // 創建索引
}
catch ( DBException & e ) {
// 保存異常信息,並執行dropIndexes
LastError * le = lastError. get ();
int savecode = 0 ;
string saveerrmsg;
if ( le ) {
savecode = le -> code;
saveerrmsg = le -> msg;
}
else {
savecode = e.getCode();
saveerrmsg = e.what();
}
// 回滾索引操作(drop索引)
string name = idx.indexName();
BSONObjBuilder b;
string errmsg;
bool ok = dropIndexes(tableToIndex, tabletoidxns.c_str(), name.c_str(), errmsg, b, true );
if ( ! ok ) {
log() << " failed to drop index after a unique key error building it: " << errmsg << ' ' << tabletoidxns << ' ' << name << endl;
}
assert( le && ! saveerrmsg.empty() );
raiseError(savecode,saveerrmsg.c_str());
throw ;
}
}
/* 將記錄數據添加到索引信息(btree)中 */
if ( d -> nIndexes ) {
try {
BSONObj obj(r -> data);
indexRecord(d, obj, loc);
}
catch ( AssertionException & e ) {
// _id index 鍵值重複
if ( tableToIndex || d -> capped ) {
massert( 12583 , " unexpected index insertion failure on capped collection " , ! d -> capped );
string s = e.toString();
s += " : on addIndex/capped - collection and its index will not match " ;
uassert_nothrow(s.c_str());
error() << s << endl;
}
else {
// 回滾上述操作
_deleteRecord(d, ns, r, loc);
throw ;
}
}
}
// out() << " inserted at loc:" << hex << loc.getOfs() << " lenwhdr:" << hex << lenWHdr << dec << ' ' << ns << endl;
return loc;
}
正如之前所說,該方法會完成添加名空間,添加索引,添加數據記錄(memcpy調用)。其中名空間的添加方法addNewNamespaceToCatalog 比較簡單,下面主要介紹一下索引的創建過程,這裏分爲了兩步:
tlog() << " building new index on " << idx.keyPattern() << " for " << ns << ( background ? " background " : "" ) << endl;
Timer t;
unsigned long long n;
if ( background ) {
log( 2 ) << " buildAnIndex: background=true/n " ;
}
assert( ! BackgroundOperation::inProgForNs(ns.c_str()) ); // should have been checked earlier, better not be...
assert( d -> indexBuildInProgress == 0 );
assertInWriteLock();
RecoverableIndexState recoverable( d );
if ( inDBRepair || ! background ) { // 當數據庫在repair時或非後臺工作方式下
n = fastBuildIndex(ns.c_str(), d, idx, idxNo); // 創建索引
assert( ! idx.head.isNull() );
}
else {
BackgroundIndexBuildJob j(ns.c_str()); // 以後臺方式創建索引
n = j.go(ns, d, idx, idxNo);
}
tlog() << " done for " << n << " records " << t.millis() / 1000.0 << " secs " << endl;
}
創建索引方法會要據創建方式(是否是後臺線程等),使用不同的方法,這裏主要講解非後臺方式,也就是上面的fastBuildIndex方法(pdfile.cpp第1101行),其定義如下(內容詳見註釋):
CurOp * op = cc().curop(); // 設置當前操作指針,用於設置操作信息
Timer t;
tlog( 1 ) << " fastBuildIndex " << ns << " idxNo: " << idxNo << ' ' << idx.info.obj().toString() << endl;
bool dupsAllowed = ! idx.unique();
bool dropDups = idx.dropDups() || inDBRepair;
BSONObj order = idx.keyPattern();
getDur().writingDiskLoc(idx.head).Null();
if ( logLevel > 1 ) printMemInfo( " before index start " );
/* 獲取並排序所有鍵值 ----- */
unsigned long long n = 0 ;
shared_ptr < Cursor > c = theDataFileMgr.findAll(ns);
BSONObjExternalSorter sorter(order);
sorter.hintNumObjects( d -> stats.nrecords );
unsigned long long nkeys = 0 ;
ProgressMeterHolder pm( op -> setMessage( " index: (1/3) external sort " , d -> stats.nrecords , 10 ) );
while ( c -> ok() ) {
BSONObj o = c -> current();
DiskLoc loc = c -> currLoc();
BSONObjSetDefaultOrder keys;
idx.getKeysFromObject(o, keys); // 從對象中獲取鍵值信息
int k = 0 ;
for ( BSONObjSetDefaultOrder::iterator i = keys.begin(); i != keys.end(); i ++ ) {
if ( ++ k == 2 ) { // 是否是多鍵索引
d -> setIndexIsMultikey(idxNo);
}
sorter.add( * i, loc); // 向排序器添加鍵值和記錄位置信息
nkeys ++ ;
}
c -> advance();
n ++ ;
pm.hit();
if ( logLevel > 1 && n % 10000 == 0 ) {
printMemInfo( " /t iterating objects " );
}
};
pm.finished();
if ( logLevel > 1 ) printMemInfo( " before final sort " );
sorter.sort();
if ( logLevel > 1 ) printMemInfo( " after final sort " );
log(t.seconds() > 5 ? 0 : 1 ) << " /t external sort used : " << sorter.numFiles() << " files " << " in " << t.seconds() << " secs " << endl;
list < DiskLoc > dupsToDrop;
/* 創建索引 */
{
BtreeBuilder btBuilder(dupsAllowed, idx); // 實例化b樹索引對象
// BSONObj keyLast;
auto_ptr < BSONObjExternalSorter::Iterator > i = sorter.iterator(); // 初始化迭代器用於下面遍歷
assert( pm == op -> setMessage( " index: (2/3) btree bottom up " , nkeys , 10 ) );
while ( i -> more() ) {
RARELY killCurrentOp.checkForInterrupt(); // 檢查衝突如shutdown或kill指令
BSONObjExternalSorter::Data d = i -> next();
try {
btBuilder.addKey(d.first, d.second); // 向b樹索引對象中添加索引鍵值和記錄位置信息
}
catch ( AssertionException & e ) {
if ( dupsAllowed ) {
// unknow exception??
throw ;
}
if ( e.interrupted() )
throw ;
if ( ! dropDups )
throw ;
/* we could queue these on disk, but normally there are very few dups, so instead we
keep in ram and have a limit.
*/
dupsToDrop.push_back(d.second);
uassert( 10092 , " too may dups on index build with dropDups=true " , dupsToDrop.size() < 1000000 );
}
pm.hit();
}
pm.finished();
op -> setMessage( " index: (3/3) btree-middle " );
log(t.seconds() > 10 ? 0 : 1 ) << " /t done building bottom layer, going to commit " << endl;
btBuilder.commit(); // 提交創建索引操作,該方法會完成最終構造Btree索引操作
wassert( btBuilder.getn() == nkeys || dropDups );
}
log( 1 ) << " /t fastBuildIndex dupsToDrop: " << dupsToDrop.size() << endl;
// 刪除索引中已出現的重複記錄
for ( list < DiskLoc > ::iterator i = dupsToDrop.begin(); i != dupsToDrop.end(); i ++ )
theDataFileMgr.deleteRecord( ns, i -> rec(), * i, false , true );
return n;
}
void BtreeBuilder::commit() {
buildNextLevel(first);
committed = true ;
}
void BtreeBuilder::buildNextLevel(DiskLoc loc) {
int levels = 1 ;
while ( 1 ) {
if ( loc.btree() -> tempNext().isNull() ) {
// 在當前層級上只有一個 bucket
getDur().writingDiskLoc(idx.head) = loc;
break ;
}
levels ++ ;
DiskLoc upLoc = BtreeBucket::addBucket(idx); // 添加bucket並實例化上一層DiskLoc
DiskLoc upStart = upLoc;
BtreeBucket * up = upLoc.btreemod(); // 獲取上一層的bucket指針
DiskLoc xloc = loc;
while ( ! xloc.isNull() ) {
RARELY {
getDur().commitIfNeeded();
b = cur.btreemod();
up = upLoc.btreemod();
}
BtreeBucket * x = xloc.btreemod();
BSONObj k;
DiskLoc r;
x -> popBack(r,k); // 彈出當前bucket中最右邊的鍵
bool keepX = ( x -> n != 0 ); // 當前bucket中元素個數是否爲0
DiskLoc keepLoc = keepX ? xloc : x -> nextChild;
// 壓入上面彈出的最右邊的鍵值,該鍵值爲當前up(bucket)中最大值
if ( ! up -> _pushBack(r, k, ordering, keepLoc) )
{
// 當前 bucket 已滿,則新創建一個addBucket
DiskLoc n = BtreeBucket::addBucket(idx);
up -> tempNext() = n;
upLoc = n;
up = upLoc.btreemod();
up -> pushBack(r, k, ordering, keepLoc);
}
DiskLoc nextLoc = x -> tempNext(); // get next in chain at current level
if ( keepX ) { // 表示當前結點非頂層結點,則設置它的父結點
x -> parent = upLoc;
}
else {
if ( ! x -> nextChild.isNull() )
x -> nextChild.btreemod() -> parent = upLoc;
x -> deallocBucket( xloc, idx ); // 刪除xloc bucket
}
xloc = nextLoc; // 指向當前層的下個元素
}
loc = upStart; // 升級當前結點
mayCommitProgressDurably();
}
if ( levels > 1 )
log( 2 ) << " btree levels: " << levels << endl;
}
上面的buildNextLevel方法自下而上根據之前抽取的鍵值逐層構造一個b樹。這裏有一個問題需要注意一下,因爲mongodb使用 bucket來作爲b樹中的一個層次結點或葉子結點容器(如下圖),bucket最大尺寸爲8192字節,c。有關b樹索引的文章可以參見這篇文章 :,
mongodb目前關於B樹索引的文檔 :http://blog.nosqlfan.com/html/758.html
if ( d -> nIndexes ) {
try {
BSONObj obj(r -> data);
indexRecord(d, obj, loc);
}
......
}
上面的indexRecord方法會將鍵值和數據(包括存儲位置)添加到索引中(其中參數d包括之前創建的B樹索引信息), 該方法定義如下(pdfile.cpp 第1355行):
static void indexRecord(NamespaceDetails * d, BSONObj obj, DiskLoc loc) {
int n = d -> nIndexesBeingBuilt(); // 獲取已(及正在)構建的索引數
for ( int i = 0 ; i < n; i ++ ) {
try {
bool unique = d -> idx(i).unique();
// 內聯函數(inline):將索引和記錄相關信息初始化到btree中
_indexRecord(d, i /* 索引順序位 */ , obj, loc, /* dupsAllowed */ ! unique);
}
catch ( DBException & ) {
/* 如果發生異常,則進行回滾操作
note <= i (not < i) is important here as the index we were just attempted
may be multikey and require some cleanup.
*/
for ( int j = 0 ; j <= i; j ++ ) {
try {
_unindexRecord(d -> idx(j), obj, loc, false );
}
catch (...) {
log( 3 ) << " unindex fails on rollback after unique failure/n " ;
}
}
throw ;
}
}
}
上面的_indexRecord爲內聯函數(pdfile.cpp)(inline關鍵字參見C++說明),該參數聲明如下:
IndexDetails & idx = d -> idx(idxNo); //
BSONObjSetDefaultOrder keys;
idx.getKeysFromObject(obj, keys); // 從對象信息中獲取鍵屬性信息
BSONObj order = idx.keyPattern();
Ordering ordering = Ordering::make(order); // 初始化排序方式用於下面傳參
int n = 0 ;
for ( BSONObjSetDefaultOrder::iterator i = keys.begin(); i != keys.end(); i ++ ) {
if ( ++ n == 2 ) {
d -> setIndexIsMultikey(idxNo); // 設置多鍵值索引
}
assert( ! recordLoc.isNull() );
try {
idx.head /* DiskLoc */ .btree() /* BtreeBucket */ -> bt_insert(idx.head, recordLoc, // 執行向btree中添加記錄和綁定索引信息的操作
* i, ordering, dupsAllowed, idx);
}
catch (AssertionException & e) {
if ( e.getCode() == 10287 && idxNo == d -> nIndexes ) {
DEV log() << " info: caught key already in index on bg indexing (ok) " << endl;
continue ;
}
if ( ! dupsAllowed ) {
// 重複鍵值異常
throw ;
}
problem() << " caught assertion _indexRecord " << idx.indexNamespace() << endl;
}
}
}
上面方法最終會執行b樹插入方法bt_insert(btree.cpp文件1622行),如下(詳情見註釋):
const BSONObj & key, const Ordering & order, bool dupsAllowed,
IndexDetails & idx, bool toplevel) const {
if ( toplevel ) { // 如果是頂級節點(如果是通過構造索引方式調用 ,則toplevel=true)
// 判斷鍵值是否過界(因爲其會存儲在system.indexs中),其中:KeyMax = 8192 / 10 .mongodb開發團隊可能會在更高版本中擴大該值
if ( key.objsize() > KeyMax ) {
problem() << " Btree::insert: key too large to index, skipping " << idx.indexNamespace() << ' ' << key.objsize() << ' ' << key.toString() << endl;
return 3 ;
}
}
// 執行添加操作
int x = _insert(thisLoc, recordLoc, key, order, dupsAllowed, DiskLoc(), DiskLoc(), idx);
assertValid( order ); // assert排序方式是否有效
return x;
}
上面代碼緊接着會調用btree.cpp文件的內部方法_insert(btree.cpp文件 1554行):
const BSONObj & key, const Ordering & order, bool dupsAllowed,
const DiskLoc lChild, const DiskLoc rChild, IndexDetails & idx) const {
if ( key.objsize() > KeyMax ) {
problem() << " ERROR: key too large len: " << key.objsize() << " max: " << KeyMax << ' ' << key.objsize() << ' ' << idx.indexNamespace() << endl;
return 2 ;
}
assert( key.objsize() > 0 );
int pos;
// 在btree bucket中使用二分查詢,查看鍵值是否已在所索引信息中
bool found = find(idx, key, recordLoc, order, pos /* 返回該索引信息所在或應該在的位置 */ , ! dupsAllowed);
if ( insert_debug ) {
out () << " " << thisLoc.toString() << ' . ' << " _insert " <<
key.toString() << ' / ' << recordLoc.toString() <<
" l: " << lChild.toString() << " r: " << rChild.toString() << endl;
out () << " found: " << found << " pos: " << pos << " n: " << n << endl;
}
if ( found ) {
const _KeyNode & kn = k(pos); // 獲取指定磁盤位置的節點信息,_KeyNode
if ( kn.isUnused() ) { // 查看已存在的鍵結點是否已使用
log( 4 ) << " btree _insert: reusing unused key " << endl;
massert( 10285 , " _insert: reuse key but lchild is not null " , lChild.isNull());
massert( 10286 , " _insert: reuse key but rchild is not null " , rChild.isNull());
kn.writing().setUsed();
return 0 ;
}
DEV {
log() << " _insert(): key already exists in index (ok for background:true)/n " ;
log() << " " << idx.indexNamespace() << " thisLoc: " << thisLoc.toString() << ' /n ' ;
log() << " " << key.toString() << ' /n ' ;
log() << " " << " recordLoc: " << recordLoc.toString() << " pos: " << pos << endl;
log() << " old l r: " << childForPos(pos).toString() << ' ' << childForPos(pos + 1 ).toString() << endl;
log() << " new l r: " << lChild.toString() << ' ' << rChild.toString() << endl;
}
alreadyInIndex(); // 提示鍵值結點已在索引中,不必再創建,並拋出異常
}
DEBUGGING out () << " TEMP: key: " << key.toString() << endl;
DiskLoc child = childForPos(pos); // 查詢當前pos的子結點信息,以尋找插入位置
if ( insert_debug )
out () << " getChild( " << pos << " ): " << child.toString() << endl;
if ( child.isNull() || ! rChild.isNull() /* 在當前buckets中插入,即 'internal' 插入 */ ) {
insertHere(thisLoc, pos, recordLoc, key, order, lChild, rChild, idx); // 在當前buckets中插入
return 0 ;
}
// 如果有子結點,則在子結點上執行插入操作
return child.btree() -> bt_insert(child, recordLoc, key, order, dupsAllowed, idx, /* toplevel */ false );
}
上面_insert方法首先會使用二分法查找要插入的記錄是否已存在於索引中,同時會返回一個插入點(pos),如不存在則會進一步在插入點位置查看找 元素以決定是在當前bucket中插入,還是在當前pos位置的(右)子結點(bucket)上插入(這會再次遞歸調用上面的bt_insert方法), 這裏我們假定在當前bucket插入,則會執行insertHere方法(btree.cpp文件1183行),它的定義如下:
* insert a key in this bucket, splitting if necessary.
* @keypos - where to insert the key in range 0..n. 0=make leftmost, n=make rightmost.
* NOTE this function may free some data, and as a result the value passed for keypos may
* be invalid after calling insertHere()
*/
void BtreeBucket::insertHere( const DiskLoc thisLoc, int keypos,
const DiskLoc recordLoc, const BSONObj & key, const Ordering & order,
const DiskLoc lchild, const DiskLoc rchild, IndexDetails & idx) const {
if ( insert_debug )
out () << " " << thisLoc.toString() << " .insertHere " << key.toString() << ' / ' << recordLoc.toString() << ' '
<< lchild.toString() << ' ' << rchild.toString() << " keypos: " << keypos << endl;
DiskLoc oldLoc = thisLoc;
// 根據keypos插入相應位置並將數據memcpy到內存指定位置
if ( ! basicInsert(thisLoc, keypos, recordLoc, key, order) ) {
// 如果插入無效,表示當前bucket已滿,則分割記錄並放到新創建的bucket中
thisLoc.btreemod() -> split(thisLoc, keypos, recordLoc, key, order, lchild, rchild, idx);
return ;
}
{ // 持久化當前thisLoc的結點信息並根據插入位置(是否最後一個key),來更新當前thisLoc(及後面key結點)的子結點信息
const _KeyNode * _kn = & k(keypos);
_KeyNode * kn = (_KeyNode * ) getDur().alreadyDeclared((_KeyNode * ) _kn); // already declared intent in basicInsert()
if ( keypos + 1 == n ) { // n爲pack(打包後)存儲的記錄數,這裏"判斷等於n"表示爲最後(last)一個key
if ( nextChild != lchild ) { // 如果是最後元素,那麼"當前最高鍵值的右子結點應該與要插入的左子結點相同
out () << " ERROR nextChild != lchild " << endl;
out () << " thisLoc: " << thisLoc.toString() << ' ' << idx.indexNamespace() << endl;
out () << " keyPos: " << keypos << " n: " << n << endl;
out () << " nextChild: " << nextChild.toString() << " lchild: " << lchild.toString() << endl;
out () << " recordLoc: " << recordLoc.toString() << " rchild: " << rchild.toString() << endl;
out () << " key: " << key.toString() << endl;
dump();
assert( false );
}
kn -> prevChildBucket = nextChild; // "當前最高鍵值的右子結點”綁定到持久化結點的左子結點
assert( kn -> prevChildBucket == lchild );
nextChild.writing() = rchild; // 持久化"當前最高鍵值的右子結點”,並將“要插入結點”的右子結點綁定到
if ( ! rchild.isNull() ) // 如果有右子結點,則更新右子結點的父結點信息爲當前thisLoc
rchild.btree() -> parent.writing() = thisLoc;
}
else {
// 如果keypos位置不是最後一個
kn -> prevChildBucket = lchild; // 將左子結點綁定到keypos位置結點的左子結點上
if ( k(keypos + 1 ).prevChildBucket != lchild ) { // 這時左子結點應該與下一個元素的左子結點相同
out () << " ERROR k(keypos+1).prevChildBucket != lchild " << endl;
out () << " thisLoc: " << thisLoc.toString() << ' ' << idx.indexNamespace() << endl;
out () << " keyPos: " << keypos << " n: " << n << endl;
out () << " k(keypos+1).pcb: " << k(keypos + 1 ).prevChildBucket.toString() << " lchild: " << lchild.toString() << endl;
out () << " recordLoc: " << recordLoc.toString() << " rchild: " << rchild.toString() << endl;
out () << " key: " << key.toString() << endl;
dump();
assert( false );
}
const DiskLoc * pc = & k(keypos + 1 ).prevChildBucket; // 獲取keypos後面元素的左子結點信息
* getDur().alreadyDeclared((DiskLoc * ) pc) = rchild; // 將右子結點綁定到下一個元素(keypos+1)的左子結點上declared in basicInsert()
if ( ! rchild.isNull() ) // 如果有右子結點,則更新右子結點的父結點信息爲當前thisLoc
rchild.btree() -> parent.writing() = thisLoc;
}
return ;
}
}
該方法中會調用一個叫basicInsert的方法,它主要會在當前bucket中指定位置(keypos)添加記錄信息,同時持久化該結點信息,如下:
bool BucketBasics::basicInsert( const DiskLoc thisLoc, int & keypos, const DiskLoc recordLoc, const BSONObj & key, const Ordering & order) const {
assert( keypos >= 0 && keypos <= n );
// 判斷bucket剩餘的空間是否滿足當前數據需要的存儲空間
int bytesNeeded = key.objsize() + sizeof (_KeyNode);
if ( bytesNeeded > emptySize ) {
_pack(thisLoc, order, keypos); // 如不夠用,進行一次整理打包操作,以爲bucket中整理更多空間
if ( bytesNeeded > emptySize ) // 如還不夠用,則返回
return false ;
}
BucketBasics * b; // 聲明Bucket管理對象指針,該對象提供了Bucket存儲管理的基本操作和屬性,如insert,_pack等
{
const char * p = ( const char * ) & k(keypos);
const char * q = ( const char * ) & k(n + 1 );
// declare that we will write to [k(keypos),k(n)]
// todo: this writes a medium amount to the journal. we may want to add a verb "shift" to the redo log so
// we can log a very small amount.
b = (BucketBasics * ) getDur().writingAtOffset(( void * ) this , p - ( char * ) this , q - p);
// 如已有3個結點,目前要插到第三個結點之間,則對每三個元素進行遷移,
// e.g. n==3, keypos==2
// 1 4 9
// ->
// 1 4 _ 9
for ( int j = n; j > keypos; j -- ) // make room
b -> k(j) = b -> k(j - 1 );
}
getDur().declareWriteIntent( & b -> emptySize, 12 ); // [b->emptySize..b->n] is 12 bytes and we are going to write those
b -> emptySize -= sizeof (_KeyNode); // 將當前bucket中的剩餘空閒空間減少
b -> n ++ ; // 已有結點數加1
_KeyNode & kn = b -> k(keypos);
kn.prevChildBucket.Null(); // 設置當前結點的左子結點爲空
kn.recordLoc = recordLoc; // 綁定結點記錄信息
kn.setKeyDataOfs(( short ) b -> _alloc(key.objsize()) ); // 設置結點數據偏移信息
char * p = b -> dataAt(kn.keyDataOfs()); // 實例化指向磁盤數據(journal文件)位置(含偏移量)的指針
getDur().declareWriteIntent(p, key.objsize()); // 持久化結點數據信息
memcpy(p, key.objdata(), key.objsize()); // 將當前結點信息複製到p指向的地址空間
return true ;
}
如果上面方法調用失效,則意味着當前 bucket中已有可用空間插入新記錄,這時系統會調用 split(btree.cpp文件 1240行)方法來進行bucket分割,以創建新的bucket並將信息塞入其中,如下:
assertWritable();
if ( split_debug )
out () << " " << thisLoc.toString() << " .split " << endl;
int split = splitPos( keypos ); // 找到要遷移的數據位置
DiskLoc rLoc = addBucket(idx); // 添加一個新的BtreeBucket
BtreeBucket * r = rLoc.btreemod();
if ( split_debug )
out () << " split: " << split << ' ' << keyNode(split).key.toString() << " n: " << n << endl;
for ( int i = split + 1 ; i < n; i ++ ) {
KeyNode kn = keyNode(i);
r -> pushBack(kn.recordLoc, kn.key, order, kn.prevChildBucket); // 向新bucket中遷移過剩數據
}
r -> nextChild = nextChild; // 綁定新bucket的右子結點
r -> assertValid( order );
if ( split_debug )
out () << " new rLoc: " << rLoc.toString() << endl;
r = 0 ;
rLoc.btree() -> fixParentPtrs(rLoc); // 設置當前bucket樹的父指針信息
{
KeyNode splitkey = keyNode(split); // 獲取內存中分割點位置所存儲的數據
nextChild = splitkey.prevChildBucket; // 提升splitkey 鍵,它的子結點將會是 thisLoc (l) 和 rLoc (r)
if ( split_debug ) {
out () << " splitkey key: " << splitkey.key.toString() << endl;
}
// 將 splitkey 提升爲父結點
if ( parent.isNull() ) {
// 如果無父結點時,則創建一個,並將
DiskLoc L = addBucket(idx);
BtreeBucket * p = L.btreemod();
p -> pushBack(splitkey.recordLoc, splitkey.key, order, thisLoc);
p -> nextChild = rLoc; // 將分割的bucket爲了當前
p -> assertValid( order );
parent = idx.head.writing() = L; // 將splitkey 提升爲父結點
if ( split_debug )
out () << " we were root, making new root: " << hex << parent.getOfs() << dec << endl;
rLoc.btree() -> parent.writing() = parent;
}
else {
// set this before calling _insert - if it splits it will do fixParent() logic and change the value.
rLoc.btree() -> parent.writing() = parent;
if ( split_debug )
out () << " promoting splitkey key " << splitkey.key.toString() << endl;
// 提升splitkey鍵,它的左子結點 thisLoc, 右子點rLoc
parent.btree() -> _insert(parent, splitkey.recordLoc, splitkey.key, order, /* dupsallowed */ true , thisLoc, rLoc, idx);
}
}
int newpos = keypos;
// 打包壓縮數據(pack,移除無用數據),以提供更多空間
truncateTo(split, order, newpos); // note this may trash splitkey.key. thus we had to promote it before finishing up here.
// add our new key, there is room now
{
if ( keypos <= split ) { // 如果還有空間存儲新鍵
if ( split_debug )
out () << " keypos<split, insertHere() the new key " << endl;
insertHere(thisLoc, newpos, recordLoc, key, order, lchild, rchild, idx); // 再次向當前bucket中添加記錄
}
else { // 如壓縮之後依舊無可用空間,則向新創建的bucket中添加節點
int kp = keypos - split - 1 ;
assert(kp >= 0 );
rLoc.btree() -> insertHere(rLoc, kp, recordLoc, key, order, lchild, rchild, idx);
}
}
if ( split_debug )
out () << " split end " << hex << thisLoc.getOfs() << dec << endl;
}
好了,今天的內容到這裏就告一段落了,在接下來的文章中,將會介紹客戶端發起Delete操作時,Mongodb的執行流程和相應實現部分。
原文鏈接:http://www.cnblogs.com/daizhj/archive/2011/03/30/1999699.html
作者: daizhj, 代震軍
微博: http://t.sina.com.cn/daizhj
Tags: mongodb,c++,btree
Mongodb源碼分析--插入記錄及索引B樹構建
enum Operations {
assembleResponse( Message & m, DbResponse & dbresponse, const SockAddr & client ) {
void receivedInsert(Message & m, CurOp & op) {
DiskLoc DataFileMgr::insertWithObjMod( const char * ns, BSONObj & o, bool god) {
DiskLoc DataFileMgr::insert( const char * ns, const void * obuf, int len, bool god, const BSONElement & writeId, bool mayAddIndex) {
static void buildAnIndex( string ns, NamespaceDetails * d, IndexDetails & idx, int idxNo, bool background) {
unsigned long long fastBuildIndex( const char * ns, NamespaceDetails * d, IndexDetails & idx, int idxNo) {
btree.cpp 1842行
/* 將記錄數據添加到索引信息(btree)中 */
/* 將鍵值和數據(包括存儲位置)添加到索引中 */
static inline void _indexRecord(NamespaceDetails * d, int idxNo, BSONObj & obj, DiskLoc recordLoc, bool dupsAllowed) {
int BtreeBucket::bt_insert( const DiskLoc thisLoc, const DiskLoc recordLoc,
int BtreeBucket::_insert( const DiskLoc thisLoc, const DiskLoc recordLoc,
/* *
// tree.cpp 1183
void BtreeBucket::split( const DiskLoc thisLoc, int keypos, const DiskLoc recordLoc, const BSONObj & key, const Ordering & order, const DiskLoc lchild, const DiskLoc rchild, IndexDetails & idx) {
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.