故障案例--mongodb writeconcern爲majority時的又一個bug

前言

之前的文章有提到過majority的一個坑,還談不上bug,鏈接如下點擊打開鏈接

故障現象

majority下應用層一直報錯,但實際數據寫入成功,包括主從節點都成功;w設置爲1以後沒有報錯,寫入成功

 044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.setWriteConcern({w:"majority",wtimeout:3000})
044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.getWriteConcern()
WriteConcern({ "w" : "majority", "wtimeout" : 3000 })
044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.things.insert({name:"123"})
WriteResult({
"nInserted" : 1,
"writeConcernError" : {
"code" : 64,
"errInfo" : {
"wtimeout" : true
},
"errmsg" : "waiting for replication timed out"
}
}) 


 044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.setWriteConcern({w:1,wtimeout:3000,j:true})
044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.getWriteConcern()
WriteConcern({ "w" : 1, "wtimeout" : 3000, "j" : true })
044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.things.insert({dingshun:"123"})
WriteResult({ "nInserted" : 1 }) 

故障原因

查了下主庫的mongod日誌,看到有這麼幾條信息,懷疑是bug引起


查了下官網,確實是個低版本的bug,我的db版本正好是2.6.1 鏈接 https://jira.mongodb.org/browse/SERVER-15849

ISSUE SUMMARY
On a replicaset that uses chained replication, if a secondary with id M that syncs from secondary with id N is removed, node N continues to forward replication information about M to the primary.

USER IMPACT
The following message appears repeatedly in the primary's logfile:

replset couldn't find a slave with id M

If the removed node is required to meet a specific write concern, write operations with that write concern will wait indefintely unless a wtimeout was specified.

On a sharded cluster, during a chunk migration the destination shard will wait for the final writes to be replicated to the majority of the nodes. If the write concern cannot be satisfied and awtimeout was not specified, the chunk migration times out after 60 minutes.

WORKAROUNDS
There is no workaround for this issue.

AFFECTED VERSIONS
MongoDB 2.6 versions up to 2.6.5 are affected by this issue.

FIX VERSION
The fix is included in the 2.6.6 production release.

RESOLUTION DETAILS
Secondaries no longer forward replication progress for nodes that are no longer part of a replica set


2.6.6版本以後修復

解決方法

重啓所有的節點使得重新刷新內部緩存數據,或者升級到高版本,建議升級版本;

另外一種方法我沒試,猜測或許可行,就是簡單地把rs.config中的_id列重新改爲順序的 1,2,3...杜絕不連續的id值

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章