故障案例--mongodb writeconcern为majority时的又一个bug

前言

之前的文章有提到过majority的一个坑,还谈不上bug,链接如下点击打开链接

故障现象

majority下应用层一直报错,但实际数据写入成功,包括主从节点都成功;w设置为1以后没有报错,写入成功

 044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.setWriteConcern({w:"majority",wtimeout:3000})
044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.getWriteConcern()
WriteConcern({ "w" : "majority", "wtimeout" : 3000 })
044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.things.insert({name:"123"})
WriteResult({
"nInserted" : 1,
"writeConcernError" : {
"code" : 64,
"errInfo" : {
"wtimeout" : true
},
"errmsg" : "waiting for replication timed out"
}
}) 


 044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.setWriteConcern({w:1,wtimeout:3000,j:true})
044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.getWriteConcern()
WriteConcern({ "w" : 1, "wtimeout" : 3000, "j" : true })
044dd16e-7706-4a49-b4ff-73d86a99d6fd:PRIMARY> db.things.insert({dingshun:"123"})
WriteResult({ "nInserted" : 1 }) 

故障原因

查了下主库的mongod日志,看到有这么几条信息,怀疑是bug引起


查了下官网,确实是个低版本的bug,我的db版本正好是2.6.1 链接 https://jira.mongodb.org/browse/SERVER-15849

ISSUE SUMMARY
On a replicaset that uses chained replication, if a secondary with id M that syncs from secondary with id N is removed, node N continues to forward replication information about M to the primary.

USER IMPACT
The following message appears repeatedly in the primary's logfile:

replset couldn't find a slave with id M

If the removed node is required to meet a specific write concern, write operations with that write concern will wait indefintely unless a wtimeout was specified.

On a sharded cluster, during a chunk migration the destination shard will wait for the final writes to be replicated to the majority of the nodes. If the write concern cannot be satisfied and awtimeout was not specified, the chunk migration times out after 60 minutes.

WORKAROUNDS
There is no workaround for this issue.

AFFECTED VERSIONS
MongoDB 2.6 versions up to 2.6.5 are affected by this issue.

FIX VERSION
The fix is included in the 2.6.6 production release.

RESOLUTION DETAILS
Secondaries no longer forward replication progress for nodes that are no longer part of a replica set


2.6.6版本以后修复

解决方法

重启所有的节点使得重新刷新内部缓存数据,或者升级到高版本,建议升级版本;

另外一种方法我没试,猜测或许可行,就是简单地把rs.config中的_id列重新改为顺序的 1,2,3...杜绝不连续的id值

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章