故障现象
一个副本集下四个节点,一个primary,两个Secondary,一个arbiter,其中将一个Secondary关闭后,修改primary节点的密码,这时修改命令会卡住直到超时失败。
1 2 3 4 5 6 |
udb-aqmp5a:PRIMARY> db.changeUserPassword("root","123123") 2016-08-23T17:05:30.879+0800 E QUERY Error: Updating user failed: timeout at Error (<anonymous>) at DB.updateUser (src/mongo/shell/db.js:1152:11) at DB.changeUserPassword (src/mongo/shell/db.js:1156:10) at (shell):1:4 at src/mongo/shell/db.js:1152 |
故障原因
查看MongoDB的错误日志
1 2 3 4 5 6 |
2016-08-19T12:37:08.897+0800 W NETWORK [ReplExecNetThread-12] Failed to connect to 10.19.66.62:27017, reason: errno:115 Operation now in progress 2016-08-19T12:37:08.897+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 10.19.66.62:27017; Location18915 Failed attempt to connect to 10.19.66.62:27017; couldn't connect to server 10.19.66.62:27017 (10.19.66.62), connection attempt failed 2016-08-19T12:37:15.524+0800 I COMMAND [conn601] command admin.$cmd command: getLastError { getLastError: 1, w: "majority", wtimeout: 30000.0 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:270 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 1, W: 2 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } 30001ms |
可以看到writeconcern为write majority,这种情况下修改密码不符合“大多数”原则。可能是majority在计算时需要符合”大多数数据节点”的需求,包括了仲裁节点,但是如果有仲裁节点存在,因为它无法实际写入数据,所以它却永远站在对立面。
故障复现
准备条件:一个primary,两个Secondary,一个arbiter,并关闭其中一台Secondary
方法1:采用普通写入,比如往一个db写入一条数据,通过设置不同的w值和writeconcern值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
udb-aqmp5a:PRIMARY> rs.status() { "set" : "udb-aqmp5a", "date" : ISODate("2016-08-23T09:04:03.018Z"), "myState" : 1, "members" : [ { "_id" : 0, "name" : "10.9.46.198:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 308, "optime" : Timestamp(1471942975, 1), "optimeDate" : ISODate("2016-08-23T09:02:55Z"), "electionTime" : Timestamp(1471942751, 2), "electionDate" : ISODate("2016-08-23T08:59:11Z"), "configVersion" : 4, "self" : true }, { "_id" : 1, "name" : "10.9.56.132:27017", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "optime" : Timestamp(0, 0), "optimeDate" : ISODate("1970-01-01T00:00:00Z"), "lastHeartbeat" : ISODate("2016-08-23T09:03:51.769Z"), "lastHeartbeatRecv" : ISODate("2016-08-23T09:03:39.757Z"), "pingMs" : 0, "lastHeartbeatMessage" : "DBClientBase::findN: transport error: 10.9.56.132:27017 ns: admin.$cmd query: { replSetHeartbeat: \"udb-aqmp5a\", pv: 1, v: 4, from: \"10.9.46.198:27017\", fromId: 0, checkEmpty: false }", "configVersion" : -1 }, { "_id" : 2, "name" : "10.9.48.72:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 84, "optime" : Timestamp(1471942975, 1), "optimeDate" : ISODate("2016-08-23T09:02:55Z"), "lastHeartbeat" : ISODate("2016-08-23T09:04:01.766Z"), "lastHeartbeatRecv" : ISODate("2016-08-23T09:04:02.802Z"), "pingMs" : 0, "syncingTo" : "10.9.56.132:27017", "configVersion" : 4 }, { "_id" : 3, "name" : "10.9.51.198:27017", "health" : 1, "state" : 7, "stateStr" : "ARBITER", "uptime" : 67, "lastHeartbeat" : ISODate("2016-08-23T09:04:01.794Z"), "lastHeartbeatRecv" : ISODate("2016-08-23T09:04:01.363Z"), "pingMs" : 0, "configVersion" : 4 } ], "ok" : 1 } |
1 2 |
udb-aqmp5a:PRIMARY> db.test.insert({name:"jason.jiang"},{writeConcern:{w:2,wtimeout:5000}}) WriteResult({ "nInserted" : 1 }) |
这时可以写入成功,是因为w为2表示副本集只要有2个节点写入成功就行,于是返回成功
1 2 3 4 5 6 7 8 9 10 11 |
udb-aqmp5a:PRIMARY> db.test.insert({name:"jason.jiang1"},{writeConcern:{w:3,wtimeout:5000}}) WriteResult({ "nInserted" : 1, "writeConcernError" : { "code" : 64, "errInfo" : { "wtimeout" : true }, "errmsg" : "waiting for replication timed out" } }) |
这时无法写入成功,因为需要写入3个节点,但是arbiter无法写入成功,而其中一个Secondary节点宕了,无法满足。
1 2 3 4 5 6 7 8 9 10 11 |
udb-aqmp5a:PRIMARY> db.test.insert({name:"jason.jiang2"},{writeConcern:{w:"majority",wtimeout:5000}}) WriteResult({ "nInserted" : 1, "writeConcernError" : { "code" : 64, "errInfo" : { "wtimeout" : true }, "errmsg" : "waiting for replication timed out" } }) |
这时也无法写入成功,因为4个majority其实也就是w为3,无法满足
总结
个人感觉这算是mongodb设计得不够合理的地方,容易引起误解。arbiter在主节点宕机选举新的primary时起到了作用,起到积极主动作用;而在writecern设置为majority后,因为其本身无法写入数据,故一直起到的是消极作用。像这种1主2从1仲裁的情况,如果主节点宕机,那么可以选举出新的主节点;如果1个从节点宕机,当设置为majority时,却无法再写入数据了,虽然数据节点中3个中的2个都是健康的(虽然现网环境下这个arbier完全是多余的,一般不会这么用)
转载:http://blog.csdn.net/cug_jiang126com/article/details/52251312