redis重做從庫時報Connection with master lost錯誤

背景:之前同事初始化redis實例的時候把monitor命令給rename了,今天剛好這組實例有問題需要通過monitor命令來幫助分析問題,結果發現monitor命令用不了,因爲redis命令重命名不支持動態修改,於是就想刪除monitor命令後趕快重啓redis從庫。重啓完後,發現主從複製狀態一直處於down的狀態,於是開始排查問題

先看redis從庫的日誌:

43037:S 20 Apr 06:12:38.047 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:12:38.047
Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:12:38.048 Master replied to PING, replication can continue...
43037:S 20 Apr 06:12:38.048
Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:12:39.112 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885061114038
43037:S 20 Apr 06:19:16.258
MASTER <-> SLAVE sync: receiving 20819568576 bytes from master
43037:S 20 Apr 06:20:08.203 MASTER <-> SLAVE sync: Flushing old data
43037:S 20 Apr 06:24:36.036
MASTER <-> SLAVE sync: Loading DB in memory
43037:S 20 Apr 06:28:18.538 MASTER <-> SLAVE sync: Finished with success
43037:S 20 Apr 06:28:19.782
Background append only file rewriting started by pid 173002
43037:S 20 Apr 06:28:19.982 # Connection with master lost.
43037:S 20 Apr 06:28:19.982 Caching the disconnected master state.
43037:S 20 Apr 06:28:20.984
Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:28:20.985 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:28:20.985
Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:28:20.985 Master replied to PING, replication can continue...
43037:S 20 Apr 06:28:20.986
Trying a partial resynchronization (request 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885062375607).
43037:S 20 Apr 06:28:22.073 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885240485270
43037:S 20 Apr 06:28:22.073
Discarding previously cached master state.
43037:S 20 Apr 06:33:30.800 # Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.
43037:S 20 Apr 06:33:30.801 Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:33:30.802
MASTER <-> SLAVE sync started
43037:S 20 Apr 06:33:30.802 Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:33:30.802
Master replied to PING, replication can continue...
43037:S 20 Apr 06:33:30.803 Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:34:27.458
AOF rewrite child asks to stop sending diffs.
173002:C 20 Apr 06:34:27.470 Parent agreed to stop sending diffs. Finalizing AOF...
173002:C 20 Apr 06:34:27.470
Concatenating 1.19 MB of AOF diff received from parent.
173002:C 20 Apr 06:34:27.477 SYNC append only file rewrite performed
173002:C 20 Apr 06:34:28.757
AOF rewrite: 119 MB of memory used by copy-on-write
43037:S 20 Apr 06:34:29.961 Background AOF rewrite terminated with success
43037:S 20 Apr 06:34:29.961
Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
43037:S 20 Apr 06:34:29.961 Background AOF rewrite finished successfully
43037:S 20 Apr 06:35:08.471
Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885393456775
43037:S 20 Apr 06:41:49.489 MASTER <-> SLAVE sync: receiving 20821399746 bytes from master
43037:S 20 Apr 06:42:28.911 # I/O error trying to sync with MASTER: connection lost
43037:S 20 Apr 06:42:32.642
Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:42:32.646 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:42:32.646
Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:42:32.647 Master replied to PING, replication can continue...
43037:S 20 Apr 06:42:32.647
Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:42:33.755 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885541422071
43037:S 20 Apr 06:49:15.956
MASTER <-> SLAVE sync: receiving 20821403571 bytes from master
43037:S 20 Apr 06:50:16.781 MASTER <-> SLAVE sync: Flushing old data
43037:S 20 Apr 06:54:23.078
MASTER <-> SLAVE sync: Loading DB in memory
43037:S 20 Apr 06:58:10.123 MASTER <-> SLAVE sync: Finished with success
43037:S 20 Apr 06:58:11.317
Background append only file rewriting started by pid 223387
43037:S 20 Apr 06:58:11.536 # Connection with master lost.
43037:S 20 Apr 06:58:11.536 * Caching the disconnected master state.

再來看一下redis主庫的日誌:

304369:M 20 Apr 05:13:00.197 Slave 10.93.157.16:6383 asks for synchronization
304369:M 20 Apr 05:13:00.197
Full resync requested by slave 10.93.157.16:6383
304369:M 20 Apr 05:13:00.197 Starting BGSAVE for SYNC with target: disk
304369:M 20 Apr 05:13:00.902
Background saving started by pid 366254
366254:C 20 Apr 05:19:14.460 DB saved on disk
366254:C 20 Apr 05:19:14.961
RDB: 4613 MB of memory used by copy-on-write
304369:M 20 Apr 05:19:15.579 Background saving terminated with success
304369:M 20 Apr 05:20:18.303
Synchronization with slave 10.93.157.16:6383 succeeded
304369:M 20 Apr 05:22:32.216 # Client id=1768461 addr=10.93.157.16:26912 fd=10 name= age=572 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=11450 omem=278945640 events=rw cmd=psync scheduled to
be closed ASAP for overcoming of output buffer limits.
304369:M 20 Apr 05:22:32.216 # Connection with slave 10.93.157.16:6383 lost.
304369:M 20 Apr 05:28:27.651 Slave 10.93.157.16:6383 asks for synchronization
304369:M 20 Apr 05:28:27.651
Unable to partial resync with slave 10.93.157.16:6383 for lack of backlog (Slave request was: 884116305579).
304369:M 20 Apr 05:28:27.651 Starting BGSAVE for SYNC with target: disk
304369:M 20 Apr 05:28:28.356
Background saving started by pid 396084
304369:M 20 Apr 05:32:37.471 # Client id=1768945 addr=10.93.157.16:21854 fd=22 name= age=250 idle=250 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16366 oll=11126 omem=271085384 events=r cmd=psync scheduled to
be closed ASAP for overcoming of output buffer limits.
304369:M 20 Apr 05:32:37.538 # Connection with slave 10.93.157.16:6383 lost.

從日誌中我們可以看到在redis主庫在接到從庫要求重新同步數據的時候先生成一個rdb文件,再通過psync來做部分同步,可以看到問題就出在部分同步這一塊,由於client-output-buffer-limit值設置太小,導致導數發送失敗。失敗後,從庫繼續發起數據同步的請求,但是每次都失敗,redis主庫反覆地生成rdb文件,雖然是有fork一個子進程,但是對redis主庫的吞吐量是有一定影響的。

現在我們來解決這個問題,先看一下目前 client-output-buffer-limit的大小:
127.0.0.1:6385> config get client-output-buffer-limit
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"
可以看到,目前的限制是最大256M和60s內不超過64M,從我們上面日誌信息可以看到psync的數據明顯是大於256M的。我們進行如下設置把很限制調大:
config set client-output-buffer-limit 'slave 1073741824 268435456 60'
調完後再觀察,發現從庫的的複製狀態很快就變成了up。
我們上面的調整是加大複製輸出緩衝區,還有一個辦法就是關閉複製輸出緩衝區的限制:
config set client-output-buffer-limit 'slave 0 0 0'

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章