redis重做從庫時報Connection with master lost錯誤

背景：之前同事初始化redis實例的時候把monitor命令給rename了，今天剛好這組實例有問題需要通過monitor命令來幫助分析問題，結果發現monitor命令用不了，因爲redis命令重命名不支持動態修改，於是就想刪除monitor命令後趕快重啓redis從庫。重啓完後，發現主從複製狀態一直處於down的狀態，於是開始排查問題

先看redis從庫的日誌：

43037:S 20 Apr 06:12:38.047 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:12:38.047 Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:12:38.048 Master replied to PING, replication can continue...
43037:S 20 Apr 06:12:38.048 Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:12:39.112 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885061114038
43037:S 20 Apr 06:19:16.258 MASTER <-> SLAVE sync: receiving 20819568576 bytes from master
43037:S 20 Apr 06:20:08.203 MASTER <-> SLAVE sync: Flushing old data
43037:S 20 Apr 06:24:36.036 MASTER <-> SLAVE sync: Loading DB in memory
43037:S 20 Apr 06:28:18.538 MASTER <-> SLAVE sync: Finished with success
43037:S 20 Apr 06:28:19.782 Background append only file rewriting started by pid 173002
43037:S 20 Apr 06:28:19.982 # Connection with master lost.
43037:S 20 Apr 06:28:19.982 Caching the disconnected master state.
43037:S 20 Apr 06:28:20.984 Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:28:20.985 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:28:20.985 Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:28:20.985 Master replied to PING, replication can continue...
43037:S 20 Apr 06:28:20.986 Trying a partial resynchronization (request 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885062375607).
43037:S 20 Apr 06:28:22.073 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885240485270
43037:S 20 Apr 06:28:22.073 Discarding previously cached master state.
43037:S 20 Apr 06:33:30.800 # Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.
43037:S 20 Apr 06:33:30.801 Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:33:30.802 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:33:30.802 Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:33:30.802 Master replied to PING, replication can continue...
43037:S 20 Apr 06:33:30.803 Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:34:27.458 AOF rewrite child asks to stop sending diffs.
173002:C 20 Apr 06:34:27.470 Parent agreed to stop sending diffs. Finalizing AOF...
173002:C 20 Apr 06:34:27.470 Concatenating 1.19 MB of AOF diff received from parent.
173002:C 20 Apr 06:34:27.477 SYNC append only file rewrite performed
173002:C 20 Apr 06:34:28.757 AOF rewrite: 119 MB of memory used by copy-on-write
43037:S 20 Apr 06:34:29.961 Background AOF rewrite terminated with success
43037:S 20 Apr 06:34:29.961 Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
43037:S 20 Apr 06:34:29.961 Background AOF rewrite finished successfully
43037:S 20 Apr 06:35:08.471 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885393456775
43037:S 20 Apr 06:41:49.489 MASTER <-> SLAVE sync: receiving 20821399746 bytes from master
43037:S 20 Apr 06:42:28.911 # I/O error trying to sync with MASTER: connection lost
43037:S 20 Apr 06:42:32.642 Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:42:32.646 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:42:32.646 Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:42:32.647 Master replied to PING, replication can continue...
43037:S 20 Apr 06:42:32.647 Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:42:33.755 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885541422071
43037:S 20 Apr 06:49:15.956 MASTER <-> SLAVE sync: receiving 20821403571 bytes from master
43037:S 20 Apr 06:50:16.781 MASTER <-> SLAVE sync: Flushing old data
43037:S 20 Apr 06:54:23.078 MASTER <-> SLAVE sync: Loading DB in memory
43037:S 20 Apr 06:58:10.123 MASTER <-> SLAVE sync: Finished with success
43037:S 20 Apr 06:58:11.317 Background append only file rewriting started by pid 223387
43037:S 20 Apr 06:58:11.536 # Connection with master lost.
43037:S 20 Apr 06:58:11.536 * Caching the disconnected master state.

再來看一下redis主庫的日誌：

304369:M 20 Apr 05:13:00.197 Slave 10.93.157.16:6383 asks for synchronization
304369:M 20 Apr 05:13:00.197 Full resync requested by slave 10.93.157.16:6383
304369:M 20 Apr 05:13:00.197 Starting BGSAVE for SYNC with target: disk
304369:M 20 Apr 05:13:00.902 Background saving started by pid 366254
366254:C 20 Apr 05:19:14.460 DB saved on disk
366254:C 20 Apr 05:19:14.961 RDB: 4613 MB of memory used by copy-on-write
304369:M 20 Apr 05:19:15.579 Background saving terminated with success
304369:M 20 Apr 05:20:18.303 Synchronization with slave 10.93.157.16:6383 succeeded
304369:M 20 Apr 05:22:32.216 # Client id=1768461 addr=10.93.157.16:26912 fd=10 name= age=572 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=11450 omem=278945640 events=rw cmd=psync scheduled to
be closed ASAP for overcoming of output buffer limits.
304369:M 20 Apr 05:22:32.216 # Connection with slave 10.93.157.16:6383 lost.
304369:M 20 Apr 05:28:27.651 Slave 10.93.157.16:6383 asks for synchronization
304369:M 20 Apr 05:28:27.651 Unable to partial resync with slave 10.93.157.16:6383 for lack of backlog (Slave request was: 884116305579).
304369:M 20 Apr 05:28:27.651 Starting BGSAVE for SYNC with target: disk
304369:M 20 Apr 05:28:28.356 Background saving started by pid 396084
304369:M 20 Apr 05:32:37.471 # Client id=1768945 addr=10.93.157.16:21854 fd=22 name= age=250 idle=250 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16366 oll=11126 omem=271085384 events=r cmd=psync scheduled to
be closed ASAP for overcoming of output buffer limits.
304369:M 20 Apr 05:32:37.538 # Connection with slave 10.93.157.16:6383 lost.

從日誌中我們可以看到在redis主庫在接到從庫要求重新同步數據的時候先生成一個rdb文件，再通過psync來做部分同步，可以看到問題就出在部分同步這一塊，由於client-output-buffer-limit值設置太小，導致導數發送失敗。失敗後，從庫繼續發起數據同步的請求，但是每次都失敗，redis主庫反覆地生成rdb文件，雖然是有fork一個子進程，但是對redis主庫的吞吐量是有一定影響的。

現在我們來解決這個問題，先看一下目前 client-output-buffer-limit的大小：
127.0.0.1:6385> config get client-output-buffer-limit
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"
可以看到，目前的限制是最大256M和60s內不超過64M,從我們上面日誌信息可以看到psync的數據明顯是大於256M的。我們進行如下設置把很限制調大：
config set client-output-buffer-limit 'slave 1073741824 268435456 60'
調完後再觀察，發現從庫的的複製狀態很快就變成了up。
我們上面的調整是加大複製輸出緩衝區，還有一個辦法就是關閉複製輸出緩衝區的限制：
config set client-output-buffer-limit 'slave 0 0 0'

redis重做從庫時報Connection with master lost錯誤

程序連不上數據庫的總結

將進程綁定到指定cpu

MongoDB用戶管理

我的友情鏈接

paramiko模塊報異常：No handlers could be found for logge

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結