Redis生產環境節點宕機問題報錯及恢復排錯

Redis故障發現

主觀下線

當cluster-node-timeout時間內某節點無法與另一個節點順利完成ping消息通信時，則將該節點標記爲主觀下線狀態。

客觀下線

當某個節點判斷另一個節點主觀下線後，該節點的下線報告會通過Gossip消息傳播。當接收節點發現消息體中含有主觀下線的節點，其會嘗試對該節點進行客觀下線，依據下線報告是否在有效期內(如果在cluster-node-timeout*2時間內無法收集到一半以上槽節點的下線報告，那麼之前的下線報告會過期)，且數量大於槽節點總數的一半。若是，則將該節點更新爲客觀下線，並向集羣廣播下線節點的fail消息。

Redis故障恢復

故障節點變爲客觀下線後，如果下線節點是持有槽的主節點，則需要在它的從節點中選出一個替換它，從而保證集羣的高可用，過程如下：

資格檢查

每個從節點都要檢查最後與主節點斷線時間，判斷是否有資格替換故障的主節點。如果從節點與主節點斷線時間超過(cluster-node-timeout*cluster-slave-validity-factor)，則當前從節點不具備故障轉移資格。

準備選舉時間

從節點符合故障轉移資格後，更新觸發故障選舉時間，只有到達該時間才能執行後續流程。採用延遲觸發機制，主要是對多個從節點使用不同的延遲選舉時間來支持優先級。複製偏移量越大說明從節點延遲越低，那麼它應該具有更高的優先級。

發起選舉

當從節點到達故障選舉時間後，會觸發選舉流程：
1. 更新配置週期
  
  配置週期是一個只增不減的整數，每個主節點自身維護一個配置週期，標示當前主節點的版本，所有主節點的配置週期都不相等，從節點會複製主節點的配置週期。整個集羣又維護一個全局的配置週期，用於記錄集羣內所有主節點配置週期的最大版本。每次集羣發生重大事件，如新加入主節點或由從節點轉換而來，從節點競爭選舉，都會遞增集羣全局配置週期並賦值給相關主節點，用於記錄這一關鍵事件。
2. 廣播選舉消息
  
  在集羣內廣播選舉消息，並記錄已發送過消息的狀態，保證該從節點在一個配置週期內只能發起一次選舉。
選舉投票

只有持有槽的主節點纔會處理故障選舉消息，每個持有槽的節點在一個配置週期內都有唯一的一張選票，當接到第一個請求投票的從節點消息，回覆消息作爲投票，之後相同配置週期內其它從節點的選舉消息將忽略。投票過程其實是一個領導者選舉的過程。

每個配置週期代表了一次選舉週期，如果在開始投票後的cluster-node-timeout*2時間內從節點沒有獲取足夠數量的投票，則本次選舉作廢。從節點對配置週期自增併發起下一輪投票，直到選舉成功爲止。

替換主節點

當前從節點取消複製變爲主節點，撤銷故障主節點負責的槽，把這些槽委派給自己，並向集羣廣播告知所有節點當前從節點變爲主節點。

Redis故障轉移時間(failover-time)

主觀下線識別時間等於 cluster-node-timeout。
主觀下線狀態消息傳播時間 小於等於 cluster-node-timeout/2(消息通信機制會優先選取下線狀態節點通信)。
從節點轉移時間 小於等於 1000毫秒(偏移量最大的從節點最多延遲1秒發起選舉，通常一次就會成功)。

所以，failover-time(毫秒) 小於等於 cluster-node-timeout + cluster-node-timeout/2 + 1000

Redis生產環境節點宕機問題報錯及恢復

問題背景和分析

當前文檔記錄一次Redis生產環境的問題，當時是發現生產環境的業務，用戶賬號無法進行登錄和註冊，但是其他的一些業務操作是可以實現。

這裏首先查看一些Java日誌，發現如下報錯：

Java日誌報錯

{"jv_time":"2020-10-01 08:02:24.222","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-33","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Redis server response timeout (10000 ms) occured for command: (HSET) with params: [redisson_spring_session:0e4db6b7-bb7a-4050-949f-d6b182e8ccd0, PooledUnsafeDirectByteBuf(ridx: 0, widx: 26, cap: 256), PooledUnsafeDirectByteBuf(ridx: 0, widx: 32, cap: 256)] channel: [id: 0xbfc0a52c, L:/172.20.20.9:22454 - R:/172.20.20.9:7002] at org.redisson.command.CommandAsyncService$11.run(CommandAsyncService.java:682) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:08:12.935","jv_level":"ERROR","jv_thread":"redisson-netty-1-4","jv_class":"org.redisson.connection.pool.SlaveConnectionPool","jv_method":"checkForReconnect","jv_message":"host testsrv1982/172.20.20.9:7002 disconnected due to failedAttempts=3 limit reached","jv_throwable":"org.redisson.client.RedisTimeoutException: Command execution timeout for testsrv1982/172.20.20.9:7002 at org.redisson.client.RedisConnection$2.run(RedisConnection.java:212) ~[redisson-3.5.7.jar!/:?] at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:139) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:08:50.563","jv_level":"ERROR","jv_thread":"redisson-netty-1-14","jv_class":"org.redisson.connection.pool.SlaveConnectionPool","jv_method":"checkForReconnect","jv_message":"host testsrv1982/172.20.20.9:7002 disconnected due to failedAttempts=3 limit reached","jv_throwable":"org.redisson.client.RedisTimeoutException: Command execution timeout for testsrv1982/172.20.20.9:7002 at org.redisson.client.RedisConnection$2.run(RedisConnection.java:212) ~[redisson-3.5.7.jar!/:?] at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:139) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:08:59.953","jv_level":"ERROR","jv_thread":"redisson-netty-1-1","jv_class":"org.redisson.connection.pool.SlaveConnectionPool","jv_method":"checkForReconnect","jv_message":"host /172.20.20.9:7003 disconnected due to failedAttempts=3 limit reached","jv_throwable":"org.redisson.client.RedisTimeoutException: Command execution timeout for /172.20.20.9:7003 at org.redisson.client.RedisConnection$2.run(RedisConnection.java:212) ~[redisson-3.5.7.jar!/:?] at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:139) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:28:54.570","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-608","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts at org.redisson.command.CommandAsyncService$8.run(CommandAsyncService.java:554) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:28:54.570","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-124","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts at org.redisson.command.CommandAsyncService$8.run(CommandAsyncService.java:554) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

{"jv_time":"2020-10-01 08:28:54.570","jv_level":"ERROR","jv_thread":"http-nio-8888-exec-379","jv_class":"org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/member].[dispatcherServlet]","jv_method":"log","jv_message":"Servlet.service() for servlet [dispatcherServlet] in context with path [/member] threw exception","jv_throwable":"org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts at org.redisson.command.CommandAsyncService$8.run(CommandAsyncService.java:554) ~[redisson-3.5.7.jar!/:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483) ~[netty-all-4.1.42.Final.jar!/:4.1.42.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] "}

通過查看Java錯誤日誌分析，主要有兩條錯誤內容：

org.redisson.client.RedisTimeoutException: Redis server response timeout (10000 ms) occured for command: (HSET) with params: [redisson_spring_session:0e4db6b7-bb7a-4050-949f-d6b182e8ccd0, PooledUnsafeDirectByteBuf(ridx: 0, widx: 26, cap: 256), PooledUnsafeDirectByteBuf(ridx: 0, widx: 32, cap: 256)] channel: [id: 0xbfc0a52c, L:/172.20.20.9:22454 - R:/172.20.20.9:7002]

第一條日誌是因爲redis 服務器地址鏈接超時，導致HSET寫入數據失敗。

org.redisson.client.RedisTimeoutException: Unable to send command: (HGET) with params: [maintained-flag, PooledUnsafeDirectByteBuf(ridx: 0, widx: 5, cap: 256)] after 3 retry attempts

第二條日誌是也是因爲鏈接超時，導致HGET查詢失敗，並且重試了3次還是查詢失敗問題。

總結：通過上述錯誤的Java日誌分析出問題出現在Redis集羣中，然後又找了Redis的日誌進行查看。發現集羣日誌中有 redis cluster fail 問題，節點出現異常，集羣啓動自動恢復的機制，導致有些redis 客戶端鏈接無法正常寫入和查詢redis 數據，所以出現了用戶註冊和登錄異常，我們來分析一些redis 集羣自動恢復的日誌。

Redis 錯誤日誌

生產環境redis 集羣出現fail 問題，自動恢復時，導致業務出現繁忙問題，分析可能導致集羣節點問題的原因：

業務接口被刷，導致大量查詢命令堆積，形成redis慢查詢，導致出現網絡超時

通過redis 慢查詢命令 SHOWLOG GET 來分析哪個查詢命令導致集羣崩潰了

http://redis.cn/commands/slowlog
cluster-node-timeout 時間內網絡超時，redis節點自動重新下線進行選舉
rdb 備份同步大量數據到Redis 服務器中，服務器性能下降，出現網絡超時

下面我們查詢下Redis 日誌，同時看下redis 集羣恢復過程：

redis 集羣節點

redissrv1:7001> cluster nodes
191e4f15c0de39efe1ab9248bd413c9fdb2dddec 172.20.20.10:7005@17005 master - 0 1601964809000 11 connected 10923-16383
00fced34f12176c1a229211be9a3d579a51091df 172.20.20.9:7003@17003 slave 9a28b0084c1486ceb0bca3c45f2554d79041e57d 0 1601964810090 9 connected
d1fbc46714f89985ce45445b07355869085b2e7e 172.20.20.9:7002@17002 slave 191e4f15c0de39efe1ab9248bd413c9fdb2dddec 0 1601964809090 11 connected
9a28b0084c1486ceb0bca3c45f2554d79041e57d 172.20.20.10:7004@17004 master - 0 1601964807087 9 connected 5461-10922
599f153057aae0b5ef2f1d895a8e4ac7d0474cec 172.20.20.10:7006@17006 master - 0 1601964809000 10 connected 0-5460
219e9a5cb2898bf656ff4bf18eeadc1467ae8dd8 172.20.20.9:7001@17001 myself,slave 599f153057aae0b5ef2f1d895a8e4ac7d0474cec 0 1601964808000 1 connected

當前這份日誌只分析了出現錯誤的節點，但是當redis 集羣一個節點出現問題之後，其他集羣節點也會打印出相關的集羣錯誤日誌，這裏不做過多贅述。

4233:C 01 Oct 07:58:09.005 * RDB: 8 MB of memory used by copy-on-write
8971:M 01 Oct 07:58:09.072 * Background saving terminated with success
8971:M 01 Oct 07:59:50.030 * 1000 changes in 100 seconds. Saving...
8971:M 01 Oct 07:59:50.182 * Background saving started by pid 4914
4914:C 01 Oct 07:59:53.778 * DB saved on disk
4914:C 01 Oct 07:59:53.796 * RDB: 7 MB of memory used by copy-on-write
8971:M 01 Oct 07:59:53.909 * Background saving terminated with success
8971:M 01 Oct 08:00:54.791 # Cluster state changed: fail 
# 在一次rdb 數據保存之後，redis 集羣狀態突然變成fail，這時說明集羣已經處於不穩定狀態
8971:M 01 Oct 08:00:59.888 # Cluster state changed: ok
8971:M 01 Oct 08:01:33.895 * Marking node d1fbc46714f89985ce45445b07355869085b2e7e as failing (quorum reached).
# d1fbc46714f89985ce45445b07355869085b2e7e 節點被標記爲主觀下線
8971:M 01 Oct 08:01:38.483 * 1000 changes in 100 seconds. Saving...
8971:M 01 Oct 08:02:23.076 * Background saving started by pid 5614
8971:M 01 Oct 08:02:23.852 # Cluster state changed: fail
8971:M 01 Oct 08:02:24.282 * Marking node 00fced34f12176c1a229211be9a3d579a51091df as failing (quorum reached).
# 00fced34f12176c1a229211be9a3d579a51091df 節點被標記爲主觀下線
8971:M 01 Oct 08:02:24.370 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: is reachable again and nobody is serving its slots after some time.
# 集羣內其它節點接收到d1fbc46714f89985ce45445b07355869085b2e7e發來的ping消息，清空客觀下線狀態。
8971:M 01 Oct 08:02:25.158 # Failover auth denied to 191e4f15c0de39efe1ab9248bd413c9fdb2dddec: its master is up
# 191e4f15c0de39efe1ab9248bd413c9fdb2dddec 被提升成爲主節點
8971:M 01 Oct 08:02:25.301 # Failover auth granted to 9a28b0084c1486ceb0bca3c45f2554d79041e57d for epoch 9
# 當前節點爲 9a28b0084c1486ceb0bca3c45f2554d79041e57d 進行投票
8971:M 01 Oct 08:02:27.454 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: master without slots is reachable again.
# 集羣內其它節點接收到00fced34f12176c1a229211be9a3d579a51091df 發來的ping消息，清空客觀下線狀態。
5614:C 01 Oct 08:02:28.802 * DB saved on disk
5614:C 01 Oct 08:02:28.822 * RDB: 8 MB of memory used by copy-on-write
8971:M 01 Oct 08:02:28.914 * Background saving terminated with success
8971:M 01 Oct 08:02:29.903 # Cluster state changed: ok
# 經過從節點下線和提升新的主節點步驟，集羣狀態由fail 變成 ok 狀態。
8971:M 01 Oct 08:03:50.110 * 5000 changes in 80 seconds. Saving...
8971:M 01 Oct 08:07:29.150 * Background saving started by pid 5708
8971:M 01 Oct 08:07:29.663 # Connection with slave client id #4 lost.
8971:M 01 Oct 08:07:29.752 * Slave 172.20.20.10:7006 asks for synchronization
8971:M 01 Oct 08:07:29.782 * Partial resynchronization request from 172.20.20.10:7006 accepted. Sending 40417 bytes of backlog starting from offset 372035256846.
# 該節點與7006 節點進行增量同步
5708:C 01 Oct 08:07:34.672 * DB saved on disk
5708:C 01 Oct 08:07:34.692 * RDB: 12 MB of memory used by copy-on-write
8971:M 01 Oct 08:07:34.815 * Background saving terminated with success
8971:M 01 Oct 08:08:55.159 * 5000 changes in 80 seconds. Saving...
8971:M 01 Oct 08:08:55.468 * Background saving started by pid 6485
6485:C 01 Oct 08:09:00.333 * DB saved on disk
6485:C 01 Oct 08:09:00.349 * RDB: 15 MB of memory used by copy-on-write
8971:M 01 Oct 08:09:00.439 * Background saving terminated with success
8971:M 01 Oct 08:09:37.073 * Marking node 00fced34f12176c1a229211be9a3d579a51091df as failing (quorum reached).
# 00fced34f12176c1a229211be9a3d579a51091df 節點被標記爲主觀下線
8971:M 01 Oct 08:10:25.863 * 5000 changes in 80 seconds. Saving...
8971:M 01 Oct 08:12:02.845 * Background saving started by pid 6751
8971:M 01 Oct 08:12:04.741 # Connection with slave client id #8319 lost.
8971:M 01 Oct 08:12:04.777 # Configuration change detected. Reconfiguring myself as a replica of 599f153057aae0b5ef2f1d895a8e4ac7d0474cec
# 219e9a5cb2898bf656ff4bf18eeadc1467ae8dd8 升級成爲 599f153057aae0b5ef2f1d895a8e4ac7d0474cec 的從節點
8971:S 01 Oct 08:12:04.787 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
8971:S 01 Oct 08:12:05.007 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:12:05.895 * Connecting to MASTER 172.20.20.10:7006
# 該節點與master 節點進行連接
8971:S 01 Oct 08:12:06.350 * MASTER <-> SLAVE sync started
# 該節點與master 節點進行同步備份
8971:S 01 Oct 08:12:06.543 * Non blocking connect for SYNC fired the event.
8971:S 01 Oct 08:12:06.543 * Master replied to PING, replication can continue...
8971:S 01 Oct 08:12:06.544 * Trying a partial resynchronization (request 57b5cf8f66406836b49f26855b77a6f61e7653fc:372038922930).
8971:S 01 Oct 08:12:06.559 * Full resync from master: 1a6c129f8d1de27cb06714b27c46a420eae014e2:372038873893
8971:S 01 Oct 08:12:06.559 * Discarding previously cached master state.
8971:S 01 Oct 08:12:10.182 * MASTER <-> SLAVE sync: receiving 165760616 bytes from master
6751:C 01 Oct 08:12:10.661 * DB saved on disk
6751:C 01 Oct 08:12:14.532 * RDB: 11 MB of memory used by copy-on-write
8971:S 01 Oct 08:12:29.561 * Background saving terminated with success
8971:S 01 Oct 08:12:37.365 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about d1fbc46714f89985ce45445b07355869085b2e7e
8971:S 01 Oct 08:12:37.764 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:12:49.951 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:12:50.674 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.
8971:S 01 Oct 08:14:08.346 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:14:13.072 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:14:40.230 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:14:42.439 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:16:35.122 * FAIL message received from 191e4f15c0de39efe1ab9248bd413c9fdb2dddec about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:16:38.671 * FAIL message received from 191e4f15c0de39efe1ab9248bd413c9fdb2dddec about d1fbc46714f89985ce45445b07355869085b2e7e
8971:S 01 Oct 08:16:42.318 * 50 changes in 250 seconds. Saving...
8971:S 01 Oct 08:19:59.146 * Background saving started by pid 7514
8971:S 01 Oct 08:20:00.886 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:20:01.319 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.
7514:C 01 Oct 08:20:04.454 * DB saved on disk
7514:C 01 Oct 08:20:04.514 * RDB: 11 MB of memory used by copy-on-write
8971:S 01 Oct 08:20:05.327 * Background saving terminated with success
8971:S 01 Oct 08:22:20.042 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:24:08.344 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:25:27.485 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about d1fbc46714f89985ce45445b07355869085b2e7e
8971:S 01 Oct 08:25:45.951 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:25:54.469 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.
8971:S 01 Oct 08:26:18.242 # Cluster state changed: fail
8971:S 01 Oct 08:26:20.579 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about d1fbc46714f89985ce45445b07355869085b2e7e
8971:S 01 Oct 08:26:28.343 # Cluster state changed: ok
8971:S 01 Oct 08:27:11.280 # Cluster state changed: fail
8971:S 01 Oct 08:27:18.984 # Cluster state changed: ok
8971:S 01 Oct 08:27:25.077 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.
8971:S 01 Oct 08:27:25.191 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:27:28.897 * MASTER <-> SLAVE sync: Flushing old data
8971:S 01 Oct 08:27:34.787 * MASTER <-> SLAVE sync: Loading DB in memory
# 上述節點日誌中，全部是在進行節點的上下線和同步過程，到這步開始進行加載內存
8971:S 01 Oct 08:28:28.832 * FAIL message received from 599f153057aae0b5ef2f1d895a8e4ac7d0474cec about d1fbc46714f89985ce45445b07355869085b2e7e
8971:S 01 Oct 08:28:28.859 * FAIL message received from 9a28b0084c1486ceb0bca3c45f2554d79041e57d about 00fced34f12176c1a229211be9a3d579a51091df
8971:S 01 Oct 08:28:37.881 * MASTER <-> SLAVE sync: Finished with success
# 這裏表示主節點和slave 備份數據同步成功。
8971:S 01 Oct 08:28:37.898 * Background append only file rewriting started by pid 10524
8971:S 01 Oct 08:28:38.733 * Clear FAIL state for node d1fbc46714f89985ce45445b07355869085b2e7e: slave is reachable again.
8971:S 01 Oct 08:28:39.330 * Clear FAIL state for node 00fced34f12176c1a229211be9a3d579a51091df: slave is reachable again.
8971:S 01 Oct 08:28:43.029 * AOF rewrite child asks to stop sending diffs.
10524:C 01 Oct 08:28:43.040 * Parent agreed to stop sending diffs. Finalizing AOF...
10524:C 01 Oct 08:28:43.040 * Concatenating 6.43 MB of AOF diff received from parent.
10524:C 01 Oct 08:28:43.087 * SYNC append only file rewrite performed
10524:C 01 Oct 08:28:43.104 * AOF rewrite: 112 MB of memory used by copy-on-write
8971:S 01 Oct 08:28:43.332 * Background AOF rewrite terminated with success
8971:S 01 Oct 08:28:43.333 * Residual parent diff successfully flushed to the rewritten AOF (0.02 MB)
# 這裏是針對aof 備份文件的重新對比校驗數據和寫入的步驟。
8971:S 01 Oct 08:28:43.333 * Background AOF rewrite finished successfully
8971:S 01 Oct 08:28:43.434 * 10 changes in 300 seconds. Saving...
8971:S 01 Oct 08:28:43.451 * Background saving started by pid 11643
11643:C 01 Oct 08:28:48.797 * DB saved on disk
11643:C 01 Oct 08:28:48.837 * RDB: 14 MB of memory used by copy-on-write
8971:S 01 Oct 08:28:48.916 * Background saving terminated with success
8971:S 01 Oct 08:30:09.007 * 5000 changes in 80 seconds. Saving...
8971:S 01 Oct 08:30:09.024 * Background saving started by pid 12605
12605:C 01 Oct 08:30:16.128 * DB saved on disk

總結：

Redis集羣自身實現了高可用，當集羣內少量節點出現故障時通過自動故障轉移保證集羣可以正常對外提供服務。

日誌中Redis集羣自動恢復機制，經過節點下線和提升新的主節點投票，上線等步驟，完成新的集羣槽點同步，數據校驗，全部是由redis 本身自動操作的。

錯誤分析彙總

問題總結：

經過Java日誌和Redis日誌的分析和彙總，結合生產業務的註冊和登錄進行判斷，是因爲Redis集羣節點在進行同步和提升新的節點成爲主節點時，更換了業務槽點，以及在同步數據時，redis 集羣佔用系統資源過多，造成用戶工程訪問redis集羣節點超時，同時查詢連接槽點失敗，導致數據查詢錯誤。

問題解決：

關於Redis集羣節點恢復之後，生產業務依然超時，不能正常工作，考慮是連接舊的槽點分配，並且用戶工程一直向舊的槽點發送請求，導致redis一直接受空的查詢和寫入，造成系統資源過高問題。

這裏重新啓動用戶工程，使工程連接新的Redis節點狀態，讓數據寫入正確的槽點位置，解決該問題。

Redis生產環境節點宕機問題報錯及恢復排錯

Redis故障發現

Redis故障恢復

Redis故障轉移時間(failover-time)

Redis生產環境節點宕機問題報錯及恢復

問題背景和分析

Java日誌報錯

Redis 錯誤日誌

錯誤分析彙總

問題總結：

問題解決：

Linux 常用系統性能命令總結

Logstash組件詳解(input、codec、filter、output)

Netflix業務運維分析和總結

Elasticsearch 集羣優化-儘可能全面詳細

圍繞 Kubernetes 的 8 大 DevOps 生產關鍵實踐

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結