redis cluster 因爲aof導致cluster down

1.業務背景

2現象:

redis 日誌中出現

3963:S 28 Jul 12:26:30.030 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3963:S 28 Jul 12:37:18.048 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3963:S 28 Jul 12:40:25.080 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3963:S 28 Jul 13:11:51.146 * FAIL message received from ae3213272c3bf10556a4798d73b2414cb4e2e78f about 3c1067d381a504dacc86766b349739c8c9e0ae5a
3963:S 28 Jul 13:11:52.306 * Clear FAIL state for node 3c1067d381a504dacc86766b349739c8c9e0ae5a: slave is reachable again.

其中3963:S 28 Jul 12:37:18.048 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.是因爲redis執行write的時候發生了阻塞,導致redis主進程阻塞,之後不會接受任何命令請求,其中包括集羣相關通信,同時redis cluster 每個節點通過gossip協議廣播失敗信息,讓其它節點收到這個消息,從而導致redis 進行投票,redis cluster重新rebalance.

4.原理:

付磊大神:https://carlosfu.iteye.com/blog/2259482

AOF設計原理:https://redisbook.readthedocs.io/en/latest/internal/aof.html

5.解決辦法

  1. 設置cluster-node-timeout 參數爲15s,解決node 網絡延時問題。
  2. 關閉aof(如果業務系統有別的db來保存信息的話)或者設置aof 模式AOF_FSYNC_ALWAYS即設置參數appendfsync具體設置appendfsync設置
  3. 設置系統參數vm.dirty_background_ratio=10 (未完全理解,帶深入研究redis源碼)

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章