Redis排错经历:MISCONF Redis is configured to save RDB snapshots

MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

 

Redis集群模式部署,在使用很长一段时间后,所有的写操作都报错,导致系统无法使用。

紧急解决办法:服务器上,使用 redis-cli连接到集群命令行,执行如下指令:

config set stop-writes-on-bgsave-error no

或者修改配置:redis.conf (修改配置后需要重启)

stop-writes-on-bgsave-error no

默认配置是 yes,修改成 no,可以临时解决该报错,快速让业务恢复。

解决思路:

1、报错原因:
因为Redis在将数据通过RDB模式持久化到硬盘时报错(报错原因后面会分析),
在配置为stop-writes-on-bgsave-error yes 时,Redis为了保证现有Redis中的数据安全,
将拒绝Redis的数据继续写入,所以会对所有写入的请求报错,从而阻止数据的继续提交。

2、解决办法:
通过将配置调整成 stop-writes-on-bgsave-error no  从而关闭Redis的该安全机制,
从而达到暂时觉得Redis不可用的情况,来快速回复业务服务。

 

问题终极解决方案:

既然我们都知道出问题的根本原因在于RDB模式持久化失败,那怎么解决这个失败的问题呢?


Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can't tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.

Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.


以上内容是官方给出的解释,具体原因就是:

这个写失败是因为Redis在进行BGSAVE的时候失败了(进行RDB模式持久化时失败了)。Redis的BGSAVE持久化时,Redis会Fork一个子进程将数据保存到磁盘上。具体BGSAVE失败的原因可以在Redis的日志中查看(Redis默认是没开启日志文件的,Redis在后台运行的时候,控制台日志直接输出到黑洞,下面会介绍如何开启Redis的日志)。但大部分BGAVE失败,是因为Fork的子进程无法分配到足够的内存。很多时候,由于操作系统的优化冲突,Fork无法分配内存(尽管机器有足够的RAM可用)。

所以Redis在进行持久化的时候无法分配到足够的内存,所以报错了,可以通过如下操作来调整系统参数来解决这个问题:

修改/etc/sysctl.conf并添加:

vm.overcommit_memory=1

在执行如下命令,使配置生效:

sudo sysctl -p /etc/sysctl.conf

这样,我们好像就已经解决了这个问题,但是很遗憾的,问题也只是暂时解决了而已,其实我们还需要对Redis的相关配置进行优化,来使我们的Redis长期处于健康状态。

优化1:配置日志文件、日志级别,以备以后出现的问题能通过查看日志分析问题:

# verbose (many rarely useful info, but not a mess like the debug level)
loglevel notice
# 指定到项目组规定的日志位置
logfile "/data/log/redis-6379.log"

优化2:对内存上线进行优化,以及配置适合的缓存回收策略。

如下命令所示,通过客户端+Redis指令可以查看Redis的内存使用情况:

[root@localhost redis-5.0.3]# redis-cli -h 192.168.50.222 -p 6371 -c
192.168.50.222:6371> info
# Memory
used_memory:1073485064
used_memory_human:1023.76M
used_memory_rss:1127706624
used_memory_rss_human:1.05G
used_memory_peak:1077872280
used_memory_peak_human:1.00G
used_memory_peak_perc:99.59%
used_memory_overhead:2072046
used_memory_startup:1454768
used_memory_dataset:1071413018
used_memory_dataset_perc:99.94%
allocator_allocated:1073897952
allocator_active:1118851072
allocator_resident:1134903296
total_system_memory:8186183680
total_system_memory_human:7.62G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:1073741824
maxmemory_human:1.00G
maxmemory_policy:allkeys-lru
allocator_frag_ratio:1.04
allocator_frag_bytes:44953120
allocator_rss_ratio:1.01
allocator_rss_bytes:16052224
rss_overhead_ratio:0.99
rss_overhead_bytes:-7196672
mem_fragmentation_ratio:1.05
mem_fragmentation_bytes:54263816
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:49694
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0

实际生产环境,都是需要配置max memory的,一般不超过机器最大内存。但Redis是基于内存的,所以提前做好内存容量规划是必要的,防止out of max memory。通常来讲实际内存达到最大内存的3/4时就要考虑加大内存或者拆分数据了。 

备注:如上面的内存设置,是因为我在测试环境中情况,不足以作为参考值

# maxmemory <bytes> 单位bytes,根据自己的内存按比例分配
# 1M = 1024 * 1024 * 1
# 1G = 1024 * 1024 * 1024 * 1
maxmemory 1073741824

# volatile-lru -> Evict using approximated LRU among the keys with an expire set.
# allkeys-lru -> Evict any key using approximated LRU.
# volatile-lfu -> Evict using approximated LFU among the keys with an expire set.
# allkeys-lfu -> Evict any key using approximated LFU.
# volatile-random -> Remove a random key among the ones with an expire set.
# allkeys-random -> Remove a random key, any key.
# volatile-ttl -> Remove the key with the nearest expire time (minor TTL)
# noeviction -> Don't evict anything, just return an error on write operations.

maxmemory-policy volatile-lru

根据项目的实际使用场景,可以适当的选择使用Redis的回收策略,毕竟Redis提供了8种回收策略,noeviction是默认策略。

我这里使用的是volatile-lru,因为我们项目允许长期不活跃的数据从Redis丢失。

 

优化3:数据持久化优化

################################ SNAPSHOTTING  ################################
#
# Save the DB on disk:
#
#   save <seconds> <changes>
#
#   Will save the DB if both the given number of seconds and the given
#   number of write operations against the DB occurred.
#
#   In the example below the behaviour will be to save:
#   after 900 sec (15 min) if at least 1 key changed
#   after 300 sec (5 min) if at least 10 keys changed
#   after 60 sec if at least 10000 keys changed
#
#   Note: you can disable saving completely by commenting out all "save" lines.
#
#   It is also possible to remove all the previously configured save
#   points by adding a save directive with a single empty string argument
#   like in the following example:
#
#   save ""
# 如果在【900】秒内存在【1】个数据变更,则进行一次rdb持久化
save 900 1
# 如果在【300】秒内存在【10】个数据变更,则进行一次rdb持久化
save 300 10
save 60 10000

根据项目的实际需要来进行调整,可以更适合各自项目的需要

############################## APPEND ONLY MODE ###############################

# By default Redis asynchronously dumps the dataset on disk. This mode is
# good enough in many applications, but an issue with the Redis process or
# a power outage may result into a few minutes of writes lost (depending on
# the configured save points).
#
# The Append Only File is an alternative persistence mode that provides
# much better durability. For instance using the default data fsync policy
# (see later in the config file) Redis can lose just one second of writes in a
# dramatic event like a server power outage, or a single write if something
# wrong with the Redis process itself happens, but the operating system is
# still running correctly.
#
# AOF and RDB persistence can be enabled at the same time without problems.
# If the AOF is enabled on startup Redis will load the AOF, that is the file
# with the better durability guarantees.
#
# Please check http://redis.io/topics/persistence for more information.

appendonly no

# The name of the append only file (default: "appendonly.aof")

appendfilename "appendonly.aof"

如上所示的配置是Redis开启 AOF实时追加持久化配置开关,以及持久化文件配置,根据项目的实际使用情况,可以选择开启、关闭AOF。

 

优化4:注意!注意!注意!注意! Redis的重启不要kill、kill、kill、kill、kill

[root@localhost redis-5.0.3]# redis-cli -h 192.168.50.222 -p 6371 -c
192.168.50.222:6371> SHUTDOWN

一定要通过命令行shutdown,这样才能保证数据的完整性。因为shutdown的时候Redis会在完成持久化后才进行关闭。

 

最后对配置、系统参数优化后,记得把最开始临时解决方案改回来,我们还需要这样的一个配置来为我们报警。当然如果项目组已经存在其他相关Redis监控报警工具的话,这个配置可以关闭。

stop-writes-on-bgsave-error yes

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章