[redis 源碼走讀] redis 過期策略

redis 可能存在大量過期數據，一次性遍歷檢查不太現實。redis 有豐富的數據結構，key-value， value 數據結構對象(redisObj)可能存儲大量數據，key 過期了，value 也不建議在進程中實時回收。爲了保證系統高性能，每次處理一點點，逐漸完成大任務，“分而治之”這是 redis 處理大任務的一貫作風。

更精彩內容，請關注我的博客：wenfh2020.com

文章目錄

檢查具體策略

訪問檢查

流程

主服務檢查過期/刪除過期邏輯 -> 刪除過期鍵值 -> 異步/同步刪除數據 -> 主從同步。

redis 數據庫，數據內容和過期時間是分開保存。expires 保存了鍵值對應的過期時間。

typedef struct redisDb {
    dict *dict;                 /* The keyspace for this DB */
    dict *expires;              /* Timeout of keys with a timeout set */
    ...
} redisDb;

策略概述

過期檢查

過期數據檢查有三個策略：

訪問鍵值觸發檢查。訪問包括外部讀寫命令，內部邏輯調用。

不可能每個過期鍵都能實時被訪問觸發，所以要結合其它策略。
事件驅動處理事件前觸發快速檢查。

將過期檢查負載一點點分攤到每個事件處理中。
時鐘定期慢速檢查。

數據回收

數據回收有同步和異步兩種方式，配置文件可以設置，一般默認異步回收數據。

異步數據回收有兩個策略：

小數據實時回收。
大數據放到任務隊列，後臺線程處理任務隊列異步回收內存。

可以看看 bio.c 的實現。

同步

int dbSyncDelete(redisDb *db, robj *key) {
    /* Deleting an entry from the expires dict will not free the sds of
     * the key, because it is shared with the main dictionary. */
    if (dictSize(db->expires) > 0)
        dictDelete(db->expires, key->ptr);
    if (dictDelete(db->dict, key->ptr) == DICT_OK) {
        if (server.cluster_enabled)
            slotToKeyDel(key);
        return 1;
    } else {
        return 0;
    }
}

異步

unlink 邏輯刪除 key，數據放在 bio 線程異步刪除。

#define LAZYFREE_THRESHOLD 64

int dbAsyncDelete(redisDb *db, robj *key) {
    if (dictSize(db->expires) > 0) dictDelete(db->expires,key->ptr);

    dictEntry *de = dictUnlink(db->dict,key->ptr);
    if (de) {
        robj *val = dictGetVal(de);
        size_t free_effort = lazyfreeGetFreeEffort(val);

        if (free_effort > LAZYFREE_THRESHOLD && val->refcount == 1) {
            atomicIncr(lazyfree_objects,1);
            // 刪除數據對象，要注意對象計數，decrRefCount 刪除。
            bioCreateBackgroundJob(BIO_LAZY_FREE,val,NULL,NULL);
            dictSetVal(db->dict,de,NULL);
        }
    }

    if (de) {
        dictFreeUnlinkedEntry(db->dict,de);
        if (server.cluster_enabled) slotToKeyDel(key);
        return 1;
    } else {
        return 0;
    }
}

檢查具體策略

訪問檢查

expireIfNeeded

外部讀寫命令/內部邏輯調用，基本所有的鍵值讀寫操作都會觸發 expireIfNeeded 過期檢查。

db.c

int expireIfNeeded(redisDb *db, robj *key) {
    if (!keyIsExpired(db,key)) return 0;

    if (server.masterhost != NULL) return 1;

    server.stat_expiredkeys++;
    // 傳播數據更新，傳播到集羣中去，如果數據庫是 `aof` 格式存儲，更新落地 `aof` 文件。
    propagateExpire(db,key,server.lazyfree_lazy_expire);
    notifyKeyspaceEvent(NOTIFY_EXPIRED, "expired",key,db->id);
    return server.lazyfree_lazy_expire ? dbAsyncDelete(db,key) :
                                         dbSyncDelete(db,key);
}

void propagateExpire(redisDb *db, robj *key, int lazy) {
    robj *argv[2];

    argv[0] = lazy ? shared.unlink : shared.del;
    argv[1] = key;
    incrRefCount(argv[0]);
    incrRefCount(argv[1]);

    // aof 存儲，del/unlink 命令入庫
    if (server.aof_state != AOF_OFF)
        feedAppendOnlyFile(server.delCommand, db->id, argv, 2);
    // 同步 del/unlink 命令到從庫
    replicationFeedSlaves(server.slaves, db->id, argv, 2);

    decrRefCount(argv[0]);
    decrRefCount(argv[1]);
}

修改/刪除過期 key

部分命令會修改或刪除過期時間。

命令	描述
del	刪除指定 key 。
unlink	邏輯刪除指定 key，數據在線程異步刪除。
set	設置一個鍵的值，ex 選項可以設置過期時間
persist	移除 key 的過期時間
rename	重命名 key，會刪除原來 key 的過期時間。
flushdb	清空當前數據庫。
flushall	清空所有數據。
expire	設置 key 的過期時間秒數。
expireat	設置一個 UNIX 時間戳的過期時間。
pexpireat	設置key到期 UNIX 時間戳，以毫秒爲單位。

maxmemory 淘汰

超出最大內存 maxmemory，觸發數據淘汰。淘汰合適的數據，可以參考[redis 源碼走讀] maxmemory 數據淘汰策略
。

typedef struct redisObject {
    ...
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits access time). */
    ...
} robj;

int processCommand(client *c) {
    ...
    if (server.maxmemory && !server.lua_timedout) {
        int out_of_memory = freeMemoryIfNeededAndSafe() == C_ERR;
        ...
    }
    ...
}

int freeMemoryIfNeededAndSafe(void) {
    if (server.lua_timedout || server.loading) return C_OK;
    return freeMemoryIfNeeded();
}

事件觸發

在事件模型中，處理事件前，觸發快速檢查。將過期檢查負載分散到各個事件中去。

int main(int argc, char **argv) {
    ...
    aeSetBeforeSleepProc(server.el,beforeSleep);
    ...
    aeMain(server.el);
    ...
}

void aeMain(aeEventLoop *eventLoop) {
    eventLoop->stop = 0;
    while (!eventLoop->stop) {
        if (eventLoop->beforesleep != NULL)
            eventLoop->beforesleep(eventLoop);
        aeProcessEvents(eventLoop, AE_ALL_EVENTS|AE_CALL_AFTER_SLEEP);
    }
}

void beforeSleep(struct aeEventLoop *eventLoop) {
    ...
    if (server.active_expire_enabled && server.masterhost == NULL)
        activeExpireCycle(ACTIVE_EXPIRE_CYCLE_FAST);
    ...
}

定期檢查

通過時鐘實現，定期檢查過期鍵值。

void initServer(void) {
    ...
    // 創建時鐘事件
    if (aeCreateTimeEvent(server.el, 1, serverCron, NULL, NULL) == AE_ERR) {
        serverPanic("Can't create event loop timers.");
        exit(1);
    }
    ...
}

int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
    ...
    databasesCron();
    ...
}

// 主庫中檢查即可，主庫會同步結果到從庫。
void databasesCron(void) {
    if (server.active_expire_enabled) {
        if (server.masterhost == NULL) {
            // 主庫慢速檢查
            activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW);
        } else {
            // 從庫如果設置了可寫功能。
            expireSlaveKeys();
        }
    }
    ...
}

redis 主邏輯在單進程主線程中實現，要保證不能影響主業務前提下，檢查過期數據，不能太影響系統性能。主要三方面進行限制：

檢查時間限制。
過期數據檢查數量限制。
過期數據是否達到可接受比例。

被檢查的數據到期了，系統會把該鍵值從字典中邏輯刪除，切斷數據與主邏輯聯繫。鍵值對應的數據，放到線程隊列，後臺線程進行異步回收（如果配置設置了異步回收）。

activeExpireCycle 檢查有“快速”和“慢速”兩種，時鐘定期檢查屬於慢速類型。慢速檢查被分配更多的檢查時間。在一個時間範圍內，到期數據最好不要太密集，因爲系統發現到期數據很多，會迫切希望儘快處理掉這些過期數據，所以每次檢查都要耗盡分配的時間片，直到到期數據到達一個可接受的密度比例。

#define CRON_DBS_PER_CALL 16 /* 每次檢查的數據庫個數 */

#define ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP 20 /* Keys for each DB loop. */
#define ACTIVE_EXPIRE_CYCLE_FAST_DURATION 1000 /* Microseconds. */
#define ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC 25 /* Max % of CPU to use. */
#define ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE 10 /* % of stale keys after which
                                                   we do extra efforts. */

void activeExpireCycle(int type) {
    /* Adjust the running parameters according to the configured expire
     * effort. The default effort is 1, and the maximum configurable effort
     * is 10. */
    unsigned long
    // 努力力度，默認 1，也就是遍歷過期字典的力度，力度越大，遍歷數量越多，但是性能損耗更多。
    effort = server.active_expire_effort-1, /* Rescale from 0 to 9. */
    // 每次循環遍歷鍵值個數。力度越大，遍歷個數越多。
    config_keys_per_loop = ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP +
                           ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP/4*effort,
    // 快速遍歷時間範圍，力度越大，給予遍歷時間越多。
    config_cycle_fast_duration = ACTIVE_EXPIRE_CYCLE_FAST_DURATION +
                                 ACTIVE_EXPIRE_CYCLE_FAST_DURATION/4*effort,
    // 慢速遍歷檢查時間片
    config_cycle_slow_time_perc = ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC +
                                  2*effort,
    // 已經到期數據 / 檢查數據 比例。達到可以接受的比例。
    config_cycle_acceptable_stale = ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE-
                                    effort;

    static unsigned int current_db = 0; /* Last DB tested. */
    // 檢查是否已經超時。
    static int timelimit_exit = 0;      /* Time limit hit in previous call? */
    // 上一次快速檢查數據起始時間。
    static long long last_fast_cycle = 0; /* When last fast cycle ran. */

    // iteration 迭代檢查個數，每 16 次循環遍歷，確認一下是否檢查超時。
    int j, iteration = 0;
    // 每次週期檢查的數據庫個數。redis 默認有 16 個庫。
    int dbs_per_call = CRON_DBS_PER_CALL;
    long long start = ustime(), timelimit, elapsed;

    /* 如果鏈接已經停止了，那麼要保留現場，不允許修改數據，也不允許到期淘汰數據。
     * 使用命令 ‘pause’ 暫停 redis 工作或者主服務正在進行從服務的故障轉移。*/
    if (clientsArePaused()) return;

    if (type == ACTIVE_EXPIRE_CYCLE_FAST) {
        /* 檢查還沒超時，但是到期數據密集度已經達到了可以接受的範圍，不要快速檢查了，
           畢竟它是快速的，留給其它方式的檢查。*/
        if (!timelimit_exit &&
            server.stat_expired_stale_perc < config_cycle_acceptable_stale)
            return;

        /* 限制快速檢查頻次，在兩個 config_cycle_fast_duration 內，只能執行一次快速檢查。 */
        if (start < last_fast_cycle + (long long)config_cycle_fast_duration*2)
            return;

        last_fast_cycle = start;
    }

    if (dbs_per_call > server.dbnum || timelimit_exit)
        dbs_per_call = server.dbnum;

    /* 檢查過期數據，但是不能太損耗資源，得有個限制。server.hz 默認爲 10
       hz 是執行後臺任務的頻率，越大表明執行的次數越頻繁，一般用默認值 10 */
    timelimit = config_cycle_slow_time_perc*1000000/server.hz/100;
    timelimit_exit = 0;
    if (timelimit <= 0) timelimit = 1;

    // 如果是快速模式，更改檢查週期時間。
    if (type == ACTIVE_EXPIRE_CYCLE_FAST)
        timelimit = config_cycle_fast_duration; /* in microseconds. */

    /* 過期數據一般是異步方式，檢查到過期數據，都是從字典中移除鍵值信息，
     * 避免再次使用，但是數據回收放在後臺回收，不是實時的，有數據有可能還存在數據庫裏。*/

    // 檢查數據個數。
    long total_sampled = 0;
    // 檢查數據，數據已經過期的個數。
    long total_expired = 0;

    for (j = 0; j < dbs_per_call && timelimit_exit == 0; j++) {
        unsigned long expired, sampled;
        redisDb *db = server.db+(current_db % server.dbnum);
        current_db++;

        // 遍歷數據庫檢查過期數據，直到超出檢查週期時間，或者過期數據比例已經很少了。
        do {
            // num 數據量，slots 哈希表大小（字典數據如果正在遷移，雙表大小）
            unsigned long num, slots;
            long long now, ttl_sum;
            int ttl_samples;
            iteration++;

            if ((num = dictSize(db->expires)) == 0) {
                db->avg_ttl = 0;
                break;
            }
            slots = dictSlots(db->expires);
            now = mstime();

            /* 過期存儲數據結構是字典，數據經過處理後，字典存儲的數據可能已經很少，
             * 但是字典還是大字典，這樣遍歷數據有效命中率會很低，處理起來會浪費資源，
             * 後面的訪問會很快觸發字典的縮容，縮容後再進行處理效率更高。*/
            if (num && slots > DICT_HT_INITIAL_SIZE &&
                (num*100/slots < 1)) break;

            // 過期的數據個數。
            expired = 0;
            // 檢查的數據個數。
            sampled = 0;
            // 沒有過期的數據時間差之和。
            ttl_sum = 0;
            // 沒有過期的數據個數。
            ttl_samples = 0;

            // 每次檢查的數據限制。
            if (num > config_keys_per_loop)
                num = config_keys_per_loop;

            /* 哈希表本質上是一個數組，可能有鍵值碰撞的數據，用鏈表將碰撞數據串聯起來，
             * 放在一個數組下標下，也就是放在哈希表的一個桶裏。max_buckets 是最大能檢查的桶個數。
             * 跳過空桶，不處理。*/
            long max_buckets = num*20;
            // 當前已經檢查哈希表桶的個數。
            long checked_buckets = 0;

            // 一個桶上有可能有多個數據。所以檢查從兩方面限制：一個是數據量，一個是桶的數量。
            while (sampled < num && checked_buckets < max_buckets) {
                for (int table = 0; table < 2; table++) {
                    // 如果 dict 沒有正在進行擴容，不需要檢查它的第二張表了。
                    if (table == 1 && !dictIsRehashing(db->expires)) break;

                    unsigned long idx = db->expires_cursor;
                    idx &= db->expires->ht[table].sizemask;
                    dictEntry *de = db->expires->ht[table].table[idx];
                    long long ttl;

                    checked_buckets++;
                    while(de) {
                        dictEntry *e = de;
                        de = de->next;

                        // 檢查數據是否已經超時。
                        ttl = dictGetSignedIntegerVal(e)-now;

                        // 如果數據過期了，進行回收處理。
                        if (activeExpireCycleTryExpire(db,e,now)) expired++;
                        if (ttl > 0) {
                            /* We want the average TTL of keys yet
                             * not expired. */
                            ttl_sum += ttl;
                            ttl_samples++;
                        }
                        sampled++;
                    }
                }
                db->expires_cursor++;
            }
            total_expired += expired;
            total_sampled += sampled;

            if (ttl_samples) {
                long long avg_ttl = ttl_sum/ttl_samples;

                /* Do a simple running average with a few samples.
                 * We just use the current estimate with a weight of 2%
                 * and the previous estimate with a weight of 98%. */
                if (db->avg_ttl == 0) db->avg_ttl = avg_ttl;
                // 對沒過期的數據，平均過期時間進行採樣，上一次統計的平均時間佔 98 %，本次佔 2%。
                db->avg_ttl = (db->avg_ttl/50)*49 + (avg_ttl/50);
            }

            /* 避免檢查週期太長，當前數據庫每 16 次循環迭代檢查，檢查是否超時，超時退出。*/
            if ((iteration & 0xf) == 0) { /* check once every 16 iterations. */
                elapsed = ustime()-start;
                if (elapsed > timelimit) {
                    timelimit_exit = 1;
                    server.stat_expired_time_cap_reached_count++;
                    break;
                }
            }

            /* 當前數據庫，如果沒有檢查到數據，或者過期數據已經達到可接受比例
             * 就退出該數據庫檢查，進入到下一個數據庫檢查。*/
        } while (sampled == 0 ||
                 (expired*100/sampled) > config_cycle_acceptable_stale);
    }

    // 添加統計信息
    elapsed = ustime()-start;
    server.stat_expire_cycle_time_used += elapsed;
    latencyAddSampleIfNeeded("expire-cycle",elapsed/1000);

    double current_perc;
    if (total_sampled) {
        current_perc = (double)total_expired/total_sampled;
    } else
        current_perc = 0;

    // 通過累加每次檢查的過期概率影響，保存過期數據佔數據比例。
    server.stat_expired_stale_perc = (current_perc*0.05)+
                                     (server.stat_expired_stale_perc*0.95);
}

刪除過期數據

int activeExpireCycleTryExpire(redisDb *db, dictEntry *de, long long now) {
    long long t = dictGetSignedIntegerVal(de);
    if (now > t) {
        sds key = dictGetKey(de);
        robj *keyobj = createStringObject(key,sdslen(key));

        propagateExpire(db,keyobj,server.lazyfree_lazy_expire);
        if (server.lazyfree_lazy_expire)
            dbAsyncDelete(db,keyobj);
        else
            dbSyncDelete(db,keyobj);
        notifyKeyspaceEvent(NOTIFY_EXPIRED, "expired", keyobj, db->id);
        trackingInvalidateKey(keyobj);
        decrRefCount(keyobj);
        server.stat_expiredkeys++;
        return 1;
    } else {
        return 0;
    }
}

總結

要熟悉字典 dict 的實現原理，dict 是 redis 常用的幾個基礎數據結構之一。
看了幾天源碼，大致理解了鍵值過期處理策略。很多細節，感覺理解還是不夠深刻，以後還是要結合實戰多思考。
redis 爲了保證系統的高性能，採取了很多巧妙的“分治策略”，例如鍵值過期檢查。過期數據檢查和處理流程看，它不是一個實時的操作，有一定的延時，這樣系統不能很好地保證數據一致性。有得必有失。
從定期回收策略的慢速檢查中，我們可以看到，redis 處理到期數據，通過採樣，判斷到期數據的密集度。到期數據越密集，處理時間越多。我們使用中，不應該把大量數據設置在同一個時間段到期。
redis.conf 配置裏面有比較詳細的過期鍵處理策略描述。很多細節，可以參考源碼註釋和文檔。文檔極其詳細，redis 作者的耐心，在開源項目中，是比較少見的 👍。例如：

############################# LAZY FREEING ####################################

# Redis has two primitives to delete keys. One is called DEL and is a blocking
# deletion of the object. It means that the server stops processing new commands
# in order to reclaim all the memory associated with an object in a synchronous
# way. If the key deleted is associated with a small object, the time needed
# in order to execute the DEL command is very small and comparable to most other
# O(1) or O(log_N) commands in Redis. However if the key is associated with an
# aggregated value containing millions of elements, the server can block for
# a long time (even seconds) in order to complete the operation.
#
# For the above reasons Redis also offers non blocking deletion primitives
# such as UNLINK (non blocking DEL) and the ASYNC option of FLUSHALL and
# FLUSHDB commands, in order to reclaim memory in background. Those commands
# are executed in constant time. Another thread will incrementally free the
# object in the background as fast as possible.
#
# DEL, UNLINK and ASYNC option of FLUSHALL and FLUSHDB are user-controlled.
# It's up to the design of the application to understand when it is a good
# idea to use one or the other. However the Redis server sometimes has to
# delete keys or flush the whole database as a side effect of other operations.
# Specifically Redis deletes objects independently of a user call in the
# following scenarios:
#
# 1) On eviction, because of the maxmemory and maxmemory policy configurations,
#    in order to make room for new data, without going over the specified
#    memory limit.
# 2) Because of expire: when a key with an associated time to live (see the
#    EXPIRE command) must be deleted from memory.
# 3) Because of a side effect of a command that stores data on a key that may
#    already exist. For example the RENAME command may delete the old key
#    content when it is replaced with another one. Similarly SUNIONSTORE
#    or SORT with STORE option may delete existing keys. The SET command
#    itself removes any old content of the specified key in order to replace
#    it with the specified string.
# 4) During replication, when a replica performs a full resynchronization with
#    its master, the content of the whole database is removed in order to
#    load the RDB file just transferred.
#
# In all the above cases the default is to delete objects in a blocking way,
# like if DEL was called. However you can configure each case specifically
# in order to instead release memory in a non-blocking way like if UNLINK
# was called, using the following configuration directives:

lazyfree-lazy-eviction no
lazyfree-lazy-expire no
lazyfree-lazy-server-del no
replica-lazy-flush no

參考

更精彩內容，請關注我的博客：wenfh2020.com

[redis 源碼走讀] redis 過期策略

文章目錄

流程

策略概述

過期檢查

數據回收

同步

異步

檢查具體策略

訪問檢查

expireIfNeeded

修改/刪除過期 key

maxmemory 淘汰

事件觸發

定期檢查

總結

參考

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

[redis 源碼走讀] rdb 持久化 - 應用場景

第一次剪片

fork 進程測試 copy-on-write

變量的內存分佈（Linux）

github + jekyll 建自己的博客

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結