Memcached LRU淘汰策略，以及數據丟失問題

0x01 問題說明：

有兩個服務，一個服務A會先通過get操作到memcached中拿圖片c，如果返回爲空會去對象存儲系統中拿圖片c然後緩存在memcached中,超時時間設置爲一週，然後返回mc_key信息，另外一個服務B會拿這個mc_key信息去memcached中獲取保存的圖片。這個是個異步的過程。
然後線上出現一個詭異的問題，A服務已經在memcached中get到了這個圖片（日誌中打印：picture already in memcached），但是B服務拿着mc_key去取這個圖片的時候卻找不到這個圖了，看日誌相隔時間不超過50ms左右

memcached memcached -m 16384M -p 11211 -c 8096 -I 20M
memcached版本是1.5.4
內存足夠大不至於50ms就把一個位於LRU頭部的數據給淘汰了。以前用的是1.4.x的版本，最近升級到1.5.x發現有這個問題。

0x02 memcached內存模型

然後我們就來看看memcached的LRU吧。
首先要知道的概念：
slab: memcached中會有多個，是內存分級用的。不同層級的slab歸做一類，叫做：slabclass，結構體如下：


typedef struct {
    unsigned int size;      /* sizes of items */
    unsigned int perslab;   /* how many items per slab */
    //這個是用來保存空閒item的
    void *slots;           /* list of item ptrs */
    unsigned int sl_curr;   /* total free items in list */
 
    void *end_page_ptr;         /* pointer to next free item at end of page, or 0 */
    unsigned int end_page_free; /* number of items remaining at end of last alloced page */
 
    unsigned int slabs;     /* how many slabs were allocated for this class */
 
    void **slab_list;       /* array of slab pointers */
    unsigned int list_size; /* size of prev array */
 
    unsigned int killing;  /* index+1 of dying slab, or zero if none */
    size_t requested; /* The number of requested bytes */
} slabclass_t;

chunk：slab中的一個內存空間。slab就是按照不同大小的chunk來分級的。從下圖中可以看到chunk的size是逐漸增大，這個增大的量級是由增長因子決定的，默認1.25。
可以用如下命令查看：

之間的關係就是這樣的：
slabclass是由chunk size確定的, 同一個slabclass內的chunk大小都一樣, 每一個slabclass 要負責管理一些內存, 初始時, 系統爲每個 slabclass 分配一個 slab, 一個 slab 就是一個內存塊, 其大小等於1M（這個可通過-I指定）. 然後每個slabclass 再把 slab 切分成一個個 chunk, 算一下, 一個 slab 可以切分得到 1M/chunk_size 個chunk.

item： memcached中保存數據的結構體，也是LRU鏈表中的node，數據也就是保存在item中的，結構如下：

/**
 * Structure for storing items within memcached.
 */
typedef struct _stritem {
    struct _stritem *next;
    struct _stritem *prev;
    struct _stritem *h_next;    /* hash chain next */
    rel_time_t      time;       /* least recent access */
    rel_time_t      exptime;    /* expire time */
    int             nbytes;     /* size of data */
    unsigned short  refcount;
    uint8_t         nsuffix;    /* length of flags-and-length string */
    uint8_t         it_flags;   /* ITEM_* above */
    uint8_t         slabs_clsid;/* which slab class we're in */
    uint8_t         nkey;       /* key length, w/terminating null and padding */
    /* this odd type prevents type-punning issues when we do
     * the little shuffle to save space when not using CAS. */
    union {
        uint64_t cas;
        char end;
    } data[];
    /* if it_flags & ITEM_CAS we have 8 bytes CAS */
    /* then null-terminated key */
    /* then " flags length\r\n" (no terminating null) */
    /* then data with terminating \r\n (no terminating null; it's binary!) */
} item;

大體是這個樣子的：

memcached的LRU就是靠item連接成的雙向鏈表。
添加一個item的方法：（可以看到新節點都是之間添加到鏈表頭部的）

static void do_item_link_q(item *it) { /* item is the new head */
    item **head, **tail;
    assert((it->it_flags & ITEM_SLABBED) == 0);

    head = &heads[it->slabs_clsid];         
    tail = &tails[it->slabs_clsid];
    assert(it != *head);
    assert((*head && *tail) || (*head == 0 && *tail == 0));
    it->prev = 0;                             
    it->next = *head;
    if (it->next) it->next->prev = it;
    *head = it;
    if (*tail == 0) *tail = it;
    sizes[it->slabs_clsid]++;                 
    return;
}

移除一個item：

static void do_item_unlink_q(item *it) {
    item **head, **tail;
    head = &heads[it->slabs_clsid];          
    tail = &tails[it->slabs_clsid];

    if (*head == it) {                        
        assert(it->prev == 0);
        *head = it->next;
    }
    if (*tail == it) {
        assert(it->next == 0);
        *tail = it->prev;
    }
    assert(it->next != it);
    assert(it->prev != it);

    if (it->next) it->next->prev = it->prev;
    if (it->prev) it->prev->next = it->next;
    sizes[it->slabs_clsid]--;                
    return;
}

整體是這樣的：

0x03 新版本的分段LRU

基本知識瞭解這些差不多夠了。在老版本中memcached的LRU是一個很典型的實現了，最近訪問的應該bump到head位置，但是新版本做了一些改變，將LRU分段了（Segmented LRU）。這也就是導致我開篇就說的問題的原因，具體我慢慢分析。

我們先看看memcached的官方文檔，將LRU分成了：HOT WARM COLD TEMP
爲什麼要分段？主要是爲了降低鎖競爭，提升效率。
每個 item 有一個 flag，存儲在其元數據中，標識其活躍程度：

FETCHED：如果一個 item 有請求操作，其 flag 等於 FETCHED。
ACTIVE：如果一個 item 第二次被請求則會標記爲 ACTIVE；當一個 item 發生 bump 或被刪除了，flag 會被清空。
INACTIVE：不活躍狀態。

item在他們之間的變化規則是這樣的：

新來的item都加到HOT中
一個item被訪問兩次就標記爲active狀態
（隨着新item不斷的加入），如果item移動到了鏈表的bottom。
- 如果是在HOT LRU中且爲active狀態，則把這個item直接移入WARM，否則加入COLD；
- 如果是在WARM中，且是active狀態那麼把這個item提到WARM鏈表的開頭，否則移動到COLD中；
COLD中的item是最慘的，他們都是inactive狀態。當內存滿了的時候就要開始淘汰他們中的一些
COLD中的item如果變成了active狀態後，會被放入隊列，然後異步（注意是異步的哦）移動到WARM中
HOT和WARM的大小是受限的，佔該slab class內存量的N%, COLD 大小是不受限的
具體狀態圖如下：

上面第5點要關注哦。是異步，並不是立馬就移動到WARM中，所以，在COLD中的item變成active後還是可能被淘汰。

引用下別人的，當新來一個item時候的流程：

1.do_item_alloc進入新增item的內存申請流程。
2.do_item_alloc_pull進入item申請的邏輯處理，最多處理10次。
3.do_item_alloc_pull內部邏輯是嘗試通過slabs_alloc申請內存，失敗則嘗試通過lru_pull_tail方法釋放LRU隊列中的item變成可用item。
4.lru_pull_tail執行釋放LRU隊列中item的過程，內部包括各種過期item的回收
5.在lru_pull_tail當中調用do_item_unlink_nolock進行item回收
6.在do_item_unlink_nolock當中調用do_item_unlink_q釋放LRU鏈表，調用do_item_remove回收item爲可用item。

下面這兩段代碼是一個新的item來的時候如何處理：

item *do_item_alloc(char *key, const size_t nkey, const unsigned int flags,
                    const rel_time_t exptime, const int nbytes) {
    uint8_t nsuffix;
    item *it = NULL;
    char suffix[40];

    size_t ntotal = item_make_header(nkey + 1, flags, nbytes, suffix, &nsuffix);
    unsigned int id = slabs_clsid(ntotal);
    unsigned int hdr_id = 0;

    if (ntotal > settings.slab_chunk_size_max) {
        int htotal = nkey + 1 + nsuffix + sizeof(item) + sizeof(item_chunk);
        if (settings.use_cas) {
            htotal += sizeof(uint64_t);
        }
        hdr_id = slabs_clsid(htotal);
        it = do_item_alloc_pull(htotal, hdr_id);
        if (it != NULL)
            it->it_flags |= ITEM_CHUNKED;
    } else {
        it = do_item_alloc_pull(ntotal, id);
    }

    // 省略一堆代碼
    return it;
}


item *do_item_alloc_pull(const size_t ntotal, const unsigned int id) {
    item *it = NULL;
    int i;

    for (i = 0; i < 10; i++) {
        uint64_t total_bytes;
        if (!settings.lru_segmented) {
            lru_pull_tail(id, COLD_LRU, 0, 0, 0, NULL);
        }

        // 先嚐試申請新的內存
        it = slabs_alloc(ntotal, id, &total_bytes, 0);

        if (settings.temp_lru)
            total_bytes -= temp_lru_size(id);

        if (it == NULL) {
            //這裏要尤其注意哦，待會我會提到。這裏的意思就是如果內存滿了，就要從LRU尾部開始淘汰數據了，注意傳入了LRU_PULL_EVICT。這個表示直接剔除，而不是報錯。
            // 再嘗試lru_pull_tail執行COLD_LRU當中釋放item
            if (lru_pull_tail(id, COLD_LRU, total_bytes, LRU_PULL_EVICT, 0, NULL) <= 0) {
                if (settings.lru_segmented) {
                    // 最後嘗試lru_pull_tail執行HOT_LRU當中釋放item
                    lru_pull_tail(id, HOT_LRU, total_bytes, 0, 0, NULL);
                } else {
                    break;
                }
            }
        } else {
            break;
        }
    }

    return it;
}

可以看到會先申請內存，如果申請失敗的話，就會調用lru_pull_tail(id, COLD_LRU, total_bytes, LRU_PULL_EVICT, 0, NULL)這個函數就是淘汰COLD LRU尾部節點

int lru_pull_tail(const int orig_id, const int cur_lru,
        const uint64_t total_bytes, const uint8_t flags, const rel_time_t max_age,
        struct lru_pull_tail_return *ret_it) {
    item *it = NULL;
    int id = orig_id;
    int removed = 0;

    int tries = 5;
    item *search;
    item *next_it;
    void *hold_lock = NULL;
    unsigned int move_to_lru = 0;
    uint64_t limit = 0;

    id |= cur_lru;
    pthread_mutex_lock(&lru_locks[id]);

    // 獲取slabclass對應id的LRU隊列的隊尾元素
    search = tails[id];
    for (; tries > 0 && search != NULL; tries--, search=next_it) {
        
        next_it = search->prev;
        // 如果item內容爲空，則繼續往LRU列表尾部搜索。
        if (search->nbytes == 0 && search->nkey == 0 && search->it_flags == 1) {
            if (flags & LRU_PULL_CRAWL_BLOCKS) {
                pthread_mutex_unlock(&lru_locks[id]);
                return 0;
            }
            tries++;
            continue;
        }

        // 如果item被其它worker引用鎖定等情況，則繼續往LRU列表尾部搜索。
        uint32_t hv = hash(ITEM_key(search), search->nkey);
        if ((hold_lock = item_trylock(hv)) == NULL)
            continue;

        if (refcount_incr(search) != 2) {
            itemstats[id].lrutail_reflocked++;
            if (settings.tail_repair_time &&
                    search->time + settings.tail_repair_time < current_time) {
                itemstats[id].tailrepairs++;
                search->refcount = 1;

                do_item_unlink_nolock(search, hv);
                item_trylock_unlock(hold_lock);
                continue;
            }
        }

        if ((search->exptime != 0 && search->exptime < current_time)
            || item_is_flushed(search)) {
            itemstats[id].reclaimed++;
            if ((search->it_flags & ITEM_FETCHED) == 0) {
                itemstats[id].expired_unfetched++;
            }

            do_item_unlink_nolock(search, hv);
            STORAGE_delete(ext_storage, search);
            do_item_remove(search);
            item_trylock_unlock(hold_lock);
            removed++;

            continue;
        }

        /* If we're HOT_LRU or WARM_LRU and over size limit, send to COLD_LRU.
         * If we're COLD_LRU, send to WARM_LRU unless we need to evict
         */
        switch (cur_lru) {
            case HOT_LRU:
                limit = total_bytes * settings.hot_lru_pct / 100;
            case WARM_LRU:
                if (limit == 0)
                    limit = total_bytes * settings.warm_lru_pct / 100;
                if ((search->it_flags & ITEM_ACTIVE) != 0) {
                    search->it_flags &= ~ITEM_ACTIVE;
                    removed++;
                    if (cur_lru == WARM_LRU) {
                        itemstats[id].moves_within_lru++;
                        do_item_update_nolock(search);
                        do_item_remove(search);
                        item_trylock_unlock(hold_lock);
                    } else {
                        itemstats[id].moves_to_warm++;
                        move_to_lru = WARM_LRU;
                        do_item_unlink_q(search);
                        it = search;
                    }
                } else if (sizes_bytes[id] > limit ||
                           current_time - search->time > max_age) {
                    itemstats[id].moves_to_cold++;
                    move_to_lru = COLD_LRU;
                    do_item_unlink_q(search);
                    it = search;
                    removed++;
                    break;
                } else {
                    /* Don't want to move to COLD, not active, bail out */
                    it = search;
                }
                break;
            case COLD_LRU:
            //重點就直接看這裏吧
                it = search; /* No matter what, we're stopping */
                if (flags & LRU_PULL_EVICT) {
               
                    if (settings.evict_to_free == 0) {
                        /* Don't think we need a counter for this. It'll OOM.  */
                        break;
                    }
                    itemstats[id].evicted++;
                    itemstats[id].evicted_time = current_time - search->time;
                    if (search->exptime != 0)
                        itemstats[id].evicted_nonzero++;
                    if ((search->it_flags & ITEM_FETCHED) == 0) {
                        itemstats[id].evicted_unfetched++;
                    }
                     //可以看到如果是EVICT的話，就算你是active狀態也會把你移除
                    if ((search->it_flags & ITEM_ACTIVE)) {
                    //可以通過stats|grep evicted_active命令查看
                        itemstats[id].evicted_active++;
                    }
                    LOGGER_LOG(NULL, LOG_EVICTIONS, LOGGER_EVICTION, search);
                    STORAGE_delete(ext_storage, search);
                    //強制移除
                    do_item_unlink_nolock(search, hv);
                    removed++;
                    if (settings.slab_automove == 2) {
                        slabs_reassign(-1, orig_id);
                    }
                } else if (flags & LRU_PULL_RETURN_ITEM) {
                    /* Keep a reference to this item and return it. */
                    ret_it->it = it;
                    ret_it->hv = hv;
                } else if ((search->it_flags & ITEM_ACTIVE) != 0
                        && settings.lru_segmented) {
                    itemstats[id].moves_to_warm++;
                    search->it_flags &= ~ITEM_ACTIVE;
                    move_to_lru = WARM_LRU;
                    do_item_unlink_q(search);
                    removed++;
                }
                break;
            case TEMP_LRU:
                it = search; /* Kill the loop. Parent only interested in reclaims */
                break;
        }
        if (it != NULL)
            break;
    }

    pthread_mutex_unlock(&lru_locks[id]);

    if (it != NULL) {
        if (move_to_lru) {
            it->slabs_clsid = ITEM_clsid(it);
            it->slabs_clsid |= move_to_lru;
            item_link_q(it);
        }
        if ((flags & LRU_PULL_RETURN_ITEM) == 0) {
            do_item_remove(it);
            item_trylock_unlock(hold_lock);
        }
    }

    return removed;
}


void do_item_unlink_nolock(item *it, const uint32_t hv) {
    MEMCACHED_ITEM_UNLINK(ITEM_key(it), it->nkey, it->nbytes);
    if ((it->it_flags & ITEM_LINKED) != 0) {
        it->it_flags &= ~ITEM_LINKED;
        STATS_LOCK();
        stats_state.curr_bytes -= ITEM_ntotal(it);
        stats_state.curr_items -= 1;
        STATS_UNLOCK();
        item_stats_sizes_remove(it);
        assoc_delete(ITEM_key(it), it->nkey, hv);
        // 從LRU的鏈表中刪除
        do_item_unlink_q(it);
        // 回收到可用的item列表當中
        do_item_remove(it);
    }
}

通過上面代碼註釋的地方可以看到，要是item進入了COLD裏面，還是evict的話，那就算你是active的話也會直接強制移除。
可以通過如下命令查看

默認在1.4.X中的版本中是沒有開啓分段LRU的，但是1.5裏面是默認開啓的。如果你用的是1.5而且還是evict模式的話就要注意你的信息可能被刪了。而且就是你設置的是永不過期也會刪。
可以使用-M 參數（內存耗盡時返回錯誤，而不是刪除項）。

（圖中部分圖片來自於一下參考）
參考：
官網關於分段鎖的解釋：https://memcached.org/blog/modern-lru/
一篇講源碼分析的https://www.jianshu.com/p/bbd24ba0ad62
https://toutiao.io/posts/5ivota/preview
https://github.com/memcached/memcached/blob/3b11d16b3f92c51bfedbb092147e1c2b225945ff/doc/new_lru.txt
https://www.cnblogs.com/zhoujinyi/p/5554083.html
https://blog.csdn.net/yxnyxnyxnyxnyxn/article/details/7869900
https://www.jianshu.com/p/a99ecc052756

Memcached LRU淘汰策略，以及數據丟失問題

0x01 問題說明：

0x02 memcached內存模型

0x03 新版本的分段LRU

WSL2 Ubuntu18運行docker run報錯：docker: Error response from daemon: OCI runtime create fail

CephFs 多節點併發讀寫，mds0: Client XXX:XXX failing to respond to capability release

grpc keepalive使用指南

Linux TCP參數調優

Memcached LRU淘汰策略，以及數據丟失問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結