[redis 源碼走讀] 壓縮列表(ziplist)

壓縮列表

點贊作者:redis 源碼,註釋很多而且很詳細。看壓縮列表源碼前,可以先看看 ziplist.c 文件頂部註釋,基本可以瞭解該數據結構設計。


壓縮列表ziplist是一個雙向鏈表,設計主要是爲了節省內存。保存字符串,數值兩種類型( It stores both strings and integer values),列表內部實現主要是對一塊連續內存進行管理,列表支持列表頭尾的插入或彈出結點操作。因爲寫操作涉及到內存重新分配,所以複雜度需要根據當前使用內存的使用情況而定,一般情況下,不建議存儲大量數據。sorted set 根據數據長度,就分別用 ziplistskiplist 兩種數據結構進行保存。

The ziplist is a specially encoded dually linked list that is designed to be very memory efficient. It stores both strings and integer values, where integers are encoded as actual integers instead of a series of characters. It allows push and pop operations on either side of the list in O(1) time. However, because every operation requires a reallocation of the memory used by the ziplist, the actual complexity is related to the amount of memory used by the ziplist.


原理

壓縮原理:舉個例子,int a = 0 a 是一個整型變量,佔 4 個字節。但是 a = 0,0 這個數字只需要一個 bit 保存就足夠了,如果用 4 個字節(32 bit)內存去保存就有點浪費了。按照這個思路,大致可以理解壓縮策略是怎麼樣的,詳細信息看文檔和源碼吧。

壓縮數據管理有點像數據序列化,序列化數據平常數據的傳輸經常用到,可以瞭解下 protobuf 源碼,看看數據是怎麼打包的。壓縮列表除了數據序列化外,還需要對數據進行插入刪除等操作,需要增加一些額外的結構進行內存管理。


結構

列表結構

頭 + 結點 + 尾
<zlbytes> <zltail> <zllen> <entry> <entry> ... <entry> <zlend>
結構

/* Size of the "end of ziplist" entry. Just one byte. */
#define ZIPLIST_HEADER_SIZE     (sizeof(uint32_t)*2+sizeof(uint16_t))

/* Size of the "end of ziplist" entry. Just one byte. */
#define ZIPLIST_END_SIZE        (sizeof(uint8_t))

/* Return total bytes a ziplist is composed of. */
#define ZIPLIST_BYTES(zl)       (*((uint32_t*)(zl)))

/* Return the offset of the last item inside the ziplist. */
#define ZIPLIST_TAIL_OFFSET(zl) (*((uint32_t*)((zl)+sizeof(uint32_t))))

/* Return the length of a ziplist, or UINT16_MAX if the length cannot be
 * determined without scanning the whole ziplist. */
#define ZIPLIST_LENGTH(zl)      (*((uint16_t*)((zl)+sizeof(uint32_t)*2)))

/* Special "end of ziplist" entry. */
#define ZIP_END 255

entry

結點結構:<prevlen> <encoding> <entry-data>,但有時候數值很小,用 <encoding> 也能保存數據,不需要 <entry-data>, 即 <prevlen> <encoding>


壓縮鏈表的結點有點特別,這裏的鏈表不是傳統的鏈表,傳統的鏈表每個結點都有 prev 或者 next 的指針,連接起來。壓縮鏈表結點通過 prevlen 在內存上進行定位前一個結點,因爲 <encoding> 存儲了當前結點數據類型和數據長度,從而可以向後定位下一個結點。

prevlen

條件 長度 格式
< 254 字節 1 字節 <prevlen from 0 to 253> <encoding> <entry-dagta>
>= 254 字節 5 字節 0xFE <4 bytes unsigned little endian prevlen> <encoding> <entry-data>

前一個結點長度,存儲在本結點首部,有兩種存儲長度,1 字節或者 5 字節空間進行存儲,具體產看前面的具體描述。

/* Return the number of bytes used to encode the length of the previous
 * entry. The length is returned by setting the var 'prevlensize'. */
#define ZIP_DECODE_PREVLENSIZE(ptr, prevlensize) do {                          \
    if ((ptr)[0] < ZIP_BIG_PREVLEN) {                                          \
        (prevlensize) = 1;                                                     \
    } else {                                                                   \
        (prevlensize) = 5;                                                     \
    }                                                                          \
} while(0);

prevlen: 前一個結點結點長度。
prevlensize: 保存 prevlen 佔用了多少內存(1/5)

/* Return the length of the previous element, and the number of bytes that
 * are used in order to encode the previous element length.
 * 'ptr' must point to the prevlen prefix of an entry (that encodes the
 * length of the previous entry in order to navigate the elements backward).
 * The length of the previous entry is stored in 'prevlen', the number of
 * bytes needed to encode the previous entry length are stored in
 * 'prevlensize'. */
#define ZIP_DECODE_PREVLEN(ptr, prevlensize, prevlen) do {                     \
    ZIP_DECODE_PREVLENSIZE(ptr, prevlensize);                                  \
    if ((prevlensize) == 1) {                                                  \
        (prevlen) = (ptr)[0];                                                  \
    } else if ((prevlensize) == 5) {                                           \
        assert(sizeof((prevlen)) == 4);                                    \
        memcpy(&(prevlen), ((char*)(ptr)) + 1, 4);                             \
        memrev32ifbe(&prevlen);                                                \
    }                                                                          \
} while(0);

encoding

編碼有兩種類型:字符串/整數

The encoding field of the entry depends on the content of the entry. When the entry is a string, the first 2 bits of the encoding first byte will hold the type of encoding used to store the length of the string, followed by the actual length of the string. When the entry is an integer the first 2 bits are both set to 1. The following 2 bits are used to specify what kind of integer will be stored after this header. An overview of the different types and encodings is as follows. The first byte is always enough to determine the kind of entry.

字符串

如果當結點內容是字符串,那麼 <encoding> 前兩個 bit 主要用來存儲編碼類型,剩下的保存當前字符串的字符串長度。從 <encoding> 可以獲得 3 個信息:

  1. 編碼類型。
  2. 結點數據內容長度。
  3. 整個 <encoding> 長度。
標識 encoding 長度 字符串長度 描述 注意
|00pppppp| 1 byte <= 63 字節(6 bits) encoding 用一個字節保存,前 2 個 bit是 0,後面 6 個 bit 保存字符串長度
|01pppppp|qqqqqqqq| 2 bytes <= 16383 字節(14 bits) encoding 前 2 個 bit是 0,緊接着後面 6 個 bit 保存字符串長度。 14 bit 數值用大端方式保存
|10000000|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| 5 bytes >= 16384 字節 encoding 前面一個字節是標識,後面 4 個字節保存字符串長度。 長度數值用大端模式保存

判斷字節前面兩個 bit 是否爲 1,如果不是就是字符串。ZIP_STR_MASK = “1100 0000”

/* Extract the encoding from the byte pointed by 'ptr' and set it into
 * 'encoding' field of the zlentry structure. */
#define ZIP_ENTRY_ENCODING(ptr, encoding) do {  \
    (encoding) = (ptr[0]); \
    if ((encoding) < ZIP_STR_MASK) (encoding) &= ZIP_STR_MASK; \
} while(0)

數值

當結點內容是數值,<encoding> 前兩個 bit 設置成 1,接下來兩個 bit 用來保存數值類型。從 <encoding> 可以獲得 3 個信息:

  1. 編碼類型。
  2. 數值類型。
  3. 數值。
首字節標識 encoding 長度 數值長度 描述
|11000000| 3 bytes 2 bytes int16_t
|11010000| 5 bytes 4 bytes int32_t
|11100000| 9 bytes 8 bytes int64_t
|11110000| 4 bytes 3 bytes Integer encoded as 24 bit signed (3 bytes).
|11111110| 2 bytes 1 byte Integer encoded as 8 bit signed (1 byte).
|1111xxxx| 1 byte 4 bits 4 bit integer, 可以存儲 0 - 12, 因爲 0000,1110,1111 不能使用,只能存儲 1 - 13,所以保存進來的數字進行 + 1 操作,解析後需要 -1
|11111111| 1 byte 0 bit 列表結束符

編解碼實現


#define ZIP_STR_MASK 0xc0
#define ZIP_INT_MASK 0x30
#define ZIP_STR_06B (0 << 6)
#define ZIP_STR_14B (1 << 6)
#define ZIP_STR_32B (2 << 6)
#define ZIP_INT_16B (0xc0 | 0<<4)
#define ZIP_INT_32B (0xc0 | 1<<4)
#define ZIP_INT_64B (0xc0 | 2<<4)
#define ZIP_INT_24B (0xc0 | 3<<4)
#define ZIP_INT_8B 0xfe

/* Macro to determine if the entry is a string. String entries never start
 * with "11" as most significant bits of the first byte. */
#define ZIP_IS_STR(enc) (((enc) & ZIP_STR_MASK) < ZIP_STR_MASK)

/* Write the encoidng header of the entry in 'p'. If p is NULL it just returns
 * the amount of bytes required to encode such a length. Arguments:
 *
 * 'encoding' is the encoding we are using for the entry. It could be
 * ZIP_INT_* or ZIP_STR_* or between ZIP_INT_IMM_MIN and ZIP_INT_IMM_MAX
 * for single-byte small immediate integers.
 *
 * 'rawlen' is only used for ZIP_STR_* encodings and is the length of the
 * srting that this entry represents.
 *
 * The function returns the number of bytes used by the encoding/length
 * header stored in 'p'. */
unsigned int zipStoreEntryEncoding(unsigned char *p, unsigned char encoding, unsigned int rawlen) {
    unsigned char len = 1, buf[5];

    if (ZIP_IS_STR(encoding)) {
        /* Although encoding is given it may not be set for strings,
         * so we determine it here using the raw length. */
        if (rawlen <= 0x3f) {
            if (!p) return len;
            buf[0] = ZIP_STR_06B | rawlen;
        } else if (rawlen <= 0x3fff) {
            len += 1;
            if (!p) return len;
            buf[0] = ZIP_STR_14B | ((rawlen >> 8) & 0x3f);
            buf[1] = rawlen & 0xff;
        } else {
            len += 4;
            if (!p) return len;
            buf[0] = ZIP_STR_32B;
            buf[1] = (rawlen >> 24) & 0xff;
            buf[2] = (rawlen >> 16) & 0xff;
            buf[3] = (rawlen >> 8) & 0xff;
            buf[4] = rawlen & 0xff;
        }
    } else {
        /* Implies integer encoding, so length is always 1. */
        if (!p) return len;
        buf[0] = encoding;
    }

    /* Store this length at p. */
    memcpy(p,buf,len);
    return len;
}

/* 'encoding' field of the zlentry structure. */
#define ZIP_ENTRY_ENCODING(ptr, encoding) do {  \
    (encoding) = (ptr[0]); \
    // 如果是字符串類型,取前面兩個 bit,其它 bit 是 0
    if ((encoding) < ZIP_STR_MASK) (encoding) &= ZIP_STR_MASK; \
} while(0)

#define ZIP_INT_IMM_MIN 0xf1    /* 11110001 */
#define ZIP_INT_IMM_MAX 0xfd    /* 11111101 */

/* Return bytes needed to store integer encoded by 'encoding'. */
unsigned int zipIntSize(unsigned char encoding) {
    switch(encoding) {
    case ZIP_INT_8B:  return 1;
    case ZIP_INT_16B: return 2;
    case ZIP_INT_24B: return 3;
    case ZIP_INT_32B: return 4;
    case ZIP_INT_64B: return 8;
    }
    if (encoding >= ZIP_INT_IMM_MIN && encoding <= ZIP_INT_IMM_MAX)
        return 0; /* 4 bit immediate */
    panic("Invalid integer encoding 0x%02X", encoding);
    return 0;
}

/* Store integer 'value' at 'p', encoded as 'encoding' */
void zipSaveInteger(unsigned char *p, int64_t value, unsigned char encoding) {
    int16_t i16;
    int32_t i32;
    int64_t i64;
    if (encoding == ZIP_INT_8B) {
        ((int8_t *)p)[0] = (int8_t)value;
    } else if (encoding == ZIP_INT_16B) {
        i16 = value;
        memcpy(p, &i16, sizeof(i16));
        memrev16ifbe(p);
    } else if (encoding == ZIP_INT_24B) {
        i32 = value << 8;
        memrev32ifbe(&i32);
        memcpy(p, ((uint8_t *)&i32) + 1, sizeof(i32) - sizeof(uint8_t));
    } else if (encoding == ZIP_INT_32B) {
        i32 = value;
        memcpy(p, &i32, sizeof(i32));
        memrev32ifbe(p);
    } else if (encoding == ZIP_INT_64B) {
        i64 = value;
        memcpy(p, &i64, sizeof(i64));
        memrev64ifbe(p);
    } else if (encoding >= ZIP_INT_IMM_MIN && encoding <= ZIP_INT_IMM_MAX) {
        /* Nothing to do, the value is stored in the encoding itself. */
    } else {
        assert(NULL);
    }
}

/* Decode the entry encoding type and data length (string length for strings,
 * number of bytes used for the integer for integer entries) encoded in 'ptr'.
 * The 'encoding' variable will hold the entry encoding, the 'lensize'
 * variable will hold the number of bytes required to encode the entry
 * length, and the 'len' variable will hold the entry length. */
#define ZIP_DECODE_LENGTH(ptr, encoding, lensize, len) do {                    \
    ZIP_ENTRY_ENCODING((ptr), (encoding));                                     \
    if ((encoding) < ZIP_STR_MASK) {                                           \
        if ((encoding) == ZIP_STR_06B) {                                       \
            (lensize) = 1;                                                     \
            (len) = (ptr)[0] & 0x3f;                                           \
        } else if ((encoding) == ZIP_STR_14B) {                                \
            (lensize) = 2;                                                     \
            (len) = (((ptr)[0] & 0x3f) << 8) | (ptr)[1];                       \
        } else if ((encoding) == ZIP_STR_32B) {                                \
            (lensize) = 5;                                                     \
            (len) = ((ptr)[1] << 24) |                                         \
                    ((ptr)[2] << 16) |                                         \
                    ((ptr)[3] <<  8) |                                         \
                    ((ptr)[4]);                                                \
        } else {                                                               \
            panic("Invalid string encoding 0x%02X", (encoding));               \
        }                                                                      \
    } else {                                                                   \
        (lensize) = 1;                                                         \
        (len) = zipIntSize(encoding);                                          \
    }                                                                          \
} while(0);

調試

我們可以先通過調試去走一次程序邏輯,觀察該數據結構的內存管理,瞭解下 ziplistNewziplistPush 等接口的工作流程。

調試爲了編譯通過,適當增減部分代碼。

gcc -g ziplist.c sds.c zmalloc.c util.c sha1.c -o ziplist  -I../deps/lua/src
sudo gdb ziplist

調試


調試中間插入結點

詳細可以查看 ziplistInsert 接口源碼

static unsigned char *createTestlist() {
    unsigned char *zl = ziplistNew();
    zl = ziplistPush(zl, (unsigned char*)"2", 1, ZIPLIST_TAIL);
    zl = ziplistPush(zl, (unsigned char*)"5", 1, ZIPLIST_TAIL);

    unsigned char test[1024];
    memset(test, 'a', sizeof(test));

    // 插入中間
    unsigned char* p = ziplistIndex(zl, 0);
    p = ziplistNext(zl, p);
    zl = ziplistInsert(zl, p, test, sizeof(test));
    return zl;
}

int main() {
    unsigned char *zl = createTestlist();
    ziplistRepr(zl);
    zfree(zl);
}

結果

{total bytes 1046} {num entries 3}
{tail offset 1039}
{
        addr 0x7fb31680060a,
        index  0,
        offset    10,
        hdr+entry len:     2,
        hdr len 2,
        prevrawlen:     0,
        prevrawlensize:  1,
        payload     0
        bytes: 00|f3|
        [int]2
}
{
        addr 0x7fb31680060c,
        index  1,
        offset    12,
        hdr+entry len:  1027,
        hdr len 3,
        prevrawlen:     2,
        prevrawlensize:  1,
        payload  1024
        bytes: 02|44|00|61|61|...|61|
        [str]aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
}
{
        addr 0x7fb316800a0f,
        index  2,
        offset  1039,
        hdr+entry len:     6,
        hdr len 6,
        prevrawlen:  1027,
        prevrawlensize:  5,
        payload     0
        bytes: fe|03|04|00|00|f6|
        [int]5
}
{end}

主要畫了部分令人費解的地方。
插入流程


接口

可以通過 sorted set (t_zset.c)源碼理解 ziplist 的使用。

插入結點

根據 p 指定的位置,插入數據。

/* Insert item at "p". */
unsigned char *__ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen) {
    // 獲取當前整個內存長度
    size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), reqlen;
    unsigned int prevlensize, prevlen = 0;
    size_t offset;
    int nextdiff = 0;
    unsigned char encoding = 0;
    long long value = 123456789; /* initialized to avoid warning. Using a value
                                    that is easy to see if for some reason
                                    we use it uninitialized. */
    zlentry tail;

    // 如果不是結束結點,那麼就從當前結點獲取前一個結點的長度。如果是結束結點,就取末結點長度(末結點不是結束結點)。
    /* Find out prevlen for the entry that is inserted. */
    if (p[0] != ZIP_END) {
        ZIP_DECODE_PREVLEN(p, prevlensize, prevlen);
    } else {
        // 末結點
        unsigned char *ptail = ZIPLIST_ENTRY_TAIL(zl);
        if (ptail[0] != ZIP_END) {
            prevlen = zipRawEntryLength(ptail);
        }
    }

    // 獲取內容長度,字符串會先嚐試轉化爲整型。
    /* See if the entry can be encoded */
    if (zipTryEncoding(s,slen,&value,&encoding)) {
        /* 'encoding' is set to the appropriate integer encoding */
        reqlen = zipIntSize(encoding);
    } else {
        /* 'encoding' is untouched, however zipStoreEntryEncoding will use the
         * string length to figure out how to encode it. */
        reqlen = slen;
    }
    /* We need space for both the length of the previous entry and
     * the length of the payload. */
    reqlen += zipStorePrevEntryLength(NULL,prevlen);
    reqlen += zipStoreEntryEncoding(NULL,encoding,slen);

    // 插入位置的後一個結點的<prevlen>發生改變, nextdiff 計算 <prevlen> 的 lensize 相差多少。
    /* When the insert position is not equal to the tail, we need to
     * make sure that the next entry can hold this entry's length in
     * its prevlen field. */
    int forcelarge = 0;
    nextdiff = (p[0] != ZIP_END) ? zipPrevLenByteDiff(p,reqlen) : 0;

    // 減少插入位置後續結點的 <prevlen> lensize 連鎖反應頻繁調用 ziplistResize 損耗性能。強制 forcelarge 寫入。
    // 該問題,詳見:https://segmentfault.com/a/1190000018878466?utm_source=tag-newest
    if (nextdiff == -4 && reqlen < 4) {
        nextdiff = 0;
        forcelarge = 1;
    }

    /* Store offset because a realloc may change the address of zl. */
    offset = p-zl;
    zl = ziplistResize(zl,curlen+reqlen+nextdiff);
    p = zl+offset;

    /* Apply memory move when necessary and update tail offset. */
    if (p[0] != ZIP_END) {
        /* Subtract one because of the ZIP_END bytes */
        memmove(p+reqlen,p-nextdiff,curlen-offset-1+nextdiff);

        /* Encode this entry's raw length in the next entry. */
        if (forcelarge)
            zipStorePrevEntryLengthLarge(p+reqlen,reqlen);
        else
            zipStorePrevEntryLength(p+reqlen,reqlen);

        /* Update offset for tail */
        ZIPLIST_TAIL_OFFSET(zl) =
            intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+reqlen);

        /* When the tail contains more than one entry, we need to take
         * "nextdiff" in account as well. Otherwise, a change in the
         * size of prevlen doesn't have an effect on the *tail* offset. */
        zipEntry(p+reqlen, &tail);
        if (p[reqlen+tail.headersize+tail.len] != ZIP_END) {
            ZIPLIST_TAIL_OFFSET(zl) =
                intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+nextdiff);
        }
    } else {
        /* This element will be the new tail. */
        ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(p-zl);
    }

    // <entr> 保存了 <prevlen>,前結點改變了,導致長度也改變了,後面的結點連鎖反映,也需要修改 <prevlen>
    /* When nextdiff != 0, the raw length of the next entry has changed, so
     * we need to cascade the update throughout the ziplist */
    if (nextdiff != 0) {
        offset = p-zl;
        zl = __ziplistCascadeUpdate(zl,p+reqlen);
        p = zl+offset;
    }

    /* Write the entry */
    p += zipStorePrevEntryLength(p,prevlen);
    p += zipStoreEntryEncoding(p,encoding,slen);
    if (ZIP_IS_STR(encoding)) {
        memcpy(p,s,slen);
    } else {
        zipSaveInteger(p,value,encoding);
    }
    ZIPLIST_INCR_LENGTH(zl,1);
    return zl;
}

問題

  • 分配內存
    ziplist 插入刪除數據需要重新分配內存。

  • 耦合問題
    ziplist 爲了在連續內存上進行數據管理,對數據進行壓縮,節省內存開銷,也減少內存碎片。但是 prevlen 作爲數據結點對組成部分,跟其它結點嚴重耦合,只要在鏈表中間插入或者刪除結點,有可能需要遍歷更新插入或刪除位置後續的所有結點 <prevlen>

  • 效率問題
    列表重點是壓縮,是一個列表,插入刪除數據,效率不高,需要重新分配內存。因爲是列表,查找結點複雜度O(n)O(n)。在 sorted set 的實現中,對 skiplist 的使用是有限制的。

redis.conf

zset-max-ziplist-entries 128
zset-max-ziplist-value 64

t_zset.c

void zaddGenericCommand(client *c, int flags) {
    ...
    zobj = lookupKeyWrite(c->db,key);
    if (zobj == NULL) {
        if (xx) goto reply_to_client; /* No key + XX option: nothing to do. */
        if (server.zset_max_ziplist_entries == 0 ||
            server.zset_max_ziplist_value < sdslen(c->argv[scoreidx+1]->ptr))
        {
            zobj = createZsetObject();
        } else {
            zobj = createZsetZiplistObject();
        }
        dbAdd(c->db,key,zobj);
    } else {
        if (zobj->type != OBJ_ZSET) {
            addReply(c,shared.wrongtypeerr);
            goto cleanup;
        }
    }
}

int zsetAdd(robj *zobj, double score, sds ele, int *flags, double *newscore) {
    ...
    zobj->ptr = zzlInsert(zobj->ptr,ele,score);
    if (zzlLength(zobj->ptr) > server.zset_max_ziplist_entries ||
        sdslen(ele) > server.zset_max_ziplist_value)
        zsetConvert(zobj,OBJ_ENCODING_SKIPLIST);
    ...
}
  • 複雜度
    指針的偏移考驗的是技術功底。ziplist 實現算是比較複雜了(對我而言)。如果用傳統的雙向鏈表實現要簡單不少的,壓縮目的還是能達到的,結點間的耦合比較小。

參考

gdb中看內存(x命令)
Redis的一個歷史bug及其後續改進
Ziplist: insertion bug under particular conditions fixed.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章