1 Redis 內存存儲結構

本文是基於 Redis-v2.2.4 版本進行分析.

1.1 Redis 內存存儲總體結構

Redis 是支持多key-value數據庫(表)的,並用 RedisDb 來表示一個key-value數據庫(表). redisServer 中有一個 redisDb *db; 成員變量, RedisServer 在初始化時,會根據配置文件的 db 數量來創建一個 redisDb 數組. 客戶端在連接後,通過 SELECT 指令來選擇一個 reidsDb,如果不指定,則缺省是redisDb數組的第1個(即下標是 0 ) redisDb. 一個客戶端在選擇 redisDb 後,其後續操作都是在此 redisDb 上進行的. 下面會詳細介紹一下 redisDb 的內存結構.

redis 的內存存儲結構示意圖

redisDb 的定義:

typedef struct redisDb

{

dict *dict;                 /* The keyspace for this DB */

dict *expires;              /* Timeout of keys with a timeout set */

dict *blocking_keys;    /* Keys with clients waiting for data (BLPOP) */

dict *io_keys;              /* Keys with clients waiting for VM I/O */

dict *watched_keys;         /* WATCHED keys for MULTI/EXEC CAS */

int id;

} redisDb;

struct

redisDb 中 ,dict 成員是與實際存儲數據相關的. dict 的定義如下:

typedef struct dictEntry

{

void *key;

void *val;

struct dictEntry *next;

} dictEntry;

typedef struct dictType

{

unsigned int (*hashFunction)(const void *key);

void *(*keyDup)(void *privdata, const void *key);

void *(*valDup)(void *privdata, const void *obj);

int (*keyCompare)(void *privdata, const void *key1, const void *key2);

void (*keyDestructor)(void *privdata, void *key);

void (*valDestructor)(void *privdata, void *obj);

} dictType;

/* This is our hash table structure. Every dictionary has two of this as we

* implement incremental rehashing, for the old to the new table. */

typedef struct dictht

{

dictEntry **table;

unsigned long size;

unsigned long sizemask;

unsigned long used;

} dictht;

typedef struct dict

{

dictType *type;

void *privdata;

dictht ht[2];

int rehashidx; /* rehashing not in progress if rehashidx == -1 */

int iterators; /* number of iterators currently running */

} dict;

dict 是主要是由 struct dictht 的哈唏表構成的, 之所以定義成長度爲2的( dictht ht[2] ) 哈唏表數組,是因爲 redis 採用漸進的 rehash,即當需要 rehash 時,每次像 hset,hget 等操作前,先執行N 步 rehash. 這樣就把原來一次性的 rehash過程拆散到進行, 防止一次性 rehash 期間 redis 服務能力大幅下降. 這種漸進的 rehash 需要一個額外的 struct dictht 結構來保存.

struct dictht 主要是由一個 struct dictEntry 指針數組組成的, hash 表的衝突是通過鏈表法來解決的.

struct dictEntry 中的 key 指針指向用 sds 類型表示的 key 字符串, val 指針指向一個 struct redisObject 結構體, 其定義如下:

typedef struct redisObject

{

unsigned type:4;

unsigned storage:2;   /* REDIS_VM_MEMORY or REDIS_VM_SWAPPING */

unsigned encoding:4;

unsigned lru:22;        /* lru time (relative to server.lruclock) */

int refcount;

void *ptr;

/* VM fields are only allocated if VM is active, otherwise the

* object allocation function will just allocate

* sizeof(redisObjct) minus sizeof(redisObjectVM), so using

* Redis without VM active will not have any overhead. */

} robj;


//type 佔 4 bit,用來表示 key-value 中 value 值的類型,目前 redis 支持: string, list, set,zset,hash 5種類型的值.

/* Object types */

#define REDIS_STRING 0

#define REDIS_LIST 1

#define REDIS_SET 2

#define REDIS_ZSET 3

#define REDIS_HASH 4

#define REDIS_VMPOINTER 8
// storage 佔 2 bit ,表示 此值是在 內存中,還是 swap 在硬盤上.
// encoding 佔 4 bit ,表示值的編碼類型,目前有 8種類型:

/* Objects encoding. Some kind of objects like Strings and Hashes can be

* internally represented in multiple ways. The 'encoding' field of the object

* is set to one of this fields for this object. */

#define REDIS_ENCODING_RAW 0     /* Raw representation */

#define REDIS_ENCODING_INT 1     /* Encoded as integer */

#define REDIS_ENCODING_HT 2      /* Encoded as hash table */

#define REDIS_ENCODING_ZIPMAP 3  /* Encoded as zipmap */

#define REDIS_ENCODING_LINKEDLIST 4 /* Encoded as regular linked list */

#define REDIS_ENCODING_ZIPLIST 5 /* Encoded as ziplist */

#define REDIS_ENCODING_INTSET 6  /* Encoded as intset */

#define REDIS_ENCODING_SKIPLIST 7  /* Encoded as skiplist */

/* 如 type 是 REDIS_STRING 類型的,則其值如果是數字,就可以編碼成 REDIS_ENCODING_INT,以節約內存.

* 如 type 是 REDIS_HASH 類型的,如果其 entry 小於配置值: hash-max-zipmap-entries 或 value字符串的長度小於 hash-max-zipmap-value, 則可以編碼成 REDIS_ENCODING_ZIPMAP 類型存儲,以節約內存. 否則採用 Dict 來存儲.

* 如 type 是 REDIS_LIST 類型的,如果其 entry 小於配置值: list-max-ziplist-entries 或 value字符串的長度小於 list-max-ziplist-value, 則可以編碼成 REDIS_ENCODING_ZIPLIST 類型存儲,以節約內存; 否則採用 REDIS_ENCODING_LINKEDLIST 來存儲.

*  如 type 是 REDIS_SET 類型的,如果其值可以表示成數字類型且 entry 小於配置值set-max-intset-entries, 則可以編碼成 REDIS_ENCODING_INTSET 類型存儲,以節約內存; 否則採用 Dict類型來存儲.

*  lru: 是時間戳

*  refcount: 引用次數

*  void * ptr : 指向實際存儲的 value 值內存塊,其類型可以是 string, set, zset,list,hash ,編碼方式可以是上述 encoding 表示的一種.

* 至於一個 key 的 value 採用哪種類型來保存,完全是由客戶端的指令來決定的,如 hset ,則值是採用REDIS_HASH 類型表示的,至於那種編碼(encoding),則由 redis 根據配置自動決定.
*/

1.2 Dict 結構

Dict 結構在<1.1Redis 內存存儲結構>; 已經描述過了,這裏不再贅述.

1.3 zipmap 結構

如果redisObject的type 成員值是 REDIS_HASH 類型的,則當該hash 的 entry 小於配置值: hash-max-zipmap-entries 或者value字符串的長度小於 hash-max-zipmap-value, 則可以編碼成 REDIS_ENCODING_ZIPMAP 類型存儲,以節約內存. 否則採用 Dict 來存儲.

zipmap 其實質是用一個字符串數組來依次保存key和value,查詢時是依次遍列每個 key-value 對,直到查到爲止. 其結構示意圖如下:

爲了節約內存,這裏使用了一些小技巧來保存 key 和 value 的長度. 如果 key 或 value 的長度小於ZIPMAP_BIGLEN(254),則用一個字節來表示,如果大於ZIPMAP_BIGLEN(254),則用5個字節保存,第一個字節爲保存ZIPMAP_BIGLEN(254),後面4個字節保存 key或value 的長度.

初始化時只有2個字節,第1個字節表示 zipmap 保存的 key-value 對的個數(如果key-value 對的個數超過 254,則一直用254來表示, zipmap 中實際保存的 key-value 對個數可以通過 zipmapLen() 函數計算得到).
- hset(nick,wuzhu) 後,
第1個字節保存key-value 對(即 zipmap 的entry 數量)的數量1
第2個字節保存key_len 值 4
第3~6 保存 key “nick”
第 7 字節保存 value_len 值 5
第 8 字節保存空閉的字節數 0 (當該 key 的值被重置時,其新值的長度與舊值的長度不一定相等,如果新值長度比舊值的長度大,則 realloc 擴大內存; 如果新值長度比舊值的長度小,且相差大於 4 bytes ,則 realloc 縮小內存,如果相差小於 4,則將值往前移,並用 empty_len 保存空閒的byte 數)
第 9~13字節保存 value 值 “wuzhu”
hset(age,30)
插入 key-value 對 (“age”,30)
hset(nick,tide)
插入 key-value 對 (“nick”,”tide”), 後可以看到 empty_len 爲1 ,

1.4 ziplist 結構

如果redisObject的type 成員值是 REDIS_LIST 類型的,則當該list 的 elem數小於配置值: hash-max-ziplist-entries 或者elem_value字符串的長度小於 hash-max-ziplist-value, 則可以編碼成 REDIS_ENCODING_ZIPLIST 類型存儲,以節約內存. 否則採用 list 來存儲.

ziplist 其實質是用一個字符串數組形式的雙向鏈表. 其結構示意圖如下:

ziplist header由3個字段組成:
- ziplist_bytes: 用一個uint32_t 來保存, 構成 ziplist 的字符串數組的總長度,包括 ziplist header,
- ziplist_tail_offset: 用一個uint32_t 來保存,記錄 ziplist 的尾部偏移位置.
- ziplist_length: 用一個 uint16_t 來保存,記錄 ziplist 中 elem 的個數
ziplist node 也由 3 部分組成:
- prevrawlen: 保存上一個 ziplist node 的佔用的字節數,包括: 保存prevarwlen,currawlen 的字節數和elem value 的字節數.
- currawlen&encoding: 當前elem value 的raw 形式存款所需的字節數及在ziplist 中保存時的編碼方式(例如,值可以轉換成整數,如示意圖中的”1024”, raw_len 是 4 字節,但在 ziplist 保存時轉換成 uint16_t 來保存,佔2 個字節).
- (編碼後的)value

可以通過 prevrawlen 和 currawlen&encoding 來遍列 ziplist.

ziplist 還能到一些小技巧來節約內存.

len 的存儲: 如果 len 小於 ZIP_BIGLEN(254),則用一個字節來保存; 否則需要 5 個字節來保存,第 1 個字節存 ZIP_BIGLEN,作爲標識符.
value 的存儲: 如果 value 是數字類型的,則根據其值的範圍轉換成 ZIP_INT_16B, ZIP_INT_32B或ZIP_INT_64B 來保存,否則用 raw 形式保存.

1.5 adlist 結構

typedef struct listNode
{

struct listNode *prev;

struct listNode *next;

void *value;

} listNode;

typedef struct listIter

{

listNode *next;

int direction;

} listIter;

typedef struct list

{

listNode *head;

listNode *tail;

void *(*dup)(void *ptr);

void (*free)(void *ptr);

int (*match)(void *ptr, void *key);

unsigned int len;

} list;

常見的雙向鏈表,不作分析.

1.6 intset 結構

intset 是用一個有序的整數數組來實現集合(set). struct intset 的定義如下:

typedef struct intset

{

uint32_t encoding;

uint32_t length;

int8_t contents[];

} intset;

encoding: 來標識數組是 int16_t 類型, int32_t 類型還是 int64_t 類型的數組. 至於怎麼先擇是那種類型的數組,是根據其保存的值的取值範圍來決定的,初始化時是 int16_t, 根據 set 中的最大值在 [INT16_MIN, INT16_MAX] , [INT32_MIN, INT32_MAX], [INT64_MIN, INT64_MAX]的那個取值範圍來動態確定整個數組的類型. 例如set一開始是 int16_t 類型,當一個取值範圍在 [INT32_MIN, INT32_MAX]的值加入到 set 時,則將保存 set 的數組升級成 int32_t 的數組.
length: 表示 set 中值的個數
contents: 指向整數數組的指針

1.7 zset 結構

首先，介紹一下 skip list 的概念，然後再分析 zset 的實現.

1.7.1 Skip List 介紹

1.7.1.1 有序鏈表

1) Searching a key in a Sorted linked list

//Searching an element <em>x</em>

cell *p =head ;

while (p->next->key < x )  p=p->next ;

return p ;

Note: we return the element proceeding either the element containing x, or the smallest element with a key larger than x (if x does not exists)

2) inserting a key into a Sorted linked list

//To insert 35 -

p=find(35);

CELL *p1 = (CELL *) malloc(sizeof(CELL));

p1->key=35;

p1->next = p->next ;

p->next  = p1 ;

3) deleteing a key from a sorted list

//To delete 37 -

p=find(37);

CELL *p1 =p->next;

p->next = p1->next ;

free(p1);

1.7.1.2 SkipList(跳躍表)定義

SKIP LIST : A data structure for maintaing a set of keys in a sorted order.

Consists of several levels.

All keys appear in level 1

Each level is a sorted list.

If key x appears in level i, then it also appears in all levels below i

An element in level i points (via down pointer) to the element with the same key in the level below.

In each level the keys and appear. (In our implementation, INT_MIN and INT_MAX

Top points to the smallest element in the highest level.

1.7.1.3 SkipList(跳躍表)操作

1) An empty SkipList

2) Finding an element with key x

p=top

While(1)

{

while (p->next->key < x ) p=p->next;

If (p->down == NULL ) return p->next

p=p->down ;

}

Observe that we return x, if exists, or succ(x) if x is not in the SkipList

3) Inserting new element X

Determine k the number of levels in which x participates (explained later)

Do find(x), and insert x to the appropriate places in the lowest k levels. (after the elements at which the search path turns down or terminates)

Example – inserting 119. k=2

If k is larger than the current number of levels, add new levels (and update top)

Example – inser(119) when k=4

Determining k

k – the number of levels at which an element x participate.

Use a random function OurRnd() — returns 1 or 0 (True/False) with equal probability.

k=1 ;

While( OurRnd() ) k++ ;

Deleteing a key x

Find x in all the levels it participates, and delete it using the standard ‘delete from a linked list’ method.

If one or more of the upper levels are empty, remove them (and update top)

Facts about SkipList

The expected number of levels is O( log n )

(here n is the numer of elements)

The expected time for insert/delete/find is O( log n )

The expected size (number of cells) is O(n )

1.7.2 redis SkipList 實現

/* ZSETs use a specialized version of Skiplists */

typedef struct zskiplistNode

{

robj *obj;

double score;

struct zskiplistNode *backward;

struct zskiplistLevel

{

struct zskiplistNode *forward;

unsigned int span;

} level[];

} zskiplistNode;

typedef struct zskiplist

{

struct zskiplistNode *header, *tail;

unsigned long length;

int level;

} zskiplist;

typedef struct zset

{

dict *dict;

zskiplist *zsl;

} zset;

zset 的實現用到了2個數據結構: hash_table 和 skip list (跳躍表),其中 hash table 是使用 redis 的 dict 來實現的,主要是爲了保證查詢效率爲 O(1),而 skip list (跳躍表) 是用來保證元素有序並能夠保證 INSERT 和 REMOVE 操作是 O(logn)的複雜度。

1) zset初始化狀態

createZsetObject函數來創建並初始化一個 zset

robj *createZsetObject(void)

{

zset *zs = zmalloc(sizeof(*zs));

robj *o;

zs->dict = dictCreate(&zsetDictType,NULL);

zs->zsl = zslCreate();

o = createObject(REDIS_ZSET,zs);

o->encoding = REDIS_ENCODING_SKIPLIST;

return o;

}

zslCreate()函數用來創建並初如化一個 skiplist。其中，skiplist 的 level 最大值爲 ZSKIPLIST_MAXLEVEL=32 層。

zskiplist *zslCreate(void)

{

int j;

zskiplist *zsl;

zsl = zmalloc(sizeof(*zsl));

zsl->level = 1;

zsl->length = 0;

zsl->header = zslCreateNode(ZSKIPLIST_MAXLEVEL,0,NULL);

for (j = 0; j < ZSKIPLIST_MAXLEVEL; j++) {

zsl->header->level[j].forward = NULL;

zsl->header->level[j].span = 0;

}

zsl->header->backward = NULL;

zsl->tail = NULL;

return zsl;

}