注意:本文使用的代碼是2018.05.07提交的master分支上的code,其具體commitID是c22fcba177bad2c755fdb6d4d52f2a799eceaf34
。
Bihash簡介
Bihash(Bounded-index extensible hash),個人認爲其特點可大致概括如下:
1、bihash支持8/16/24/40/48等類型,減少對於_mm_crc32_u32/16等的使用,提高效率的同時,避免引入GCC的bug;
2、bihash使用64bit Hash值,最多可以支持雙層hash查找,第一層bucket查找,第二層page查找(後面具體分析其hash結構);
3、bucket大小和page大小均爲2的指數倍,因此hash查找時,僅需要位與操作即可;
4、採用working_copy,bucket操作區變換前後添加內存屏障,實現線程安全,使得bihash的查找無鎖,修改時仍可進行查找;
5、採用freelists減少heap的碎片化,同時提高分配效率;
6、採用cache_lru的方式,提高查找效率。
Bihash的hash值計算
Bihash計算hash值時,對key進行hash。採用clib_crc32c或者clib_xxhash算法進行計算,x86中默認爲clib_crc32c。key長度爲8的倍數,即爲了儘量使用_mm_crc32_u64函數,避免進一步使用_mm_crc32_u32等,出現GCC bug,同時加快了計算效率。得到的hash值爲64bit,其結構如下:
後log2_nbuckets位用於表示所落buckets的index,中間log2_pages表示具體page的index,即bucket、page的雙層hash。注意,當page級出現hash衝突的時候,該層可能被拉成線形。
Bihash數據結構簡介
這裏主要介紹三個數據結構,BVT (clib_bihash_value)
, BVT (clib_bihash_bucket)
,以及BVT (clib_bihash)
。其中BVT是個宏,用於將其轉換爲對應的bihash類型,如_8_8
,_16_8
,其中第一個數字表示key
的長度,第二個數字表示value
的長度,單位字節,所以BVT (clib_bihash_value)
閱讀時可直接將其當作clib_bihash_value_8_8
。
BVT (clib_bihash_value)
爲存放若干kvp的數據結構,vpp中將其記作page,默認情況每個page中存放BIHASH_KVP_PER_PAGE
(默認爲4)個kvp。具體數據結構如下:
typedef struct BV (clib_bihash_value)
{ // bucket位置決定是kvp(*buckets)或者next_free(**free_list)
union
{
// 真正存放的kvp
BVT (clib_bihash_kv) kvp[BIHASH_KVP_PER_PAGE];
// 當該bucket被free後,通過next_free將釋放的內存空間串聯起來
struct BV (clib_bihash_value) * next_free;
};
} BVT (clib_bihash_value);
//kvp的結構體
typedef struct
{
u64 key; /**< the key */
u64 value; /**< the value */
} clib_bihash_kv_8_8_t;
BVT (clib_bihash_bucket)
爲bucket的數據結構,記錄該bucket的詳細信息。具體數據結構如下:
typedef struct
{
union
{
struct
{ // 距離整個bihash內存頭部的偏移值
u32 offset;
// bihash內是否是page是否是hash的
u8 linear_search;
// 該bucket中page數目(log2後的)
u8 log2_pages;
// bucket內kvp的數目,當爲0是進行free
i16 refcnt;
};
u64 as_u64;
};
// bucket的cache,用於提高查找效率,僅對頻繁查找相同內容的纔有較好效果==
#if BIHASH_KVP_CACHE_SIZE > 0
u16 cache_lru;
BVT (clib_bihash_kv) cache[BIHASH_KVP_CACHE_SIZE];
#endif
} BVT (clib_bihash_bucket);
BVT (clib_bihash)
是bihash的數據結構,記錄整個bihash表的全部信息,其具體數據結構如下:
typedef struct
{ // 這個目測是沒有用到==, 如錯,請指正^^
BVT (clib_bihash_value) * values;
// buckets指針
BVT (clib_bihash_bucket) * buckets;
// 修改時加鎖
volatile u32 *writer_lock;
// 拷貝區,當進行修改操作時,相應bucket內數據會拷貝到working_copies中
BVT (clib_bihash_value) ** working_copies;
int *working_copy_lengths;
// 用於記錄正在修改的bucket位置
BVT (clib_bihash_bucket) saved_bucket;
// buckets的數目
u32 nbuckets;
u32 log2_nbuckets;
u8 *name;
// cache命中情況記錄
u64 cache_hits;
u64 cache_misses;
// 用於釋放的內存碎片收集,便於下一次快速分配
BVT (clib_bihash_value) ** freelists;
// bihash的heap指針
void *mheap; //heap地址
// bihash對應的format函數,用於bihash打印
format_function_t *fmt_fn;
} BVT (clib_bihash);
Bihash具體實現
接下來我們對bihash的具體實現從init過程,add/del過程,search過程,以及free過程進行進行簡單介紹。
init過程
在這一過程中init函數主要做四件事情:
輸出話bihash結構體的各個參數;
創建bihash的heap;
在heap中創建bucket的vec;
分配write_lock內存;
針對每個bucket初始化緩存index序列(如果cache使能了) cache_lru (0 1 2 3 4)
完成初始化後,其在內存中的情況如下圖所示:
void BV (clib_bihash_init)
(BVT (clib_bihash) * h, char *name, u32 nbuckets, uword memory_size)
{
void *oldheap;
int i;
// 各種參數初始化
nbuckets = 1 << (max_log2 (nbuckets));
h->name = (u8 *) name;
h->nbuckets = nbuckets;
h->log2_nbuckets = max_log2 (nbuckets);
h->cache_hits = 0;
h->cache_misses = 0;
// 通過mheap_alloc分配內存
if (h->mheap == NULL) /* Allow customerized mheap, by Jordy*/
h->mheap = mheap_alloc (0 /* use VM */ , memory_size);
// 對該內存進行初始化,包括初始化bucket指針的vec,lock等
oldheap = clib_mem_set_heap (h->mheap);
vec_validate_aligned (h->buckets, nbuckets - 1, CLIB_CACHE_LINE_BYTES);
h->writer_lock = clib_mem_alloc_aligned (CLIB_CACHE_LINE_BYTES,
CLIB_CACHE_LINE_BYTES);
h->writer_lock[0] = 0;
// 針對每個bucket 初始化cache
for (i = 0; i < nbuckets; i++)
BV (clib_bihash_reset_cache) (h->buckets + i);
clib_mem_set_heap (oldheap);
h->fmt_fn = NULL;
}
add/del過程
add/del過程我們以add和del,以及空桶和非空桶進行區別,分爲四種情況進行討論。
空桶add
1、空桶,即bucket->offset 爲0;
2、爲桶分配kvp存儲空間(page), 先從h->freelists中查找,發現freelist爲空,創建freelists的vec, 即*clib_bihash_value的vec;
3、創建pages,即BVT (clib_bihash_value)對象,並將bucket的offset設爲該pages到heap->mheap的offset,將kvp放置於該page,並bucket的refcnt+=1。
其內存對應的結構如圖所示:
空桶del
1、空桶,即bucket->offset 爲0;
2、return -1即可。
非空桶add
1、找到對應的bucket,(b0),將b0在h->working_copy[threads_index]中複製一份;
2、將b0 (即,bucket[0]->offset)設在working_copy,即圖中b0(add時),查找時,查找的是working_copy區,增加操作在h->saved_bucke區進行處理,保證線程安全;
a.不需要擴展的情況,即增加新kvp不會造成衝突:
1、完成增加後,b0重新指向h->saved_bucket。
b.需要擴展的情況:
擴展, pages×2, 內存從freelist中取,此時操作的區域是working_copy區(僅作爲被複制的對應,不會影響查詢)。
擴展分爲兩種,hash擴展和線性擴展
1.1、 hash擴展,原來的kvp,重新hash,落到new_bucket中,若hash的page發生碰撞,則hash擴展失敗,釋放new_pages,即將其掛在freelist中,進行b;
1.2、 線性擴展,原來的kvp按序落到new_bucket中, 標記new_pages-> linear_search=1;
2、擴展完成後,b0的offset爲new_pages; 將kvp放置於new_pages,並bucket的refcnt+=1;
3、 釋放h->saved_bucket, 即原來的pages,將該塊存儲空間的地址放入freelist[log2_pages]的表頭。
擴展情況下,其非空桶add過程,內存對應的結構如下圖所示:
非空桶del
1、找到對應的bucket,(b0),將b0在h->working_copy[threads_index]中複製一份,將b0 (即,bucket[0]->offset)設在working_copy處,即查找時,查找的是working_copy區,刪除操作在h->saved_bucke區進行處理;
2、查找key,hash或者linear查找,找到key,刪除之(置爲全1);
3、h->saved_bucket.refcnt>1, 則自減1,b0.as_u64 = h->saved_bucket.as_u64 否則free該pages,將其放到freelists中。
int BV (clib_bihash_add_del)
(BVT (clib_bihash) * h, BVT (clib_bihash_kv) * add_v, int is_add)
{
u32 bucket_index;
BVT (clib_bihash_bucket) * b, tmp_b;
BVT (clib_bihash_value) * v, *new_v, *save_new_v, *working_copy;
int rv = 0;
int i, limit;
u64 hash, new_hash;
u32 new_log2_pages, old_log2_pages;
u32 thread_index = os_get_thread_index ();
int mark_bucket_linear;
int resplit_once;
// 計算kvp的hash值
hash = BV (clib_bihash_hash) (add_v);
// 桶定位
bucket_index = hash & (h->nbuckets - 1);
b = &h->buckets[bucket_index];
hash >>= h->log2_nbuckets;
tmp_b.linear_search = 0;
// 上鎖
while (__sync_lock_test_and_set (h->writer_lock, 1))
;
/* First elt in the bucket? */
// 空桶情況
if (b->offset == 0)
{
if (is_add == 0)
{
rv = -1;
goto unlock;
}
v = BV (value_alloc) (h, 0);
*v->kvp = *add_v;
tmp_b.as_u64 = 0;
tmp_b.offset = BV (clib_bihash_get_offset) (h, v);
tmp_b.refcnt = 1;
b->as_u64 = tmp_b.as_u64;
goto unlock;
}
/* Note: this leaves the cache disabled */
/* 爲保證線程安全,進行bucket的備份,複製到working_copy中,
* 將原bucket的指針指向這個working_copy中,
* h->saved_bucket指向原bucket的內存位置
* PS: working_copy中的per_thread沒看出什麼用,求解惑
*/
BV (make_working_copy) (h, b);
// 找到bucket具體對應的pages
v = BV (clib_bihash_get_value) (h, h->saved_bucket.offset);
// 確定循環範圍, hash則從指定page內直接找, linear則全桶找
limit = BIHASH_KVP_PER_PAGE;
v += (b->linear_search == 0) ? hash & ((1 << b->log2_pages) - 1) : 0;
if (b->linear_search)
limit <<= b->log2_pages;
if (is_add)
{
/*
* For obvious (in hindsight) reasons, see if we're supposed to
* replace an existing key, then look for an empty slot.
*/
for (i = 0; i < limit; i++)
{ // key相同,更新原value值,結束
if (!memcmp (&(v->kvp[i]), &add_v->key, sizeof (add_v->key)))
{
clib_memcpy (&(v->kvp[i]), add_v, sizeof (*add_v));
CLIB_MEMORY_BARRIER ();
/* Restore the previous (k,v) pairs */
b->as_u64 = h->saved_bucket.as_u64;
goto unlock;
}
}
for (i = 0; i < limit; i++)
{ // 找到第一個空solt(bihash中全1爲空)
if (BV (clib_bihash_is_free) (&(v->kvp[i])))
{
clib_memcpy (&(v->kvp[i]), add_v, sizeof (*add_v));
CLIB_MEMORY_BARRIER ();
b->as_u64 = h->saved_bucket.as_u64;
b->refcnt++;
goto unlock;
}
}
/* no room at the inn... split case... */
// 沒有空solt,則需要進行擴展
}
else
//非空桶 del 情況
{
for (i = 0; i < limit; i++)
{ // 找key,然後刪除之
if (!memcmp (&(v->kvp[i]), &add_v->key, sizeof (add_v->key)))
{
memset (&(v->kvp[i]), 0xff, sizeof (*(add_v)));
CLIB_MEMORY_BARRIER ();
if (PREDICT_TRUE (h->saved_bucket.refcnt > 1))
{
h->saved_bucket.refcnt -= 1;
b->as_u64 = h->saved_bucket.as_u64;
goto unlock;
}
else
{ // bucket已經空了,將其釋放,即放入freelist中
tmp_b.as_u64 = 0;
goto free_old_bucket;
}
}
}
// 沒找到==
rv = -3;
b->as_u64 = h->saved_bucket.as_u64;
goto unlock;
}
// 開始進行擴展
old_log2_pages = h->saved_bucket.log2_pages;
new_log2_pages = old_log2_pages + 1;
mark_bucket_linear = 0;
working_copy = h->working_copies[thread_index];
resplit_once = 0;
// 首先嚐試哈希擴展,失敗則進行線性擴展
new_v = BV (split_and_rehash) (h, working_copy, old_log2_pages,
new_log2_pages);
if (new_v == 0)
{
try_resplit:
resplit_once = 1;
new_log2_pages++;
/* Try re-splitting. If that fails, fall back to linear search */
new_v = BV (split_and_rehash) (h, working_copy, old_log2_pages,
new_log2_pages);
if (new_v == 0)
{
mark_linear:
new_log2_pages--;
/* pinned collisions, use linear search */
new_v =
BV (split_and_rehash_linear) (h, working_copy, old_log2_pages,
new_log2_pages);
mark_bucket_linear = 1;
}
}
/* Try to add the new entry */
save_new_v = new_v;
new_hash = BV (clib_bihash_hash) (add_v);
limit = BIHASH_KVP_PER_PAGE;
if (mark_bucket_linear)
limit <<= new_log2_pages;
new_hash >>= h->log2_nbuckets;
new_hash &= (1 << new_log2_pages) - 1;
new_v += mark_bucket_linear ? 0 : new_hash;
for (i = 0; i < limit; i++)
{
if (BV (clib_bihash_is_free) (&(new_v->kvp[i])))
{
clib_memcpy (&(new_v->kvp[i]), add_v, sizeof (*add_v));
goto expand_ok;
}
}
/* Crap. Try again */
BV (value_free) (h, save_new_v, new_log2_pages);
/*
* If we've already doubled the size of the bucket once,
* fall back to linear search now.
*/
if (resplit_once)
goto mark_linear;
else
goto try_resplit;
expand_ok:
tmp_b.log2_pages = new_log2_pages;
tmp_b.offset = BV (clib_bihash_get_offset) (h, save_new_v);
tmp_b.linear_search = mark_bucket_linear;
tmp_b.refcnt = h->saved_bucket.refcnt + 1;
free_old_bucket:
CLIB_MEMORY_BARRIER ();
b->as_u64 = tmp_b.as_u64;
v = BV (clib_bihash_get_value) (h, h->saved_bucket.offset);
BV (value_free) (h, v, h->saved_bucket.log2_pages);
unlock:
BV (clib_bihash_reset_cache) (b);
BV (clib_bihash_unlock_bucket) (b);
CLIB_MEMORY_BARRIER ();
h->writer_lock[0] = 0;
return rv;
}
search過程
1、對key進行hash;
2、根據hash值找到對應的bucket,若支持cache,則在bucket[k]->cache中查找,找到return 0;
3、在b0:new_pages 中查找,根據bucket[0]-> linear_search,進行hash或者線型查找,找到,則更新bucket[0]->cache,循環bucket[0]->cache_lru,否則return -1。
int BV (clib_bihash_search)
(BVT (clib_bihash) * h,
BVT (clib_bihash_kv) * search_key, BVT (clib_bihash_kv) * valuep)
{
u64 hash;
u32 bucket_index;
BVT (clib_bihash_value) * v;
#if BIHASH_KVP_CACHE_SIZE > 0
BVT (clib_bihash_kv) * kvp;
#endif
BVT (clib_bihash_bucket) * b;
int i, limit;
ASSERT (valuep);
// 計算hash值
hash = BV (clib_bihash_hash) (search_key);
// 定位bucket
bucket_index = hash & (h->nbuckets - 1);
b = &h->buckets[bucket_index];
if (b->offset == 0)
return -1;
// 若cache enable了 則先從cache中進行查找
#if BIHASH_KVP_CACHE_SIZE > 0
/* Check the cache, if currently enabled */
if (PREDICT_TRUE ((b->cache_lru & (1 << 15)) == 0))
{
limit = BIHASH_KVP_CACHE_SIZE;
kvp = b->cache;
for (i = 0; i < limit; i++)
{
if (BV (clib_bihash_key_compare) (kvp[i].key, search_key->key))
{
*valuep = kvp[i];
h->cache_hits++;
return 0;
}
}
}
#endif
hash >>= h->log2_nbuckets;
// 找到bucket具體對應的pages
v = BV (clib_bihash_get_value) (h, b->offset);
// 確定查找範圍
limit = BIHASH_KVP_PER_PAGE;
// hash從指定page開始查找,否則從頭開始找
v += (b->linear_search == 0) ? hash & ((1 << b->log2_pages) - 1) : 0;
if (PREDICT_FALSE (b->linear_search))
limit <<= b->log2_pages;
// 循環查找
for (i = 0; i < limit; i++)
{
if (BV (clib_bihash_key_compare) (v->kvp[i].key, search_key->key))
{
*valuep = v->kvp[i];
// 更新cache
#if BIHASH_KVP_CACHE_SIZE > 0
u8 cache_slot;
/* Shut off the cache */
if (BV (clib_bihash_lock_bucket) (b))
{
cache_slot = BV (clib_bihash_get_lru) (b);
b->cache[cache_slot] = v->kvp[i];
BV (clib_bihash_update_lru) (b, cache_slot);
/* Reenable the cache */
BV (clib_bihash_unlock_bucket) (b);
h->cache_misses++;
}
#endif
return 0;
}
}
return -1;
}
free過程
釋放mheap, munmap(h->mheap);
清空結構體,memset (h, 0, sizeof (*h))。
void BV (clib_bihash_free) (BVT (clib_bihash) * h)
{ // 釋放內存
mheap_free (h->mheap);
// 清空bihash結構體
memset (h, 0, sizeof (*h));
}
以上~