《數據結構》學習-- Hash(3) --Open Addressing

1. 回顧

上一次我們講了Hash衝突解決方案之開散列（Separate Chaining）。其優點是思路簡單，實現也容易。這一回我們介紹另一種Hash衝突解決方案，名爲閉散列法，或叫Open Addressing

你可能覺得閉散列和Open有些矛盾。其實，看了Open Addressing的核心思想後，你就明白了。

2. Open Addressing核心思想

Open Addressing思想非常簡單。如果第一次Hash尋找到得位置失敗，那就不斷進行位移，直到找到滿足條件的位置。

即：我們不斷嘗試h0(x),h1(x),h2(x)...這些位置。其中：hi(x) = (Hash(x) + Function(i)) % tableSize。其中Function(0) = 0.

2.1 Linear Open Addressing

顧名思義，就是Function(i) = i。也就是說，如果第一次Hash尋找位置失敗，那麼就順序找下去，直到找到一個滿足要求的位置爲止。

優點：思路簡單，而且只要Hash表不滿，總能找到滿足條件的位置。

缺點：容易產生主聚合效應（primary clustering）。簡單來說，就是插入的點容易聚集到一塊地方，從而使得第一次Hash到這塊範圍的數都必須順序搜索這塊範圍。根據複雜的計算，我們可以得到，當load factor（此概念在上一章介紹）爲0.5時，平均每次插入（等同於非成功尋找）需要位移2.5次，平均每次成功尋找需要位移1.5次。將load factor保證在0.5以下，那麼時間是比較理想的。

2.2 Quadratic Open Addressing

顧名思義，就是Function(i) = i^2。簡單地計算可以得到：h(i+1)(x) = hi(x) + 2i -1. 另外，只有當load factor小於0.5且Hash表大小爲質數時，才能保證每次插入都成功（可以證明，這裏略）。

優點：不會產生主聚合效應。

缺點：雖然Quadratic方法不會產生主聚合效應。但會產生次聚合效應（secondary clustering）。即，第一次Hash到同一個位置的點，他們之後的搜索過程都完全一樣，需要重複。

3. 延遲刪除（lazy deletion）

如果我們需要刪除一個值，不能簡單的把那個位置的值去掉。簡單思索便可明白，因爲這個點後面的值可能是通過位移過去的，如果這點被挖空，那麼我們想尋找後面的值就變得不可能了。

因此，我們使用一個延遲刪除的技術。思想很簡單，我們給每個點賦予一個狀態，分別是被佔用（legitimate），空（empty），被刪除（deleted）。初始時所有點都爲空，當被插入一個值時將狀態設爲被佔用，但被刪除時狀態設爲被刪除。這樣的話，如果我們要尋找一個點，只要搜索路徑上的點非空，且其值與我們想要搜索的值不同，那麼就不斷搜索下去，直到找到空點或者相同值得點。（如果覺得拗口，請看下面的代碼）。

4. Open Addressing實現

4.1 基本數據結構

enum Kind  {LEGITIMATE,EMPTY,DELETED};
struct HashNode{
	ElementType elementValue;
	enum Kind kind;
};
struct HashTbl{
	int tableSize;
	int content;
	HashNode* table;
};
HashTbl* hashTable;

4.2 初始化

template<class elementtype="">
void HashTable<elementtype>::initialize(HashTbl*& newHashTable, int minSize)
{
	int tableSize=nextPrime(minSize);//尋找下一個比minSize大的質數

	try{
		newHashTable=new HashTbl;
	}catch(std::bad_alloc&){
		errorDisplay("new memory failed!",__FILE__,__FUNCTION__,__LINE__);//如果new失敗，報錯
	}

	try{
		newHashTable->table=new HashNode[tableSize];
	}catch(std::bad_alloc&){
		errorDisplay("new memory failed!",__FILE__,__FUNCTION__,__LINE__);//如果new失敗，報錯
	}

	for(int i=0;i<tablesize i="" newhashtable-="">table[i].kind=EMPTY;
	}

	newHashTable->tableSize=tableSize;
	newHashTable->content=0;
}
</tablesize></elementtype></class>

4.3 尋找Find

Find函數可以說是Open Addressing的關鍵。

template<class elementtype="">
int HashTable<elementtype>::findInner(HashTbl* _hashTable,ElementType& elementValue)
{
	int key=getElementKey(elementValue)%_hashTable->tableSize; //第一次Hash，getElementKey是根據輸入數據獲得一個初始key值，詳細可參考上一章
	int hashTimes=0;
	while(_hashTable->table[key].kind!=EMPTY && _hashTable->table[key].elementValue!=elementValue){
		key=hash2(key,hashTimes)%_hashTable->tableSize; //hash2就是上面所提到的Function，具體見下面
	}
	return key;
}

template<class elementtype="">
bool HashTable<elementtype>::find(ElementType elementValue)
{
	int pos=findInner(hashTable,elementValue);
	return hashTable->table[pos].kind==LEGITIMATE;
}

template<class elementtype="">
int HashTable<elementtype>::hash2(int key,int hashTimes)
{
	switch(OPEN_ADDRESS){ //根據不同的Open Addressing方法，選擇不同的位移方式
		case LINEAR:
			return key+hashTimes;
		case QUDRATIC:
			return key+2*(hashTimes+1)-1;
		default:
			errorDisplay("OPEN_ADDRESS method error!",__FILE__,__FUNCTION__,__LINE__);
			return -1;
	}
}
</elementtype></class></elementtype></class></elementtype></class>

4.4 插入Insertion

template<class elementtype="">
bool HashTable<elementtype>::insertInner(HashTbl*& _hashTable, ElementType& elementValue) 
{
	//rehash
	if(loadFactor>MAX_LOAD_FACTOR){ //MAX_LOAD_FACTOR一般取0.5
		_hashTable=rehash(_hashTable->tableSize); //rehash的概念在上一章講過
	}
	
	int pos=findInner(_hashTable,elementValue);

	HashNode& hashNode=_hashTable->table[pos];
	if(hashNode.kind==LEGITIMATE) //該值已經存在，無需插入
		return false;
	else{ //該值不存在，或者已被刪除
		hashNode.elementValue=elementValue;
		hashNode.kind=LEGITIMATE;
		_hashTable->content++;
		loadFactor=(double)(_hashTable->content)/(double)(_hashTable->tableSize);
		return true;
	}
}

template<class elementtype="">
bool HashTable<elementtype>::insert(ElementType elementValue)
{
	return insertInner(hashTable,elementValue);
}
</elementtype></class></elementtype></class>

4.5 刪除Remove

template<class elementtype="">
bool HashTable<elementtype>::removeInner(HashTbl* _hashTable,ElementType& elementValue)
{
	int pos=findInner(_hashTable,elementValue);
	
	HashNode& hashNode=_hashTable->table[pos];
	if(hashNode.kind==LEGITIMATE){ //這個點存在
		hashNode.kind=DELETED;
		_hashTable->content--;
		loadFactor=(double)(_hashTable->content)/(double)(_hashTable->tableSize);
		return true;
	}
	else //這個點不存在，或已被刪除
		return false;
}

template<class elementtype="">
bool HashTable<elementtype>::remove(ElementType elementValue)
{
	return removeInner(hashTable,elementValue);
}
</elementtype></class></elementtype></class>

4.6 擴充Hash表 rehash

template<class elementtype="">
class HashTable<elementtype>::HashTbl* HashTable<elementtype>::rehash(int currentSize)
{
	HashTbl* newHashTable;
	initialize(newHashTable,currentSize*10);//擴充一個比原來大十倍的Hash表，這個數字是我簡單設定的，沒有經過考量！
	loadFactor=(double)(hashTable->content)/(double)(newHashTable->tableSize);

	for(int i=0;i<hashtable->tableSize;i++){
		insertInner(newHashTable,hashTable->table[i].elementValue);
	}

	return newHashTable;
}
</hashtable-></elementtype></elementtype></class>

5. 性能測試

我們創建一個使用Quadratic方式位移的Hash表。初始大小設爲1,000,000.然後不斷插入10,000,000個隨機數。測試需要多少時間。

int main()
{
	HashTable<int> hashTable(1000000, &getElementKey,&isEqual,HashTable<int>::QUDRATIC,0.49);
	clock_t start=clock();

	for(int i=0;i<10000000;i++){
		int r=rand();
		hashTable.insert(r);
	}

	clock_t finish=clock();
	printf("time is %fs\n",(double)(finish-start)/CLOCKS_PER_SEC);

	return 0;
}
</int></int>

使用clang++編譯，O3速度優化。測試結果：

time is 2.239344s
time is 2.059147s
time is 2.318181s

6.總結

這次我們介紹了閉散列法（Open Addressing），實測下來，這種方法比開散列速度更快。個人認爲主要原因是避免了內存分配/釋放操作這一非常耗時的過程。至此爲止，我們已經把主流的Hash方法都介紹了。對於一般的應用基本足夠。Hash衝突解決方案，Hash Function的設計，都是需要具體問題具體分析的，沒有一個放之四海而皆準的方案，關於這一點我也並沒有經驗，請大家參考其他資源。最後，在下一章（應該也是最終章）中，將介紹C++ STL，以及Python中的Hash庫。敬請期待吧！

《數據結構》學習-- Hash(3) --Open Addressing

1. 回顧

2. Open Addressing核心思想

2.1 Linear Open Addressing

2.2 Quadratic Open Addressing

3. 延遲刪除（lazy deletion）

4. Open Addressing實現

4.1 基本數據結構

4.2 初始化

4.3 尋找Find

4.4 插入Insertion

4.5 刪除Remove

4.6 擴充Hash表 rehash

5. 性能測試

6.總結

由喫飯想到的產品痛點問題

C++ Primer 學習《輸入輸出》

C++ Primer 學習《編程風格》

如何精通C++ 摘自知乎和quora

Kinect 2.0 + OpenCV 顯示深度數據、骨架信息、手勢狀態和人物二值圖

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結