《數據結構》學習-- Hash(2) --Separate Chaining

本系列是《數據結構與算法分析-C語言描述》（Data Structures and Algorithm Analysis in C，作者Mark Weiss）一書的學習筆記，當我在做cc150需要補某個知識點時，就會把這本書翻出來學習一下，同時分享~

如果你有任何問題和建議，希望能與我分享.

1. 回顧

上一次我們簡單介紹了：

Hash表的概念
由輸入數據的數據結構構成的Array
優缺點
查找、插入、刪除平均常數時間。但不維護Array的順序。
四大要素
Hash表主體，Hash表大小，Hash Function，衝突解決方案。

實際上，Hash表最難設計的就是Hash Function以及衝突解決方法（Collision Resolution）。關於Hash Function的設計，沒有一個統一的方法，常用的方法我們也在第一章介紹過了。
這一章，我們將介紹一種常用的衝突解決方案，即開散列法（Separate Chaining）。

2. Separate Chaining簡介

Separate Chaining的思想無比簡單。即原本Hash表每個元素是一個輸入數據的數據結構類型，現在把每個元素改成一個由該數據結構類型構成的指針鏈表。這樣，當發生衝突時，只要在該指針鏈表的尾端或首端插入該值即可。

3.Rehash

在詳述我們的HashTable實現之前，我們還要引入最後一個概念：rehash。當我們不斷往HashTable內插入元素，HashTable就會越來越滿，而Find，Insert，Remove的操作都會越來越慢！
事實上，我們定義一個負重參數（load factor）λ ，它的值是HashTable已有的元素數除以HashTable表大小。對於每一次不成功的搜索，平均搜索次數爲λ （不包括最後的NULL），對於每一次成功的搜索，平均搜索次數爲1+λ/2 。因此，HashTable本身的大小對性能的影響並不大，重要的是load factor的大小。Mark Weiss建議，general rule是保證λ 儘可能接近1。
因此，當我們的load factor比較大時，我們需要擴充HashTable，以讓load factor儘可能接近1，這個過程就是rehash。另一方面，rehash是非常耗時的工作，因爲我們需要遍歷所有元素，然後全部重新插入一遍，因此，只有當必須（即load factor大到某一閾值時）時纔去執行。

3. Separate Chaining實現

接下來使用C++代碼詳解Separate Chaining的實現。該代碼在Mac OS X 64bit系統，clang++編譯器下調試通過，其他平臺不能保證。想直接下載文件可以戳[這裏]。

3.1 Hash表主體

        struct HashNode{
            ElementType elementValue;
            HashNode* next;
        };
        typedef HashNode* HashList;
        struct HashTbl{
            HashList* table;
            int tableSize;
            int content;
        };
        HashTbl* hashTable;

不知道爲何CSDN的Markdown對混合了大段註釋語句的高亮支持很不好，我把註釋放在下面，已經清楚解釋了這個代碼的功能。

            /*our hash table is actually a link to a struct called HashTbl,
             * the HashTbl consists of a array of HashList, which is a linked list of HashNode
             * and a tableSize, indicating how large is our hash table
             * and a content, indicating how full is our hash table
             * the HashNode is just a node in a linked list, consisting of the content(elementValue) and a linked to next node
             */

3.2 初始化操作

初始化操作很簡單，不斷把內存空間分配就好了。
注意，我這裏在new以後，做了判斷NULL的步驟，不過根據StackOverflow討論，這種分配方式在內存分配失敗後是會返回bad_alloc標誌，因此應該用try catch來處理，而不是判斷NULL。另外，文中也建議使用STL庫而不是手動分配內存來管理動態數據。我還在學習中，以後進一步完善。

template<class ElementType>
void HashTable<ElementType>::initialize(HashTbl*& newHashTable,int minSize)
{
    newHashTable=new HashTbl;
    if(newHashTable==NULL){
        printf("new operation failed! At file: %s,line: %d\n",__FILE__,__LINE__);
        exit(-1);
    }

    //尋找比minSize大的最近的質數。因爲質數大小的Hash表性能最好。minSize是用戶一開始指定的Hash表大小。
    int tableSize=nextPrime(minSize);

    newHashTable->table=new HashList[tableSize];
    if(newHashTable->table==NULL){
        printf("new operation failed! At file: %s,line: %d\n",__FILE__,__LINE__);
        exit(-1);
    }

    for(int i=0;i<tableSize;i++){
        newHashTable->table[i]=new HashNode;
        if(newHashTable->table[i]==NULL){
            printf("new operation failed! At file: %s,line: %d\n",__FILE__,__LINE__);
            exit(-1);
        }
        newHashTable->table[i]->next=NULL;
    }

    newHashTable->tableSize=tableSize;
    newHashTable->content=0;
}

值得注意的一點是，這裏函數聲明時，第一個參數是HashTbl*&,即一個指向指針的引用。這樣做的目的是，之後我們傳遞實際的HashTable指針進來時，可以在函數內部修改這個指針。可以參考這個文章。

3.3 Hash Function

template<class ElementType>
int HashTable<ElementType>::hashFunc(ElementType elementValue)
{
    //getElementKey是根據輸入數據的數據結構來獲得數據Key的函數。如果數據數據是整數，那麼Key可以就等於輸入數據。如果輸入數據是字符串，那麼Key可以等於所有字符對應ASCII碼之和。等等。
    int key=getElementKey(elementValue);
    return key%(hashTable->tableSize);//using simple module method to get new position
}

3.4 Find

template<class ElementType>
class HashTable<ElementType>::HashNode* HashTable<ElementType>::findInner(HashTbl* _hashTable, ElementType elementValue)
{
    int position=hashFunc(_hashTable,elementValue);
    HashNode* hashNode=_hashTable->table[position];
    while(hashNode->next!=NULL && !isEqual(hashNode->next->elementValue,elementValue)){
        hashNode=hashNode->next;
    }
    return hashNode;
}

template<class ElementType>
bool HashTable<ElementType>::find(ElementType elementValue)
{
    HashNode* hashNode = findInner(hashTable,elementValue);
    return hashNode->next != NULL;
}

這個函數總體是很容易看懂的（前提是你得懂類模板哈哈，不懂的話可以參考：類模板基礎和類模板中的結構體）
至於爲什麼需要一個findInner和一個find呢？find函數是給用戶用的，用戶只需要簡單的傳遞一個elementValue就能知道find是否成功。而findInner是給HashTable內部實現用的。
另外，這裏的isEqual函數是用戶自己定義的，因爲不同的數據結構判斷相等的標準不一樣（如int直接等於即可，而string需要用strcmp）。

3.5 Insert

template<class ElementType>
bool HashTable<ElementType>::insertInner(HashTbl*& _hashTable, ElementType elementValue)
{
    //rehash
    if(_hashTable->content>_hashTable->tableSize*10)
    {
        _hashTable=rehash();
    }

    HashNode* insertNode=findInner(_hashTable,elementValue);
    if(insertNode->next==NULL){
        insertNode->next=new HashNode;
        if(insertNode->next==NULL){
            printf("new operation failed! At file: %s,line: %d\n",__FILE__,__LINE__);
            exit(-1);
        }

        insertNode->next->elementValue=elementValue;
        insertNode->next->next=NULL;
        _hashTable->content++;
        return true;
    }
    return false;
}

template<class ElementType>
bool HashTable<ElementType>::insert(ElementType elementValue)
{
    return insertInner(hashTable,elementValue);
}

同樣，這段代碼也很容易理解。rehash函數的實現見下面。

3.6 Remove

template<class ElementType>
bool HashTable<ElementType>::removeInner(HashTbl* _hashTable,ElementType elementValue)
{
    HashNode* removeNode=findInner(_hashTable,elementValue);
    if(removeNode->next !=NULL){
        HashNode* toBeDelete=removeNode->next;
        removeNode->next=removeNode->next->next;
        delete toBeDelete;
        _hashTable->content--;
        return true;
    }
    return false;
}

template<class ElementType>
bool HashTable<ElementType>::remove(ElementType elementValue)
{
    return removeInner(hashTable,elementValue);
}

同樣比較清楚。

3.7 rehash

template<class ElementType>
class HashTable<ElementType>::HashTbl* HashTable<ElementType>::rehash()
{
    HashTbl* newTable;
    initialize(newTable,hashTable->tableSize*10);

    for(int i=0;i<hashTable->tableSize;i++){
        HashNode* hashNode=hashTable->table[i]->next;
        while(hashNode){
            insertInner(newTable,hashNode->elementValue);
            hashNode=hashNode->next;
        }
    }
    delete hashTable;
    return newTable;
}

rehash函數非常簡單，創建一個兩倍於原來大小的新HashTable，然後把之前每個HashNode重新插入到新的HashTable中。
另外，這裏我設定當load factor爲10時，我們把HashTable擴充到原來的10倍。這個10是我隨意設定的，沒有進行過性能優化。

3.8 nextPrime


template<class ElementType>
bool HashTable<ElementType>::isPrime(int num)
{
    bool result;
    if(num==2)
        result=true;
    else if(num/2*2 == num)
        result=false;
    else{
        int sqrtNum=sqrt(num);
        result=true;
        for(int i=3;i<sqrtNum;i+=2){
            if(num/i*i==num){
                result=false;
                break;
            }
        }
    }
    return result;
}

template<class ElementType>
int HashTable<ElementType>::nextPrime(int num)
{
    int result=num+1;
    while(!isPrime(result))
        result++;
    printf("result:%d\n",result);
    return result;
}

尋找下一個質數。

4. HashTable測試

我們首先定義好使用的輸入數據的數據結構，getElementKey函數以及isEqual函數。在這裏，我們用最簡單的int做測試。

typedef int ElementType;

int getElementKey(ElementType elementValue){
    return elementValue;
}

bool isEqual(ElementType elementValue1,ElementType elementValue2){
    return elementValue1==elementValue2;
}

4.1正確性測試

爲了測試我們的HashTable是否正確。我們在一個循環中，插入循環index，然後搜索這個index，確保每一次都能搜索到。然後在一個循環中，刪除index，然後搜索這個index，確保每一次都搜索不到。

bool testHashCorrectness(HashTable<ElementType> &hashTable)
{
    int checkTotal=10000000;
    for(int i=0;i<checkTotal;i++){
        hashTable.insert(i);
        if(!hashTable.find(i))
            return false;
    }
    for(int i=0;i<checkTotal;i++){
        hashTable.remove(i);
        if(hashTable.find(i))
            return false;
    }
    return true;
}

int main()
{
    HashTable<int> hashTable(100000, &getElementKey,&isEqual);
    printf("check correctness: %d\n",testHashCorrectness(hashTable));
    return 0;
}

結果爲：

check correctness: 1

4.2 性能測試

最後，我們對於使用或不使用rehash函數來做一個性能測試。

int main()
{
    HashTable<int> hashTable(100000, &getElementKey,&isEqual);

    clock_t start=clock();
    for(int i=0;i<10000000;i++){
        int r=rand();
        hashTable.insert(r);
    }
    clock_t finish=clock();
    printf("time is %fs\n",(double)(finish-start)/CLOCKS_PER_SEC);

    return 0;
}

當我們不適用rehash函數，測試結果爲：

time is 47.611208s

而使用了rehash函數之後，測試結果爲：

time is 8.676079s

相信細調load factor之後，性能可以有更大的提升。

5. 總結

這一章我們介紹了HashTable中的Separate Chaining解決Collision的方法。其核心思想就是用一個鏈表結構替代原先最簡單的HashTable中的每一個元素。
另外我們還介紹了rehash函數，並引入了load factor概念。
Separate Chaining的優點在於其實現非常簡單。並且能非常好的解決衝突問題。
Separate Chaining的缺點在於，每次插入一個元素都需要new一個內存空間，這個操作涉及到核心函數，所以速度會慢。
如何解決這個問題呢？請期待下一篇“《數據結構》學習–Hash（3）–Open Addressing”！

《數據結構》學習-- Hash(2) --Separate Chaining

1. 回顧

2. Separate Chaining簡介

3.Rehash

3. Separate Chaining實現

3.1 Hash表主體

3.2 初始化操作

3.3 Hash Function

3.4 Find

3.5 Insert

3.6 Remove

3.7 rehash

3.8 nextPrime

4. HashTable測試

4.1正確性測試

4.2 性能測試

5. 總結

由喫飯想到的產品痛點問題

C++ Primer 學習《輸入輸出》

C++ Primer 學習《編程風格》

如何精通C++ 摘自知乎和quora

Kinect 2.0 + OpenCV 顯示深度數據、骨架信息、手勢狀態和人物二值圖

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結