字符串匹配常用算法

字符串匹配（string match)是在實際工程中經常會碰到的問題，通常其輸入是原字符串(String)和子串（又稱模式，Pattern)組成，輸出爲子串在原字符串中的首次出現的位置。通常精確的字符串搜索算法包括暴力搜索(Brute force)，KMP, BM(Boyer Moore), sunday, robin-karp 以及 bitap。下面分析這幾種方法並給出其實現。假設原字符串長度M，字串長度爲N。

1. Brute force.

該方法又稱暴力搜索，也是最容易想到的方法。

預處理時間 O(0)

匹配時間複雜度O(N*M)

主要過程：從原字符串開始搜索，若出現不能匹配，則從原搜索位置+1繼續。

[cpp]view
plaincopy

/*  

 * ===  FUNCTION  ====================================================================== 

 *         Name:  bf 

 *  Description: brute-force method for string match problem. 

 * ===================================================================================== 

 */  

int bf(const char *text, const char *find)  

{  

    if (text == '/0' || find == '/0')  

        return -1;  

    int find_len = strlen(find);  

    int text_len = strlen(text);  

    if (text_len < find_len)  

        return -1;  

    char *s = text;  

    char *p = s;  

    char *q = find;  

    while (*p != '/0')  

    {  

        if (*p == *q)  

        {  

            p++;  

            q++;  

        }  

        else  

        {  

            s++;  

            p = s;  

            q = find;  

        }  

        if (*q == '/0')  

        {  

            return (p - text) - (q - find);  

        }  

    }  

    return -1;  

}

2，KMP.

KMP是經典的字符串匹配算法。

預處理時間：O(M)

匹配時間複雜度：O(N)

主要過程：通過對字串進行預處理，當發現不能匹配時，可以不進行回溯。

[cpp]view
plaincopy

/*  

 * ===  FUNCTION  ====================================================================== 

 *         Name:  kmp 

 *  Description:  kmp method for string match. 

 * ===================================================================================== 

 */  

/* 

 * examples of prepocessing for pattern 

 * pattern_1:  

 * a b c a b c a 

 * 0 0 0 0 1 2 3 

 * pattern_2: 

 * a a a a b a a 

 * 0 0 0 0 0 0 1 

 */  

int kmp(const char *text, const char *find)  

{  

    if (text == '/0' || find == '/0')  

        return -1;  

    int find_len = strlen(find);  

    int text_len = strlen(text);  

    if (text_len < find_len)  

        return -1;  

    int map[find_len];  

    memset(map, 0, find_len*sizeof(int));  

    //initial the kmp base array: map  

    map[0] = 0;  

    map[1] = 0;  

    int i = 2;  

    int j = 0;  

    for (i=2; i<find_len; i++)  

    {  

        while (1)  

        {  

            if (find[i-1] == find[j])  

            {  

                j++;  

                if (find[i] == find[j])  

                {  

                    map[i] = map[j];  

                }  

                else  

                {  

                    map[i] = j;  

                }  

                break;  

            }  

            else  

            {  

                if (j == 0)  

                {  

                    map[i] = 0;  

                    break;  

                }  

                j = map[j];  

            }  

        }  

    }  

    i = 0;  

    j = 0;  

    for (i=0; i<text_len;)  

    {  

        if (text[i] == find[j])  

        {  

            i++;  

            j++;  

        }  

        else  

        {  

            j = map[j];  

            if (j == 0)  

                i++;  

        }  

        if (j == (find_len))  

            return i-j;  

    }  

    return -1;  

}

注意：在預處理中，表面看起來時間複雜度爲O(N^2)，但是爲什麼是線性的，在時間複雜度分析中中，通過觀察變量的變化來統計零碎的、執行次數不規則的情況，這種方法叫做攤還分析。我們從上述程序的j 值入手。每一次執行上述循環預處理語句中的第二個else時都會使j減小（但不能減成負的），而另外的改變j值的地方只有一處。每次執行了這一處，j都只能加1；因此，整個過程中j最多加了M-1個1。於是，j最多隻有M-1次減小的機會（j值減小的次數當然不能超過M-1，因爲j永遠是非負整數）。這告訴我們，while循環總共最多執行了M-1次。按照攤還分析的說法，平攤到每次for循環中後，一次for循環的複雜度爲O(1)。整個過程顯然是O(M)的。另外關於KMP的詳細分析，可以參考Matrix67KMP算法詳解。

3，Boyer Moore

Boyer Moore是字符串匹配算法中的經典，可以參考論文a faster string searching algorithm。

預處理時間O(N + M^2)

匹配時間複雜度O(N)

主要過程：通過預處理原字符串以及待匹配字串，從而在匹配失敗時可以跳過更多的字符。

[cpp]view
plaincopy

/*  

 * ===  FUNCTION  ====================================================================== 

 *         Name:  bm 

 *         Descritexttion:  Boyer–Moore method for string match. 

 *====================================================================================== 

 */  

int bm(const char *text, const char *find)  

{  

    if (text == '/0' || find == '/0')  

        return -1;  

    int i, j, k;  

    int text_len = strlen(text);  

    int find_len = strlen(find);  

    if (text_len < find_len)  

        return -1;  

    int delta_1[CHAR_MAX];  

    for (i=0; i<CHAR_MAX; i++)  

        delta_1[i] = find_len;  

    for (i=0; i<find_len; i++)  

        delta_1[find[i]] = find_len - i - 1;  

    int rpr[find_len];  

    rpr[find_len-1] = find_len - 1;  

    for (i=find_len-2; i>=0; i--)  

    {  

        int len = (find_len - 1) - i;  

        //find the reoccurence of the right most (len) chars  

        for (j=find_len-2; j>=(len-1); j--)  

        {  

            if (strncmp(find+i+1, find+j-len+1, len) == 0)  

            {  

                if ((j-len) == -1 || find[i] != find[j-len])  

                {  

                    rpr[i] = j - len + 1;  

                    break;  

                }  

            }  

        }  

        //if the right most (len) chars not completely occur, we find the right  

        //substring of (len). every step, we try to find the right most (len-k)  

        //chars.  

        for (k=1; j<(len-1) && k<len; k++)  

        {  

            if (strncmp(find+i+k, find, len-k) == 0)  

            {  

                rpr[i] = 0 - k;  

                break;  

            }  

        }  

        if (j<(len-1) && k == len)  

        {  

            rpr[i] = 0 - len;  

        }  

    }  

    int delta_2[find_len];  

    for (i=0; i<find_len; i++)  

        delta_2[i] = find_len - rpr[i];  

    i = find_len - 1;  

    j = find_len - 1;  

    while (i < text_len)  

    {  

        if (text[i] == find[j])  

        {  

            i--;  

            j--;  

        }  

        else  

        {  

            if (delta_1[text[i]] > delta_2[j])  

            {  

                i += delta_1[text[i]];  

            }  

            else  

            {  

                i += delta_2[j];  

            }  

            j = find_len - 1;  

        }  

        if (j == -1)  

            return i+1;  

    }  

    return -1;  

}

提示：該算法主要利用壞字符規則和好後綴規則進行轉換。所謂壞字符規則，是指不能匹配時的字符在待匹配字串中從右邊數的位置；而好後綴規則則是指子串中從該不匹配位置後面所有字符（都是已匹配字符）再次在字串中出現的位置(k)，其中s[k,k+1,---,k+len-j-1] = s[j+1, j+1,---,len-1], 並且s[k-1] != [j] || s[k-1] = $, 其中$表示增補的字符，可以與任何字符相等。

舉例來說，對於字串ABCXXXABC

-4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

A B C X X X A B C

j=9 9//NULL->其值爲當前位置。

j=8 $ 0 //C->雖然出現在3，但[2] = [j]，所以不滿足

j=7 $ $ -1 //BC出現在開始[2]，但[1]=[j]

j=6 1 //ABC

j=5 $ 0 //XABC

j=4 $ $ -1 //XXABC

j=3 $ $ $ -2 //XXXABC

j=2 $ $ $ $ -3 //CXXXABC

j=1 $ $ $ $ $ -4 //BCXXXABC

4, Sunday

Sunday算法比較簡單，其實就是利用Boyer Moore中的壞字符規則，實現起來簡單，效果也還不錯。

預處理時間O(M)

匹配時間複雜度O(N*M)

[cpp]view
plaincopy

/*  

 * ===  FUNCTION  ====================================================================== 

 *         Name:  sunday 

 *  Description:  sunday method for string match. 

 * ===================================================================================== 

 */  

int sunday(const char *text, const char *find)  

{  

    if (text == '/0' || find == '/0')  

        return -1;  

    char map[CHAR_MAX];  

    int i;  

    int text_len = strlen(text);  

    int find_len = strlen(find);  

    if (text_len < find_len)  

        return -1;  

    //preprocess  

    for (i=0; i<CHAR_MAX; i++)  

        map[i] = find_len + 1;  

    for (i=0; i<find_len; i++)  

        map[find[i]] = find_len - i;  

    //match process  

    i = 0;  

    while (i <= (text_len - find_len))  

    {  

        if (strncmp(find, text + i, find_len) == 0)  

            return i;  

        else  

            i += map[text[i + find_len]];  

    }  

    return -1;  

}

5, Robin-Karp

Robin-Karp主要利用HASH函數來處理字串，從而完成匹配。

預處理時間O(0)

最壞匹配時間複雜度O(N*M)

[cpp]view
plaincopy

/*  

 * ===  FUNCTION  ====================================================================== 

 *         Name:  robin_karp 

 *  Description:  robin_karp method for string match problem. 

 * ===================================================================================== 

 */  

// karp_robin need a hash function  

int hash(const char *s, unsigned int len)  

{  

    int result = 0;  

    int base = 3;  

    int i;  

    for (i=0; i<len; i++)  

    {  

        result += s[i];  

        result *= base;  

    }  

    result /= base;  

    return result;  

}  

int robin_karp(const char *text, const char *find)  

{  

    if (text == '/0' || find == '/0')  

        return -1;  

    int i, j;  

    int text_len = strlen(text);  

    int find_len = strlen(find);  

    if (text_len < find_len)  

        return -1;  

    int h_find = hash(find, find_len);  

    int h_tmp = 0;  

    for (i=0; i<=(text_len-find_len); i++)  

    {  

        h_tmp = hash(text+i, find_len);  

        if (h_tmp == h_find)  

        {  

            for (j=0; j<find_len; j++)  

            {  

                if (find[j] != text[i+j])  

                {  

                    break;  

                }  

            }  

            if (j == find_len)  

                return i;  

        }  

    }  

    return -1;  

}

注意：主要依賴於hash函數的設計。

6, Bitap

Bitap算法主要利用位運算進行字符串的匹配，其匹配過程可以看作是有窮自動機中狀態的轉換，按照字串(pattern)的連續分解狀態進行轉換，從而到達終點，此時匹配過程完成。

預處理時間O(M)

最壞匹配時間複雜度O(N*M)

[cpp]view
plaincopy

/*  

 * ===  FUNCTION  ====================================================================== 

 *         Name:  bitap  

 *         Description:  bitap method. 

 *======================================================================================= 

 */  

int bitap(const char *text, const char *find)  

{  

    if (text == '/0' || find == '/0')  

        return -1;  

    int text_len = strlen(text);  

    int find_len = strlen(find);  

    if (text_len < find_len)  

        return -1;  

    int i = 0;  

    int j = find_len - 1;  

    char map[find_len + 1];  

    map[0] = 1;  

    for (i=1; i<=find_len; i++)  

    {  

        map[i] = 0;  

    }  

    for (i=0; i< text_len; i++)  

    {  

        for (j=find_len-1; j>=0; j--)  

        {  

            map[j+1] = map[j] & (text[i] == find[j]);  

        }  

        if (map[find_len] == 1)  

        {  

            return i - find_len + 1;  

        }  

    }  

    return -1;  

}

注意：Bitap匹配算法中可以改用位移操作實現，從而將匹配複雜度從O(N*M)降低到O(N)。

總結，以上算法中，性能較好的爲KMP，BM, 實現簡單的爲BF，Sunday，Bitap。兩者折中來看，KMP表現較好。

預處理時間匹配時間複雜度

BF O(0) O(N*M)

KMP O(M) O(N)

BM O(N+M^2) O(N)

Sunday O(M) O(N*M)

Robin-Karp O(0) O(N*M)

Bitap O(M) O(N*M)->O(N)

以上六種算法比較實現的代碼如下所示（其中string長度10000）。

轉自：http://blog.csdn.net/meixr/article/details/6456896

站內首發文章

tdmyl

發佈了22 篇原創文章 · 獲贊 13 · 訪問量 12萬+

私信關注

字符串匹配常用算法

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

HTTP URL 詳解

創新工具：2024年開發者必備的一款表格控件（二）

車牌識別控制檯可快速整合二次開發

error: no matching function for call to 'std::basic_ifstream::open(std::string&)

SHELL函數返回字符串的方法

apache-maven-3.2.1的安裝

getopt：命令行選項、參數處理

shell中單中括號和雙中括號的區別

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結