KMP字符串匹配算法

1. KMP算法基本思想

問題：在字符串ABABABACA中尋找字符串ABABACA，並返回第一次出現的位置。
下面分析匹配過程

ABABABACA
ABABACA
     |此處出現不匹配

若此時按照樸素字符串匹配算法進行匹配，模式字符串在不匹配的時候右移一位，重新從第一個字符進行匹配，情況如下

ABABABACA
 ABABACA
 |右移一位，重新從第一個字符進行匹配，很明顯不匹配，無效偏移

ABABABACA
  ABABACA
  |再次右移一位，重新從第一個字符進行匹配，一直到模式串末尾，匹配成功

能否避免無效偏移和每次都從頭開始匹配？這就是KMP算法所實現的。

ABABABACA
ABABACA
     |此處出現不匹配，將該位置記爲pos
ABABABACA
  ABABACA
     |直接偏移2位
     |發現在上次出現不匹配的位置pos之前的3個字符ABA是匹配的，那麼就不需要從模式串頭開始匹配，直接從pos處進行匹配

問題：ABABA和ABA是什麼關係？怎麼知道可以直接偏移2位？
ABA爲字符串ABABA的前綴和後綴的最長的共有字符串。
ABABA的前綴字符串（不包括尾字符）有A AB ABA ABAB
ABABA的後綴字符串（不包括頭字符）有A BA ABA BABA
所以ABABA的前綴和後綴的最長的共有字符串爲ABA，長度爲3

移動位數 = 已匹配的字符數 - 對應的部分匹配值

上述例子中，已匹配=5，部分匹配=3，所以移動位數=2

倘若算出每個位置的部分匹配值，就可以直接得到應該移動的位數，從而避免無效移位，這個要求的部分匹配值被稱爲部分匹配表（Partial Match Table）。

2. 如何求部分匹配表（next數組）？

next數組的前兩個元素爲-1，0

  A B A B A C A
 -1 0

求next[pos]要根據next[pos - 1]的值。

1. 當pos - 1處的字符與next[pos - 1]即cnd處字符相同時
如下圖所示，淺藍色是子串P[0..pos - 2]的最長前綴後綴公共字符串，並且兩個深藍色處字符相同，那麼子串P[0..pos - 1]的最長前綴後綴公共字符串長度爲next[pos - 1] + 1，即cnd + 1。

2. 當pos - 1處的字符與next[pos - 1]即cnd處字符不相同時
如下圖所示，綠色方塊表示子串P[0..cnd - 1]的最長前綴後綴公共字符串，該綠色方塊字符串一定也會是子串P[0..pos - 2]的前綴後綴公共字符串(非最長)，若next[cnd]處字符與pos - 1處字符相同，則next[pos] = next[cnd] + 1，若不相同，重複上述步驟。

實現代碼如下：

private int[] getNext(String p) {
    if (p.length() == 1)
        return new int[] {-1};
    int[] next = new int[p.length()];
    next[0] = -1;
    next[1] = 0;
    int pos = 2; // 當前計算位置爲2
    int cnd = 0; // 當前已經計算出的最長前綴後綴公共字符串的下一個字符位置
    while (pos < p.length()) {
        if (p.charAt(pos - 1) == p.charAt(cnd)) {
            next[pos++] = ++cnd;
        } else if (cnd > 0) {
            cnd = next[cnd];
        } else {
            next[pos++] = 0;
        }
    }
    return next;
}

3. 優化next數組

目前該算法實現並不完美。依然以模式串ABABACA爲例，然而此時的待檢測字符串爲ABABCABABACA。讓我們分析下匹配過程。

ABABCABABACA
ABABACA
    | 此處出現不匹配，根據部分匹配表，next[4] = 2，最長前綴後綴公共字符串爲AB，右移2位

ABABCABABACA
  ABABACA
    | 不匹配。注意，上一次是字符C與A進行比較，這一次依然是字符C與A比較，這一次也是一次無效偏移，這就是待優化的地方

優化方法爲判斷當前字符是否與前綴下一個字符相同，若相同，則next[pos] = next[cnd]。
優化結果

原next數組
 A B A B A C A
-1 0 0 1 2 3 0

改進後next數組
 A B A B A C A
-1 0 0 0 0 3 0

優化代碼如下

private int[] getNext(String p) {
    if (p.length() == 1)
        return new int[] {-1};
    int[] next = new int[p.length()];
    next[0] = -1;
    next[1] = p.charAt(0) == p.charAt(1) ? -1 : 0;
    next[1] = 0;
    int cnd = 0;
    int pos = 2;
    while (pos < p.length()) {
        if (p.charAt(pos - 1) == p.charAt(cnd)) {
            // 此處判斷當前字符是否與前綴下一個字符相同
            // 若相同，則next[pos] = next[cnd]
            if (p.charAt(pos) != p.charAt(++cnd))
                next[pos++] = cnd;
            else
                next[pos++] = next[cnd];
        } else if (next[cnd] > -1) {
            cnd = next[cnd];
        } else {
            next[pos++] = 0;
        }
    }
    return next;
}

4. 根據next數組實現線性字符串匹配

實現代碼如下

public int strStr(String str, String pattern) {
    if (str == null || pattern == null)
        return -1;
    if (pattern.length() == 0)
        return 0;
    int[] next = getNext(pattern);
    int m = 0; // 已匹配字符串頭在待檢測字符串str中的位置
    int i = 0; // 當前進行匹配的字符在模式串pattern中所處的位置
    while (m + i < str.length()) {
        if (str.charAt(m + i) == pattern.charAt(i)) {
            i++;
            if (i == pattern.length())
                return m;
        } else {
            if (next[i] == -1) {
                // 無前綴後綴公共字符串
                // 右移一位，從模式串頭開始匹配
                m++;
                i = 0;
            } else {
                // i爲已匹配長度 next[i]爲部分匹配長度 i - next[i]爲移動位數
                m = m + i - next[i]; // 右移
                i = next[i]; // 用部分匹配長度更新已匹配長度
            }
        }
    }
    return -1;
}

參考資料

Knuth–Morris–Pratt algorithm

KMP字符串匹配算法

1. KMP算法基本思想

2. 如何求部分匹配表（next數組）？

3. 優化next數組

4. 根據next數組實現線性字符串匹配

參考資料

Wireshark 安裝+使用（一）

Android如何保證一個線程最多隻能有一個Looper？

Java多線程—Executor框架概述

Java反射機制總結

設計模式—結構型模式

設計模式—行爲模式

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結