概述
給定一個字符串數組,str1[0…n-1] 和 str2[0…m-1],編寫一個函數search(char str1 [],char str2 []),將所有出現在str1 []中的str2 [] 的位置打印出來。 假設n> m。
例子:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
Input: txt[] = "AABAACAADAABAABA"
pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12
字符串匹配是計算機科學中的一個重要問題。 當我們在記事本/單詞文件或瀏覽器或數據庫中搜索字符串時,將使用字符串匹配算法來顯示搜索結果。
我們在上一篇文章中討論了基本的字符串匹配算法。 算法的最壞情況複雜度是O(m(n-m + 1))。 在最壞的情況下,KMP算法的時間複雜度爲O(n)。
什麼是KMP算法 (Knuth Morris Pratt) Pattern Searching
如果發現可以匹配的字符串後面有很多不匹配的字符串。那麼使用基本的字符串匹配算法的效果不好。
例子
str1[] = "AAAAAAAAAAAAAAAAAB"
str2[] = "AAAAB"
str1[] = "ABABABCABABABCABABABC"
str2[] = "ABABAC" (not a worst case, but a bad case for Naive)
首先我們需要了解幾個概念
舉例說明
step1 第一次匹配:在str1 中找到了str2 的匹配,這個和我們樸素算法的方式一樣沒有什麼區別。
step2 按照樸素的算法將 str2 向右移動一位
這裏就是 KMP 和 樸素算法進行優化的地方,在第二次比較中,我們使用str2 中的第四個字符來決定當前str2 是否匹配。無論如何,前三個字符都會匹配,我們跳過了匹配前三個字符。
這裏會有一個問題 ----- 我們如何知道要跳過的字符。要跳過多少個字符?這裏需要做一些提前的處理!
step0
部分匹配表(Partial Match Table)的數組
- KMP 算法對str2[] 進行預處理,構造一個大小爲M(和str2 大小相同)的輔助 lps[],用於在匹配是跳過字符。
- lps 數組表示最長的正確前綴,這裏有兩個概念"前綴"指除了最後一個字符以外,一個字符串的全部頭部組合;"後綴"指除了第一個字符以外,一個字符串的全部尾部組合。
舉例說明
- 我們lps 中存儲的是,在str2 中搜索的前綴和後綴。
- lps[i] 存儲的是最大匹配的適當前綴的長度,該前綴也是str2 中的後綴。
lps[i] = the longest proper prefix of pat[0..i]
which is also a suffix of pat[0..i].
lps[i] 可以定義爲最長前綴,這也是後綴。我們需要在一個地方正確的使用確保不需要考慮整個子字符串。
Examples of lps[] construction:
For the pattern “AAAA”,
lps[] is [0, 1, 2, 3]
For the pattern “ABCDE”,
lps[] is [0, 0, 0, 0, 0]
For the pattern “AABAACAABAA”,
lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
For the pattern “AAACAAAAAC”,
lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]
For the pattern “AAABAAA”,
lps[] is [0, 1, 2, 0, 1, 2, 3]
匹配算法
和樸素的算法不一樣,我們將str2 滑動每次移動一位,並在每次移位時候比較所有的字符,我們使用lps 中存儲的值來確定下一個要匹配的字符。這個想法是不匹配我們知道會匹配的字符。
這裏會有一個問題
如何使用lps []決定下一個位置(或知道要跳過的字符數)?
-
我們開始匹配如果 str2[j] 並且 j = 0 如果str1[i] i = 0 的值相等,繼續匹配下一個。
-
我們保持匹配字符str1 [i] 和 str2 [j],並保持i和j遞增,而str2 [j]和 str1 [i]保持匹配。
-
如果發現字符串不匹配
- 我們知道字符str2 [0…j-1]與str1 [i-j…i-1]相匹配(請注意,j以0開頭,僅在存在匹配項時遞增)。
- 從上面的定義中我們還知道lps [j-1]是str2 [0…j-1]的字符計數,它們都是正確的前綴和後綴。
- 從以上兩點可以看出,
我們不需要將這些lps [j-1]字符與txt [i-j…i-1]匹配,因爲我們知道這些字符仍然可以匹配。 讓我們考慮上面的例子來理解這一點。
舉例說明:
C++ 實現
// C++ program for implementation of KMP pattern searching
// algorithm
#include <bits/stdc++.h>
void computeLPSArray(char* pat, int M, int* lps);
// Prints occurrences of txt[] in pat[]
void KMPSearch(char* pat, char* txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix
// values for pattern
int lps[M];
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
int j = 0; // index for pat[]
while (i < N) {
if (pat[j] == txt[i]) {
j++;
i++;
}
if (j == M) {
printf("Found pattern at index %d ", i - j);
j = lps[j - 1];
}
// mismatch after j matches
else if (i < N && pat[j] != txt[i]) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
}
// Fills lps[] for given patttern pat[0..M-1]
void computeLPSArray(char* pat, int M, int* lps)
{
// length of the previous longest prefix suffix
int len = 0;
lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
int i = 1;
while (i < M) {
if (pat[i] == pat[len]) {
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = lps[len - 1];
// Also, note that we do not increment
// i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
// Driver program to test above function
int main()
{
char txt[] = "ABABDABACDABABCABAB";
char pat[] = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}
java 實現
// JAVA program for implementation of KMP pattern
// searching algorithm
class KMP_String_Matching {
void KMPSearch(String pat, String txt)
{
int M = pat.length();
int N = txt.length();
// create lps[] that will hold the longest
// prefix suffix values for pattern
int lps[] = new int[M];
int j = 0; // index for pat[]
// Preprocess the pattern (calculate lps[]
// array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
while (i < N) {
if (pat.charAt(j) == txt.charAt(i)) {
j++;
i++;
}
if (j == M) {
System.out.println("Found pattern "
+ "at index " + (i - j));
j = lps[j - 1];
}
// mismatch after j matches
else if (i < N && pat.charAt(j) != txt.charAt(i)) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
}
void computeLPSArray(String pat, int M, int lps[])
{
// length of the previous longest prefix suffix
int len = 0;
int i = 1;
lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
while (i < M) {
if (pat.charAt(i) == pat.charAt(len)) {
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = lps[len - 1];
// Also, note that we do not increment
// i here
}
else // if (len == 0)
{
lps[i] = len;
i++;
}
}
}
}
// Driver program to test above function
public static void main(String args[])
{
String txt = "ABABDABACDABABCABAB";
String pat = "ABABCABAB";
new KMP_String_Matching().KMPSearch(pat, txt);
}
}
// This code has been contributed by Amit Khandelwal.
預處理算法
預處理主要是用來計算pls 的值。
pat[] = “AAACAAAA”
len = 0, i = 0.
lps[0] is always 0, we move
to i = 1
len = 0, i = 1.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 1, lps[1] = 1, i = 2
len = 1, i = 2.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 2, lps[2] = 2, i = 3
len = 2, i = 3.
Since pat[len] and pat[i] do not match, and len > 0,
set len = lps[len-1] = lps[1] = 1
len = 1, i = 3.
Since pat[len] and pat[i] do not match and len > 0,
len = lps[len-1] = lps[0] = 0
len = 0, i = 3.
Since pat[len] and pat[i] do not match and len = 0,
Set lps[3] = 0 and i = 4.
We know that characters pat
len = 0, i = 4.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 1, lps[4] = 1, i = 5
len = 1, i = 5.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 2, lps[5] = 2, i = 6
len = 2, i = 6.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 3, lps[6] = 3, i = 7
len = 3, i = 7.
Since pat[len] and pat[i] do not match and len > 0,
set len = lps[len-1] = lps[2] = 2
len = 2, i = 7.
Since pat[len] and pat[i] match, do len++,
store it in lps[i] and do i++.
len = 3, lps[7] = 3, i = 8
We stop here as we have constructed the whole lps[].