這兩天看了一下KMP算法,它是什麼,我就不贅述了。不懂的自己動手查查。
我已經把代碼上傳到Github了,可以去那裏下載,地址如下:
https://github.com/nemax/KMP-search-algorithm
一般來說,我們習慣於把KMP和Brute-Force解法比較,那麼KMP到底勝在什麼地方呢?勝在它的覆蓋函數。什麼是覆蓋函數呢,它是一個用來計算模式串自身信息的函數,計算出來的函數表徵着自我覆蓋的函數,所以說它是覆蓋函數。(其實還聽過叫next函數或其他名字的。)
約定:紅色字母代表失配字母,綠色字母代表上一輪比較結束後,這一輪起始比較位置,而藍色代表起始位置和失配位置一樣。
簡單的來說明一下,假如有模式串abaabc,主串abaabacdad,第一輪匹配剛好最後一個c沒法匹配,如下:
主串: a b a a b a c d a d
模式串: a b a a b c
對於BF而言,是從頭再匹配了。但是想想,最後一個c沒法匹配就意味着前面的都對着呢。說明在主串中,對不上的地方的前兩個字符肯定是ab,而注意到沒有,模式串一開頭就是ab,如果知道這一點,我們是不是可以把這兩次匹配省了,直接從上面的情形跳到下面這樣,從第三個a開始比較
主串: a b a a b a c d a d
模式串: a b a a b c
而對於BF而言,在上一步錯了以後應該是下面這樣
主串: a b a a b a c d a d
模式串: a b a a b c
發現差異有多大了嗎?看看起始位置,KMP可以避免主串指針的回溯,而BF法一旦一輪結束,必須指針回溯。這就是二者差異所在,KMP充分利用了模式串自身的信息,避免了指針回溯,避免了不必要的比較。在KMP算法中,模式串的每一個字符都有自己相應的next值,何謂next值,可以膚淺地理解爲在下標爲i的字符處失配時,我們下次應該用模式串中下標爲next[i]的字符來比對。而計算next值得函數就是前面說道的覆蓋函數。
我大概看過兩種計算next值得方法,一種計算出來的next值直接告訴你失配後下一次用哪個位置的字符比較,而另一種給出的next值需要經過一個固定的計算,算出一下次需要比對的位置。而KMP的核心,也就在於這個next值得計算。
跟着next函數走一遍,你會發現,代碼其實很簡單。難者不會,會者不難。
首先他有兩個指針,一個指向當前要計算的字符,即指針k,另一個指針value不好解釋,看着就知道了。
我們先簡單說一下next值怎麼算吧,先說第一種,對應代碼如下:
- void get_next_array_origin(char * pattern){
- //Get pattern length
- int length = strlen(pattern);
- int k = 1,value=0;
- //It`s a rule to set next[0] with value -1
- next[0]=-1;
- while(k < length){
- //Keep the next value of last unmatch character
- value = next[k-1];
- /*value>=0 means there is some overlays was discovered,
- the second condition means the overlay was stop here
- */
- while(value >= 0 && pattern[k] != pattern[value+1]){
- value = next[value];
- }
- /*It means we discoverd an overlay and pattern[k] is the
- first(value equals -1) or subsequent char(value >= 0).
- */
- if(pattern[k] == pattern[value+1]){
- next[k] = value+1;
- }
- //Other condition
- else{
- next[k]=-1;
- }
- k++;
- //printf("next[%d] = %d\n",k-1,next[k-1]);
- }
- }
假設模式串被表示爲a[0]a[1]..a[k]...a[j-k]...a[j],如果把a[0]到a[k]和a[j-k]到a[j]剛好能配得上,那麼next[j] = k,也可以說發現了覆蓋,即模式串自身內部的重疊.如果找不到這樣的匹配,next[j]=-1特殊的地方就在於next[0]=-1是定死的,無論哪種覆蓋函數。
這下好辦了吧,根據這個定義看看abaabc是多少?
1.next[0]=-1;
2.只看ab,next[1]=-1;
3.只看aba,next[2]=0,因爲next[0]到next[0]和next[2-0]到next[2]一樣,所以next[2]=0;
4.只看abaa,next[3]=0,因爲next[0]到next[0]和next[3-0]到next[3]一樣,所以next[3]=0;
5.只看abaab,next[5]=1,因爲next[0]到next[1]和next[5-1]到next[5]一樣,所以next[5]=1;
6.同理,next[6]=-1
所以,
模式串:a b a a b c
next值:-1 -1 0 0 1 -1
那麼這個值怎麼用呢,這樣的規則計算出來的next值就如我前面說的那樣,不是直接告訴你失配了再比較哪一個,而是要計算的。
舉個例子,加入c失配了,我們知道主串失配處前兩個字符爲ab,我們的模式串一開始也爲ab,所以,我們用模式串的next[2]處的a來比較就好了,因爲模式串前面那兩個ab,肯定和主串中失配位置前那兩個ab重合。那這個next[2]的2是怎樣計算出來的呢,很遺憾,我們用的不是c對應的next值,而是c之前一位的next值,也就是甚爲next[4]的b的next值加一計算出來的。所以,計算公式就是:
下一個比較的字符的下標 = 模式串中最後一個匹配得上的字符的next值+1
然而,我們有更好地next值計算方法。代碼如下:
- void get_next_array(char * pattern){
- int length = strlen(pattern) ;
- int k = 0,value=-1;
- next[k]=-1;
- while(k < length){
- while(value >= 0 && pattern[k] != pattern[value]){
- value = next[value];
- }
- k++;
- value++;
- next[k] = value;
- }
- }
next[0]=-1還是不變,指針還是兩個,不過,其他計算過程稍有不同。這樣計算出來的值就直接告訴你如果匹配錯了,下一次用下標爲幾的字符匹配。next值具體計算過程看代碼。
模式串:a b a a b c
next值:-1 0 0 1 1 2
如果在c處失配,則下一個用pattern[c的next值],即pattern[2]來匹配。
其實理解透徹了以後你會發現,value的值其實是向前推進的,如果有覆蓋的話,而如果沒有覆蓋,它會往前回溯到前一個可能發生或延續覆蓋的地方,如果一直沒法發生或延續覆蓋,它最終退爲-1。
其實就和前面說的那個好多個a的公式有點相似了。
其實還有辦法改進這個next值得算法,想想看,還是上面那個串,假如在next[2]處的a失效了是不是下一次應該比較next[0]的值,而next[0]還是一個a,肯定不匹配,最終主串指針進一,模式串從頭匹配。所以我們是不是可以再改進一下next值,省去了這樣的盲目跳轉,改進的算法對應get_next_array_enhanced()。這樣計算出來以後使用方法和第二種差不多,但是如果失配位的next值爲-1就直接做主串指針進一,從頭匹配的操作。
所有的代碼如下:
KMP.h
- #include <string.h>
- #include <stdio.h>
- static int next[20]={0};
- /*This is the worst one I think,the next value dosen`t tell
- you where the pattern_index should be put then,but it can
- work out by the value.
- */
- void get_next_array_origin(char * pattern){
- //Get pattern length
- int length = strlen(pattern);
- int k = 1,value=0;
- //It`s a rule to set next[0] with value -1
- next[0]=-1;
- while(k < length){
- //Keep the next value of last unmatch character
- value = next[k-1];
- /*value>=0 means there is some overlays was discovered,
- the second condition means the overlay was stop here
- */
- while(value >= 0 && pattern[k] != pattern[value+1]){
- value = next[value];
- }
- /*It means we discoverd an overlay and pattern[k] is the
- first(value equals -1) or subsequent char(value >= 0).
- */
- if(pattern[k] == pattern[value+1]){
- next[k] = value+1;
- }
- //Other condition
- else{
- next[k]=-1;
- }
- k++;
- //printf("next[%d] = %d\n",k-1,next[k-1]);
- }
- }
- /*This is the second next value caculate algorithm,it`s
- convenient.Because the next value tell you what the
- next value of pattern_index.
- */
- void get_next_array(char * pattern){
- int length = strlen(pattern) ;
- int k = 0,value=-1;
- next[k]=-1;
- while(k < length){
- while(value >= 0 && pattern[k] != pattern[value]){
- value = next[value];
- }
- k++;
- value++;
- next[k] = value;
- }
- }
- /*It`s an improvement algrithm for get_next_array().Former just
- tell you where to set you pattern_index,but not to concerned
- about is the next char equal to the mismatch one,this algori-
- thm fix this problem.
- */
- void get_next_array_enhanced(char * pattern){
- int length = strlen(pattern) ;
- int k = 0,value = -1;
- next[k]=value;
- while(k < length){
- while(value>=0 && pattern[k]!=pattern[value]){
- value = next[value];
- }
- /*Once the next char is equal to the current one,also
- the mismatch one,we do this.Although they are equal
- characters,but the next value of former one has been
- work out,so it`s an available next value for the seond
- one.
- */
- if(pattern[k] == pattern[value]){
- next[k] = next[value];
- }
- k++;
- value++;
- next[k] = value;
- }
- }
- void KMP_search_origin(char * main,char * pattern){
- get_next_array_origin(pattern);
- int main_index = 0,pattern_index = 0;
- int main_length = strlen(main);
- int pattern_length = strlen(pattern);
- int flag=-1;
- while(main_index<main_length){
- //printf("main_index:%d\n",main_index);
- if(main[main_index] == pattern[pattern_index]){
- //printf("%c = %c\n",main[main_index],pattern[pattern_index]);
- if(pattern_index == pattern_length-1){
- printf("find in place %d\n",main_index - pattern_length+1);
- flag=1;
- /*Once the last char equals the first char in pattern,
- that means current char in main string can match the
- first char in pattern,so we do this to avoiding miss
- the comparision.
- */
- if(pattern[0] == pattern[pattern_index]){
- main_index--;
- }
- pattern_index = -1;
- }
- main_index++;
- pattern_index++;
- }
- else{
- //printf("%c != %c\n",main[main_index],pattern[pattern_index]);
- if(pattern_index == 0){
- main_index++;
- }
- else{
- /*caculate the next position to be compare according
- to the next value
- */
- pattern_index = next[pattern_index-1]+1;
- }
- }
- }
- if(flag == -1){
- printf("Sorry,we find nothing.");
- }
- }
- void KMP_search(char * main,char * pattern){
- get_next_array(pattern);
- int main_index = 0,pattern_index = 0;
- int main_length = strlen(main);
- int pattern_length = strlen(pattern);
- int flag=-1;
- while(main_index<main_length){
- //printf("main_index:%d\n",main_index);
- if(main[main_index] == pattern[pattern_index]){
- //printf("%c = %c\n",main[main_index],pattern[pattern_index]);
- if(pattern_index == pattern_length-1){
- printf("find in place %d\n",main_index - pattern_length+1);
- flag=1;
- if(pattern[0] == pattern[pattern_index]){
- main_index--;
- }
- pattern_index = -1;
- }
- main_index++;
- pattern_index++;
- }
- else{
- //printf("%c != %c\n",main[main_index],pattern[pattern_index]);
- if(pattern_index == 0){
- main_index++;
- }
- else{
- /*It`s easier than before,the next value is
- just where to put next time.
- */
- pattern_index = next[pattern_index];
- }
- }
- }
- if(flag == -1){
- printf("Sorry,we find nothing.");
- }
- }
- void KMP_search_enhanced(char * main,char * pattern){
- get_next_array(pattern);
- int main_index = 0,pattern_index = 0;
- int main_length = strlen(main);
- int pattern_length = strlen(pattern);
- int flag=-1;
- while(main_index<main_length){
- //printf("main_index:%d\n",main_index);
- if(main[main_index] == pattern[pattern_index]){
- //printf("%c = %c\n",main[main_index],pattern[pattern_index]);
- if(pattern_index == pattern_length-1){
- printf("find in place %d\n",main_index - pattern_length+1);
- flag=1;
- if(pattern[0] == pattern[pattern_index]){
- main_index--;
- }
- pattern_index = -1;
- }
- main_index++;
- pattern_index++;
- }
- else{
- //printf("%c != %c\n",main[main_index],pattern[pattern_index]);
- /*next value equals -1 means the condition we mismatch at the
- first char,so we match again from the next char in main string
- */
- if(next[pattern_index] == -1){
- main_index++;
- pattern_index = 0;
- }
- else{
- pattern_index = next[pattern_index];
- }
- }
- }
- if(flag == -1){
- printf("Sorry,we find nothing.");
- }
- }
測試用例:
KMP.c
- #include "KMP.h"
- int main(){
- printf("=========origin===========\n");
- KMP_search_origin("abacababa","aba");
- printf("=========origin===========\n");
- printf("=========normal===========\n");
- KMP_search("abacababa","aba");
- printf("=========normal===========\n");
- printf("=========enhanced===========\n");
- KMP_search_enhanced("abacababa","aba");
- printf("=========enhanced===========\n");
- return 0;
- }
next.c
- #include "KMP.h"
- int main(){
- int i = 0;
- char * p = "abaabc";
- int length = strlen(p);
- printf("=========origin===========\n");
- get_next_array_origin(p);
- while(i < length){
- printf("next[%d] = %d\n",i,next[i]);
- i++;
- }
- printf("=========origin===========\n");
- printf("=========normal===========\n");
- get_next_array(p);
- i = 0;
- while(i < length){
- printf("next[%d] = %d\n",i,next[i]);
- i++;
- }
- printf("=========normal===========\n");
- printf("=========enhanced===========\n");
- get_next_array_enhanced(p);
- i = 0;
- while(i < length){
- printf("next[%d] = %d\n",i,next[i]);
- i++;
- }
- printf("=========enhanced===========\n");
- return 0;
- }