AC自動機模板及對多模式匹配的理解

在接觸AC自動機之前, 只僅僅掌握單模式匹配的算法: 比如KMP, BMH等算法; 經過優化後, KMP和BMH都具有線性時間複雜度, 而實際情況下, 一般的匹配問題BMH具有亞線性的表現; 而昨天接觸的AC自動機則是一種結合了字典樹和KMP的一種算法, 使得在多模式匹配下, 時間複雜度達到O(Σmi + n), 其中n爲原串長度, mi爲第i個模式串的長度;

匹配過程中類似於KMP, 原串不走回頭路, 利用之前已經匹配過的結果來構造特殊的字典樹從而形成AC自動機;

創建自動機的過程中, 最爲重要的是fail指針的構造; 我是從這篇文章中學會的: AC自動機算法詳解; fail指針的作用類似於KMP中的next數組, 而AC自動機的實質是一個特殊的哈希樹;

應用:

-- 在很多計算機算法競賽中, 多模式匹配的AC自動機通常與樹形動態規劃相結合, 動態規劃的過程就是在自動機上走過路徑形成的字串與給定字串的比較關係(最少修改字串, 避免病毒串等);

-- 多模式精確匹配;

優化:

-- 我的這個模板中並沒有考慮對自動機的優化, 比如ptr->fail->next[i]與ptr->next[i]若同時不存在, 則ptr->fail其實是可以直接指向ptr->fail->fail的(原因很簡單, 因爲ptr->next[i]發生失配時, ptr = ptr->fail, 此時肯定仍然失配, 需要繼續ptr->fail), 當然優化的代價是增加對存儲空間的佔用, fail需要變爲vector<trieTreeNode*> fail, 每個字母都應對應一個fail指針)

模板:

輸入參數: patterns是模式串集合, s爲待匹配原串, answer是成功匹配的模式串集合, 返回值爲成功匹配的模式串個數;

////////////////////////////////////////////////////////////////////////////////
/*
//readme//
interfaces: 
multiPatternsMatchingByAcAutomation:
1 const vector<string> &patterns: several pattern strings;
2 const string &s: original strings;
3 vector<string> &answer: the patterns which are matched in the original strings;
4 return the number of patterns which are matched.
*/
//made by HaoyuHu
//Tsinghua University
////////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include <vector>
#include <queue>
#include <string>
#include <unordered_set>
#define ALPH_NUM 26
using namespace std;

struct trieTreeNode {
	vector<trieTreeNode*> next;
	bool mark;
	trieTreeNode *fail;
	trieTreeNode(): next(26, nullptr), mark(false), fail(nullptr) {}
};

trieTreeNode *createAcAutomation(const vector<string> &patterns);
int findPatterns(vector<string> &answer, trieTreeNode *root, const string &s);
void makeFoundPatterns(vector<string> &answer, unordered_set<trieTreeNode*> &save, trieTreeNode *root, string pattern);
inline char turn_char(int index);
int multiPatternsMatchingByAcAutomation(const vector<string> &patterns, const string &s, vector<string> &answer);

trieTreeNode *createAcAutomation(const vector<string> &patterns) {
	trieTreeNode *root = new trieTreeNode(), *ptr, *cur;
	for (int i = 0; i != patterns.size(); ++i) {
		cur = root;
		for (int k = 0; k != patterns[i].size(); ++k) {
			int index = patterns[i][k] - 'a';
			if (!cur->next[index])
				cur->next[index] = new trieTreeNode();
			cur = cur->next[index];
		}
		cur->mark = true;
	}
	queue<trieTreeNode*> makeFail;
	makeFail.push(root);
	while (!makeFail.empty()) {
		cur = makeFail.front(); makeFail.pop();
		for (int i = 0; i != ALPH_NUM; ++i) {
			if (cur->next[i]) {
				for (ptr = cur->fail; ptr && !ptr->next[i]; ptr = ptr->fail);
				cur->next[i]->fail = ptr ? ptr->next[i] : root;
				makeFail.push(cur->next[i]);
			}
		}
	}
	return root;
}

int findPatterns(vector<string> &answer, trieTreeNode *root, const string &s) {
	int count = 0;
	string pattern;
	unordered_set<trieTreeNode*> save;
	trieTreeNode *cur = root;
	for (int i = 0; i != s.size(); ) {
		int index = s[i] - 'a';
		if (!cur) {
			cur = root; ++i;
		}
		else if (cur->next[index]) {
			cur = cur->next[index];
			if (cur->mark) {
				++count;
				save.insert(cur);
			}
			++i;
		}
		else {
			cur = cur->fail;
			if (cur && cur->mark) {
				++count;
				save.insert(cur);
			}
		}
	}
	makeFoundPatterns(answer, save, root, pattern);
	return count;
}

void makeFoundPatterns(vector<string> &answer, unordered_set<trieTreeNode*> &save,
	trieTreeNode *root, string pattern) {
	unordered_set<trieTreeNode*>::iterator it = save.find(root);
	if (it != save.end())
		answer.push_back(pattern);
	for (int i = 0; i != ALPH_NUM; ++i) {
		if (root->next[i]) {
			string t(pattern);
			t.push_back(turn_char(i));
			makeFoundPatterns(answer, save, root->next[i], t);
		}
	}
}

inline char turn_char(int index) {
	return 'a' + index;
}

int multiPatternsMatchingByAcAutomation(const vector<string> &patterns, const string &s, vector<string> &answer) {
	trieTreeNode *root = createAcAutomation(patterns);
	return findPatterns(answer, root, s);
}

Enjoy it!

如果有錯誤請指出, 謝謝!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章