Python實現多模匹配——AC自動機
目標:學習AC自動機,多模匹配。
要求:儘可能用純Python實現,提升代碼的擴展性。
一、什麼是AC自動機?
AC自動機,Aho-Corasick automaton,該算法在1975年產生於貝爾實驗室,是著名的多模匹配算法。要學會AC自動機,我們必 須知道什麼是Trie,也就是字典樹。Trie樹,又稱單詞查找樹或鍵樹,是一種樹形結構,是一種哈希樹的變種。典型應用是用於統計和排序大量的字符串(但不僅限於字符串),所以經常被搜索引擎系統用於文本詞頻統計。
——摘自百度百科
二、AC自動機用來做什麼?
一個常見的例子就是給出n個單詞,再給出一段包含m個字符的文章,讓你找出有多少個單詞在文章裏出現過。要搞懂AC自動機,先得有模式樹(字典樹)Trie和KMP模式匹配算法的基礎知識。AC自動機算法分爲3步:構造一棵Trie樹,構造失敗指針和模式匹配過程。
如果你對KMP算法瞭解的話,應該知道KMP算法中的next函數(shift函數或者fail函數)是幹什麼用的。KMP中我們用兩個指針i和j分別表示,A[i-j+ 1..i]與B[1..j]完全相等。也就是說,i是不斷增加的,隨着i的增加j相應地變化,且j滿足以A[i]結尾的長度爲j的字符串正好匹配B串的前 j個字符,當A[i+1]≠B[j+1],KMP的策略是調整j的位置(減小j值)使得A[i-j+1..i]與B[1..j]保持匹配且新的B[j+1]恰好與A[i+1]匹配,而next函數恰恰記錄了這個j應該調整到的位置。同樣AC自動機的失敗指針具有同樣的功能,也就是說當我們的模式串在Trie上進行匹配時,如果與當前節點的關鍵字不能繼續匹配,就應該去當前節點的失敗指針所指向的節點繼續進行匹配。
三、AC自動機的Python安裝
安裝過這個包的朋友,相信都遇到過各種坑。
1、pip安裝
官網:https://pypi.org/project/pyahocorasick/。源碼下載:
- GitHub: https://github.com/WojciechMula/pyahocorasick/
- Pypi: https://pypi.python.org/pypi/pyahocorasick/
- Conda-Forge: https://github.com/conda-forge/pyahocorasick-feedstock/
安裝方式:pip install pyahocorasick(python3),但嘗試過的朋友會發現,這個包需要C編譯器,如果自己的電腦中沒有安裝C編譯器,是安裝不成功的。pip install ahocorasick(python2)也無法安裝。具體報錯代碼:
pip install pyahocorasick
Collecting pyahocorasick
Using cached https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz
Building wheels for collected packages: pyahocorasick
Running setup.py bdist_wheel for pyahocorasick ... error
Complete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-wheel-rbzdosp6 --python-tag cp37:
running bdist_wheel
running build
running build_ext
building 'ahocorasick' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.7
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/include -arch x86_64 -I/anaconda3/include -arch x86_64 -DAHOCORASICK_UNICODE= -I/anaconda3/include/python3.7m -c pyahocorasick.c -o build/temp.macosx-10.7-x86_64-3.7/pyahocorasick.o
xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
error: command 'gcc' failed with exit status 1
----------------------------------------
Failed building wheel for pyahocorasick
Running setup.py clean for pyahocorasick
Failed to build pyahocorasick
Installing collected packages: pyahocorasick
Running setup.py install for pyahocorasick ... error
Complete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-record-5oyl9c1l/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_ext
building 'ahocorasick' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.7
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/include -arch x86_64 -I/anaconda3/include -arch x86_64 -DAHOCORASICK_UNICODE= -I/anaconda3/include/python3.7m -c pyahocorasick.c -o build/temp.macosx-10.7-x86_64-3.7/pyahocorasick.o
xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
error: command 'gcc' failed with exit status 1
----------------------------------------
Command "/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-record-5oyl9c1l/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/
如果直接下載Github中的源碼,在使用ahocorasick.Automaton()函數會報錯。那怎麼辦?
個人嘗試着安裝ahocorasick-python,官網:https://pypi.org/project/ahocorasick-python/,GitHub源碼:源碼。
但是結果發現Mac/Linux系統可以使用,Win10不行?瞬間無語了。demo環境用的是win10。
2、解決方案
網上查找了一些解決方案,主要包括三種:
(1)老老實實地裝C編譯器;
(2)python使用esmre代替ahocorasick實現ac自動機多模匹配
(3)個人改寫ahocorasick——Python下的ahocorasick實現快速的關鍵字匹配
四、ahocorasick的Python代碼
1、Python2代碼:
# python2
# coding=utf-8
KIND = 16
class Node():
static = 0
def __init__(self):
self.fail = None
self.next = [None] * KIND
self.end = False
self.word = None
Node.static += 1
class AcAutomation():
def __init__(self):
self.root = Node()
self.queue = []
def getIndex(self, char):
return ord(char) # - BASE
def insert(self, string):
p = self.root
for char in string:
index = self.getIndex(char)
if p.next[index] == None:
p.next[index] = Node()
p = p.next[index]
p.end = True
p.word = string
def build_automation(self):
self.root.fail = None
self.queue.append(self.root)
while len(self.queue) != 0:
parent = self.queue[0]
self.queue.pop(0)
for i, child in enumerate(parent.next):
if child == None: continue
if parent == self.root:
child.fail = self.root
else:
failp = parent.fail
while failp != None:
if failp.next[i] != None:
child.fail = failp.next[i]
break
failp = failp.fail
if failp == None: child.fail = self.root
self.queue.append(child)
def matchOne(self, string):
p = self.root
for char in string:
index = self.getIndex(char)
while p.next[index] == None and p != self.root: p = p.fail
if p.next[index] == None:
p = self.root
else:
p = p.next[index]
if p.end: return True, p.word
return False, None
class UnicodeAcAutomation():
def __init__(self, encoding='utf-8'):
self.ac = AcAutomation()
self.encoding = encoding
def getAcString(self, string):
string = bytearray(string.encode(self.encoding))
ac_string = ''
for byte in string:
ac_string += chr(byte % 16)
ac_string += chr(byte / 16)
# print ac_string
return ac_string
def insert(self, string):
if type(string) != unicode:
raise Exception('UnicodeAcAutomation:: insert type not unicode')
ac_string = self.getAcString(string)
self.ac.insert(ac_string)
def build_automation(self):
self.ac.build_automation()
def matchOne(self, string):
if type(string) != unicode:
raise Exception('UnicodeAcAutomation:: insert type not unicode')
ac_string = self.getAcString(string)
retcode, ret = self.ac.matchOne(ac_string)
if ret != None:
s = ''
for i in range(len(ret) / 2):
tmp = chr(ord(ret[2 * i]) + ord(ret[2 * i + 1]) * 16)
s += tmp
ret = s.decode('utf-8')
return retcode, ret
def main():
ac = UnicodeAcAutomation()
ac.insert(u'丁亞光')
ac.insert(u'好喫的')
ac.insert(u'好玩的')
ac.build_automation()
print(ac.matchOne(u'hi,丁亞光在幹啥'))
print(ac.matchOne(u'ab'))
print(ac.matchOne(u'不能喫飯啊'))
print(ac.matchOne(u'飯很好喫,有很多好好的喫的,'))
print(ac.matchOne(u'有很多好玩的'))
if __name__ == '__main__':
main()
輸出:
(True, u'\u4e01\u4e9a\u5149')
(False, None)
(False, None)
(False, None)
(True, u'\u597d\u73a9\u7684')
可能很多朋友習慣了Python3,這裏提供個人修改後的代碼(主要是編碼格式的修改)
2、Python3
# python3
# coding=utf-8
KIND = 16
class Node():
static = 0
def __init__(self):
self.fail = None
self.next = [None] * KIND
self.end = False
self.word = None
Node.static += 1
class AcAutomation():
def __init__(self):
self.root = Node()
self.queue = []
def getIndex(self, char):
return ord(char) # - BASE
def insert(self, string):
p = self.root
for char in string:
index = self.getIndex(char)
if p.next[index] == None:
p.next[index] = Node()
p = p.next[index]
p.end = True
p.word = string
def build_automation(self):
self.root.fail = None
self.queue.append(self.root)
while len(self.queue) != 0:
parent = self.queue[0]
self.queue.pop(0)
for i, child in enumerate(parent.next):
if child == None: continue
if parent == self.root:
child.fail = self.root
else:
failp = parent.fail
while failp != None:
if failp.next[i] != None:
child.fail = failp.next[i]
break
failp = failp.fail
if failp == None: child.fail = self.root
self.queue.append(child)
def matchOne(self, string):
p = self.root
for char in string:
index = self.getIndex(char)
while p.next[index] == None and p != self.root: p = p.fail
if p.next[index] == None:
p = self.root
else:
p = p.next[index]
if p.end: return True, p.word
return False, None
class UnicodeAcAutomation():
def __init__(self, encoding='utf-8'):
self.ac = AcAutomation()
self.encoding = encoding
def getAcString(self, string):
string = bytearray(string.encode(self.encoding))
ac_string = ''
for byte in string:
ac_string += chr(byte % 16)
ac_string += chr(byte // 16)
return ac_string
def insert(self, string):
if type(string) != str:
raise Exception('StrAcAutomation:: insert type not str')
ac_string = self.getAcString(string)
self.ac.insert(ac_string)
def build_automation(self):
self.ac.build_automation()
def matchOne(self, string):
if type(string) != str:
raise Exception('StrAcAutomation:: insert type not str')
ac_string = self.getAcString(string)
retcode, ret = self.ac.matchOne(ac_string)
if ret != None:
s = ''
for i in range(len(ret) // 2):
s += chr(ord(ret[2 * i]) + ord(ret[2 * i + 1]) * 16)
ret = s.encode("latin1").decode('utf-8')
return retcode, ret
def main():
ac = UnicodeAcAutomation()
ac.insert('丁亞光')
ac.insert('好喫的')
ac.insert('好玩的')
ac.build_automation()
print(ac.matchOne('hi,丁亞光在幹啥'))
print(ac.matchOne('ab'))
print(ac.matchOne('不能喫飯啊'))
print(ac.matchOne('飯很好喫,有很多好好的喫的,'))
print(ac.matchOne('有很多好玩的'))
if __name__ == '__main__':
輸出:
(True, '丁亞光')
(False, None)
(False, None)
(False, None)
(True, '好玩的')
總結:ahocorasick個人改寫的方法還有很多,比如根據ahocorasick-python的源碼進行改寫。其中ahocorasick-python的核心源碼如下。
# coding:utf-8
# write by zhou
# revised by zw
class Node(object):
"""
節點的抽象
"""
def __init__(self, str='', is_root=False):
self._next_p = {}
self.fail = None
self.is_root = is_root
self.str = str
self.parent = None
def __iter__(self):
return iter(self._next_p.keys())
def __getitem__(self, item):
return self._next_p[item]
def __setitem__(self, key, value):
_u = self._next_p.setdefault(key, value)
_u.parent = self
def __repr__(self):
return "<Node object '%s' at %s>" % \
(self.str, object.__repr__(self)[1:-1].split('at')[-1])
def __str__(self):
return self.__repr__()
class AhoCorasick(object):
"""
Ac自動機對象
"""
def __init__(self, *words):
self.words_set = set(words)
self.words = list(self.words_set)
self.words.sort(key=lambda x: len(x))
self._root = Node(is_root=True)
self._node_meta = {}
self._node_all = [(0, self._root)]
_a = {}
for word in self.words:
for w in word:
_a.setdefault(w, set())
_a[w].add(word)
def node_append(keyword):
assert len(keyword) > 0
_ = self._root
for _i, k in enumerate(keyword):
node = Node(k)
if k in _:
pass
else:
_[k] = node
self._node_all.append((_i+1, _[k]))
self._node_meta.setdefault(id(_[k]),set())
if _i >= 1:
for _j in _a[k]:
if keyword[:_i+1].endswith(_j):
self._node_meta[id(_[k])].add((_j, len(_j)))
_ = _[k]
else:
if _ != self._root:
self._node_meta[id(_)].add((keyword, len(keyword)))
for word in self.words:
node_append(word)
self._node_all.sort(key=lambda x: x[0])
self._make()
def _make(self):
"""
構造Ac樹
:return:
"""
for _level, node in self._node_all:
if node == self._root or _level <= 1:
node.fail = self._root
else:
_node = node.parent.fail
while True:
if node.str in _node:
node.fail = _node[node.str]
break
else:
if _node == self._root:
node.fail = self._root
break
else:
_node = _node.fail
def search(self, content, with_index=False):
result = set()
node = self._root
index = 0
for i in content:
while 1:
if i not in node:
if node == self._root:
break
else:
node = node.fail
else:
for keyword, keyword_len in self._node_meta.get(id(node[i]), set()):
if not with_index:
result.add(keyword)
else:
result.add((keyword, (index - keyword_len + 1, index + 1)))
node = node[i]
break
index += 1
return result
if __name__ == '__main__':
ac = AhoCorasick("abc", 'abe', 'acdabd', 'bdf', 'df', 'f', 'ac', 'cd', 'cda')
print(ac.search('acdabdf', True))
輸出:
{('cd', (1, 3)), ('acdabd', (0, 6)), ('df', (5, 7)), ('f', (6, 7)), ('bdf', (4, 7)), ('cda', (1, 4)), ('ac', (0, 2))}
參考文獻: