Python實現多模匹配——AC自動機

                                                      Python實現多模匹配——AC自動機

 

目標:學習AC自動機,多模匹配。

要求:儘可能用純Python實現,提升代碼的擴展性。

 

一、什麼是AC自動機?

        AC自動機,Aho-Corasick automaton,該算法在1975年產生於貝爾實驗室,是著名的多模匹配算法。要學會AC自動機,我們必 須知道什麼是Trie,也就是字典樹。Trie樹,又稱單詞查找樹或鍵樹,是一種樹形結構,是一種哈希樹的變種。典型應用是用於統計和排序大量的字符串(但不僅限於字符串),所以經常被搜索引擎系統用於文本詞頻統計。

                                                                                                                                                                                           ——摘自百度百科

二、AC自動機用來做什麼?

        一個常見的例子就是給出n個單詞,再給出一段包含m個字符的文章,讓你找出有多少個單詞在文章裏出現過。要搞懂AC自動機,先得有模式樹(字典樹)Trie和KMP模式匹配算法的基礎知識。AC自動機算法分爲3步:構造一棵Trie樹,構造失敗指針和模式匹配過程。

        如果你對KMP算法瞭解的話,應該知道KMP算法中的next函數(shift函數或者fail函數)是幹什麼用的。KMP中我們用兩個指針i和j分別表示,A[i-j+ 1..i]與B[1..j]完全相等。也就是說,i是不斷增加的,隨着i的增加j相應地變化,且j滿足以A[i]結尾的長度爲j的字符串正好匹配B串的前 j個字符,當A[i+1]≠B[j+1],KMP的策略是調整j的位置(減小j值)使得A[i-j+1..i]與B[1..j]保持匹配且新的B[j+1]恰好與A[i+1]匹配,而next函數恰恰記錄了這個j應該調整到的位置。同樣AC自動機的失敗指針具有同樣的功能,也就是說當我們的模式串在Trie上進行匹配時,如果與當前節點的關鍵字不能繼續匹配,就應該去當前節點的失敗指針所指向的節點繼續進行匹配。

 

三、AC自動機的Python安裝

安裝過這個包的朋友,相信都遇到過各種坑。

1、pip安裝

官網:https://pypi.org/project/pyahocorasick/。源碼下載:

安裝方式:pip install pyahocorasick(python3),但嘗試過的朋友會發現,這個包需要C編譯器,如果自己的電腦中沒有安裝C編譯器,是安裝不成功的。pip install ahocorasick(python2)也無法安裝。具體報錯代碼:

pip install pyahocorasick

Collecting pyahocorasick
  Using cached https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz
Building wheels for collected packages: pyahocorasick
  Running setup.py bdist_wheel for pyahocorasick ... error
  Complete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-wheel-rbzdosp6 --python-tag cp37:
  running bdist_wheel
  running build
  running build_ext
  building 'ahocorasick' extension
  creating build
  creating build/temp.macosx-10.7-x86_64-3.7
  gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/include -arch x86_64 -I/anaconda3/include -arch x86_64 -DAHOCORASICK_UNICODE= -I/anaconda3/include/python3.7m -c pyahocorasick.c -o build/temp.macosx-10.7-x86_64-3.7/pyahocorasick.o
  xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
  error: command 'gcc' failed with exit status 1
  
  ----------------------------------------
  Failed building wheel for pyahocorasick
  Running setup.py clean for pyahocorasick
Failed to build pyahocorasick
Installing collected packages: pyahocorasick
  Running setup.py install for pyahocorasick ... error
    Complete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-record-5oyl9c1l/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_ext
    building 'ahocorasick' extension
    creating build
    creating build/temp.macosx-10.7-x86_64-3.7
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/include -arch x86_64 -I/anaconda3/include -arch x86_64 -DAHOCORASICK_UNICODE= -I/anaconda3/include/python3.7m -c pyahocorasick.c -o build/temp.macosx-10.7-x86_64-3.7/pyahocorasick.o
    xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
    error: command 'gcc' failed with exit status 1
    
    ----------------------------------------
Command "/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-record-5oyl9c1l/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/jd/6t6rh0991m72k_vxp02p7f440000gn/T/pip-install-_tg58exd/pyahocorasick/

如果直接下載Github中的源碼,在使用ahocorasick.Automaton()函數會報錯。那怎麼辦?

個人嘗試着安裝ahocorasick-python,官網:https://pypi.org/project/ahocorasick-python/,GitHub源碼:源碼

但是結果發現Mac/Linux系統可以使用,Win10不行?瞬間無語了。demo環境用的是win10。

 

2、解決方案

網上查找了一些解決方案,主要包括三種:

(1)老老實實地裝C編譯器;

(2)python使用esmre代替ahocorasick實現ac自動機多模匹配

(3)個人改寫ahocorasick——Python下的ahocorasick實現快速的關鍵字匹配

 

四、ahocorasick的Python代碼

1、Python2代碼:

# python2
# coding=utf-8

KIND = 16

class Node():
    static = 0

    def __init__(self):
        self.fail = None
        self.next = [None] * KIND
        self.end = False
        self.word = None
        Node.static += 1


class AcAutomation():
    def __init__(self):
        self.root = Node()
        self.queue = []

    def getIndex(self, char):
        return ord(char)  # - BASE

    def insert(self, string):
        p = self.root
        for char in string:
            index = self.getIndex(char)
            if p.next[index] == None:
                p.next[index] = Node()
            p = p.next[index]
        p.end = True
        p.word = string

    def build_automation(self):
        self.root.fail = None
        self.queue.append(self.root)
        while len(self.queue) != 0:
            parent = self.queue[0]
            self.queue.pop(0)
            for i, child in enumerate(parent.next):
                if child == None: continue
                if parent == self.root:
                    child.fail = self.root
                else:
                    failp = parent.fail
                    while failp != None:
                        if failp.next[i] != None:
                            child.fail = failp.next[i]
                            break
                        failp = failp.fail
                    if failp == None: child.fail = self.root
                self.queue.append(child)

    def matchOne(self, string):
        p = self.root
        for char in string:
            index = self.getIndex(char)
            while p.next[index] == None and p != self.root: p = p.fail
            if p.next[index] == None:
                p = self.root
            else:
                p = p.next[index]
            if p.end: return True, p.word
        return False, None


class UnicodeAcAutomation():
    def __init__(self, encoding='utf-8'):
        self.ac = AcAutomation()
        self.encoding = encoding

    def getAcString(self, string):
        string = bytearray(string.encode(self.encoding))
        ac_string = ''
        for byte in string:
            ac_string += chr(byte % 16)
            ac_string += chr(byte / 16)
        # print ac_string
        return ac_string

    def insert(self, string):
        if type(string) != unicode:
            raise Exception('UnicodeAcAutomation:: insert type not unicode')
        ac_string = self.getAcString(string)
        self.ac.insert(ac_string)

    def build_automation(self):
        self.ac.build_automation()

    def matchOne(self, string):
        if type(string) != unicode:
            raise Exception('UnicodeAcAutomation:: insert type not unicode')
        ac_string = self.getAcString(string)
        retcode, ret = self.ac.matchOne(ac_string)
        if ret != None:
            s = ''
            for i in range(len(ret) / 2):
                tmp = chr(ord(ret[2 * i]) + ord(ret[2 * i + 1]) * 16)
                s += tmp
            ret = s.decode('utf-8')
        return retcode, ret


def main():
    ac = UnicodeAcAutomation()
    ac.insert(u'丁亞光')
    ac.insert(u'好喫的')
    ac.insert(u'好玩的')
    ac.build_automation()

    print(ac.matchOne(u'hi,丁亞光在幹啥'))
    print(ac.matchOne(u'ab'))
    print(ac.matchOne(u'不能喫飯啊'))
    print(ac.matchOne(u'飯很好喫,有很多好好的喫的,'))
    print(ac.matchOne(u'有很多好玩的'))


if __name__ == '__main__':
    main()

輸出:

(True, u'\u4e01\u4e9a\u5149')
(False, None)
(False, None)
(False, None)
(True, u'\u597d\u73a9\u7684')

可能很多朋友習慣了Python3,這裏提供個人修改後的代碼(主要是編碼格式的修改)


2、Python3

# python3
# coding=utf-8

KIND = 16

class Node():
    static = 0

    def __init__(self):
        self.fail = None
        self.next = [None] * KIND
        self.end = False
        self.word = None
        Node.static += 1


class AcAutomation():
    def __init__(self):
        self.root = Node()
        self.queue = []

    def getIndex(self, char):
        return ord(char)  # - BASE

    def insert(self, string):
        p = self.root
        for char in string:
            index = self.getIndex(char)
            if p.next[index] == None:
                p.next[index] = Node()
            p = p.next[index]
        p.end = True
        p.word = string

    def build_automation(self):
        self.root.fail = None
        self.queue.append(self.root)
        while len(self.queue) != 0:
            parent = self.queue[0]
            self.queue.pop(0)
            for i, child in enumerate(parent.next):
                if child == None: continue
                if parent == self.root:
                    child.fail = self.root
                else:
                    failp = parent.fail
                    while failp != None:
                        if failp.next[i] != None:
                            child.fail = failp.next[i]
                            break
                        failp = failp.fail
                    if failp == None: child.fail = self.root
                self.queue.append(child)

    def matchOne(self, string):
        p = self.root
        for char in string:
            index = self.getIndex(char)
            while p.next[index] == None and p != self.root: p = p.fail
            if p.next[index] == None:
                p = self.root
            else:
                p = p.next[index]
            if p.end: return True, p.word
        return False, None


class UnicodeAcAutomation():
    def __init__(self, encoding='utf-8'):
        self.ac = AcAutomation()
        self.encoding = encoding

    def getAcString(self, string):
        string = bytearray(string.encode(self.encoding))
        ac_string = ''
        for byte in string:
            ac_string += chr(byte % 16)
            ac_string += chr(byte // 16)
        return ac_string

    def insert(self, string):
        if type(string) != str:
            raise Exception('StrAcAutomation:: insert type not str')
        ac_string = self.getAcString(string)
        self.ac.insert(ac_string)

    def build_automation(self):
        self.ac.build_automation()

    def matchOne(self, string):
        if type(string) != str:
            raise Exception('StrAcAutomation:: insert type not str')
        ac_string = self.getAcString(string)
        retcode, ret = self.ac.matchOne(ac_string)
        if ret != None:
            s = ''
            for i in range(len(ret) // 2):
                s += chr(ord(ret[2 * i]) + ord(ret[2 * i + 1]) * 16)
            ret = s.encode("latin1").decode('utf-8')
        return retcode, ret


def main():
    ac = UnicodeAcAutomation()
    ac.insert('丁亞光')
    ac.insert('好喫的')
    ac.insert('好玩的')
    ac.build_automation()
    print(ac.matchOne('hi,丁亞光在幹啥'))
    print(ac.matchOne('ab'))
    print(ac.matchOne('不能喫飯啊'))
    print(ac.matchOne('飯很好喫,有很多好好的喫的,'))
    print(ac.matchOne('有很多好玩的'))


if __name__ == '__main__':

輸出:

(True, '丁亞光')
(False, None)
(False, None)
(False, None)
(True, '好玩的')

 

總結:ahocorasick個人改寫的方法還有很多,比如根據ahocorasick-python的源碼進行改寫。其中ahocorasick-python的核心源碼如下。

# coding:utf-8
# write by zhou
# revised by zw

class Node(object):
    """
    節點的抽象
    """
    def __init__(self, str='', is_root=False):
        self._next_p = {}
        self.fail = None
        self.is_root = is_root
        self.str = str
        self.parent = None

    def __iter__(self):
        return iter(self._next_p.keys())

    def __getitem__(self, item):
        return self._next_p[item]

    def __setitem__(self, key, value):
        _u = self._next_p.setdefault(key, value)
        _u.parent = self

    def __repr__(self):
        return "<Node object '%s' at %s>" % \
               (self.str, object.__repr__(self)[1:-1].split('at')[-1])

    def __str__(self):
        return self.__repr__()


class AhoCorasick(object):
    """
    Ac自動機對象
    """
    def __init__(self, *words):
        self.words_set = set(words)
        self.words = list(self.words_set)
        self.words.sort(key=lambda x: len(x))
        self._root = Node(is_root=True)
        self._node_meta = {}
        self._node_all = [(0, self._root)]
        _a = {}
        for word in self.words:
            for w in word:
                _a.setdefault(w, set())
                _a[w].add(word)

        def node_append(keyword):
            assert len(keyword) > 0
            _ = self._root
            for _i, k in enumerate(keyword):
                node = Node(k)
                if k in _:
                    pass
                else:
                    _[k] = node
                    self._node_all.append((_i+1, _[k]))
                self._node_meta.setdefault(id(_[k]),set())
                if _i >= 1:
                    for _j in _a[k]:
                        if keyword[:_i+1].endswith(_j):
                            self._node_meta[id(_[k])].add((_j, len(_j)))
                _ = _[k]
            else:
                if _ != self._root:
                    self._node_meta[id(_)].add((keyword, len(keyword)))

        for word in self.words:
            node_append(word)
        self._node_all.sort(key=lambda x: x[0])
        self._make()

    def _make(self):
        """
        構造Ac樹
        :return:
        """
        for _level, node in self._node_all:
            if node == self._root or _level <= 1:
                node.fail = self._root
            else:
                _node = node.parent.fail
                while True:
                    if node.str in _node:
                        node.fail = _node[node.str]
                        break
                    else:
                        if _node == self._root:
                            node.fail = self._root
                            break
                        else:
                            _node = _node.fail

    def search(self, content, with_index=False):
        result = set()
        node = self._root
        index = 0
        for i in content:
            while 1:
                if i not in node:
                    if node == self._root:
                        break
                    else:
                        node = node.fail
                else:
                    for keyword, keyword_len in self._node_meta.get(id(node[i]), set()):
                        if not with_index:
                            result.add(keyword)
                        else:
                            result.add((keyword, (index - keyword_len + 1, index + 1)))
                    node = node[i]
                    break
            index += 1
        return result


if __name__ == '__main__':
    ac = AhoCorasick("abc", 'abe', 'acdabd', 'bdf', 'df', 'f', 'ac', 'cd', 'cda')
    print(ac.search('acdabdf', True))

輸出:

{('cd', (1, 3)), ('acdabd', (0, 6)), ('df', (5, 7)), ('f', (6, 7)), ('bdf', (4, 7)), ('cda', (1, 4)), ('ac', (0, 2))}

 

參考文獻:

1、AC自動機的python實現

2、70行Python實現AC自動機

3、序列比對(二十六)——精準匹配之KMP算法、Trie樹以及AC自動機

4、關於AC自動機的思考

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章