Python——利用AC自動機進行關鍵詞提取
目標:在之前寫的文章【Python實現多模匹配——AC自動機】基礎上,安裝gcc(C編譯器),再裝ahocorasick ,並完成從文本中提取關鍵詞的任務。
PS:由於原理之前已經介紹,本文只介紹安裝過程,以及如何應用。還想看原理的朋友,請戳【模式匹配】Aho-Corasick自動機與Aho-Corasick自動機淺析。
1、安裝ahocorasick(python3)
終端輸入:
conda install pyahocorasick
anaconda search -t conda pyahocorasick
conda install -c https://conda.anaconda.org/conda-forge pyahocorasick
最後輸入y,就安裝完成了。參考博客:python中安裝ahocorasick庫(原博客已打不開)。
2、Python例子(參考,並優化ahocorasick 的使用中出現的問題和簡單使用)
import ahocorasick
import time
class AhocorasickNer:
def __init__(self, user_dict_path):
self.user_dict_path = user_dict_path
self.actree = ahocorasick.Automaton()
def add_keywords(self):
flag = 0
with open(self.user_dict_path, "r", encoding="utf-8") as file:
for line in file:
word, flag = line.strip(), flag + 1
self.actree.add_word(word, (flag, word))
self.actree.make_automaton()
def get_ner_results(self, sentence):
ner_results = []
# i的形式爲(index1,(index2,word))
# index1: 提取後的結果在sentence中的末尾索引
# index2: 提取後的結果在self.actree中的索引
for i in self.actree.iter(sentence):
ner_results.append((i[1], i[0] + 1 - len(i[1][1]), i[0] + 1))
return ner_results
if __name__ == "__main__":
ahocorasick_ner = AhocorasickNer(user_dict_path="../../funNLP/organization_dict.txt")
ahocorasick_ner.add_keywords()
while True:
sentence = input("\nINPUT : ")
ss = time.time()
res = ahocorasick_ner.get_ner_results(sentence)
print("TIME : {0}ms!". format(round(1000*(time.time() - ss), 3)))
print("OUTPUT:{0}".format(res))
輸出:
說明:
1、代碼中的文件【funNLP/organization_dict.txt】來自於:funNLP-公司名字詞庫,註冊就能快速下載(比github快N倍);
2、文件中的詞典屬於百萬級別,從文本中提取關鍵詞的耗時在0.05ms左右(個人筆記本)。
補充:純Python版本的AC自動機,代碼如下(待更新):
#!/usr/bin/env python
# -*- coding:utf-8 -*-
"""
@Author :geekzw
@Contact :[email protected]
@File :AC_ner.py
@Time :2020/3/5 12:56 AM
@Software :Pycharm
@Copyright (c) 2020,All Rights Reserved.
"""
import time
class node(object):
def __init__(self):
self.next = {} # 相當於指針,指向樹節點的下一層節點
self.fail = None # 失配指針,這個是AC自動機的關鍵
self.isWord = False # 標記,用來判斷是否是一個標籤的結尾
self.word = "" # 用來儲存標籤
class ac_automation(object):
def __init__(self, user_dict_path):
self.root = node()
self.user_dict_path = user_dict_path
def add(self, word):
temp_root = self.root
for char in word:
if char not in temp_root.next:
temp_root.next[char] = node()
temp_root = temp_root.next[char]
temp_root.isWord = True
temp_root.word = word
# 添加文件中的關鍵詞
def add_keyword(self):
with open(self.user_dict_path, "r", encoding="utf-8") as file:
for line in file:
self.add(line.strip())
def make_fail(self):
temp_que = []
temp_que.append(self.root)
while len(temp_que) != 0:
temp = temp_que.pop(0)
p = None
for key,value in temp.next.item():
if temp == self.root:
temp.next[key].fail = self.root
else:
p = temp.fail
while p is not None:
if key in p.next:
temp.next[key].fail = p.fail
break
p = p.fail
if p is None:
temp.next[key].fail = self.root
temp_que.append(temp.next[key])
def search(self, content):
p = self.root
result = set()
index = 0
while index < len(content) - 1:
currentposition = index
while currentposition < len(content):
word = content[currentposition]
while word in p.next == False and p != self.root:
p = p.fail
if word in p.next:
p = p.next[word]
else:
p = self.root
if p.isWord:
end_index = currentposition + 1
result.add((p.word, end_index - len(p.word), end_index))
break
currentposition += 1
p = self.root
index += 1
return result
if __name__ == "__main__":
ac = ac_automation(user_dict_path="../../funNLP/organization_dict.txt")
ac.add_keyword() # 添加關鍵詞到AC自動機
while True:
query = input("\nINPUT: ")
ss = time.time()
res = ac.search(query)
print("TIME: {0} ms!".format(round(1000 * (time.time() - ss), 3)))
print("OUTPUT:", res)
輸出:
說明:對比發現,自己改寫的純Python版本,存在優化空間,降低耗時成本!
參考文獻
1、字符串多模式匹配:AC算法的Java實現與原理介紹
4、AC自動機-知乎
7、利用AC自動機進行關鍵字的提取和過濾(Java,美團代碼)
最後分享一個IntelliJ IDEA的激活碼:IntelliJ IDEA 2019.3激活破解教程(親測有效,可激活至 2089 年,持續更新~),感謝原博主。