Python——利用AC自动机进行关键词提取
目标:在之前写的文章【Python实现多模匹配——AC自动机】基础上,安装gcc(C编译器),再装ahocorasick ,并完成从文本中提取关键词的任务。
PS:由于原理之前已经介绍,本文只介绍安装过程,以及如何应用。还想看原理的朋友,请戳【模式匹配】Aho-Corasick自动机与Aho-Corasick自动机浅析。
1、安装ahocorasick(python3)
终端输入:
conda install pyahocorasick
anaconda search -t conda pyahocorasick
conda install -c https://conda.anaconda.org/conda-forge pyahocorasick
最后输入y,就安装完成了。参考博客:python中安装ahocorasick库(原博客已打不开)。
2、Python例子(参考,并优化ahocorasick 的使用中出现的问题和简单使用)
import ahocorasick
import time
class AhocorasickNer:
def __init__(self, user_dict_path):
self.user_dict_path = user_dict_path
self.actree = ahocorasick.Automaton()
def add_keywords(self):
flag = 0
with open(self.user_dict_path, "r", encoding="utf-8") as file:
for line in file:
word, flag = line.strip(), flag + 1
self.actree.add_word(word, (flag, word))
self.actree.make_automaton()
def get_ner_results(self, sentence):
ner_results = []
# i的形式为(index1,(index2,word))
# index1: 提取后的结果在sentence中的末尾索引
# index2: 提取后的结果在self.actree中的索引
for i in self.actree.iter(sentence):
ner_results.append((i[1], i[0] + 1 - len(i[1][1]), i[0] + 1))
return ner_results
if __name__ == "__main__":
ahocorasick_ner = AhocorasickNer(user_dict_path="../../funNLP/organization_dict.txt")
ahocorasick_ner.add_keywords()
while True:
sentence = input("\nINPUT : ")
ss = time.time()
res = ahocorasick_ner.get_ner_results(sentence)
print("TIME : {0}ms!". format(round(1000*(time.time() - ss), 3)))
print("OUTPUT:{0}".format(res))
输出:
说明:
1、代码中的文件【funNLP/organization_dict.txt】来自于:funNLP-公司名字词库,注册就能快速下载(比github快N倍);
2、文件中的词典属于百万级别,从文本中提取关键词的耗时在0.05ms左右(个人笔记本)。
补充:纯Python版本的AC自动机,代码如下(待更新):
#!/usr/bin/env python
# -*- coding:utf-8 -*-
"""
@Author :geekzw
@Contact :[email protected]
@File :AC_ner.py
@Time :2020/3/5 12:56 AM
@Software :Pycharm
@Copyright (c) 2020,All Rights Reserved.
"""
import time
class node(object):
def __init__(self):
self.next = {} # 相当于指针,指向树节点的下一层节点
self.fail = None # 失配指针,这个是AC自动机的关键
self.isWord = False # 标记,用来判断是否是一个标签的结尾
self.word = "" # 用来储存标签
class ac_automation(object):
def __init__(self, user_dict_path):
self.root = node()
self.user_dict_path = user_dict_path
def add(self, word):
temp_root = self.root
for char in word:
if char not in temp_root.next:
temp_root.next[char] = node()
temp_root = temp_root.next[char]
temp_root.isWord = True
temp_root.word = word
# 添加文件中的关键词
def add_keyword(self):
with open(self.user_dict_path, "r", encoding="utf-8") as file:
for line in file:
self.add(line.strip())
def make_fail(self):
temp_que = []
temp_que.append(self.root)
while len(temp_que) != 0:
temp = temp_que.pop(0)
p = None
for key,value in temp.next.item():
if temp == self.root:
temp.next[key].fail = self.root
else:
p = temp.fail
while p is not None:
if key in p.next:
temp.next[key].fail = p.fail
break
p = p.fail
if p is None:
temp.next[key].fail = self.root
temp_que.append(temp.next[key])
def search(self, content):
p = self.root
result = set()
index = 0
while index < len(content) - 1:
currentposition = index
while currentposition < len(content):
word = content[currentposition]
while word in p.next == False and p != self.root:
p = p.fail
if word in p.next:
p = p.next[word]
else:
p = self.root
if p.isWord:
end_index = currentposition + 1
result.add((p.word, end_index - len(p.word), end_index))
break
currentposition += 1
p = self.root
index += 1
return result
if __name__ == "__main__":
ac = ac_automation(user_dict_path="../../funNLP/organization_dict.txt")
ac.add_keyword() # 添加关键词到AC自动机
while True:
query = input("\nINPUT: ")
ss = time.time()
res = ac.search(query)
print("TIME: {0} ms!".format(round(1000 * (time.time() - ss), 3)))
print("OUTPUT:", res)
输出:
说明:对比发现,自己改写的纯Python版本,存在优化空间,降低耗时成本!
参考文献
1、字符串多模式匹配:AC算法的Java实现与原理介绍
4、AC自动机-知乎
7、利用AC自动机进行关键字的提取和过滤(Java,美团代码)
最后分享一个IntelliJ IDEA的激活码:IntelliJ IDEA 2019.3激活破解教程(亲测有效,可激活至 2089 年,持续更新~),感谢原博主。