英文关键词提取之RAKE算法

转载自https://zhikaizhang.cn/

RAKE算法是由2010年的论文Automatic keyword extraction from individual documents提出的，比TextRank算法效果更好，原repository链接是 https://github.com/aneesha/RAKE，已经很久没有维护了，本文重新整理了代码，做了以下3个工作：

使其支持python 3.0版本
使其更灵活地用命令行调用
代码重构，提高可读性
托管在https://github.com/laserwave/RAKE

RAKE算法思想

RAKE算法用来做关键词(keyword)的提取，实际上提取的是关键的短语(phrase)，并且倾向于较长的短语，在英文中，关键词通常包括多个单词，但很少包含标点符号和停用词，例如and，the，of等，以及其他不包含语义信息的单词。
RAKE算法首先使用标点符号（如半角的句号、问号、感叹号、逗号等）将一篇文档分成若干分句，然后对于每一个分句，使用停用词作为分隔符将分句分为若干短语，这些短语作为最终提取出的关键词的候选词。
那么，如何来衡量每个短语的重要程度呢？
我们注意到，每个短语可以再通过空格分为若干个单词，可以通过给每个单词赋予一个得分，通过累加得到每个短语的得分。一个关键点在于将这个短语中每个单词的共现关系考虑进去。
最终定义的公式是:

即单词w的得分是该单词的度（是一个网络中的概念，每与一个单词共现在一个短语中，度就加1，考虑该单词本身）除以该单词的词频（该单词在该文档中出现的总次数）。
然后对于每个候选的关键短语，将其中每个单词的得分累加，并进行排序，RAKE将候选短语总数的前三分之一的认为是抽取出的关键词。

RAKE的实现

源码中使用maxPhraseLength参数来限定候选短语的长度，用来过滤掉过长的短语。

import re
import operator
import argparse
import codecs

def isNumber(s):
    try:
        float(s) if '.' in s else int(s)
        return True
    except ValueError:
        return False

class Rake:
    
    def __init__(self, inputFilePath, stopwordsFilePath, outputFilePath, minPhraseChar, maxPhraseLength):
        self.outputFilePath = outputFilePath
        self.minPhraseChar = minPhraseChar
        self.maxPhraseLength = maxPhraseLength
        # read documents
        self.docs = []
        for document in codecs.open(inputFilePath, 'r', 'utf-8'):
            self.docs.append(document)
        # read stopwords
        stopwords = []
        for word in codecs.open(stopwordsFilePath, 'r', 'utf-8'):
            stopwords.append(word.strip())
        stopwordsRegex = []
        for word in stopwords:
            regex = r'\b' + word + r'(?![\w-])'
            stopwordsRegex.append(regex)
        self.stopwordsPattern = re.compile('|'.join(stopwordsRegex), re.IGNORECASE)

    def separateWords(self, text):
        splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]')
        words = []
        for word in splitter.split(text):
            word = word.strip().lower()
            # leave numbers in phrase, but don't count as words, since they tend to invalidate scores of their phrases
            if len(word) > 0 and word != '' and not isNumber(word):
                words.append(word)
        return words
    
    
    def calculatePhraseScore(self, phrases):
        # calculate wordFrequency and wordDegree
        wordFrequency = {}
        wordDegree = {}
        for phrase in phrases:
            wordList = self.separateWords(phrase)
            wordListLength = len(wordList)
            wordListDegree = wordListLength - 1
            for word in wordList:
                wordFrequency.setdefault(word, 0)
                wordFrequency[word] += 1
                wordDegree.setdefault(word, 0)
                wordDegree[word] += wordListDegree
        for item in wordFrequency:
            wordDegree[item] = wordDegree[item] + wordFrequency[item]
    
        # calculate wordScore = wordDegree(w)/wordFrequency(w)
        wordScore = {}
        for item in wordFrequency:
            wordScore.setdefault(item, 0)
            wordScore[item] = wordDegree[item] * 1.0 / wordFrequency[item]

        # calculate phraseScore
        phraseScore = {}
        for phrase in phrases:
            phraseScore.setdefault(phrase, 0)
            wordList = self.separateWords(phrase)
            candidateScore = 0
            for word in wordList:
                candidateScore += wordScore[word]
            phraseScore[phrase] = candidateScore
        return phraseScore
    
        
    def execute(self):
        file = codecs.open(self.outputFilePath,'w','utf-8')
        for document in self.docs:
            # split a document into sentences
            sentenceDelimiters = re.compile(u'[.!?,;:\t\\\\"\\(\\)\\\'\u2019\u2013]|\\s\\-\\s')
            sentences = sentenceDelimiters.split(document)
            # generate all valid phrases
            phrases = []
            for s in sentences:
                tmp = re.sub(self.stopwordsPattern, '|', s.strip())
                phrasesOfSentence = tmp.split("|")
                for phrase in phrasesOfSentence:
                    phrase = phrase.strip().lower()
                    if phrase != "" and len(phrase) >= self.minPhraseChar and len(phrase.split()) <= self.maxPhraseLength:
                        phrases.append(phrase)
    
            # calculate phrase score
            phraseScore = self.calculatePhraseScore(phrases)
            keywords = sorted(phraseScore.items(), key = operator.itemgetter(1), reverse=True)
            file.write(str(keywords[0:int(len(keywords)/3)]) + "\n")
        file.close()
        
def readParamsFromCmd():
    parser = argparse.ArgumentParser(description = "This is a python implementation of rake(rapid automatic keyword extraction).")
    parser.add_argument('inputFilePath', help = 'The file path of input document(s). One line represents a document.')
    parser.add_argument('stopwordsFilePath', help = 'The file path of stopwords, each line represents a word.')
    parser.add_argument('-o', '--outputFilePath', help = 'The file path of output. (default output.txt in current dir).', default = 'output.txt')
    parser.add_argument('-m', '--minPhraseChar', type = int, help = 'The minimum number of characters of a phrase.(default 1)', default = 1)
    parser.add_argument('-a', '--maxPhraseLength', type = int, help = 'The maximum length of a phrase.(default 3)', default = 3)
    return parser.parse_args()

params = readParamsFromCmd().__dict__

rake = Rake(params['inputFilePath'], params['stopwordsFilePath'], params['outputFilePath'], params['minPhraseChar'], params['maxPhraseLength'])
rake.execute()

用法

python rake.py [-h] [-o OUTPUTFILEPATH] [-m MINPHRASECHAR] [-a MAXPHRASELENGTH] inputFilePath stopwordsFilePath

optional arguments:positional arguments:
inputFilePath The file path of input document(s). One line represents a document.
stopwordsFilePath The file path of stopwords, each line represents a word.

-h, –help show this help message and exit
-o OUTPUTFILEPATH, –outputFilePath OUTPUTFILEPATH The file path of output. (default output.txt in current dir).
-m MINPHRASECHAR, –minPhraseChar MINPHRASECHAR The minimum number of characters of a phrase.(default 1)
-a MAXPHRASELENGTH, –maxPhraseLength MAXPHRASELENGTH The maximum length of a phrase.(default 3)

实验

源码中的example中包括两个文档，第一个文档是论文中给出的一个样例，第二个文档是从wikipedia的NLP词条中拷贝的一段话。如下所示：

输出是每个文档中抽取的关键短语以及得分，如下：

使用RAKE算法处理中文，会遇到一些问题，中文使用停用词来将一句话划分为若干短语的效果远不及英文，大部分的汉字相互粘连在一起，因此效果不好。

参考资料

1 Automatic keyword extraction from individual documents by Stuart Rose et al.

https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents

2 https://github.com/laserwave/keywords_extraction_rake

3 https://blog.csdn.net/chinwuforwork/article/details/77993277

英文关键词提取之RAKE算法

RAKE算法思想

RAKE的实现

用法

实验

参考资料

诈骗（杀猪盘）网站进行渗透测试

Python 潮流周刊#50：我最喜欢的 Python 3.13 新特性！

外行也能读懂的网络硬件设备功能原理速成

詞向量 — word2vec

Flutter中模擬器無法啓動的解決方法

Pytorch實現一個用於學習正態分佈的GAN網絡

解決ModuleNotFoundError: No module named 'numpy.core._multiarray_umath'的方法

英文關鍵詞提取之RAKE算法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結