RAKE算法是由2010年的論文Automatic keyword extraction from individual documents提出的,比TextRank算法效果更好,原repository鏈接是 https://github.com/aneesha/RAKE,已經很久沒有維護了,本文重新整理了代碼,做了以下3個工作:
- 使其支持python 3.0版本
- 使其更靈活地用命令行調用
- 代碼重構,提高可讀性
託管在https://github.com/laserwave/RAKE
RAKE算法思想
RAKE算法用來做關鍵詞(keyword)的提取,實際上提取的是關鍵的短語(phrase),並且傾向於較長的短語,在英文中,關鍵詞通常包括多個單詞,但很少包含標點符號和停用詞,例如and,the,of等,以及其他不包含語義信息的單詞。
RAKE算法首先使用標點符號(如半角的句號、問號、感嘆號、逗號等)將一篇文檔分成若干分句,然後對於每一個分句,使用停用詞作爲分隔符將分句分爲若干短語,這些短語作爲最終提取出的關鍵詞的候選詞。
那麼,如何來衡量每個短語的重要程度呢?
我們注意到,每個短語可以再通過空格分爲若干個單詞,可以通過給每個單詞賦予一個得分,通過累加得到每個短語的得分。一個關鍵點在於將這個短語中每個單詞的共現關係考慮進去。
最終定義的公式是:
即單詞w的得分是該單詞的度(是一個網絡中的概念,每與一個單詞共現在一個短語中,度就加1,考慮該單詞本身)除以該單詞的詞頻(該單詞在該文檔中出現的總次數)。
然後對於每個候選的關鍵短語,將其中每個單詞的得分累加,並進行排序,RAKE將候選短語總數的前三分之一的認爲是抽取出的關鍵詞。
RAKE的實現
源碼中使用maxPhraseLength參數來限定候選短語的長度,用來過濾掉過長的短語。
import re
import operator
import argparse
import codecs
def isNumber(s):
try:
float(s) if '.' in s else int(s)
return True
except ValueError:
return False
class Rake:
def __init__(self, inputFilePath, stopwordsFilePath, outputFilePath, minPhraseChar, maxPhraseLength):
self.outputFilePath = outputFilePath
self.minPhraseChar = minPhraseChar
self.maxPhraseLength = maxPhraseLength
# read documents
self.docs = []
for document in codecs.open(inputFilePath, 'r', 'utf-8'):
self.docs.append(document)
# read stopwords
stopwords = []
for word in codecs.open(stopwordsFilePath, 'r', 'utf-8'):
stopwords.append(word.strip())
stopwordsRegex = []
for word in stopwords:
regex = r'\b' + word + r'(?![\w-])'
stopwordsRegex.append(regex)
self.stopwordsPattern = re.compile('|'.join(stopwordsRegex), re.IGNORECASE)
def separateWords(self, text):
splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]')
words = []
for word in splitter.split(text):
word = word.strip().lower()
# leave numbers in phrase, but don't count as words, since they tend to invalidate scores of their phrases
if len(word) > 0 and word != '' and not isNumber(word):
words.append(word)
return words
def calculatePhraseScore(self, phrases):
# calculate wordFrequency and wordDegree
wordFrequency = {}
wordDegree = {}
for phrase in phrases:
wordList = self.separateWords(phrase)
wordListLength = len(wordList)
wordListDegree = wordListLength - 1
for word in wordList:
wordFrequency.setdefault(word, 0)
wordFrequency[word] += 1
wordDegree.setdefault(word, 0)
wordDegree[word] += wordListDegree
for item in wordFrequency:
wordDegree[item] = wordDegree[item] + wordFrequency[item]
# calculate wordScore = wordDegree(w)/wordFrequency(w)
wordScore = {}
for item in wordFrequency:
wordScore.setdefault(item, 0)
wordScore[item] = wordDegree[item] * 1.0 / wordFrequency[item]
# calculate phraseScore
phraseScore = {}
for phrase in phrases:
phraseScore.setdefault(phrase, 0)
wordList = self.separateWords(phrase)
candidateScore = 0
for word in wordList:
candidateScore += wordScore[word]
phraseScore[phrase] = candidateScore
return phraseScore
def execute(self):
file = codecs.open(self.outputFilePath,'w','utf-8')
for document in self.docs:
# split a document into sentences
sentenceDelimiters = re.compile(u'[.!?,;:\t\\\\"\\(\\)\\\'\u2019\u2013]|\\s\\-\\s')
sentences = sentenceDelimiters.split(document)
# generate all valid phrases
phrases = []
for s in sentences:
tmp = re.sub(self.stopwordsPattern, '|', s.strip())
phrasesOfSentence = tmp.split("|")
for phrase in phrasesOfSentence:
phrase = phrase.strip().lower()
if phrase != "" and len(phrase) >= self.minPhraseChar and len(phrase.split()) <= self.maxPhraseLength:
phrases.append(phrase)
# calculate phrase score
phraseScore = self.calculatePhraseScore(phrases)
keywords = sorted(phraseScore.items(), key = operator.itemgetter(1), reverse=True)
file.write(str(keywords[0:int(len(keywords)/3)]) + "\n")
file.close()
def readParamsFromCmd():
parser = argparse.ArgumentParser(description = "This is a python implementation of rake(rapid automatic keyword extraction).")
parser.add_argument('inputFilePath', help = 'The file path of input document(s). One line represents a document.')
parser.add_argument('stopwordsFilePath', help = 'The file path of stopwords, each line represents a word.')
parser.add_argument('-o', '--outputFilePath', help = 'The file path of output. (default output.txt in current dir).', default = 'output.txt')
parser.add_argument('-m', '--minPhraseChar', type = int, help = 'The minimum number of characters of a phrase.(default 1)', default = 1)
parser.add_argument('-a', '--maxPhraseLength', type = int, help = 'The maximum length of a phrase.(default 3)', default = 3)
return parser.parse_args()
params = readParamsFromCmd().__dict__
rake = Rake(params['inputFilePath'], params['stopwordsFilePath'], params['outputFilePath'], params['minPhraseChar'], params['maxPhraseLength'])
rake.execute()
用法
python rake.py [-h] [-o OUTPUTFILEPATH] [-m MINPHRASECHAR] [-a MAXPHRASELENGTH] inputFilePath stopwordsFilePath
optional arguments:positional arguments:
inputFilePath The file path of input document(s). One line represents a document.
stopwordsFilePath The file path of stopwords, each line represents a word.
-h, –help show this help message and exit
-o OUTPUTFILEPATH, –outputFilePath OUTPUTFILEPATH The file path of output. (default output.txt in current dir).
-m MINPHRASECHAR, –minPhraseChar MINPHRASECHAR The minimum number of characters of a phrase.(default 1)
-a MAXPHRASELENGTH, –maxPhraseLength MAXPHRASELENGTH The maximum length of a phrase.(default 3)
實驗
源碼中的example中包括兩個文檔,第一個文檔是論文中給出的一個樣例,第二個文檔是從wikipedia的NLP詞條中拷貝的一段話。如下所示:
輸出是每個文檔中抽取的關鍵短語以及得分,如下:
使用RAKE算法處理中文,會遇到一些問題,中文使用停用詞來將一句話劃分爲若干短語的效果遠不及英文,大部分的漢字相互粘連在一起,因此效果不好。
參考資料
1 Automatic keyword extraction from individual documents by Stuart Rose et al.
2 https://github.com/laserwave/keywords_extraction_rake
3 https://blog.csdn.net/chinwuforwork/article/details/77993277