【NLP基礎】英文關鍵詞抽取RAKE算法

RAKE簡介

RAKE英文全稱爲Rapid Automatic keyword extraction,中文稱爲快速自動關鍵字提取,是一種非常高效的關鍵字提取算法,可對單個文檔進行操作,以實現對動態集合的應用,也可非常輕鬆地應用於新域,並且在處理多種類型的文檔時也非常有效。


算法思想

RAKE算法用來做關鍵詞(keyword)的提取,實際上提取的是關鍵的短語(phrase),並且傾向於較長的短語,在英文中,關鍵詞通常包括多個單詞,但很少包含標點符號和停用詞,例如and,the,of等,以及其他不包含語義信息的單詞。

RAKE算法首先使用標點符號(如半角的句號、問號、感嘆號、逗號等)將一篇文檔分成若干分句,然後對於每一個分句,使用停用詞作爲分隔符將分句分爲若干短語,這些短語作爲最終提取出的關鍵詞的候選詞。

最後,每個短語可以再通過空格分爲若干個單詞,可以通過給每個單詞賦予一個得分,通過累加得到每個短語的得分。一個關鍵點在於將這個短語中每個單詞的共現關係考慮進去。最終定義的公式是:
wordScore = wordDegree(w)/wordFrequency(w)

算法步驟

(1)算法首先對句子進行分詞,分詞後去除停用詞,根據停 用詞劃分短語;
(2)之後計算每一個詞在短語的共現詞數,並構建 詞共現矩陣;
(3)共現矩陣的每一列的值即爲該詞的度deg(是一個網絡中的概念,每與一個單詞共現在一個短語中,度就加1,考慮該單詞本身),每個詞在文本中出現的次數即爲頻率freq;
(4)得分score爲度deg與頻率 freq的商,score越大則該詞更重 ;
(5)最後按照得分的大小值降序 輸出該詞所在的短語。

下面我們以一箇中文例子具體解釋RAKE算法原理,例如“系統有聲音,但系統托盤的音量小喇叭圖標不見了”,經過分詞、去除停用詞處理 後得到的詞集W = {系統,聲音,托盤,音量,小喇叭,圖標,不見},短語集D={系統,聲音,系統托盤,音量小喇叭圖標不見},詞共現矩陣如表:


每一個詞的度爲deg={"系統”:2,“聲音”:1,“托盤”:1; “音量” :3; “小喇叭” :3,“圖標” :3,“不見” :3},頻率freq = { “系統” :2, “聲音” :1, “托盤” :1 ; “音量” :1; “小喇叭” :1, “圖標”丄“不見” :1}, score ={“系統”:1,“聲音”:1,“託 盤” :1 ;“音量” :1小喇叭” :3, “圖標” :3, “不見” :3 },輸出結果爲{音量小喇叭圖標不見 ,系統托盤,系統,聲音}

代碼實現


import string

from typing import Dict, List, Set, Tuple

PUNCTUATION = string.punctuation.replace('\'', '')  # Do not use apostrophe as a delimiter

ENGLISH_WORDS_STOPLIST: List[str] = [
    '(', ')', 'and', 'of', 'the', 'amongst', 'with', 'from', 'after', 'its', 'it', 'at', 'is',
    'this', ',', '.', 'be', 'in', 'that', 'an', 'other', 'than', 'also', 'are', 'may', 'suggests',
    'all', 'where', 'most', 'against', 'more', 'have', 'been', 'several', 'as', 'before',
    'although', 'yet', 'likely', 'rather', 'over', 'a', 'for', 'can', 'these', 'considered',
    'used', 'types', 'given', 'precedes',
]


def split_to_tokens(text: str) -> List[str]:
    '''
    Split text string to tokens.
    Behavior is similar to str.split(),
    but empty lines are omitted and punctuation marks are separated from word.
    Example:
    split_to_tokens('John     said 'Hey!' (and some other words.)') ->
    -> ['John', 'said', ''', 'Hey', '!', ''', '(', 'and', 'some', 'other', 'words', '.', ')']
    '''
    result = []
    for item in text.split():
        while item[0] in PUNCTUATION:
            result.append(item[0])
            item = item[1:]
        for i in range(len(item)):
            if item[-i - 1] not in PUNCTUATION:
                break
        if i == 0:
            result.append(item)
        else:
            result.append(item[:-i])
            result.extend(item[-i:])
    return [item for item in result if item]


def split_tokens_to_phrases(tokens: List[str], stoplist: List[str] = None) -> List[str]:
    """
    Merge tokens into phrases, delimited by items from stoplist.
    Phrase is a sequence of token that has the following properties:
    - phrase contains 1 or more tokens
    - tokens from phrase go in a row
    - phrase does not contain delimiters from stoplist
    - either the previous (not in a phrase) token belongs to stop list or it is the beginning of tokens list
    - either the next (not in a phrase) token belongs to stop list or it is the end of tokens list
    Example:
    split_tokens_to_phrases(
        tokens=['Mary', 'and', 'John', ',', 'some', 'words', '(', 'and', 'other', 'words', ')'],
        stoplist=['and', ',', '.', '(', ')']) ->
    -> ['Mary', 'John', 'some words', 'other words']
    """
    if stoplist is None:
        stoplist = ENGLISH_WORDS_STOPLIST
    stoplist += list(PUNCTUATION)

    current_phrase: List[str] = []
    all_phrases: List[str] = []
    stoplist_set: Set[str] = {stopword.lower() for stopword in stoplist}
    for token in tokens:
        if token.lower() in stoplist_set:
            if current_phrase:
                all_phrases.append(' '.join(current_phrase))
            current_phrase = []
        else:
            current_phrase.append(token)
    if current_phrase:
        all_phrases.append(' '.join(current_phrase))
    return all_phrases


def get_cooccurrence_graph(phrases: List[str]) -> Dict[str, Dict[str, int]]:
    """
    Get graph that stores cooccurence of tokens in phrases.
    Matrix is stored as dict,
    where key is token, value is dict (key is second token, value is number of cooccurrence).
    Example:
    get_occurrence_graph(['Mary', 'John', 'some words', 'other words']) -> {
        'mary': {'mary': 1},
        'john': {'john': 1},
        'some': {'some': 1, 'words': 1},
        'words': {'some': 1, 'words': 2, 'other': 1},
        'other': {'other': 1, 'words': 1}
    }
    """
    graph: Dict[str, Dict[str, int]] = {}
    for phrase in phrases:
        for first_token in phrase.lower().split():
            for second_token in phrase.lower().split():
                if first_token not in graph:
                    graph[first_token] = {}
                graph[first_token][second_token] = graph[first_token].get(second_token, 0) + 1
    return graph


def get_degrees(cooccurrence_graph: Dict[str, Dict[str, int]]) -> Dict[str, int]:
    """
    Get degrees for all tokens by cooccurrence graph.
    Result is stored as dict,
    where key is token, value is degree (sum of lengths of phrases that contain the token).
    Example:
    get_degrees(
        {
            'mary': {'mary': 1},
            'john': {'john': 1},
            'some': {'some': 1, 'words': 1},
            'words': {'some': 1, 'words': 2, 'other': 1},
            'other': {'other': 1, 'words': 1}
        }
    ) -> {'mary': 1, 'john': 1, 'some': 2, 'words': 4, 'other': 2}
    """
    return {token: sum(cooccurrence_graph[token].values()) for token in cooccurrence_graph}


def get_frequencies(cooccurrence_graph: Dict[str, Dict[str, int]]) -> Dict[str, int]:
    """
    Get frequencies for all tokens by cooccurrence graph.
    Result is stored as dict,
    where key is token, value is frequency (number of times the token occurs).
    Example:
    get_frequencies(
        {
            'mary': {'mary': 1},
            'john': {'john': 1},
            'some': {'some': 1, 'words': 1},
            'words': {'some': 1, 'words': 2, 'other': 1},
            'other': {'other': 1, 'words': 1}
        }
    ) -> {'mary': 1, 'john': 1, 'some': 1, 'words': 2, 'other': 1}
    """
    return {token: cooccurrence_graph[token][token] for token in cooccurrence_graph}


def get_ranked_phrases(phrases: List[str], *,
                       degrees: Dict[str, int],
                       frequencies: Dict[str, int]) -> List[Tuple[str, float]]:
    """
    Get RAKE measure for every phrase.
    Result is stored as list of tuples, every tuple contains of phrase and its RAKE measure.
    Items are sorted non-ascending by RAKE measure, than alphabetically by phrase.
    """
    processed_phrases: Set[str] = set()
    ranked_phrases: List[Tuple[str, float]] = []
    for phrase in phrases:
        lowered_phrase = phrase.lower()
        if lowered_phrase in processed_phrases:
            continue
        score: float = sum(degrees[token] / frequencies[token] for token in lowered_phrase.split())
        ranked_phrases.append((lowered_phrase, round(score, 2)))
        processed_phrases.add(lowered_phrase)
    # Sort by score than by phrase alphabetically.
    ranked_phrases.sort(key=lambda item: (-item[1], item[0]))
    return ranked_phrases


def rake_text(text: str) -> List[Tuple[str, float]]:
    """
    Get RAKE measure for every phrase in text string.
    Result is stored as list of tuples, every tuple contains of phrase and its RAKE measure.
    Items are sorted non-ascending by RAKE measure, than alphabetically by phrase.
    """
    tokens: List[str] = split_to_tokens(text)
    phrases: List[str] = split_tokens_to_phrases(tokens)
    cooccurrence: Dict[str, Dict[str, int]] = get_cooccurrence_graph(phrases)
    degrees: Dict[str, int] = get_degrees(cooccurrence)
    frequencies: Dict[str, int] = get_frequencies(cooccurrence)
    ranked_result: List[Tuple[str, float]] = get_ranked_phrases(phrases, degrees=degrees, frequencies=frequencies)
    return ranked_result

執行效果:

if __name__ == '__main__':
    text = 'Mercy-class includes USNS Mercy and USNS Comfort hospital ships. Credit: US Navy photo Mass Communication Specialist 1st Class Jason Pastrick. The US Naval Air Warfare Center Aircraft Division (NAWCAD) Lakehurst in New Jersey is using an additive manufacturing process to make face shields.........'
    ranked_result = rake_text(text)
    print(ranked_result)

關鍵短語抽取效果如下:

[
('additive manufacturing process to make face shields.the 3d printing face shields', 100.4),
 ('us navy photo mass communication specialist 1st class jason pastrick', 98.33),
 ('us navy’s mercy-class hospital ship usns comfort.currently stationed', 53.33), 
...
]

代碼來自:https://github.com/eeeeeeeelias/nlp-rake

參考資料

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章