機器學習（十一）：新聞摘要提取小案例

注：基於現有案例教程

完成一個相對簡單的 “關鍵字提取” 算法，來達到最自然語言處理的一個初步的理解。

詞彙數據下載：

http://labfile.oss.aliyuncs.com/courses/741/nltk_data.tar.gz

也可以用下面的下載

import nltk
nltk.download('stopwords')
nltk.download('punkt')

程序測試樣本數據下載：

http://labfile.oss.aliyuncs.com/courses/741/news.txt

nltk.tokenize 是 NLTK 提供的分詞工具包。所謂的分詞 tokenize 實際就是把段落分成句子，把句子分成一個個單詞的過程。我們導入的 sent_tokenize() 函數對應的是分段爲句。 word_tokenize() 函數對應的是分句爲詞。

stopwords 是一個列表，包含了英文中那些頻繁出現的詞，如 am, is, are。

defaultdict 是一個帶有默認值的字典容器。

puctuation 是一個列表，包含了英文中的標點和符號。

nlargest() 函數可以很快地求出一個容器中最大的 n 個數字。

導入這些包：

from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest

基本思想：擁有關鍵詞最多的句子就是最重要的句子。我們把句子按照關鍵詞數量的多少排序，取前 n 句，即可彙總成我們的摘要。

整體代碼：

from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest
#定義一些需要的常量
stopwords = set(stopwords.words('english')+list(punctuation))
max_cut = 0.9
min_cut = 0.1
def compute_frequencies(word_sent):
    freq = defaultdict(int)
    for s in word_sent:
        for word  in s:
            if word not in stopwords:
                freq[word] += 1
    m = float(max(freq.values()))
    for w in list(freq.keys()):
        freq[w] = freq[w]/m
        if freq[w] >= max_cut or freq[w] <= min_cut:
            #del 刪除變量，解出當前freq[w]對當前值的佔用
            del freq[w]
    return freq
def summarize(text,n):
    #把段落變成一個個句子
    sents = sent_tokenize(text)
    #斷言字段長度大於n，也就是2
    assert n<=len(sents)
    #將句子分成一個個單詞，並小寫
    word_sent = [word_tokenize(s.lower()) for s in sents]
    #計算每個詞出現的頻率，返回freq[w]代表了w出現的頻率
    freq = compute_frequencies(word_sent)
    #生成一個帶有 默認值的字典容器，默認值是int型
    ranking = defaultdict(int)
    for i,word in enumerate(word_sent):
        for w in word:
            if w in freq:
                ranking[i] += freq[w]
    sents_idx = rank(ranking,n)
    return [sents[j] for j in sents_idx]
def rank(ranking,n):
    #求出一個容器中最大的n個數字
    return nlargest(n,ranking,key=ranking.get)
if __name__ == '__main__':
    with open("news.txt", "r") as myfile:
        text = myfile.read().replace('\n','')
    res = summarize(text, 2)
    for i in range(len(res)):
        print(res[i])

運行結果：

方法只是單純的疊加重要性，導致長句子佔有優勢。

下面使用TextRank 算法完成新聞摘要提取，TextRank 對 PageRank 算法做了改進，使其可以計算每一個句子的 重要性 ：

代碼如下：

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import math
from itertools import product, count
from string import punctuation
from heapq import nlargest
stopwords = set(stopwords.words('english') + list(punctuation))
def calculate_similarity(sen1,sen2):
    counter =0
    for word in sen1:
        if word in sen2:
            counter +=1
    return counter/(math.log(len(sen1))+math.log(len(sen2)))
def create_graph(word_sent):
    num = len(word_sent)
    board = [[0.0 for _ in range(num)] for _ in range(num)]
    for i,j in product(range(num),repeat=2):
        if i != j:
            board[i][j] = calculate_similarity(word_sent[i],word_sent[j])
    return board
def weighted_pagerank(weight_graph):
    scores = [0.5 for _ in range(len(weight_graph))]
    old_scores = [0.0 for _ in range(len(weight_graph))]
    while different(scores,old_scores):
        for i in range(len(weight_graph)):
            old_scores[i] = scores[i]
        for  i in range(len(weight_graph)):
            scores[i] = calculate_score(weight_graph,scores,i)
    return scores
def different(scores,old_scores):
    flag = False
    for  i in range(len(scores)):
        if math.fabs(scores[i]-old_scores[i]) >= 0.0001:
            flag = True
            break
    return flag
def calculate_score(weight_graph,scores,i):
    length = len(weight_graph)
    d = 0.85
    added_score = 0.0
    for j in range(length):
        fraction = 0.0
        denominator = 0.0
        fraction = weight_graph[j][i] * scores[j]
        for k in range(length):
            denominator += weight_graph[j][k]
        added_score += fraction/denominator
    weight_score = (1-d)+d*added_score
    return weight_score
def Summarize(text,n):
    sents = sent_tokenize(text)
    word_sent = [word_tokenize(s.lower()) for  s in sents]
    for i in  range(len(word_sent)):
        for word in word_sent[i]:
            if word in stopwords:
                word_sent[i].remove(word)
    similarity_graph = create_graph(word_sent)
    scores = weighted_pagerank(similarity_graph)
    sent_selected = nlargest(n,zip(scores,count()))
    sent_index=[]
    for i in range(n):
        sent_index.append(sent_selected[i][1])
    return [sents[i] for i in sent_index]
if __name__ =='__main__':
    with open("news.txt","r") as  myfile:
        text = myfile.read().replace('\n','')
    print(Summarize(text,2))

生成結果如下：

識別出來的內容已經和之前的不同。雖然兩個好像都沒能把主題句找出來

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

機器學習（十一）：新聞摘要提取小案例

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

hive使用tez環境配置

在spark，MapReduce 或 Flink 程序裏面制定環境變量

spark日常報錯問題-持續性更新

flink設置historyserver

kafka參數整理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結