自動提取摘要

1.TF-IDF提取關鍵詞

TF-IDF是Term Frequency - Inverse Document Frequency的縮寫，即“詞頻-逆文本頻率”。它由兩部分組成，TF和IDF。TF爲詞頻，即某個詞在文章中出現的次數。
IDF爲逆文檔頻率：

TF-IDF的計算爲：

TF_IDF提取關鍵詞的步驟就是，對文本先進行分詞處理，再對每一個詞計算TF-IDF值，然後按降序排序，取排在最前面的幾個詞。
參考文獻：TF-IDF與餘弦相似性的應用（一）：自動提取關鍵詞阮一峯

2.摘要提取

2.1 基於關鍵詞匹配的摘要提取

算法思想來自於阮一峯 TF-IDF與餘弦相似性的應用（三）：自動摘要
我們要做的是對對話文本進行摘要提取。
對話文本實例：

（1）文本預處理
文本預處理時將比較短的文本過濾掉，我們設置的字符個數爲7，然後每一個對話句子作爲一個sentence.
(2)TF-IDF提取關鍵詞
我們直接調用的結巴的接口，提取了6個關鍵詞
(3)匹配句子
根據關鍵詞去匹配每一個sentence,並且只考慮關鍵詞首先出現的句子。最多提取5個句子。

import json
import jieba
import jieba.analyse
import re
def search_sentences(sentences,word):
    for sentence in sentences:
        if re.search(word,sentence):
            return sentence
    return word   
file = "./data/mendian_class1/司內投訴.txt"
with open(file,"r",encoding="utf-8") as f:
    lines = f.readlines()
for line in lines:
    res = line.split("|")[1]
    res = res.replace("\'","\"")
    res = json.loads(res)
    sentences_list = []
    sentences = ""
    sentences_dict = {}
    count = 0
    if res is not None and "sentences" in res:
        if res["sentences"]:
            for index,sent in enumerate(res["sentences"]):
                tex = sent["text"]
                if len(tex) >= 7:
                    sentences_list.append(tex)
                    sentences += tex
                    count += 1

    #print("sentences_list:",sentences_list)
    #print("sentences:",sentences)

    hotwords = jieba.analyse.extract_tags(sentences, topK=6, allowPOS=( 'n', 'vn', 'v'))   
    print(hotwords)

    set_summary_sentences = set()
    for word in hotwords:
        #print(word)
        first_match_sentence = search_sentences(sentences_list,word)
        set_summary_sentences.add(first_match_sentence)
        if len(set_summary_sentences) == 3:
            break

    print("set_summary_sentences:",set_summary_sentences)
    summary = ""
    for sentence in set_summary_sentences:
        summary = summary + " " + sentence + "/"
    print("summary:",summary)

運行結果：

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\xiang\AppData\Local\Temp\jieba.cache
Loading model cost 1.280 seconds.
Prefix dict has been built succesfully.
['理賠', '污染', '保價', '肯定', '不了', '公司']
set_summary_sentences: {'喂，你好，就是關於之前的一個胡蘿蔔住兩天胡蘿蔔素，那個理賠呃，現在公司逐步給到的一個金額，是理賠，並是2000元。', '他這個是量筒的話有一同學壞的嗎，換的話就是因爲我們這邊理賠的，是按照保價比例啊，還有壞掉一個程度，所以說是理賠的異同的那個
成本是2000元。', '如果說是外面污染被污染掉的話，這個清洗完了之後，它裏面的東西，實際上也沒有，會不會受到影響嗎？'}
summary:  喂，你好，就是關於之前的一個胡蘿蔔住兩天胡蘿蔔素，那個理賠呃，現在公司逐步給到的一個金額，是理賠，並是2000元。/ 他這個是量筒的話有一同學壞的嗎，換的話就是因爲我們這邊理賠的，是按照保價比例啊，還有壞掉一個程度，所以說是理賠的異同的那個成本是2000元。/ 如
果說是外面污染被污染掉的話，這個清洗完了之後，它裏面的東西，實際上也沒有，會不會受到影響嗎？/
PS G:\debang_item_20190614\abstract_extract>

可以看到提取的摘要語句比較長，也比較雜亂，對此我們進行了改進。
(1)文本預處理
文本預處理時將比較短的文本過濾掉，我們設置的字符個數爲7，然後每一個對話句子以逗號爲分隔符切分爲短句子，並進一步過濾，將字符數少於5個的過濾掉，最後將得到的每個短句子作爲一個sentence.
(2)TF-IDF提取關鍵詞
我們直接調用的結巴的接口，提取了6個關鍵詞
(3)匹配句子
根據關鍵詞去匹配每一個sentence,並且只考慮關鍵詞首先出現的句子。最多提取5個句子。

import json
import jieba
import jieba.analyse
import re
import time
#創建停用詞list
def stop_word_list(path):
    stopwords = [line.strip() for line in open(path, 'r', encoding='utf-8').readlines()] 
    return stopwords
#預處理文本
def preprocess(text):
    text_with_spaces=""
    textcut = jieba.cut(text.strip()) 
    stopwords = stop_word_list("./data/stop_words.txt")
    for word in textcut:
        if word not in stopwords:
            if word != '\t':
                text_with_spaces += word + " "
    return text_with_spaces

def search_sentences(sentences,word):
    for sentence in sentences:
        if re.search(word,sentence):
            return sentence
    return word 

def sentence_preprocess(res):
    res = res.replace("\'","\"")
    res = json.loads(res)
    text_all = ""
    if res is not None and "sentences" in res:
        if res["sentences"]:
            for index,sent in enumerate(res["sentences"]):
                    tex = sent["text"]
                    if len(tex) >= 7:
                        #print(tex)
                        text_all = text_all + "#".join(tex.split("，"))+"#"
                        #print(text_all)
    
    text_all = text_all.split("#")
    text_all_list = []
    text_old = []
    for tex in text_all:
        if len(tex) <= 5:
            continue
        else:
            tex = ''.join(tex.split('？'))
            tex = ''.join(tex.split('?'))
            tex = ''.join(tex.split('。'))
            tex = ''.join(tex.split('.'))
            text_old.append(tex)
            tex = preprocess(tex)
            text_all_list.append(tex)
    return text_old,text_all_list

def hotwords_extract(text_all_list):
    jieba.analyse.set_idf_path("./data/idf.txt")
    text_preprocess = " ".join(text_all_list)
    hotwords = jieba.analyse.extract_tags(text_preprocess, topK=6, allowPOS=( 'n', 'vn', 'v'))   
    return hotwords

def summarizer(text_dic):
    text_old,text_all_list = sentence_preprocess(text_dic)
    hotwords = hotwords_extract(text_all_list)
    print("hotwords:",hotwords)
    set_summary_sentences = set()
    for word in hotwords:
        #print(word)
        first_match_sentence = search_sentences(text_old,word)
        set_summary_sentences.add(first_match_sentence)
        if len(set_summary_sentences) == 5:
            break

    #print("set_summary_sentences:",set_summary_sentences)
    summary = ""
    for sentence in set_summary_sentences:
        summary = summary + " " + sentence + "。"
    #print("summary:",summary)
    #print("---------------------------------------")
    return summary

if __name__ == "__main__":
    file = "./data/mendian_class1/司內投訴.txt"
    with open(file,"r",encoding="utf-8") as f: 
        lines = f.readlines()
    start_time = time.time()
    count = 0
    for line in lines:
        count += 1
        res = line.split("|")[1]
        summary = summarizer(res)
        print("summary:",summary)
        print("---------------------------------------")
    end_time = time.time()
    test_time = end_time - start_time
    print("count:",count)
    print("test_time:",test_time)

運行結果：

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\xiang\AppData\Local\Temp\jieba.cache
Loading model cost 0.729 seconds.
Prefix dict has been built succesfully.
hotwords: ['理賠', '污染', '公司', '沒有', '肯定', '投訴']
summary:  實際上也沒有。 現在公司逐步給到的一個金額。 如果說是外面污染被污染掉的話。 當然配起來肯定不一樣。 換的話就是因爲我們這邊理賠的。
---------------------------------------
count: 1
test_time: 0.945472240447998

這個還可以，但是會抓出一些無意義的語句，下一節進行進一步優化。

2.2基於關鍵詞評分的摘要提取

(1)文本預處理
文本預處理時將比較短的文本過濾掉，我們設置的字符個數爲7，然後每一個對話句子以逗號爲分隔符切分爲短句子，並進一步過濾，將字符數少於5個的過濾掉，最後將得到的每個短句子作爲一個sentence.
(2)TF-IDF提取關鍵詞
我們直接調用的結巴的接口，提取了6個關鍵詞
(3)句子評分
用關鍵詞列表給句子評分，分值 = 句子裏面包含的關鍵詞個數/句子詞數
(4)抽取句子
對分值逆序排序，取排名靠前的句子，最多取5個。

import json
import jieba
import jieba.analyse
import re
import time
import numpy as np
#創建停用詞list
def stop_word_list(path):
    stopwords = [line.strip() for line in open(path, 'r', encoding='utf-8').readlines()] 
    return stopwords
#預處理文本
def preprocess(text):
    text_with_spaces=""
    textcut = jieba.cut(text.strip()) 
    stopwords = stop_word_list("data/stop_words.txt")
    for word in textcut:
        if word not in stopwords:
            if word != '\t':
                text_with_spaces += word + " "
    return text_with_spaces

def sentence_preprocess(res):
    res = res.replace("\'","\"")
    res = json.loads(res)
    text_all = ""
    if res is not None and "sentences" in res:
        if res["sentences"]:
            for index,sent in enumerate(res["sentences"]):
                    tex = sent["text"]
                    if len(tex) >= 7:
                        #print(tex)
                        text_all = text_all + "#".join(tex.split("，"))+"#"
                        #print(text_all)
    
    text_all = text_all.split("#")
    text_all_list = []
    text_old = []
    for tex in text_all:
        if len(tex) <= 5:
            continue
        else:
            tex = ''.join(tex.split('？'))
            tex = ''.join(tex.split('?'))
            tex = ''.join(tex.split('。'))
            tex = ''.join(tex.split('.'))
            text_old.append(tex)
            tex = preprocess(tex)
            text_all_list.append(tex)
    return text_old,text_all_list

def hotwords_extract(text_all_list):
    jieba.analyse.set_idf_path("./data/idf.txt")
    text_preprocess = " ".join(text_all_list)
    hotwords = jieba.analyse.extract_tags(text_preprocess, topK=6, allowPOS=( 'n', 'vn', 'v'))   
    return hotwords

def summarizer(text_dic):
    text_old,text_all_list = sentence_preprocess(text_dic)
    hotwords = hotwords_extract(text_all_list)
    print("hotwords:",hotwords)
    set_summary_sentences = set()
    weight_list = []
    for line in text_all_list:
        count = 0
        length = len(line.split())
        for word in line.split():
            if word in hotwords:
                count+=1
        #print("line:%s,count:%s,length:%s"%(str(line),str(count),str(length)))
        weight_list.append(count/length)
    #print("weight_list:",weight_list)    
    weight_arr = np.array(weight_list)
    weight_idx = np.argsort(-weight_arr,kind = "heapsort")
    if len(weight_list) >= 5:
        l = 5
    else:
        l = len(weight_list)
    for i in range(l):
        idx = weight_idx[i]
        if weight_arr[idx] > 0:
            set_summary_sentences.add(text_old[idx])
            #print(weight_arr[idx])
            print(text_old[idx])
    summary = ""
    for sentence in set_summary_sentences:
        summary = summary + " " + sentence + "。"
    return summary

if __name__ == "__main__":
    file = "./data/mendian_class1/司內投訴.txt"
    with open(file,"r",encoding="utf-8") as f: 
        lines = f.readlines()
    start_time = time.time()
    count = 0
    for line in lines:
        count += 1
        res = line.split("|")[1]
        summary = summarizer(res)
        print("summary:",summary)
        print("---------------------------------------")
    end_time = time.time()
    test_time = end_time - start_time
    print("count:",count)
    print("test_time:",test_time)

運行結果：

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\xiang\AppData\Local\Temp\jieba.cache
Loading model cost 0.699 seconds.
Prefix dict has been built succesfully.
hotwords: ['理賠', '污染', '公司', '沒有', '肯定', '投訴']
外面污染了裏面沒有關係
沒有破損的話
對損壞成污染的話確實理賠起來理賠
我可以投訴吧
外面被污染了
summary:  沒有破損的話。 我可以投訴吧。 對損壞成污染的話確實理賠起來理賠。 外面被污染了。 外面污染了裏面沒有關係。
---------------------------------------
count: 1
test_time: 0.8766558170318604

這個算法效果還不錯

附錄：
司內投訴.txt

u113202_530505_9915317890623_20190228154110_FFFFB93C.mp3|{'sentences': [{'speech_rate': 455, 'emotion_value': 6.0, 'begin_time': 11800, 'silence_duration': 6, 'text': '喂，你好，哎，你好，我這裏的方式，羊絨女是嗎？', 'channel_id': 0, 'end_time': 14830}, {'speech_rate': 200, 'emotion_value': 6.0, 'begin_time': 15530, 'silence_duration': 0, 'text': '嗯', 'channel_id': 1, 'end_time': 15830}, {'speech_rate': 373, 'emotion_value': 6.0, 'begin_time': 15890, 'silence_duration': 0, 'text': '喂，你好，就是關於之前的一個胡蘿蔔住兩天胡蘿蔔素，那個理賠呃，現在公司逐步給到的一個金額，是理賠，並是2000元。', 'channel_id': 0, 'end_time': 25035}, {'speech_rate': 309, 'emotion_value': 6.0, 'begin_time': 25110, 'silence_duration': 0, 'text': '他這個是量筒的話有一同學壞的嗎，換的話就是因爲我們這邊理賠的，是按照保價比例啊，還有壞掉一個程度，所以說是理賠的異同的那個成本是2000元。', 'channel_id': 0, 'end_time': 38670}, {'speech_rate': 116, 'emotion_value': 6.0, 'begin_time': 39340, 'silence_duration': 0, 'text': '還痛', 'channel_id': 1, 'end_time': 40370}, {'speech_rate': 389, 'emotion_value': 6.0, 'begin_time': 41160, 'silence_duration': 0, 'text': '因爲另一種的話，像這種東西，我可以看到那個照片了，就說他用食品添加劑的話，也是一個金屬的基礎罐子，一個包裝也是有一定的，那個密封信嗎？', 'channel_id': 0, 'end_time': 51475}, {'speech_rate': 360, 'emotion_value': 6.0, 'begin_time': 51550, 'silence_duration': 0, 'text': '如果說是外面污染被污染掉的話，這個清洗完了之後，它裏面的東西，實際上也沒有，會不會受到影響嗎？', 'channel_id': 0, 'end_time': 59380}, {'speech_rate': 500, 'emotion_value': 6.0, 'begin_time': 61220, 'silence_duration': 1, 'text': '那關鍵，你賣不這樣，你怎麼辦？', 'channel_id': 1, 'end_time': 63020}, {'speech_rate': 400, 'emotion_value': 6.0, 'begin_time': 63990, 'silence_duration': 0, 'text': '嗯，中文就說', 'channel_id': 0, 'end_time': 64890}, {'speech_rate': 401, 'emotion_value': 6.0, 'begin_time': 64900, 'silence_duration': 0, 'text': '你這樣說，我那我媽不讓我怎麼辦呀', 'channel_id': 1, 'end_time': 67290}, {'speech_rate': 284, 'emotion_value': 6.0, 'begin_time': 67900, 'silence_duration': 0, 'text': '因爲我們這個理賠啊，理賠是賠付的貨物的一個實際損失就是說呃成本呢，是4501公斤，然後', 'channel_id': 0, 'end_time': 76975}, {'speech_rate': 379, 'emotion_value': 5.0, 'begin_time': 77100, 'silence_duration': 0, 'text': '呃，5公斤的話，就是22005，呃，我們是會參考這個保障嘛，因爲', 'channel_id': 0, 'end_time': 82165}, {'speech_rate': 368, 'emotion_value': 6.0, 'begin_time': 82250, 'silence_duration': 0, 'text': '壞掉的壞掉的都沒換呢，那確實是也是有一定的差距', 'channel_id': 0, 'end_time': 86000}, {'speech_rate': 466, 'emotion_value': 5.0, 'begin_time': 86640, 'silence_duration': 0, 'text': '所以說目前的話，公司是這樣，給了一個金額。', 'channel_id': 0, 'end_time': 89340}, {'speech_rate': 244, 'emotion_value': 6.0, 'begin_time': 91040, 'silence_duration': 1, 'text': '你像不合理的', 'channel_id': 1, 'end_time': 92515}, {'speech_rate': 403, 'emotion_value': 6.0, 'begin_time': 92600, 'silence_duration': 0, 'text': '因你，你那個我現在等於紙箱和我都不能用了。', 'channel_id': 1, 'end_time': 95720}, {'speech_rate': 413, 'emotion_value': 6.0, 'begin_time': 96720, 'silence_duration': 1, 'text': '那你那個你那個，如果你說你那個能用的話，那我那個我賣給你', 'channel_id': 1, 'end_time': 100785}, {'speech_rate': 266, 'emotion_value': 6.0, 'begin_time': 100910, 'silence_duration': 0, 'text': '沒有人要的呀', 'channel_id': 1, 'end_time': 102260}, {'speech_rate': 397, 'emotion_value': 6.0, 'begin_time': 103370, 'silence_duration': 1, 'text': '那這個因爲我們這個理賠呀，也確實是賠付貨物的一個實際損失，如果說呃，好的跟壞的了，當然配起來肯定不一樣，如果壞的那種被戳破了，這個事情type嗎，被觸碰到裏面東西肯定是也不不能用了，所以說呃公司的話系統是可以那個按照的一個原價來ip錢給你，但是睡另外一頭', 'channel_id': 0, 'end_time': 122535}, {'speech_rate': 363, 'emotion_value': 6.0, 'begin_time': 122730, 'silence_duration': 0, 'text': '如果僅僅10萬，外面被污染了，它裏面的東西，實際上', 'channel_id': 0, 'end_time': 126855}, {'speech_rate': 348, 'emotion_value': 6.0, 'begin_time': 126930, 'silence_duration': 0, 'text': '沒有受損，就說它是有一個生命價值，所以說', 'channel_id': 0, 'end_time': 130370}, {'speech_rate': 250, 'emotion_value': 6.0, 'begin_time': 130380, 'silence_duration': 0, 'text': '那你，你那你，你憑什麼說他沒有被污染呀？', 'channel_id': 1, 'end_time': 135170}, {'speech_rate': 100, 'emotion_value': 6.0, 'begin_time': 135180, 'silence_duration': 0, 'text': '這種一個', 'channel_id': 0, 'end_time': 137570}, {'speech_rate': 407, 'emotion_value': 6.0, 'begin_time': 137580, 'silence_duration': 0, 'text': '就像我就像我給客戶說一下，我說我這個東西，我裏面是好的，沒有問題的，但是我憑什麼這樣說呢，那是外地怎麼辦呢？', 'channel_id': 1, 'end_time': 145530}, {'speech_rate': 323, 'emotion_value': 6.0, 'begin_time': 146280, 'silence_duration': 0, 'text': '他這種啊，就是說一個人，另，另一桶啊，類目是', 'channel_id': 0, 'end_time': 150355}, {'speech_rate': 371, 'emotion_value': 6.0, 'begin_time': 150650, 'silence_duration': 0, 'text': '首先是一個銅，是金屬的一個統稱，沒有位置，並被戳破啊，有沒有被刮傷啊，什麼的，就說只是因爲另外一種壞了，他，導致裏面的胡蘿蔔做露出來了，把那個重置打紅了。', 'channel_id': 0, 'end_time': 163080}, {'speech_rate': 327, 'emotion_value': 6.0, 'begin_time': 163090, 'silence_duration': 0, 'text': '我這兩天想了，我們這個東西，他是他是你，你懂，我這個東西賣出去的話我都是整，就是直接原廠包裝，這樣賣的，你這個你這個因爲你這個本身就是你們，你們運輸過程當中人爲操作的', 'channel_id': 1, 'end_time': 178285}, {'speech_rate': 287, 'emotion_value': 6.0, 'begin_time': 178360, 'silence_duration': 0, 'text': '對吧，你先讓我紙箱的貨', 'channel_id': 1, 'end_time': 180655}, {'speech_rate': 318, 'emotion_value': 6.0, 'begin_time': 180730, 'silence_duration': 0, 'text': '你一個是一個是一個是那個綴空了，另外一個也被污染了，你現在就說嗯，只賠出不了那個，然後另外一個，你說', 'channel_id': 1, 'end_time': 190135}, {'speech_rate': 335, 'emotion_value': 6.0, 'begin_time': 190210, 'silence_duration': 0, 'text': '外面污染了裏面沒有關係，這樣這樣子，我，我那個還有一萬的話，那個我也沒辦法用的。', 'channel_id': 1, 'end_time': 197370}, {'speech_rate': 311, 'emotion_value': 6.0, 'begin_time': 198060, 'silence_duration': 0, 'text': '那真那麗雲關，就是說包送這個備過案的，之後呢，呃和可以跟公司申請申請一下，就是說對被對針，對於這個外包裝嗎，再進行一部分補償', 'channel_id': 0, 'end_time': 209990}, {'speech_rate': 404, 'emotion_value': 6.0, 'begin_time': 210850, 'silence_duration': 0, 'text': '但是說，如果是另一桶的話，是按照也是按照那個原價ip長的話，公司這邊也不會認可。', 'channel_id': 0, 'end_time': 216780}, {'speech_rate': 329, 'emotion_value': 6.0, 'begin_time': 217500, 'silence_duration': 0, 'text': '那你這個你這個你這個我不認可的', 'channel_id': 1, 'end_time': 220230}, {'speech_rate': 93, 'emotion_value': 6.0, 'begin_time': 221530, 'silence_duration': 1, 'text': '嗯這個', 'channel_id': 0, 'end_time': 223455}, {'speech_rate': 408, 'emotion_value': 6.0, 'begin_time': 223530, 'silence_duration': 0, 'text': '因爲確實這樣也單次，我們這個理賠呀，也不是針對你這樣一個客戶，因爲我們對所有客戶標準，都是一樣的，讓類似的事情，我們之前也都', 'channel_id': 0, 'end_time': 232630}, {'speech_rate': 378, 'emotion_value': 6.0, 'begin_time': 232640, 'silence_duration': 0, 'text': '不用跟我姐你不用跟我講這些，我發你們德邦發太多了，而且你幫我們建損壞，這也不是第一次了，你們這是要講肯定我是接受不了的。', 'channel_id': 1, 'end_time': 242160}, {'speech_rate': 411, 'emotion_value': 7.0, 'begin_time': 243090, 'silence_duration': 0, 'text': '因爲我本身，我自己姓和我是全部損失了，你現在來來跟我說破的那一罐，賠給我，另外一個你教我怎麼弄呢', 'channel_id': 1, 'end_time': 250090}, {'speech_rate': 333, 'emotion_value': 7.0, 'begin_time': 250790, 'silence_duration': 0, 'text': '因爲在包裝', 'channel_id': 0, 'end_time': 251690}, {'speech_rate': 417, 'emotion_value': 6.0, 'begin_time': 251700, 'silence_duration': 0, 'text': '那你那你那你那你是買鞋的話，你另外一個洗壞了，另外一個鞋底做舊，我在我就陪你一直寫的，錢就直接任務的', 'channel_id': 1, 'end_time': 258890}, {'speech_rate': 268, 'emotion_value': 6.0, 'begin_time': 259220, 'silence_duration': 0, 'text': '這個這個還是有她有差別嗎？', 'channel_id': 0, 'end_time': 262125}, {'speech_rate': 305, 'emotion_value': 7.0, 'begin_time': 262200, 'silence_duration': 0, 'text': '他這個', 'channel_id': 0, 'end_time': 262790}, {'speech_rate': 422, 'emotion_value': 7.0, 'begin_time': 262800, 'silence_duration': 0, 'text': '我肯定我，我這個按包裝啊，包裝，它說我出廠包裝，你們就是兩萬了，就是你在一個被污染了，另外一個就是已經過了，我現在就是在相互，我都用不了了', 'channel_id': 1, 'end_time': 272600}, {'speech_rate': 453, 'emotion_value': 7.0, 'begin_time': 273390, 'silence_duration': 0, 'text': '而且，我搞價，我也不是綁在一罐，那我能不能說我只跑了這一罐的價格呢？', 'channel_id': 1, 'end_time': 277890}, {'speech_rate': 384, 'emotion_value': 6.0, 'begin_time': 278530, 'silence_duration': 0, 'text': '這個是保價4000元，是這邊是飽了再一箱一箱也就涼', 'channel_id': 0, 'end_time': 282430}, {'speech_rate': 328, 'emotion_value': 7.0, 'begin_time': 282440, 'silence_duration': 0, 'text': '對我搞了一箱呀，你看我一箱，你看醫生都保意向破了一箱', 'channel_id': 1, 'end_time': 287195}, {'speech_rate': 344, 'emotion_value': 6.0, 'begin_time': 287270, 'silence_duration': 0, 'text': '與相鄰的兩罐一罐破了一罐，損壞了', 'channel_id': 1, 'end_time': 290055}, {'speech_rate': 151, 'emotion_value': 6.0, 'begin_time': 290240, 'silence_duration': 0, 'text': '對吧', 'channel_id': 1, 'end_time': 291030}, {'speech_rate': 394, 'emotion_value': 6.0, 'begin_time': 291670, 'silence_duration': 0, 'text': '對損壞成污染的話確實理賠起來理賠，這方面的話是有一定的差別的。', 'channel_id': 0, 'end_time': 296385}, {'speech_rate': 299, 'emotion_value': 6.0, 'begin_time': 296460, 'silence_duration': 0, 'text': '喂能用跟不能用，確實差別比較大', 'channel_id': 0, 'end_time': 299470}, {'speech_rate': 401, 'emotion_value': 7.0, 'begin_time': 299480, 'silence_duration': 0, 'text': '對，那你出個官方文件證明我的另外一罐，還能用我就接受', 'channel_id': 1, 'end_time': 303370}, {'speech_rate': 475, 'emotion_value': 6.0, 'begin_time': 303610, 'silence_duration': 0, 'text': '哦，我們這邊是沒有出具不了這個，因爲我們只是一個物流公司。', 'channel_id': 0, 'end_time': 307270}, {'speech_rate': 410, 'emotion_value': 7.0, 'begin_time': 307280, 'silence_duration': 0, 'text': '對，那你憑你嘴巴說，我不禁售呀', 'channel_id': 1, 'end_time': 309470}, {'speech_rate': 318, 'emotion_value': 6.0, 'begin_time': 310230, 'silence_duration': 0, 'text': '說，相應的先生就說，因爲這個外包裝污染污染的話你，這邊可以就說，呃，採取一個清洗或者說', 'channel_id': 0, 'end_time': 318330}, {'speech_rate': 498, 'emotion_value': 6.0, 'begin_time': 318340, 'silence_duration': 0, 'text': '你現在不用給我講這些，我現在你現在給我理賠過程，我不接受，我要怎麼做？', 'channel_id': 1, 'end_time': 322555}, {'speech_rate': 525, 'emotion_value': 6.0, 'begin_time': 322630, 'silence_duration': 0, 'text': '我可以投訴吧。', 'channel_id': 1, 'end_time': 323430}, {'speech_rate': 271, 'emotion_value': 6.0, 'begin_time': 323650, 'silence_duration': 0, 'text': '可以的，沒問題', 'channel_id': 0, 'end_time': 325195}, {'speech_rate': 422, 'emotion_value': 6.0, 'begin_time': 325540, 'silence_duration': 0, 'text': '對你這個你這個理賠，我不行，我我不接受', 'channel_id': 1, 'end_time': 328240}, {'speech_rate': 387, 'emotion_value': 6.0, 'begin_time': 328930, 'silence_duration': 0, 'text': '不接受不接收的話，我就只能建議您考慮一下或者說你如果是呃，這樣', 'channel_id': 0, 'end_time': 333730}, {'speech_rate': 417, 'emotion_value': 7.0, 'begin_time': 333740, 'silence_duration': 0, 'text': '考慮我一現貨的，用不了了，還考慮一下，你說怎麼考慮', 'channel_id': 1, 'end_time': 337330}, {'speech_rate': 345, 'emotion_value': 6.0, 'begin_time': 337340, 'silence_duration': 0, 'text': '這當然，那個如果說你這邊覺得其他途徑的話，可以解決這個問題的話，也可以去開取相應的一個呃，途徑來維權嗎，因爲', 'channel_id': 0, 'end_time': 346715}, {'speech_rate': 412, 'emotion_value': 6.0, 'begin_time': 346790, 'silence_duration': 0, 'text': '這個金額不是我定的，這個是公司出具的，這樣一個結果，我誤籤的話，就是因爲有這個結果了所以說要先告訴你', 'channel_id': 0, 'end_time': 354060}, {'speech_rate': 367, 'emotion_value': 6.0, 'begin_time': 356420, 'silence_duration': 0, 'text': '我，我現在就是你公司出去，我也不知道您哪個工，你們現在是哪，你是得到那個北橋點，還是什麼', 'channel_id': 1, 'end_time': 363610}, {'speech_rate': 287, 'emotion_value': 6.0, 'begin_time': 363620, 'silence_duration': 0, 'text': '我是總部理賠部門', 'channel_id': 0, 'end_time': 365290}, {'speech_rate': 433, 'emotion_value': 6.0, 'begin_time': 366500, 'silence_duration': 1, 'text': '那我就找點完讓別人去溝通的，那我的貨肯定是一想，利用', 'channel_id': 1, 'end_time': 370100}, {'speech_rate': 146, 'emotion_value': 5.0, 'begin_time': 370110, 'silence_duration': 0, 'text': '了', 'channel_id': 0, 'end_time': 370520}, {'speech_rate': 370, 'emotion_value': 6.0, 'begin_time': 371160, 'silence_duration': 0, 'text': '嗯也行，您這邊可以先溝通一下，我也會跟公司把你這個情況再反饋一下。', 'channel_id': 0, 'end_time': 376505}, {'speech_rate': 383, 'emotion_value': 6.0, 'begin_time': 376870, 'silence_duration': 0, 'text': '對你們沒有這樣子，我我我想你一個東西我去按沈香保價了，你像我等下，或者用不了了，你現在返過來，是這樣子，說就是說只賠破那個，你這個我肯定是接受不了，如果像你們這樣的話，那你，你們這個報價，這個按照說費用的話，你們這是不合理的，因爲我每次我其它我其它那個經營', 'channel_id': 1, 'end_time': 396895}, {'speech_rate': 383, 'emotion_value': 6.0, 'begin_time': 396880, 'silence_duration': 0, 'text': '建的時候你你保價，然後貨物沒有過，沒有破損的話，你們這個費用也是完全都說過去了嘛，對吧？', 'channel_id': 1, 'end_time': 403770}, {'speech_rate': 416, 'emotion_value': 6.0, 'begin_time': 404630, 'silence_duration': 7, 'text': '這這個就跟那個買保險是一個道理啊，你現在就是買買個車險，呃，買買個微信，您的授權，如果說一年之內稱沒有發什麼，他全問題，您那個寶貝還是正常給的，如果說發生什麼問題呢，那保險公司的話，呃，他管理一下，或者撞人保險公司肯定會按照那個有損傷程度，比一個相應的非上班是吧', 'channel_id': 0, 'end_time': 423510}, {'speech_rate': 300, 'emotion_value': 6.0, 'begin_time': 424160, 'silence_duration': 0, 'text': '就去信', 'channel_id': 0, 'end_time': 424760}, {'speech_rate': 322, 'emotion_value': 7.0, 'begin_time': 424770, 'silence_duration': 0, 'text': '對呀那是那是因爲有，有第三方干預證明誰是誰了，責任怎麼樣賠現在是沒有人，第三方證明就是你們可以賠一半對吧，只是你們自己這樣說的', 'channel_id': 1, 'end_time': 436480}, {'speech_rate': 200, 'emotion_value': 7.0, 'begin_time': 437320, 'silence_duration': 0, 'text': '對吧', 'channel_id': 1, 'end_time': 437920}, {'speech_rate': 281, 'emotion_value': 7.0, 'begin_time': 437930, 'silence_duration': 0, 'text': '你說的那個車險', 'channel_id': 0, 'end_time': 439420}, {'speech_rate': 414, 'emotion_value': 6.0, 'begin_time': 439430, 'silence_duration': 0, 'text': '是買保險車子出交通事故了，人家定了是多少多少錢買保險一樣不對了，那可以這樣理解，那是有第三方說，這是誰的責任，大家都同意了簽字', 'channel_id': 1, 'end_time': 448545}, {'speech_rate': 427, 'emotion_value': 6.0, 'begin_time': 448620, 'silence_duration': 0, 'text': '對吧，你現你現在，你說我和壞了，你說配偶一半，你說這樣就這樣，那肯定我不接受的', 'channel_id': 1, 'end_time': 454090}, {'speech_rate': 312, 'emotion_value': 6.0, 'begin_time': 455040, 'silence_duration': 0, 'text': '嗯，目前的話，公司給到的結果是這樣，就說呃，如果說', 'channel_id': 0, 'end_time': 459840}, {'speech_rate': 331, 'emotion_value': 6.0, 'begin_time': 459850, 'silence_duration': 0, 'text': '你們公司太老江湖了', 'channel_id': 1, 'end_time': 461480}, {'speech_rate': 349, 'emotion_value': 6.0, 'begin_time': 462380, 'silence_duration': 0, 'text': '嗯，然，然後就說，你這邊可以先跟那個網點那邊先溝通一下，我也會把這個情況再跟公司反饋一下。', 'channel_id': 0, 'end_time': 470100}, {'speech_rate': 200, 'emotion_value': 6.0, 'begin_time': 471010, 'silence_duration': 0, 'text': '好吧', 'channel_id': 0, 'end_time': 471610}, {'speech_rate': 203, 'emotion_value': 6.0, 'begin_time': 471620, 'silence_duration': 0, 'text': '可以', 'channel_id': 1, 'end_time': 472210}, {'speech_rate': 147, 'emotion_value': 5.0, 'begin_time': 472220, 'silence_duration': 0, 'text': '嗯行行', 'channel_id': 0, 'end_time': 473440}]}

參考文獻：
TF-IDF與餘弦相似性的應用（一）：自動提取關鍵詞阮一峯
TF-IDF與餘弦相似性的應用（二）：找出相似文章阮一峯
TF-IDF與餘弦相似性的應用（三）：自動摘要阮一峯
文本挖掘技術在客服對話數據分析中的應用與實踐電子商務電子支付國家工程實驗室

自動提取摘要

目錄

1.TF-IDF提取關鍵詞

2.摘要提取

2.1 基於關鍵詞匹配的摘要提取

2.2基於關鍵詞評分的摘要提取

DAPPER 事務 TRANSACTION

Java中線程的創建方式

一鍵自動化博客發佈工具,chrome和firfox詳細配置

語義相似度的計算

自動提取摘要

python接口調用 get/post

openpyxl寫入讀取數據

SQL連接查詢

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結