對話摘要 | 抽取式與生成式 | 數據集與baseline

原創

codebrid

2020-06-25 19:25

背景：由於CVAE在summary_yxu的代碼和訊飛的自己提取出來的對話數據集上，初見成效，於是這次用在一些官方的數據集上和一些對話摘要的baseline上試一下，萬一效果也好呢？

NOTE：再次提醒自己這次一定要用心，用心！

一、文本摘要與對話摘要數據集對比

文本摘要
全文沒有對話信息
DUC/TAC 英文|數據集較小|適用於傳統摘要方法的評估
Gigaword 英文|啓發式規則構成|適用於深度神經網絡方法
CNN/DailyMail 多句摘要數據|常用於生成式摘要方法
NYTAC 長度較短|可用於文本摘要、信息檢索、信息抽取
ASNAPR 商品評論數據|可用於評論和情感的摘要
LCSTS 中文|新浪微博數據|短文本|深度網絡摘要方法
NLPCC 文本摘要、情感分類、自動問答等任務
對話摘要（數據集相對較少）
兩個或兩個以上的對話參與者
AMI 英|規模小|會議多模態數據|可用於抽取式與生成式摘要
SAMSum英|規模大|閒聊數據|人工標註|適用於生成式摘要
ICSI 英|規模較小|會議語料庫

二、本次要用的對話摘要數據集的具體格式

本次要用：AMI Meeting Dataset (Carletta et al., 2005)

Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, et al. 2005. The ami meeting corpus: A pre-announcement. In International workshop on machine learning for multimodal interaction, pages 28–39. Springer.

AMI：英文 | 會議 | 多模態數據集，包含100小時的會議錄音。大約三分之二的數據是通過參與者在設計團隊中扮演不同角色的場景引出的，設計項目從開始到結束。其餘部分包括各種領域的自然會議。在會議期間，參與者還可以使用不同步的筆來記錄所寫的內容。會議以英語記錄，使用三個不同聲學特性的房間，主要包括非母語人士。

官方網址：http://groups.inf.ed.ac.uk/ami/corpus/

中文參照：http://sykv.cn/m/view.php?aid=19912

來源：雖然AMI會議語料庫是爲了開發會議瀏覽技術的聯盟的使用而創建的，但它被設計用於廣泛的研究領域。此網站上的下載內容包括適合大多數用途的視頻，所有信號和轉錄以及一些註釋都已根據知識共享署名4.0國際許可（CC BY 4.0）公開發布。

描述數據的相關論文：

Jean Carletta（2007年）。釋放殺手語料庫：創建多種一切AMI會議語料庫的經驗。
Steve Renals，Thomas Hain和HervéBourlard（2007）。會議的認可和解釋：AMI和AMIDA項目。

數據集包含：視頻、註釋(Annotations)、註釋手寫體，

大小：（這裏我要下載的是下圖22MB的這一份）train97 val20 test20【數據集大小】

We preprocess and divide 457 the dataset into training (97 meetings), development (20 meetings) and test (20 meetings) sets as 459 done by (Shang et al., 2018).

格式：NXT格式，需要 NXT version 1.4.4，原始數據格式如下圖，記錄每一時刻的單詞是什麼

使用：要想使用的話，需要處理成自己想要的格式。（這裏C哥已經有處理好的一份數據，我會直接用這一份）

三、目前已有的對話摘要論文及效果，哪些能作爲baseline

C哥論文裏的baseline：

TextRank (Mihalcea and Tarau, 2004) is a graph-based extractive method that selects im- 491portant sentences from the input document.
C&L (Cheng and Lapata, 2016) is an extractive method based on sequence-to-sequence 494framework. Its decoder receives sentence embeddings and outputs sentence labels.
SummaRunner (Nallapati et al., 2017) is an extractive method based on hierarchical RNN which iteratively constructs summary representation to predict sentence labels.
CoreRank (Shang et al., 2018) is a unsupervised abstractive method which generates summaries by combining several approachs.
Pointer-Generator (See et al., 2017) is an abstractive method equips with copy mechanism, its decoder can either generate from the vocabulary or copy from the input.
HRED (Serban et al., 2016) is a hierarchical sequence-to-sequence model which is composed of a word-level LSTM and a sentence-level LSTM.
Sentence-Gated (Goo and Chen, 2018) is an abstractive method that incorporates dialogue acts by the sentence-gated mechanism.
TopicSeg (Li et al., 2019a) is an abstractive method using a hierarchical attention mechanism at three levels (topic, utterance, word).

及效果

四、會議摘要的相關工作

7 Related Work
Meeting Summarization
Graph-to-Sequence Generation

五、開始跑baseline

本次我要使用的baseline：

Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks.
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents.
Chih-Wen Goo and Yun-Nung Chen. 2018. Abstractive dialogue summarization with sentence-gated modeling optimized by dialogue acts.

5.1、 Pointer-Generator

論文：https://arxiv.org/abs/1704.04368

代碼：https://github.com/abisee/pointer-generator，tf版本

此處使用：https://github.com/OpenNMT/OpenNMT-py，pytorch版本

source activate onmt_diasum

pip install OpenNMT-py

基礎模型結構記錄
NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(5809, 300, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(300, 200, num_layers=2, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(5809, 300, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.3, inplace=False)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.3, inplace=False)
      (layers): ModuleList(
        (0): LSTMCell(500, 200)
        (1): LSTMCell(200, 200)
      )
    )
    (attn): GlobalAttention(
      (linear_in): Linear(in_features=200, out_features=200, bias=False)
      (linear_out): Linear(in_features=400, out_features=200, bias=False)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=200, out_features=5809, bias=True)
    (1): Cast()
    (2): LogSoftmax()
  )
)

生成式-已復現

5.2、SummaRunner

論文：https://arxiv.org/abs/1611.04230

代碼：無官網代碼

此處使用：https://github.com/kedz/nnsum，pytorch版本

conda(source) activate nnsum

抽取式-

5.3、Sentence-Gated

論文：https://arxiv.org/abs/1809.05715

代碼：https://github.com/MiuLab/DialSum，tf版本

此處使用：一份C哥自己復現的代碼，根據open-nmt改的，pytorch版本，所以直接使用5.1的環境

生成式-已復現

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

對話摘要 | 抽取式與生成式 | 數據集與baseline

一、文本摘要與對話摘要數據集對比

文本摘要

對話摘要（數據集相對較少）

二、本次要用的對話摘要數據集的具體格式

三、目前已有的對話摘要論文及效果，哪些能作爲baseline

C哥論文裏的baseline：

及效果

四、會議摘要的相關工作

五、開始跑baseline

本次我要使用的baseline：

5.1、 Pointer-Generator

5.2、SummaRunner

5.3、Sentence-Gated

Python heapq（堆操作）

【書籍記錄】《編程之法》

面經 | 記錄秋招遇到的概率題與智力題（附答案）

【ERROR】TypeError: expected bytes, Descriptor found

【論文】【ACL2018】Neural Document Summarization by Jointly Learning to Score and Select Sentences

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結