[讀論文] 使用相似模式生成科學論文的相關綜述

Generating Coherent Summaries of Scientific Articles Using Coherence Patterns

使用相似模式生成科學論文的相關綜述

論文作者:Daraksha Parveen、Mohshe Mesgar 、Michael Strube

1 摘要 Abstract

使用基於圖的方法對論文進行綜述,利用相似模式來保證生成的綜述是相關的。另外還提出了一個方法來整合相關性、重要性和非冗餘性。同時還使用混合整數規劃方法來優化參數。

They use a graph-based approach to summarize articles, and employ coherence patterns to ensure that the generated summaries are coherent. They also propose a method to combine the importance, coherence and non-redundancy. To optimize these factories, they use Mixed Integer Programming.

2 背景概念和知識 Introduction

2.1 綜述的三個需要考慮的屬性 Three properties summarize should consider

  • 重要性 Importance
    • 綜述應該包含輸入的文檔中的重要信息。
    • The summary should contain the important information of the input document.
  • 非冗餘性 Non-Redundancy
    • 綜述應該包含非冗餘的信息,這些信息都應該是不相同的。
    • The summary should contain non-redundant information. The information should be diverse in the summary.
  • 相關性 Coherence
    • 儘管綜述應該包含輸入文檔的不同的重要信息,他的句子還應該銜接連貫而且易於閱讀。
    • Though the summary should comprise diverse and important information of the input document, its sentences should be connected to one another such that it becomes coherent and easy to read.

2.2 評估標準 ROUGE SCORE

  • ROUGH,是用來評估自然語言處理中自動總結和機器翻譯軟件的一系列度量和軟件包。這個度量工作比較自動處理的總結或者翻譯和人工的總結與翻譯來得到。

  • ROUGE, or Recall-Oriented Understudy for Gisting Evaluation,is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

2.3 二分圖 Bipartite Graph (Bigraph)

  • 把圖形的理論放在數學中考慮,一個二分圖的含義就是他裏面的節點都能分成不相交而且不互相依賴的兩類。
  • In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent setsU and V such that every edge connects a vertex in U to one in V. Vertex sets U and V are usually called the parts of the graph. Equivalently, a bipartite graph is a graph that does not contain any odd-length cycles.

3 方法 Method

method

3.0 預處理

  1. Use Stanford parser to determine sentence boundaries.

  2. Use Brown coherence toolkit to convert the articles into entity grids.

  3. Use gSpan to extract all subgraphs from the projection graphs of the abstracts of the PubMed corpus.

3.1 文本表示 Document Representation

  • 用一個實體圖來表示這些科技論文,這個實體圖是一個二分圖,這個圖裏面實體句子這兩組不相交的節點構成。其中,實體節點只能和句子節點相連(僅當實體出現在句子中的時候),而不能和其他實體節點相連。
  • They use the entity graph to present scientific articles. The entity graph is a bipartite graph which consists of entities and sentences as two disjoint sets of nodes. Entity nodes are connected only with sentence nodes and not among each other. An entity node is connected with a sentence node only if the entity is present in the sentence.
  • 得到論文的二分圖後,對其中的句子節點進行單模式的映射,來創建一個有向的單模式投影圖。當兩個句子有相同的實體的時候,他們會有連接,邊的方向和在文中出現的順序一致。
  • They perform a one-mode projection on sentence nodes to create a directed one-mode projection graph. Two sentence nodes in the one-mode projection graph are connected if they share at least one entity in the entity graph.

3.2 挖掘相關模式 Mining Coherence Patterns

  • 使用PubMed corpus的摘要的單模式投影圖來挖掘相關模式。

  • Use one-mode projection graphs of abstracts in the PubMed corpus to mine coherence patterns.

  • 有一個相似模式的權重的計算方法,其中q是語料庫中我們得到的圖的數量。*gkg_k*表示第kk個摘要的圖。

    weight(patu)=k=1qfreq(patu,gk)maxk=1qfreq(patu),gkweight(pat_u)=\frac{\sum_{k=1}^qfreq(pat_u,g_k)}{max_{k=1}^qfreq(pat_u),g_k}

  • There is a method to calculate the weight of patterns, where q is the number of graphs associated with abstracts in the corpus, and gkg_k​ represents the graph of the kthk^{th}​ abstract in the PubMed corpus.

  • 我們可以判斷,權重的值,並不是在同一個範圍內,所以我們用一個sigmoid函數來對weightweight進行正則化,把它的範圍縮放到[0,1].

  • The weights of the coherence patterns are not on the same scale. We use a sigmoid function scales weights to the interval [0,1].

3.3 生成綜述 Summary Generation

3.3.1 重要性 Importance

  • 通過Rank函數計算而來,算法是由Kleinberg的HITS算法得到的
  • Importance is calculated by considering the ranks of selected sentences for the summary.

3.3.2 非冗餘性 Non-Redundancy

fR(E)=j=1mejf_R(E)=\sum_{j=1}^me_j

  • Where m is the number of entities and eje_j is a binary variable for each entity.

3.3.3 相關性 Coherence

fC(P)=u=1Uweight(patu)×pu.f_C(P)=\sum_{u=1}^Uweight(pat_u)\times p_u..

  • Where pup_u is a boolean variable associated with coherence pattern patupat_u

數據集 Dataset

1 PubMed corpus ( Get Coherence Pattern)

  • 分析美國國立醫學圖書館中的論文的摘要,來獲取這裏的相關性模式。
  • obtain coherence patterns by analyzing a corpus of abstracts of articles from biomedicine (PubMed corpus).

2 PLOS Medicine dataset ( Test the system) 公共科學圖書館醫學

  • 將其中的論文輸入系統,並和人工書寫的綜述進行比較來評估他們的系統。
  • Input the documents to their system, and evaluate them by comparing with summaries written by a PLOS Medicine editor.

3 DUC 2002 Document Understanding Conference 文本理解會議

  • 用來評估綜述結果
  • Used to evaluate the summaries.

相關工作 Related Work

總結 Conclusion

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章