AAAI論文Joint Extraction of Entities and Overlapping Relations Using Position-Attentive Sequence閱讀筆記


本文在AAAI19發佈,論文地址:https://wvvw.aaai.org/ojs/index.php/AAAI/article/view/4591
在這裏插入圖片描述

摘要

聯合實體和關係提取是使用單個模型同時檢測出實體和關係。目前主流的聯合實體和關係抽取方法,主要是採用管道模型,先識別出實體,再找出實體間的關係。這樣的框架雖然容易推導,但易導致錯誤傳播,並且忽略了實體和關係間的內在聯繫。本文提出了一種新穎的統一聯合提取模型,該模型根據查詢詞位置p標記實體和關係標籤,即設定一個位置p,然後在p處檢測實體,並在與前者(p處實體)有關係的其他位置識別實體。爲了實現這種模式,文中還提出了一種標記方案,爲n個單詞的句子生成n個標記序列。然後引入位置注意機制爲每個查詢位置生成不同的句子表示,以對這n個標籤序列進行建模。文中提出的模型可以同時提取實體及其類型以及所有重疊關係。實驗結果表明,在提取重疊關係以及檢測遠程關係方面,文中提出的框架性能非常好。

相關工作

文中提出了新的統一框架進行聯合抽取。給定一個句子和一個查詢位置p,我們的模型將回答兩個僞問題:“p處的實體及其類型是什麼?”和“ p處的實體與哪個實體有關係?”通過回答這兩個問題我們將聯合抽取問題轉化爲序列標註問題,對於一個有n個單詞的句子,我們根據n個查詢位置(每個單詞是一個查詢位)註釋了n個不同的標籤序列。爲了在單個統一模型中對這n個標記序列建模,我們將一種新穎的位置注意力機制引入序列標記模型以生成n個不同的位置感知語句表示。另外,所提出的注意力機制可以在單詞(實體)之間建立直接連接,這有助於提取遠程關係(兩個實體之間的距離很長)。

本文的主要工作爲:

  1. 設計了一種可以同時表示實體類型和重疊關係的標記方案。
  2. 提出了一種位置注意機制,根據查詢位置p生成不同的位置感知語句表示,可用於解碼不同的標籤序列並提取重疊關係。
  3. 採用實驗證明了提出的方法的有效性。

主要方法及創新點

本篇論文主要有兩個創新點,一個是提出了一套新穎的標記方案,標記n個(n爲句子長度)和句子長度相同的標籤序列,使得模型在提取重疊關係方面相比其他模型有很大的改善。另一個是使用了位置注意力機制,位置注意力機制的使用使模型提取遠程關係的效果非常出色。

1.使用標記序列提取重疊關係

重疊關係的定義

在這裏插入圖片描述上圖是論文中給出的一個重疊關係的例子。該句子中,多個關係共享該句子中同一個實體,這幾個關係就是重疊關係。例如,表中的前三個關係因爲它們共享相同的實體’‘特朗普’'而重疊。類似地,後兩個關係也重疊,共享實體“紐約市”。這種重疊關係在關係提取數據集中非常常見,而傳統的模型一般只能抽取出重疊關係中的一個,而不能把所有的重疊關係全部抽取出來。

標記方案

爲了解決重疊關係的提取問題,文中提出了標記序列的方法。標記序列的標記過程如下:

對於一個有n個單詞的句子,爲每一個單詞創建一個長度爲n的序列並標記,共標記n個長度爲n的序列。根據不同的查詢位置p(也就是目標單詞在句子中的位置)對n個不同的標記序列進行註釋。在每個標籤序列中,如果查詢位置p是在實體的開始處,則在查詢位置p標記該實體的類型,而在p處與該實體有關係的其他實體則用關係類型標記,其餘位置則分配“O”標籤(外部標籤),表明它們與所關注的實體沒有關係。因此,可以基於標籤序列提取由三元組(Entity1,RelationType,Entity2)表示的關係。顯然,查詢位置p處的實體可以多次使用組成重疊的關係。

在這裏插入圖片描述
上圖是標記方案的一個示例,其中n表示句子長度,p∈[1,n]是查詢詞的位置。對於查詢詞p,建立一個長度爲n的標籤序列來表示與p處的實體相對應的所有可能的重疊關係。如果p在實體的開始處,則在p處標記該實體類型;對於其他的位置,如果它們與p處的實體具有關係,則在其餘的單詞上標記關係類型。這樣,可以使用此標記方案對所有實體和重疊關係進行註釋。在此示例中,“ B-LOC”中的“ LOC”是實體類型LOCATION的縮寫,“ S-PER”中的“ PER”是PERSON的縮寫,“ B-PO”中的“ PO”是關係類型的主席的縮寫 ,“ B-BI”中的“ BI”是“出生於”的縮寫,而“ B-LI”中的“ LI”是“位於”的縮寫。在上圖所示例子中,當p=5時,目標單詞爲Trump,建立一個標籤序列,由於p=5的位置有一個實體Trump,所以在p=5的位置標記S-PER,S是signal單字的意思,PER是這個實體的類型。對於實體Trump,它和位置爲14,15,16處的New York City構成了一個關係三元組{Trump, Born_in, New York City},關係爲Born_in,所以在實體New York City上,分別標記了B-BI、I-BI、E-BI。由這個例子可以看出標記方案,即在關係三元組的第一個實體上標記實體的類型,第二個實體上標記關係類型。

2.在關係提取中使用位置注意力機制

在計算位置注意力向量之前,需要做的準備工作有,將字向量和詞向量拼接後[Ww;Wc],作爲輸入,輸入進一個Bi-LSTM編碼器,得到隱藏層狀態,使用H = {ht},t∈[1,n]表示。之後進行位置注意力的計算。

在這裏插入圖片描述
上圖展示了論文中提出的位置注意力機制。其主要思想是爲句子中的每一個單詞求出一個注意力向量ct,ct = att(H, hp, ht),t∈[1,n]。其中H是整個句子的隱藏狀態,hp是目標詞狀態,ht是對應的單詞的隱藏狀態。得到ct的方法如下:
在這裏插入圖片描述上述公式中WH,Wp,Wh,v是要學習的參數,hj,hp,ht分別是位置j,p和t處的隱藏狀態,stj是通過將hp和ht與每個句子狀態hj進行比較而得出的分數, atj是stj歸一化產生的注意力權重。j用來遍歷整個句子中的所有單詞,j∈[1,n]。對句子中的一個單詞wt,它的注意力權重在計算時會用到整個句子中所有單詞的隱藏狀態(用hj表示)、位置p處的單詞的隱藏狀態hp,和wt自己的隱藏狀態ht。

通過上式中的方法求出了位置注意力向量ct,將其和Bi-LSTM編碼器得到隱藏層狀態ht拼接,得到向量ut,然後將ut作爲輸入,輸入進一個CRF解碼器,計算後得到實體和關係三元組。
在這裏插入圖片描述

實驗

實驗使用了紐約時報和Wiki-KBP的數據集來評估該方法。NYT和Wiki-KBP的統計數據如下表所示。
在這裏插入圖片描述作者還通過和lstm-lstm-bias模型做的對比實驗,證明了該模式在抽取長距離的實體關係方面有更傑出的效果。
在這裏插入圖片描述

全文翻譯

使用Google翻譯,不是很準確,可以做個參考

摘要
Joint entity and relation extraction is to detect entity and relation using a single model. In this paper, we present a novel unified joint extraction model which directly tags entity and relation labels according to a query word position p, i.e., detecting entity at p, and identifying entities at other positions that have relationship with the former. To this end, we first design a tagging scheme to generate n tag sequences for an n-word sentence. Then a position-attention mechanism is introduced to produce different sentence representations for every query position to model these n tag sequences. In this way, our method can simultaneously extract all entities and their type, as well as all overlapping relations. Experiment results show that our framework performances significantly better on extracting overlapping relations as well as detecting long-range relation, and thus we achieve state-of-the-art performance on two public datasets.
聯合實體和關係提取是使用單個模型檢測實體和關係。在本文中,我們提出了一種新穎的統一聯合提取模型,該模型根據查詢詞位置p直接標記實體和關係標籤,即在p處檢測實體,並在與前者有關係的其他位置識別實體。爲此,我們首先設計一種標記方案,以爲n個單詞的句子生成n個標記序列。然後引入位置注意機制爲每個查詢位置生成不同的句子表示,以對這n個標籤序列進行建模。這樣,我們的方法可以同時提取所有實體及其類型以及所有重疊關係。實驗結果表明,在提取重疊關係以及檢測遠程關係方面,我們的框架性能顯着提高,因此我們在兩個公共數據集上實現了最新的性能。
引言
傳統的RE系統將此任務分爲流水線子任務:首先檢測實體,然後對候選實體對之間的關係類型進行分類,這樣的框架使任務易於執行 ,但忽略了這兩個子任務之間的潛在相互依賴關係和錯誤傳播(Li和Ji,2014; Gupta,Schutze和Andrassy,2016)。
與流水線方法不同,聯合提取是使用聯合模型來檢測實體及其之間的關係。最近的研究表明,聯合學習方法可以有效地整合實體和關係的信息,因此在兩個子任務中均具有更好的性能。這些模型基於基於特徵的結構化學習(Kate and Mooney 2010; Li and Ji 2014; Miwa and Sasaki 2014; Ren et al.2017),這些方法在很大程度上依賴於手工製作的特徵和其他NLP工具包。已經應用了架構,其中大多數利用參數共享進行聯合建模,但仍然需要用於實體識別和關係分類的顯式單獨組件(Miwa和Bansal 2016; Gupta,Schutze和Andrassy 2016)。相比之下,Zheng等人(2017b)提出了一種特殊的標記方案,將聯合提取轉換爲序列標記問題,以統一的方式解決任務。 Zeng等人(2018)提出的另一種統一方法是採用具有複製機制的序列到序列學習方法。提取重疊關係時,他們的模型無法識別多詞實體。總的來說,使用單個統一模型聯合提取實體和重疊關係仍然具有挑戰性。

在本文中,我們提出了一種新的統一方法,通過根據查詢詞位置p同時標記實體和關係標籤來解決聯合提取問題。給定一個句子和一個查詢位置p,我們的模型將回答兩個僞問題:是實體及其類型在p?處”和“哪個實體與p處的實體有關係?”爲此,我們設計了一種特殊的標記方案,該方案在查詢位置p處標註實體標籤,並在其他位置標註關係標籤。因此,它實際上將聯合關係提取問題轉換爲序列標籤問題列表,例如,對於n字句子,我們根據n個查詢位置註釋n個不同的標籤序列。在單個統一模型中對這n個標記序列進行建模,將一種新穎的位置注意機制引入序列標記模型(參見圖3),以生成n個不同的位置感知語句表示。 sed解碼不同的標記結果,從中我們可以提取所有實體,它們的類型和所有重疊的關係。此外,所提出的注意力機制可以在單詞(實體)之間建立直接聯繫,這可能有助於提取長距離關係(這兩個實體之間的距離很長。)

本文的主要貢獻是新提出的統一提取實體和重疊關係的統一模型。
1.設計一種可以同時表示實體類型和重疊關係的標記方案;
2.提出一種位置注意機制,根據查詢位置p生成不同的位置感知語句表示,可用於解碼不同的標記序列。
3.我們使用兩個公共數據集演示了該方法的有效性,並獲得了最新的結果;此外,分析表明,我們的模型在提取遠程關係方面表現更好,通常更多。 很難。

在這裏插入圖片描述

Figure 1: An example sentence that contains overlapping relations which share the same entity in the sentence. For example, the first three relations in the table are overlapping because they share the same entity “Trump”. Similarly, the last two relations are also overlapping due to the shared entity “New York City”. Such overlapping relations are very common in datasets of relation extraction (see Table 1).
圖1:包含重疊關係的句子示例,該重疊關係共享該句子中的相同實體。例如,表中的前三個關係因爲它們共享相同的實體``特朗普’'而重疊。類似地,後兩個關係也重疊。 由於共享實體“紐約市”,這種重疊關係在關係提取數據集中非常常見(請參見表1)。
方法
在本節中,我們首先介紹將重疊關係提取轉換爲序列標記問題列表的標記方案,然後詳細介紹基於該標記方案的位置注意序列標記模型。

Figure 2: An example of our tagging scheme, where n denotes the sentence length, p ∈ [1, n] is the query word position. For a query p, we build an n-tag sequence to represent all possible overlapping relations that correspond to the entity at p. Thus, entity type is labeled at p if it is the start of an entity, and relation types are labeled at the rest of words if they have relationship with the entity at p. In this way, all entities and overlapping relations can be annotated using this tagging scheme. In this example, “LOC” in “B-LOC” is short for entity type LOCATION, “PER” in “S-PER” is short for PERSON, “PO” in “B-PO” is short for relation type President of, “BI” in “B-BI” is short for Born in and “LI” in “B-LI’ is short for Located in.

在這裏插入圖片描述
圖2:我們的標記方案的示例,其中n表示句子長度,p∈[1,n]是查詢詞的位置。對於查詢p,我們建立一個n標籤序列來表示與p處的實體相對應的所有可能的重疊關係。因此,如果實體類型是實體的開始,則在p處標記該實體類型;如果它們與p處的實體具有關係,則在其餘的單詞上標記關係類型。這樣,可以使用此標記方案對所有實體和重疊關係進行註釋。在此示例中,“ B-LOC”中的“ LOC”是實體類型LOCATION的縮寫,“ S-PER”中的“ PER”是PERSON的縮寫,“ B-PO”中的“ PO”是關係類型的主席的縮寫 ,“ B-BI”中的“ BI”是“出生於”的縮寫,而“ B-LI”中的“ LI”是“位於”的縮寫。

As shown in Figure 2, for an n-word sentence, n different tag sequences are annotated based on our tagging scheme according to different query position p. In each tag sequence, entity type is labeled at the current query position p if this position is the start of an entity, and other entities, which have relationship to the entity at p, are labeled with relation types. The rest of tokens are assigned label “O” (Outside), indicating that they do not correspond to the entity that is attended to. Thus, relations, which are represented by a triplet (Entity1, RelationType, Entity2), can be extracted based on a tag sequence. Here, Entity1 is the first argument of the relation and can be obtained from the detected entity at the query position. Entity2, the second argument, and RelationType can be extracted from other detected entities and their labels in the same tag sequence. Obviously, the first entity can be used multiple times to form overlapping relations.

如圖2所示,對於n個單詞的句子,根據我們的標記方案,根據不同的查詢位置p對n個不同的標記序列進行註釋。在每個標籤序列中,如果實體類型是在實體的開始處,則在當前查詢位置p標記實體類型,而在p與該實體有關係的其他實體則用關係類型標記。其餘令牌被分配了標籤“ O”(外部),表明它們不對應於所關注的實體。因此,可以基於標籤序列提取由三元組(Entity1,RelationType,Entity2)表示的關係。在這裏,Entity1是關係的第一個參數,可以從查詢位置處的檢測到的實體中獲取.Entity2,第二個參數和RelationType可以從其他檢測到的實體及其相同標籤序列中的標籤中提取。顯然,第一個實體可以多次使用以形成重疊關係。

For example, when the query position p is 5 and the token at this position is “Trump”, the label of “Trump” is PERSON. Other entities, such as “United States”, “Queens” and “New York City”, which are corresponding to “Trump”, are labeled as President of, Born in and Born in. The first entity “Trump” will be used for three times to form three different triplets in this situation. If the query p is 12 and the token is “Queens”, its tag is LOCATION, and the corresponding entity “New York City” is labeled as Located in. For p is 2, there is no corresponding entity and thus only the entity type of “United States” is labeled (notice that the relation is unidirectional). All of the tokens are labeled as “O” when p is 1 because there is no entity at the position p which is attended to. For both entity and relation type annotation, we use “BIES” (Begin, Inside, End, Single) signs to indicate the position information of tokens in the entity, and therefore we can extract multi-word entities. Through our tagging scheme, all of overlapping relations in an n-word sentence, together with all entity mentions and their entity types, can be represented in n tag sequences.
Note that our tagging scheme is quite different from table filling method (Miwa and Sasaki 2014). It uses only half of the table and hence cannot represent reverse relation, which is also a kind of overlapping form, e.g., if the relation of entity pair (Entity1, Entity2) is Capital of, the reverse pair (Entity2, Entity1) may have relation of Contains. In addition, the best search order “close-first” actually equals to first detecting entities and then classifying relations. Most joint neural methods following table filling (Gupta, Schutze, and Andrassy 2016; Zhang, Zhang, and Fu 2017) also use this search order and usually require explicit RC components in the network.

例如,當查詢位置p爲5且該位置的令牌爲“特朗普”時,“特朗普”的標籤爲PERSON。對應於“特朗普”的其他實體,例如“美國”,“皇后”和“紐約市”,被標記爲總裁,出生地和出生地。在這種情況下,第一個實體“ Trump”將被使用三次以形成三個不同的三元組。如果查詢p爲12,令牌爲“ Queens”,則其標籤爲LOCATION,而對應的實體“ New York City”爲標記爲位於。對於p爲2,沒有對應的實體,因此僅標記了“美國”的實體類型(請注意,該關係是單向的)。當p爲1時,所有標記都被標記爲“ O”,因爲在位置p上沒有任何實體。對於實體和關係類型註釋,我們都使用“ BIES”(開始,內部,結尾,單個)。符號來表示令牌在實體中的位置信息,因此我們可以提取多字實體。通過我們的標記方案,n字句子中的所有重疊關係以及所有實體提及及其實體類型都可以用n個標記序列表示。
請注意,我們的標記方案與表格填充方法完全不同(Miwa和Sasaki 2014)。它僅使用表的一半,因此不能表示反向關係,這也是一種重疊形式,例如,如果實體對(Entity1,Entity2)的關係爲Capital of,則反向對(Entity2,Entity1)可能具有包含的關係。此外,最佳搜索順序“先近”實際上等於先檢測實體然後對關係進行分類。表格填充後的大多數聯合神經方法(Gupta,Schutze和Andrassy 2016; Zhang,Zhang和Fu 2017)也使用此方法。搜索順序,通常需要網絡中的顯式RC組件。

End-to-End Sequence Labeling Model with Position-Attention
With our tagging scheme, we build an end-to-end sequence labeling neural architecture (Figure 3) to jointly extract entities and overlapping relations. Our architecture first encodes the n-word sentence using a RNN encoder. Then, we use a position-attention mechanism to produce different position-aware sentence representations for every query position p. Based on these positionaware representations, we finally use Conditional Random Field (CRF) to decode the n tag sequences to extract entities and overlapping relations. Bi-LSTM Encoder RNNs have been shown powerful to capture the dependencies of input word sequence. In this work, we choose bidirectional Long Short Term Memory (Bi-LSTM) as the encoding RNN. Consider a sentence that consists of n words S = {wt}n , we first convert the words to their word-level representations {wt }tn=1 , where wt ∈ Rd is the d-dimensional word vector corresponding to the t-th word in the sentence. As out of vocabulary (OOV) word is common for entity, we also augment word representation with character-level information. Character-level representations {w }tn=1 are extracted by a convolution neural network (CNN), where w ∈ Rk is the k-dimensional outputs of the CNN with k-filters. This CNN, similar to the one applied on words, receives character embeddings as input and generates representation which effectively captures the morphological information of the word. The final representations of words are concatenation of word-level and character-level representation [wt ; w ]. Then, the Bi-LSTM is implemented to produce forward state h−→t and backward state h←−t for each time step:

具有位置注意的端到端序列標記模型
通過我們的標記方案,我們構建了端到端序列標記神經體系結構(圖3),以共同提取實體和重疊關係。我們的架構首先使用RNN編碼器對n字句子進行編碼。然後,我們使用位置注意機制爲每個查詢位置p生成不同的位置感知語句表示。基於這些位置感知表示,我們最終使用條件隨機場(CRF)對n個標記序列進行解碼,以提取實體和重疊關係。Bi-LSTM編碼器RNN已顯示出強大的功能,可以捕獲輸入單詞序列的依賴性。在本文中,我們選擇雙向長期短期記憶(Bi-LSTM)作爲編碼RNN。考慮由n個單詞S = {wt} n組成的句子,我們首先將單詞轉換爲它們的單詞級表示形式{wt} tn = 1,其中wt∈Rd是與第t個單詞相對應的d維單詞向量句子中的單詞。由於詞彙外(OOV)單詞對於實體很常見,因此我們還使用字符級信息來增強單詞表示。字符級表示{w} tn = 1是由卷積神經網絡(CNN)提取的,其中w∈Rk是帶有k濾波器的CNN的k維輸出。該CNN與應用於單詞的CNN相似,接收字符嵌入作爲輸入,並生成有效捕獲單詞形態信息的表示形式。單詞的最終表示形式是單詞級別表示形式和字符級別表示形式的組合[wt; w]。然後,Bi-LSTM被實現爲在每個時間步產生正向狀態h−→t和反向狀態h←−t:

These two separate hidden states capture both past (forward) and future (backward) information of the word sequence. Finally, we concatenate h−→t and h←−t as the encoding output of the t-th word, donated as ht = [h−→t ; h←−t ], to obtain the final sentence representations H = {ht}tn=1. However, such representations are not enough for decoding the n tag sequences produced by our tagging scheme. Because position information is lacking where to detect Entity1 and other components in overlapping triplets.
Position-Attention Mechanism The key information for detecting an entity and its relationship with another entity include: (1) the words inside the entity itself; (2) the depended entity; (3) the context that indicates the relationship. Under these considerations, we propose position-attention, which can encode the entity information at query position as well as the context information of the whole sentence to generate position-aware and context-aware representations {ut}n

這兩個單獨的隱藏狀態捕獲單詞序列的過去(向前)和將來(向後)信息。最後,我們將h−→t和h←−t連接爲第t個單詞的編碼輸出,輸出爲ht = [h−→t; h←−t],以獲得最終的句子表示形式H =(ht } tn = 1。但是,這樣的表示不足以對我們的標記方案產生的n個標記序列進行解碼。因爲缺少位置信息來檢測重疊的三胞胎中的Entity1和其他組件。位置注意機制用於檢測一個實體及其與另一個實體的關係的關鍵信息包括:(1)實體本身內部的單詞;(2)從屬實體;(3)表示關係的上下文。基於這些考慮,我們提出了位置注意,它可以對查詢位置處的實體信息以及整個句子的上下文信息進行編碼,以生成位置感知和上下文感知的表示形式{ut} n

在這裏插入圖片描述

Figure 3: The architecture of our position-attentive sequence labeling model. It receives the same sentence input and a different query position p to extract all overlapping relations. Here, the red “Queens” is the token at query position p, hp is the hidden state of time-step p, ht is the hidden vector of time-step t, at is the attention weights and ct is the attention-pooling vector.
圖3:我們的位置注意序列標記模型的體系結構。它接收相同的句子輸入和不同的查詢位置p以提取所有重疊關係。在這裏,紅色“ Queens”是查詢位置p處的標記,hp是時間步長p的隱藏狀態,ht是時間步長t的隱藏向量,at是注意力權重,ct是注意力集中向量 。

where WH , Wp, Wh, v are parameters to be learned, hj , hp, ht are the hidden states at position j, p and t respectively, stj is the score computed by comparing hp and ht with each of the sentence state hj , and atj is the attention weight produced by normalization of stj . It means that hp, the state at the position that we attend to, is used for comparing with the sentence representations to encode position information, and ht is used for matching the sentence representations against itself (self-matching) to collect information from the context (Wang et al. 2017). The positionattention mechanism produces different sentence representations according to the query position p, and thus solves the problem for modeling different tag sequences of a sentence. The following tag decoder can predict completely distinct labels given the same sentence and different query positions.CRF Decoder It is shown beneficial for sequence labeling model to consider the correlations between labels in neighborhoods and jointly decode the best chain of labels. Thus, we use CRF for jointly decoding, instead of independently decoding each label.
其中WH,Wp,Wh,v是要學習的參數,hj,hp,ht分別是位置j,p和t處的隱藏狀態,stj是通過將hp和ht與每個句子狀態hj進行比較而得出的分數, atj是stj歸一化產生的注意力權重。這意味着hp(我們所處位置的狀態)用於與句子表示進行比較以編碼位置信息,而ht用於將句子表示與自身進行匹配(自我匹配)以從上下文中收集信息(Wang et al.2017)。位置注意機制根據查詢位置p產生不同的句子表示,從而解決了對句子的不同標籤序列進行建模的問題。在給定相同的句子和不同的查詢位置的情況下,以下標籤解碼器可以預測完全不同的標籤。
CRF解碼器 對於序列標記模型來說,考慮鄰域中的標記之間的相關性並共同解碼最佳標記鏈被證明是有益的。因此,我們使用CRF進行聯合解碼,而不是獨立地解碼每個標記。

Thus, we use CRF for jointly decoding, instead of independently decoding each label. We consider Z = {zt}n to be the input sequence scores, which is generated from position-aware sentence representation ut:
因此,我們使用CRF進行聯合解碼,而不是獨立地解碼每個標籤。我們認爲Z = {zt} n是輸入序列分數,它是根據位置感知語句表示ut生成的:
where zt ∈ RNt is the tag scores of the t-th word, and Nt is the number of distinct tags. Consider Zt,j as the score of the j-th tag at position t. For a sequence of labels y = {yt}tn=1 we define the decoding score as:
其中zt∈RNt是第t個單詞的標籤分數,Nt是不同標籤的數量。將Zt,j視爲在位置t處第j個標籤的分數。對於標籤序列y =(yt } tn = 1我們定義解碼分數爲:
where A is transition matrix such that Ai,j represents the transition score from the tag i to tag j. Then we get the conditional probability over all possible label sequences y with the following form:
其中A是過渡矩陣,使得Ai,j表示從標籤i到標籤j的過渡分數。然後,我們以以下形式獲得所有可能的標籤序列y的條件概率:

Extracting Entities and Overlapping Relations from Tag Sequences
From our tagging scheme, we know that the first entity of the triplet and its entity type can be extracted from the labels at the query position, and the second corresponding entity, if existed, as well as the relation type, can be obtained from the labels at other positions (see Figure 2). The overlapping relation extracting problem is solved because the first entity is allowed to belong to multiple triplets in a tag sequence. Through n tag sequence results considering different query positions, we can extract all overlapping relations in a sentence, as well as all entity mentions and their entity types. Moreover, the extracted entity types can be used to validate the triplets, for example, if the relation type is Born in, the entity type of the first argument must be PERSON.
從標籤序列中提取實體和重疊關係
從我們的標記方案中,我們知道可以從查詢位置的標籤中提取三元組的第一個實體及其實體類型,而從其他位置的標籤中獲取第二個對應的實體(參見圖2)。 如果存在,以及關係類型,因爲第一實體被允許屬於標籤序列中的多個三元組,所以解決了重疊關係提取問題。通過考慮不同查詢位置的n個標記序列結果,我們可以提取句子中所有重疊的關係,以及所有實體提及及其實體類型。此外,提取的實體類型可用於驗證三元組,例如,如果關係類型爲“出生於”,則第一個參數的實體類型必須爲PERSON。

Experiments
Experiment Settings
Datasets We use two public datasets to demonstrate the effectiveness of our method: (1) NYT (Riedel, Yao, and McCallum 2010) is a news corpus sampled from 294k 1989-2007 New York Times news articles. We use the same dataset1 published by (Ren et al. 2017). The training data are automatically labeled using distant supervision, while 395 sentences are annotated by the author of (Hoffmann et al. 2011) and used as test data. (2) Wiki-KBP (Xiao and Weld 2012) utilizes 1.5M sentences sampled from 780k Wikipedia articles as training corpus, while test set consists of 289 sentences selected by the author of (Ren et al. 2017) from the manual annotations in 2013 KBP slot filling assessment results (Ellis et al. 2012). We use the public training data2 which are automatically labeled using distant supervision and handcrafted patterns by the author of (Liu et al. 2017). Statistics of the datasets are shown in Table 1.
Evaluation We mainly focus on overlapping relation extraction in this paper. Because our model directly extracts relations without detecting entities and their entity types first, we only evaluate the results of extracted triplets. We use the F1 metric computed from Precision (Prec.) and Recall (Rec.) for evaluation. A triplet is marked correct when its relation type and two corresponding entities are all correct, where the entity is considered correct if the head and tail offsets are both correct. We exclude all triplets with relation type of “None” (because we do not require them as negative samples) and create a validation set by randomly sampling 10%sentences from test set as previous studies (Ren et al. 2017; Zheng et al. 2017b) did.
數據集
我們使用兩個公共數據集來證明我們方法的有效性:
(1)NYT(Riedel,Yao,and McCallum 2010)是一個新聞語料庫,樣本來自294k 1989-2007年《紐約時報》新聞文章,我們使用(Ren et al.2017)發佈的相同數據集1。訓練數據會使用遠程監督自動標記,而395個句子由(Hoffmann et al。2011)的作者註釋,並用作測試數據。
(2)Wiki-KBP(Xiao and Weld 2012)使用從780k Wikipedia文章中抽取的150萬個句子作爲訓練語料,而測試集由(Ren et al.2017)的作者從2013年的手動註解中選擇的289個句子組成KBP插槽填充評估結果(Ellis et al.2012)。我們使用公開培訓數據2(Liu et al.2017)的作者使用遠程監督和手工製作的模式自動標記這些數據。表1顯示了數據集的統計信息。
評估
本文主要關注重疊關係提取。因爲我們的模型直接提取關係,而不先檢測實體及其實體類型,所以我們僅評估提取的三元組的結果。我們使用從Precision(Prec。)和Recall(Rec。)計算出的F1指標進行評估。當三元組的關係類型和兩個對應的實體都正確時,則將其標記爲正確;如果前後偏移都正確,則認爲該實體正確。我們排除關係類型爲“無”的所有三元組(因爲我們不要求它們作爲陰性樣本),並通過從先前的研究中隨機抽取測試集中的10%句子來創建驗證集(Ren等人,2017; Zheng等人。 2017b)。

Implementation Details
For both datasets, the word embeddings are randomly initialized with 100 dimensions and the character embeddings are randomly initialized with 50 dimensions. The window size of CNN is set to 3 and the number of filters is 50. For Bi-LSTM encoder, the hidden vector length is set to 200. We use l2 regularization with a parameter of 0.001 to avoid overfitting. Parameter optimization is performed using Adam (Kingma and Ba 2014) with learning rate 0.001 and batch size 16. In addition, we randomly sample 10% negative tag sequences in which all words are labeled as “O” to reduce the training samples.
Baselines We compare our method on NYT and Wiki-KBP datasets with the following baselines: (1) MultiR (Hoffmann et al. 2011) models training label noise based on multi-instance multilabel learning; (2) DS-Joint (Li and Ji 2014) jointly extracts entities and relations using structured perceptron; (3) Cotype (Ren et al. 2017) learns jointly the representations of entity mentions, relation mentions and type labels; (4) ReHession (Liu et al. 2017) employs heterogeneous supervision from both knowledge base and heuristic patterns. (5) LSTM-CRF and LSTM-LSTM-Bias (Zheng et al. 2017b), the most related work to our method, converts the joint extraction task to a sequence labeling problem based on a novel tagging scheme. However, it cannot detect overlapping relations. Note that we do not compare our method with (Zeng et al. 2018) for two reasons. First, their model can decode only the first word of multi-word entity, while ours can detect the whole. In this paper, we conduct a stricter evaluation in which an entity is considered correct only if the head and tail offsets are both correct, which makes the task more challenging. Second, they do not report the result on manually labeled NYT test set. Instead, they use test set split from training data which is generated by distant supervision.
實施細節
對於兩個數據集,單詞嵌入都是使用100個維度隨機初始化的,而字符嵌入則是使用50個維度隨機初始化的。CNN的窗口大小設置爲3,過濾器數爲50.對於Bi-LSTM編碼器,隱藏矢量長度設置爲200。我們使用參數0.001的l2正則化來避免過度擬合。參數優化是使用Adam(Kingma and Ba 2014),學習率爲0.001,批次大小爲16進行的。此外,我們隨機抽取10%的負面標籤序列,其中所有單詞均標記爲“ O”以減少訓練樣本。基線我們將NYT和Wiki-KBP數據集上的方法與以下基線進行比較:
(1)MultiR(Hoffmann et al。2011)基於多實例多標籤學習對訓練標籤噪聲進行建模;
(2)DS-Joint(Li和Ji 2014)使用結構化感知器共同提取實體和關係;
(3)Cotype(Ren et al.2017)共同學習實體提及,關係提及和類型標籤的表示;
(4)ReHession(Liu et al.2017)採用了知識庫和啓發式模式的異構監督。
(5)LSTM-CRF和LSTM-LSTM-Bias(Zheng et al.2017b)是與我們方法最相關的工作,基於一種新穎的標記方案將聯合提取任務轉換爲序列標記問題。
但是,它無法檢測到重疊關係。請注意,由於兩個原因,我們沒有將我們的方法與(Zeng等人2018)進行比較。首先,他們的模型只能解碼多字實體的第一個字,而我們的模型可以檢測整個字。在本文中,我們進行了更嚴格的評估,其中只有當頭部和尾部偏移量都正確時,纔將實體視爲正確,這使任務更具挑戰性。其次,他們不會在手動標記的NYT測試儀上報告結果。相反,他們使用從遠程監督生成的訓練數據中分離出的測試集。

Experimental Results and Analyses
Main Results
We report our experimental results on NYT and Wiki-KBP datasets in Table 2. It is shown that our model, position-attentive LSTM-CRF (PA-LSTM-CRF), outperforms all the baselines and achieves state-of-the-art F1 score on both datasets. Specially, compared to LSTM-LSTM-Bias (Zheng et al. 2017b), our method achieves significant improvements of 5.7% in F1 Wiki-KBP dataset, which is mainly because our model can extract overlapping relations. For NYT dataset, although no overlapping relation is manually labeled, our model also outperforms LSTM-LSTM-Bias by 4.3% in F1 due to a large improvement in Recall. We consider that it is because our model our model is capable of identifying more long-range relations, which will be further discussed in section of attention analysis. We also notice that the Precision of our model drops compared with LSTM-LSTM-Bias. This is mainly because many over-lapping relations are not annotated in the manually labeled test data. We will discuss it in the following paragraph.
Effect on Overlapping Relation Extraction
As Table 1 shows, there are about a third of sentences that contain overlapping relations in training data of both datasets, but much less in test sets. In fact, we find that many overlapping relations are omitted from the manually labeled test data in both datasets, especially for the relations of reverse pair of entities, e.g., “per:parents” and “per:children” in Wiki-KBP, or “/location/country/administrative divisions” and “/location/administrative division/country” in NYT. This may significantly affect the performance of our model on overlapping relation detection, especially the Precision. Thus, to discover the capability of our model to identify overlapping triplets, we simply add some gold triplets into test set of Wiki-KBP. For example, we add a ground-truth reverse triplet with “per:parents” type if “per:children” is originally labeled and vice versa. This will increase the number of sentences with overlapping relations in test set to about 50 from 23, but still much less in proportion to training data. We report the evaluation results compared with LSTM-LSTM-Bias in Table 3. It can be seen that our model achieves an large improvement of 6.9% in F1 and 11.2% in Precision compared with the results in Table 2, while the performance of LSTM-LSTM-Bias basically remain the same in F1. Moreover, for overlapping relations, our model significantly outperforms LSTM-LSTM-Bias by about 30%, which demonstrates the effectiveness of our method on extraction of overlapping relations.
實驗結果與分析
主要結果
我們在表2中針對NYT和Wiki-KBP數據集報告了我們的實驗結果,結果表明,我們的模型(即位置專注的LSTM-CRF(PA-LSTM-CRF))優於所有基線並達到了最新水平兩個數據集上的F1得分。特別是,與LSTM-LSTM-Bias(Zheng et al.2017b)相比,我們的方法在F1 Wiki-KBP數據集中實現了5.7%的顯着改進,這主要是因爲我們的模型可以提取重疊關係。對於NYT數據集,儘管沒有重疊這種關係是手動標記的,由於召回率有很大的改善,因此我們的模型在F1中也比LSTM-LSTM-Bias高出4.3%。我們認爲這是因爲我們的模型能夠識別更多的遠程關係,這將在注意力分析部分中進一步討論。我們還注意到,與LSTM-LSTM-Bias相比,我們模型的Precision有所下降。這主要是因爲在手動標記的測試數據中沒有註釋許多重疊的關係。我們將在以下段落中進行討論。
對重疊關係提取的影響
如表1所示,兩個數據集的訓練數據中大約有三分之一的句子包含重疊關係,而在測試集中則少得多。實際上,我們發現在兩個數據集中手動標記的測試數據中都省略了許多重疊關係,特別是對於反向實體對的關係,例如Wiki-KBP中的“每個:父母”和“每個:孩子”,或者紐約時報中的“ /位置/國家/行政區劃”和“ /位置/國家行政區劃/國家”。這可能會嚴重影響我們模型在重疊關係檢測上的性能,尤其是精度。因此,爲了發現我們的模型識別重疊的三胞胎的能力,我們只需在Wiki-KBP的測試集中添加一些金三胞胎即可。例如,如果“ per:兒童”最初帶有標籤,反之亦然。這將使測試集中具有重疊關係的句子數量從23個增加到大約50個,但與訓練數據的比例仍然要少得多。我們在表3中報告了與LSTM-LSTM-Bias相比的評估結果。可以看出,與表2的結果相比,我們的模型在F1方面實現了6.9%的大幅改進,在Precision方面實現了11.2%的大幅改進,而LSTM-LSTM-Bias的性能在F1中基本保持不變。此外,對於重疊關係,我們的模型明顯優於LSTM-LSTM-Bias約30%,這證明了我們的方法在提取重疊關係方面的有效性。

Ablation Study
We also conduct ablation experiments to study the effect of components of our model. As shown in Table 4, all the components play an important roles in our model. Consistent with previous work, the character-level representations are helpful to capture the morphological information and deal with OOV words in sequence labeling task (Ma and Hovy 2016). Introducing position attention mechanism for generating the position-aware representations seems an efficient way to incorporate the information of the query position compared with directly concatenating the the hidden vector at the query position to each state of the BiLSTM encoder. In addition, the self-matching in our position-attention mechanism also contributes to the final results for the reason of extracting more information from the context.
Comparison of Running Time While LSTM-LSTM-Bias or LSTM-CRF runs sequence tagging only once to extract non-overlapping relations, our model tags the same sentence for another n − 1 times in order to recognize all overlapping relations. This means our model is more time-consuming (O(n2) vs. O(n)). For instance, LSTM-CRF only predicts about 300 samples and consumes 2s on Wiki-KBP test set, while our model decodes about 7000 tag sequences and takes about 50s. However, testing can be speeded up by sharing the sentence representation before position attention because it is identical for the other n−1 times decoding. In this way, the running time of our model reduces to 16s. Moreover, we may also prune some query positions to further accelerate in real application because it is impossible for some words to be the head of an entity.

消融研究
我們還進行了消融實驗,以研究模型組件的效果。如表4所示,所有組件都在模型中起着重要作用。與以前的工作一致,字符級表示有助於捕獲形態信息和在序列標記任務中處理OOV單詞(Ma and Hovy 2016)。與直接將查詢位置處的隱藏向量連接到BiLSTM編碼器的每個狀態相比,引入用於生成位置感知表示的位置注意機制似乎是合併查詢位置信息的有效方法。此外,由於從上下文中提取了更多信息,因此我們的位置注意機制中的自我匹配也有助於最終結果。
運行時間比較
雖然LSTM-LSTM-Bias或LSTM-CRF僅運行一次序列標籤以提取非重疊關係,但我們的模型將同一句子再標記n-1次,以便識別所有重疊關係。
這意味着我們的模型比較費時(O(n2)vs. O(n)),例如,LSTM-CRF僅預測約300個樣本,在Wiki-KBP測試集上消耗2s,而我們的模型則解碼約7000個標籤序列,大約需要50秒鐘。但是,可以通過在位置注意之前共享句子表示來加快測試速度,因爲它與其他n-1次解碼相同。這樣,我們的模型的運行時間減少到16s。此外,我們還可以修剪一些查詢位置以在實際應用中進一步加快速度,因爲某些詞不可能成爲實體的頭部。

Further Analysis for Attention
Detecting Long-Range Relations It is shown in previous work that attention mechanism is helpful to capturing the long range dependencies between arbitrary tokens, which is very important for detecting triplets composed of entities that have a long distance from each other. To further prove the effectiveness of position attention, we analyze the F1 score on triplets with different distances between entities on Wiki-KBP dataset. As shown in Figure 4, we find that the performance of our model remains stable as the distance between entities increases, while that of LSTM-LSTM-Bias drops significantly. It means that our model can effectively model the dependencies between entities despite a long distance between them.
Case Study for Attention Weights We select two sentences from the test set of Wiki-KBP to show the alignment of our position-attention for case study. As shown in Figure 5, the information of the first entity at the query position, together with the context evidence for recognizing relations, is encoded into the position-aware representations regardless of distance between entities. For instance, in Figure 5(a), the second entity “Albert” pays more attention to the possible corresponding entities “Anderson” and “Carol”, as well as the key context word “sons” that contains information of relation. The model also produces reasonable alignment at other query position as Figure 5(b) shows. In Figure 5©, “Thousand Oaks” can still attend to the first entity “Sparky Anderson” despite a long distance between them.
進一步分析注意
檢測遠距離關係在先前的工作中表明,注意力機制有助於捕獲任意標記之間的遠距離依賴關係,這對於檢測由彼此之間具有較長距離的實體組成的三元組非常重要。爲了進一步證明位置注意的有效性,我們分析了Wiki-KBP數據集上實體之間具有不同距離的三元組的F1分數。如圖4所示,我們發現隨着實體之間距離的增加,我們模型的性能保持穩定,而LSTM-LSTM-Bias的性能卻顯着下降。這意味着我們的模型可以有效地建模實體之間的依賴關係,儘管它們之間的距離很長。注意權重的案例研究我們從Wiki-KBP的測試集中選擇兩個句子,以顯示案例研究中我們的位置注意的對齊方式。如圖5所示,與實體之間的距離無關,在查詢位置的第一個實體的信息以及用於識別關係的上下文證據被編碼爲位置感知表示。例如,在圖5(a)中,第二個實體“阿爾伯特”更加關注可能的對應實體“安德森”和“卡羅爾”,以及包含關聯信息的關鍵上下文單詞“子”。該模型還在其他查詢位置產生了合理的對齊,如圖5(b)所示。在圖5(c)中,“千橡樹”仍然可以參加第一個實體“ Sparky Anderson”,儘管它們之間的距離很長。

Figure 5: Part of attention matrices for position-attention. Each row is the attention weights of the whole sentence for the current token. The query position is marked by an arrow, the red tokens indicate the first entity extracted at the query position, and the blue tokens indicates the second corresponding entities. (a) and (b) are the attention matrices of different query positions for the same sentence, and © is the attention matrix for a sentence with long-range entity pairs.
圖5:位置注意的部分注意矩陣。每行是當前令牌的整個句子的關注權重。查詢位置由箭頭標記,紅色標記指示在查詢位置提取的第一個實體,藍色標記指示第二個對應的實體。(a)和(b)是同一句子的不同查詢位置的注意矩陣,(c)是具有遠程實體對的句子的注意矩陣。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章