複雜語境下的實體關係抽取

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實體關係抽取是知識圖譜構建過程中的一個重要環節,同時也是信息抽取中的一個主要任務。近年來,該課題受到了學術界與工業界研究者們的廣泛關注,其主要包含多個子任務,如關係分類、遠程監督關係抽取等。今天的分享主要爲實體關係聯合抽取與文檔級關係抽取,可將它們歸結爲複雜語境下的實體關係抽取。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"實體關係抽取任務介紹"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/16\/16d57ff12b6ab6551e634f37dd0aa5f5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般情況下,我們將關係定義爲兩個或多個實體間的某種聯繫,而實體關係抽取旨在自動發現實體間存在的某種語義關係。在上圖中,我們可以判別出“喬布斯(人)”和“蘋果(公司)”之間有一種創始人關係。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1c\/1c56f0a92bec52692827e2fe2539a6f4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統的實體關係抽取我們稱之爲簡單語境,其主要針對一個句子中的兩個實體之間的語義關係特徵,在這種語境下會忽略其他實體或者關係的影響。而複雜語境通常包括兩種,如上圖中所列出的:(1)同一個句子中多個三元組之間相互影響;如上圖(左),Donald J. Trump和Queens之間有Born_in關係,Queens和New York City之間有Located_in關係,那麼也可以推斷出Donald J. Trump和New York City之間有Located_in關係;(2)大量的實體間關係是通過多個句子表達的,其主要涉及多個實體間的跨句關係抽取。上圖(右)爲從清華大學劉知遠老師團隊發佈的DocRED數據集上所獲得的截圖,在發佈該數據集時有進行簡單的統計,統計結果表明:在維基百科數據集人工標註的結果中,至少有40%實體關係相關的事實只能從多個句子中進行聯合抽取。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"實體關係聯合抽取"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/40\/408728aa451712468a8eba6ffdd52e1a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於實體關係聯合抽取任務,目前有較多的框架\/方法類別,如基於序列標註的方法,基於表填充的方法,序列到序列的方法,多任務學習的方法等,以下我們將就三類方法進行展開介紹。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 基於序列標註的方法"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6a\/6ac8b1b9f824a7dd4f40aa667dd92d65.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於序列標註的方法,不得不提2017年在ACL上發表的一篇文章《Joint Extraction of Entities and Relations Based on Novel Tagging Scheme》,(當時這篇文章也獲得了Outstanding paper)。在上述NovelTagging方法中,作者提出了一種新的關係標註模式(Relation tagging schema),對於每種關係,將其與(Begin, Inside,End,Single)以及頭實體和尾實體的序號(1,2)組會起來進行關係抽取,並根據最後的標註結果進行解碼,進而得到關係三元組。再者,該方法額外考慮了一個Other標籤,主要表示不屬於任何一種關係。如果總共有|R|種關係,那麼一共就有2*4*|R|+1個標籤。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/da\/dab89f3845cf4f734a54de2dd1cc5788.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上述方法中,作者也嘗試了一系列經典的序列標註框架,如LSTM+CRF,其中CRF善於捕捉近距離的標籤依賴,而LSTM可以捕捉長距離的標籤依賴。在此基礎上,由於LSTM可以堆疊之前所有時刻的隱狀態,所以我們在Encoding-Decoding框架的Decoding層同樣採用LSTM結構,如上圖(左);再者,考慮到other標籤也佔據較大的比例,所以作者加入損失因子(即Bias weight)以對other標籤的重要程度進行設置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c7\/c76a8c5df60e205d1f0bd0d7d2f1fed0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實驗中,作者選擇NYT數據集,並將其看作是監督數據(即已標註的數據),通過在24種關係上進行嘗試,實驗結果表明:該方法的F1值並不是特別高,但是驗證了該方法在處理關係抽取問題上的可行性,同時也爲基於該方法的改進模型留下了空間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d0\/d04b98b446a7a2f03d061c25eb9948ca.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在做實體關係抽取任務時,我們可以將關係三元組劃分爲Normal、EPO和SEO三種類型(也被稱爲關係重疊三元組)。EPO(Entity pair overlap)指的是兩個實體之間存在不止一種關係類型;SEO(Single entity overlap)指的是一個實體同時參與了兩個關係三元組。NovelTagging模型能夠較好地處理Normal三元組,然而卻無法有效適應EPO和SEO的三元組情況。這是因爲基於序列標註的方法只能爲每個文本單詞打上一個標籤(即N輸入N輸出),所以這類模型不能同時處理EPO和SEO三元組。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3e\/3edf1447935470333b50840685bb7fe7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於NovelTagging模型,Wei等人設計了一種名爲Hierarchical Binary Tagging的模型,將關係三元組的抽取任務建模爲三個級別的問題,從而能夠更好解決三元組重疊問題。該方法不再將關係抽取的過程看作是對實體對離散標籤的預測過程,而是將其看作是兩個實體的映射過程。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 基於表填充的方法"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/41\/41e86a2225923f5231dbd44c34928628.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二類實體關係聯合抽取的方法是基於表填充的方法,該類方法最早於2014年提出(請見EMNLP’2014上的《Modeling Joint Entity and Relation Extraction with Table Representation》)。在上圖中,對角線元素用於填充實體標籤(即通過命名實體識別後的結果),下三角元素用於填充關係(可被視爲一個softmax的分類)。我們發現該方法存在一個明顯的問題,即表格中每個單元格只能有一個標籤,不能抽取EPO三元組。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/37\/3750d28ec01600451ff75dd50ebff835.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於表填充工作的一項較好的改進爲多頭選擇工作(請見《Joint recognition and relation extraction as a multi-head selection problem》)。如上圖所示,這個模型的框架看上去好像和表格填充沒有太大的關係,實際上,這是一種多任務的框架。具體而言,該模型首先在BiLSTM的基礎上使用CRF進行序列標註,找到實體。然後,將句子中的詞進行排列組合;最後結合LSTM學習到的特徵,標籤的特徵,將它們送到sigmoid分類器中,進而判斷實體之間的關係。之所以將這項工作歸結爲表填充的類別,是因爲該模型在進行兩兩詞的組合過程中,就像表格中將一個詞作爲橫縱座標上的索引,進而填充該索引位置中的類別標籤這一情形一樣。值得一提的,爲了使該模型能夠支持上述提到的EPO和SEO情況,在進行關係分類(即表格填充)時,使用的是Sigmoid函數,允許兩個詞之間的多關係標籤存在,因此可以解決關係重疊的問題。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 序列到序列的方法"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/87\/87c9b319048526ee6d6674fd9c035032.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後介紹我們的工作CopyRE,在該工作中,我們將關係三元組的抽取問題看成是序列到序列的生成問題,即根據輸入的句子,藉助於機器翻譯的想法,將三元組看成是一個翻譯的序列。在使用Encoder-Decoder框架時,在輸出結果中可能會有部分實體不在輸入的句子中,這是因爲我們是立足於整個語料庫來構建的詞表,並將其作爲Decoder的預測對象。因此,我們藉助於拷貝機制,直接從源句中找到實體對象。與NovelTagging模型相比,我們的模型取得了性能的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f8\/f89fd0e77b0c7794babb3ef8ec7e7dc0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在CopyRE模型中,我們發現在進行拷貝時,主要是在控制實體的最後一個詞(token),當頭實體和尾實體是由多個token構成時,得到的關係三元組結果是不完整的。爲了解決上述問題,我們提出了CopyMTL模型,該模型通過序列標註方法(Sequence-Labelling)來識別由多個詞構成的實體,並增加非線性變換,改進了拷貝函數的計算方式,進而解決了實體拷貝不全的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/45\/4558bbecdc78846fdeafc5c3f5840560.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述也是我們自己的工作,該工作的出發點在於:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)在序列到序列的標註方法(Seq2seq)中,一個固有的問題是:由自迴歸解碼而導致的標記偏置問題,即在解碼第一實體時,依賴於start標記;解碼第二個實體時,依賴於第一個實體是什麼,以此類推。如果解碼過程中的某一步出現了錯誤,就會導致後續的解碼過程不準確;因爲在訓練時,我們有正確的標籤(ground truth),而測試時就沒有了正確標籤的指導,生成的標籤不可避免的會出現一定的錯誤。因此,標記偏置是一類比較嚴重的問題;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)模型強制學習訓練數據中三元組的先後順序。如上圖,例如,當模型在抽取得到“graduate_from”關係後,發現其與另外一種關係共現,進而就會再抽取這種關係,而將“graduate_from”關係遺忘。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cb\/cba265f2002e84cf5a851fb7136789fe.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲解決上述兩方面的問題:我們提出了Seq2UMTree模型。該模型的邏輯框架如上圖所示,其建模動機在於:(1)我們考慮是否能通過縮短解碼長度來解決標記偏置問題;(2)我們在解碼時,不考慮三元組之間的關係,而是採用樹狀結構的解碼方式。在樹狀結構的第一層中,我們先預測有哪幾種類型的關係;在第二層中,找到頭實體開始(start)與結束(end)的位置;在第三層中,再找到尾實體開始(start)與結束(end)的位置。該方法取得了比較好的性能效果。在這個工作中,無論句子中有多少個三元組,我們的解碼器最多隻有三層(即最多可爲3個時間步長(time step)進行解碼),這樣可以避免:如果解碼的time step越多,標記偏置的問題可能會越嚴重。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"文檔級關係抽取"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bf\/bf1d8c985ee7fa428eeff8c727f7eb55.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面給大家介紹本報告中第二個系列的工作,即文檔級關係抽取。上圖爲DocRED數據集中的一個例子。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/33\/334010b78cb7c72b4edaf549af22bbf3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文檔級關係抽取的關鍵在於:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)如何有效的建模實體的多粒度信息,其主要包括2點:(a)如上圖中,同一個實體可能在多個句子中出現(即實體在多個句子中的提及);(b)在文本寫作中,自然出現的實體指代問題;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)如何建模文檔內的複雜語義信息,其主要涉及多個方面的推理,如邏輯推理、指代推理和常識知識推理等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f0\/f03f2d3732b284965625c473826e49c7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"早期的文檔級關係抽取主要是基於句子級別的抽取,再進行聚合操作。這裏,我主要介紹基於圖神經網絡來完成上述任務的若干工作,第一個工作發表在ACL 2019上(請見《Inter-sentence Relation Extraction with Document-level Graph Convolutional Neural Network》),作者提出的模型被稱爲GCNN,主要的建模過程包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 使用圖神經網絡來建模文檔:在構建文檔時,將輸入文檔中的每個詞作爲圖中的一個節點,通過不同類型的邊對整個文檔的結構進行表示。如上圖所示,作者用句法依賴標籤建立句子中單詞之間的遠距離依賴;用指代消解工具將具有指代關係的詞進行連接;將句子中相鄰的兩個詞也連接起來;將句子中當前詞的前後節點連接起來;最後將詞本身進行連接。連接完成後,便構建出了一個圖結構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② 在下一步中,基於構建好的文檔圖,使用GCNN來學習節點之間的信息交互\/傳遞,計算得到每個節點的表示。這樣,不同句子中的不同提及都會得到表示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ 最後,通過多示例學習的方式聚合目標實體的所有提及並進行關係分類。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9a\/9a2d831ecdf430bd9245dbd5635455c0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實驗在CDR和CHR數據集上完成,並取得了較好的效果。特別要指出的是消融實驗的結果,如上表所示。實驗結果表明:大部分構建的邊對模型效果能起到提升的作用,而指代的邊並不影響句子內抽取,對於跨句子間的抽取確能起到重要的作用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/58\/5821efb83e450c3fb8528555ac5d6978.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個工作也是同一批作者完成的,所提出的模型叫做EOG(請見EMNLP’2019《Connecting the dots: Document-level neural relation extraction with edge-oriented graphs》)。現有的方法主要使用基於圖的模型,以實體作爲節點,根據兩個目標節點來確定實體間的關係。然而,關係是兩個節點之間的核心,實體關係可以通過節點間路徑形成的唯一的邊表示來更好地表達。EoG通過在不同類型的節點之間建立不同類型的邊來決定信息流入節點的多少,如此可以更好地擬合文檔之間異構的交互關係。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6d\/6de2762a055fb762325fbded291e2d7c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在EOG中,作者主要採用啓發式的方法來構建圖結構,圖中的節點有實體提及、實體和句子,邊主要包括:提及到提及(mention-mention, MM)、提及到實體(mention-entity, ME)、提及到句子(mention-sentence,MS)、實體到句子(entity-sentence,ES)和句子到句子(sentence-sentence,SS)之間的關係。構建好圖之後,下一步就是推理處理。與使用GCN來進行推理不同,這裏主要是訓練連接兩個實體之間的路徑,多次迭代加權該路徑上的信息,如上圖(右)所示,從而將路徑上的信息進行傳遞,起到推理的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/31\/31e481b0808866ceec0d939eb3dd0e29.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從實驗效果上來看,與對比方法相比,EoG模型取得了最好的效果。值得一提的是,EoG(full)表示的是一種全連接的圖,EoG(NoInf)表示不採用在圖中游走的策略,EoG(Sent)表示以句子級別來構建的圖。實驗結果表明,在構建文檔級別的圖結構時,跨句推理非常關鍵。另外,圖的結構也非常重要,構建全連接圖的方法與使用啓發式的規則、有目的地構建圖的方法(即EoG方法)的差距也很大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/aa\/aaee247595fbc6ee62b6e2c559db3a1b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於上述工作,我們發現圖的結構非常重要。這裏介紹ACL 2020年的一項工作,即LSR模型(請見《Reasoning with latent structure refinement for document-level relation extraction》),這個工作的動機主要是:圖的結構是否也可以從大量文本中進行學習?作者據此提出:將圖結構視爲一個潛在的變量,並且以端到端的方式對其進行歸納推理。即在初始建圖時,可以採用隨機初始化或其它方式來完成;後續每一步迭代都將該圖進行修正,進而得到最終的結果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a7\/a74ed73bf2c8d9ddaf6d6179d231781b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從實驗效果來看,通過與GCNN、GAT、AGGCN等方法進行對比,在LSR模型中,通過自動學習圖結構的方式也取得了非常好的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/42\/4231b060946dce5fade1cfb98bc5730c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我認爲,在文檔級關係抽取中一項有意思的工作來源於北京大學(請見EMNLP’ 2020《Double Graph Based Reasoning for Document-level Relation Extraction》),被稱爲Double Graph。我們一直在講,文檔級關係抽取中最重要的包括:如何對文檔進行建模以及如何進行關係推理。在Double Graph中,作者將以上兩個部分分隔開,第一個部分是在提及級別的圖(mention-level graph)上進行GCN或隨機遊走,進而完成圖中的信息傳遞與推理。如上圖(左),不同的顏色區域代表不同的句子,由於圓圈2表示的實體在多個句子中出現,故我們可將它們連接起來;一個句子中出現的所有實體也被連接起來。第二部分是比較創新的一個部分,即實體級別的圖(entity-level graph),它是將所有提及級別的節點進行加權。在考慮兩個實體之間的關係時,將它們的連接路徑也考慮在內,並且進行壓縮。最終在識別兩個實體之間的關係時,將上述壓縮後的向量送入MLP網絡進行關係分類。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5e\/5e002251338865b8f8dae5fbefeab6b6.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Double Graph模型與之前提到的一系列模型(如GAT、GCNN、EoG、LSR等)進行了對比,取得了優質的效果。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結與展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單總結一下,上述報告主要圍繞聯合抽取和文檔級別抽取兩方面工作進行介紹。在聯合抽取中,主要從序列到序列的角度進行展開。我們認爲該方向仍有一系列的工作可以深入地去做。在之前的講述中,我們一直提到標記偏置的問題,雖然Seq2UMTree模型能夠在一定程度上縮短解碼的長度,但是不可避免的,它還是存在偏置問題。是否有一種序列到集合的方式,既能不捕獲三元組的順序,又能解決偏置問題,值得進一步探索。對於文檔級別的抽取,如今GCN已成爲主流的方法,在NLP相關的會議上,基於GCN的文檔級別抽取方法的論文應該是最多的。但是,總體而言,這類方法主要是將提及(Mention)、實體、句子級別的信息進行聚合以及傳遞,但是我們在做關係抽取時,更多的是要關注實體對級別的信息傳遞,這方面的工作仍然可以做進一步的展開。還有一點,如何有效解決GNN可能面臨的“過平滑”問題。再者,在基於文檔而建立的異構圖中,不同節點之間的信息是如何進行傳遞的,這也是值得研究的一個點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:Datafuntalk(ID:datafuntalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/NAyuYMLDyx9Fut2blpvbRA","title":"xxx","type":null},"content":[{"type":"text","text":"複雜語境下的實體關係抽取"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章