ACE05 自然語言信息抽取數據集
簡介
-
數據集概述
提供已經標註好的多種類型實體,關係和事件,目前該數據集主要用於事件抽取任務中
有中文、英文和阿拉伯文的數據
標註說明
- 標註過程如下
- 先進行1P和DUAL兩輪標註,標註的結果分別存儲於對應語料的fp1和fp2目錄下
- 對以上兩輪標註的結果進行裁決,將才絕後的標註結果存儲於對應語料的adj目錄下
- 對於English的語料,對adj目錄下標註的結果再進行一步處理,將結果存儲於timex2norm目錄下
對應的標註過程和標註內容如下
1P: entities DUAL: entities
values values
events events
relations relations
| |
| |
|_________?__________|
|
|
|
V
ADJ: entities
values
events
relations
|
|
|
V
NORM: TIMEX2 normalization
(English only)
目錄架構
-
目錄架構如下
─Arabic # 阿拉伯語語料庫 │ ├─bn │ │ ├─adj │ │ ├─altAdj │ │ ├─fp1 │ │ └─fp2 │ ├─nw │ │ ├─adj │ │ ├─altAdj │ │ ├─fp1 │ │ └─fp2 │ └─wl │ ├─adj │ ├─fp1 │ └─fp2 ├─Chinese # 中文語料 │ ├─bn │ │ ├─adj │ │ ├─fp1 │ │ └─fp2 │ ├─nw │ │ ├─adj │ │ ├─fp1 │ │ └─fp2 │ └─wl │ ├─adj │ ├─fp1 │ └─fp2 ├─dtd # 數據說明文件 └─English # 英文語料 ├─bc │ ├─adj │ ├─fp1 │ ├─fp2 │ └─timex2norm ├─bn │ ├─adj │ ├─fp1 │ ├─fp2 │ └─timex2norm ├─cts │ ├─adj │ ├─fp1 │ ├─fp2 │ └─timex2norm ├─nw │ ├─adj │ ├─fp1 │ ├─fp2 │ └─timex2norm ├─un │ ├─adj │ ├─fp1 │ ├─fp2 │ └─timex2norm └─wl ├─adj ├─fp1 ├─fp2 └─timex2norm
文件解讀
-
每份語料由如下所示的5個文件組成
Source Text (.sgm) Files - 這些文件是SGM格式的源文本文件,.sgm文件是UTF-8編碼的 ACE Program Format (APF) (.apf.xml) Files - 這些文件採用ACE註釋文件格式。 AG (.ag.xml) Files - 這些是使用LDC的註釋工具創建的註釋文件,這些文件被轉換爲對應的.apf.xml文件。 ID table (.tab) Files - 這些文件通過使用ag.xml文件和相應的apf.xml文件存儲ID們之間的映射表 AIF (.aif.xml) Files - 這些是使用MITRE的Callisto創建的註釋文件,僅適用於Valorem產生的阿拉伯數據。
以下以/English/bn/CNN_ENG_20030630_085848.18爲例進行具體的解讀
-
CNN_ENG_20030630_085848.18.sgm中內容(關於類似<DOC>這些標籤的含義可見dtd/ace_source_sgml.v1.0.2.dtd)
<DOC> <DOCID> CNN_ENG_20030630_085848.18 </DOCID>#文件名字 <DOCTYPE SOURCE="broadcast news"> NEWS STORY </DOCTYPE>#文件來源 <DATETIME> 2003-06-30 09:23:30 </DATETIME>#時間 <BODY> <TEXT> <TURN>#具體內容 a wildfire in california forced hundreds of people from their homes. the fire, near the historic state park started yesterday when a trailer, hauled by a pickup, ignited on the golden state freeway. the fire consumed more than 500 acres is only about 35% contained. no injuries have been reported thankfully hat this time. </TURN> </TEXT> </BODY> <ENDTIME> 2003-06-30 09:23:54 </ENDTIME> </DOC>
-
CNN_ENG_20030630_085848.18.apf.xml
.apf.xml文件是ACE標註過實體、關係、事件等要素後以XML格式呈現的文本(.apf.xml文件的說明文檔是dtd/ace_source_sgml.apf.v5.1.1.dtd)。
說一下dtd/ace_source_sgml.apf.v5.1.1.dtd應該怎麼讀
<!ATTLIST relation #relation的標籤具有以下的幾個屬性 ID ID #REQUIRED #這個REQUIRED表示必須的 TYPE (PHYS|PART-WHOLE|PER-SOC|ORG-AFF| ART|GEN-AFF|METONYMY) #REQUIRED SUBTYPE (Located|Near|Geographical| #二級分類 Subsidiary|Artifact|Business| Family|Lasting-Personal|Employment| Ownership|Founder|Student-Alum| Sports-Affiliation| Investor-Shareholder| Membership| User-Owner-Inventor-Manufacturer| Citizen-Resident-Religion-Ethnicity| Org-Location) #IMPLIED MODALITY (Asserted|Other) #IMPLIED TENSE (Past|Present|Future| #時態 Unspecified) #IMPLIED >
relation標籤:
<relation ID="CNN_ENG_20030630_085848.18-R1" TYPE="ART" SUBTYPE="User-Owner-Inventor-Manufacturer" TENSE="Unspecified" MODALITY="Asserted">
-
回到CNN_ENG_20030630_085848.18.apf.xml其中標記的要素包括
-
ENTITY
<entity ID="CNN_ENG_20030630_085848.18-E2" TYPE="PER" SUBTYPE="Group" CLASS="USP"> <entity_mention ID="CNN_ENG_20030630_085848.18-E2-2" TYPE="NOM" LDCTYPE="NOM"> <extent> <charseq START="100" END="117">hundreds of people</charseq> </extent> <head> <charseq START="112" END="117">people</charseq> </head> </entity_mention> <entity_mention ID="CNN_ENG_20030630_085848.18-E2-3" TYPE="PRO" LDCTYPE="PRO"> <extent> <charseq START="124" END="128">their</charseq> </extent> <head> <charseq START="124" END="128">their</charseq> </head> </entity_mention> </entity> <entity ID="CNN_ENG_20030630_085848.18-E3" TYPE="FAC" SUBTYPE="Building-Grounds" CLASS="SPC"> <entity_mention ID="CNN_ENG_20030630_085848.18-E3-4" TYPE="NOM" LDCTYPE="NOM"> <extent> <charseq START="124" END="134">their homes</charseq> </extent> <head> <charseq START="130" END="134">homes</charseq> </head> </entity_mention> </entity>
-
entity包含4個必須具備的屬性:ID,TYPE,SUBTYPE和CLASS
-
entity屬性中的TYPE共有7類,分別是PER、ORG、LOC、GPE、FAC、VEH和WEA;每一類下都有若干對應的子類,具體可見dtd/ace_source_sgml.apf.v5.1.1.dtd文檔;
TYPE="PER" SUBTYPE="Individual" TYPE="PER" SUBTYPE="Group" TYPE="PER" SUBTYPE="Indeterminate" TYPE="ORG" SUBTYPE="Government" ...
-
entity_mention是對實體進一步區分他有extent和head兩個子標籤,extent代表詞的全稱,head代表詞中最關鍵的單詞。他有一系列的屬性例如ID,TYPE,LDCTYPE,ROLE等。
-
entity還有external_link和entity_attributes這兩個屬性,external_link表示有些詞有什麼外部鏈接,entity_attributes表示將來可能要引入到庫裏的新詞
-
-
VALUE
<value ID="CNN_ENG_20030630_085848.18-V1" TYPE="Numeric" SUBTYPE="Percent"> <value_mention ID="CNN_ENG_20030630_085848.18-V1-1"> <extent> <charseq START="319" END="320">35</charseq> </extent> </value_mention> </value>
-
VALUE包含三個必備的屬性:ID,TYPE和SUBTYPE
-
VALUE的TYPE一共有5類分別是Numeric、Contact-Info、Crime、Job-Title和Sentence;每一類下都有若干對應的子類,具體可見dtd/ace_source_sgml.apf.v5.1.1.dtd文檔
TYPE="Numeric" SUBTYPE="Money" TYPE="Numeric" SUBTYPE="Percent" TYPE="Contact-Info" SUBTYPE="Phone-Number" TYPE="Contact-Info" SUBTYPE="E-Mail" TYPE="Contact-Info" SUBTYPE="URL" TYPE="Crime" TYPE="Job-Title" TYPE="Sentence"
-
value_mention標籤和上述entity_mention標籤類似有extent和head兩個子標籤
-
-
timex2
<timex2 ID="CNN_ENG_20030630_085848.18-T1" VAL="2003-06-30T09:23:30"> <timex2_mention ID="CNN_ENG_20030630_085848.18-T1-1"> <extent> <charseq START="44" END="62">2003-06-30 09:23:30</charseq> </extent> </timex2_mention> </timex2> <timex2 ID="CNN_ENG_20030630_085848.18-T2" VAL="2003-06-29"> <timex2_mention ID="CNN_ENG_20030630_085848.18-T2-1"> <extent> <charseq START="184" END="192">yesterday</charseq> </extent> </timex2_mention> </timex2> <timex2 ID="CNN_ENG_20030630_085848.18-T3" VAL="2003-06-30TMO"> <timex2_mention ID="CNN_ENG_20030630_085848.18-T3-1"> <extent> <charseq START="380" END="388">this time</charseq> </extent> </timex2_mention> </timex2>
-
timex2可選屬性包括VAL(標準形式的時間)
-
timex2_mention與上邊同理
-
-
RELATION
<relation ID="CNN_ENG_20030630_085848.18-R1" TYPE="ART" SUBTYPE="User-Owner-Inventor-Manufacturer" TENSE="Unspecified" MODALITY="Asserted"> <relation_argument REFID="CNN_ENG_20030630_085848.18-E2" ROLE="Arg-1"/> <relation_argument REFID="CNN_ENG_20030630_085848.18-E3" ROLE="Arg-2"/> <relation_mention ID="CNN_ENG_20030630_085848.18-R1-1" LEXICALCONDITION="Possessive"> <extent> <charseq START="124" END="134">their homes</charseq> </extent> <relation_mention_argument REFID="CNN_ENG_20030630_085848.18-E2-3" ROLE="Arg-1"> <extent> <charseq START="124" END="128">their</charseq> </extent> </relation_mention_argument> <relation_mention_argument REFID="CNN_ENG_20030630_085848.18-E3-4" ROLE="Arg-2"> <extent> <charseq START="124" END="134">their homes</charseq> </extent> </relation_mention_argument> </relation_mention> </relation>
-
relation包含TYPE屬性表示後邊兩個詞ROLE='Arg-1’與’Arg-2’之間的關係,關係主要包括
<!-- List of TYPE/SUBTYPE pairs (as of May 7, 2005) TYPE="PHYS" SUBTYPE="Located" TYPE="PHYS" SUBTYPE="Near" TYPE="PART-WHOLE" SUBTYPE="Geographical" TYPE="PART-WHOLE" SUBTYPE="Subsidiary" TYPE="PART-WHOLE" SUBTYPE="Artifact" ... TYPE="METONYMY" (no SUBTYPE)
-
-
EVENT
<event ID="CNN_ENG_20030630_085848.18-EV1" TYPE="Movement" SUBTYPE="Transport" MODALITY="Asserted" POLARITY="Positive" GENERICITY="Specific" TENSE="Past"> <event_argument REFID="CNN_ENG_20030630_085848.18-E2" ROLE="Artifact"/> <event_argument REFID="CNN_ENG_20030630_085848.18-E3" ROLE="Origin"/> <event_mention ID="CNN_ENG_20030630_085848.18-EV1-1"> <extent> <charseq START="93" END="134">forced hundreds of people from their homes</charseq> </extent> <ldc_scope> <charseq START="68" END="134">a wildfire in california forced hundreds of people from their homes</charseq> </ldc_scope> <anchor> <charseq START="93" END="98">forced</charseq> </anchor> <event_mention_argument REFID="CNN_ENG_20030630_085848.18-E2-2" ROLE="Artifact"> <extent> <charseq START="100" END="117">hundreds of people</charseq> </extent> </event_mention_argument> <event_mention_argument REFID="CNN_ENG_20030630_085848.18-E3-4" ROLE="Origin"> <extent> <charseq START="124" END="134">their homes</charseq> </extent> </event_mention_argument> </event_mention> </event>
-
event的TYPE屬性如下
TYPE="Life" SUBTYPE="Be-Born" TYPE="Life" SUBTYPE="Die" TYPE="Life" SUBTYPE="Marry" TYPE="Life" SUBTYPE="Divorce" TYPE="Life" SUBTYPE="Injure" TYPE="Transaction" SUBTYPE="Transfer-Ownership" TYPE="Transaction" SUBTYPE="Transfer-Money" TYPE="Movement" SUBTYPE="Transport" TYPE="Business" SUBTYPE="Start-Org" TYPE="Business" SUBTYPE="End-Org" ... TYPE="Justice" SUBTYPE="Pardon" TYPE="Justice" SUBTYPE="Appeal"
-
event共有6個必須的屬性TYPE,SUBTYPE,MODALITY,POLARITY,GENERICITY,TENSE
-
他的子標籤有event_argument、event_mention
-
event_mention包含extent、ldc_scope、anchor、event_mention_argument子標籤,其中ldc_scope表示整個一句話,anchor是event_trigger
-
-
參考了 https://blog.csdn.net/carrie_0307/article/details/91417203 的文章