ACE05 關係抽取數據集

ACE05 自然語言信息抽取數據集

簡介

  • 數據集概述

    提供已經標註好的多種類型實體,關係和事件,目前該數據集主要用於事件抽取任務中

    有中文、英文和阿拉伯文的數據

標註說明

  • 標註過程如下
  1. 先進行1P和DUAL兩輪標註,標註的結果分別存儲於對應語料的fp1和fp2目錄下
  2. 對以上兩輪標註的結果進行裁決,將才絕後的標註結果存儲於對應語料的adj目錄下
  3. 對於English的語料,對adj目錄下標註的結果再進行一步處理,將結果存儲於timex2norm目錄下

對應的標註過程和標註內容如下

    1P: entities        DUAL: entities
        values                values
        events                events
        relations             relations
            |                    |
            |                    |
            |_________?__________|
                      |
                      |
                      |
                      V
                 ADJ: entities
                      values
                      events
                      relations
                      |
                      |
                      |
                      V
                 NORM: TIMEX2 normalization 
                       (English only)

目錄架構

  • 目錄架構如下

    ─Arabic              # 阿拉伯語語料庫
    │  ├─bn
    │  │  ├─adj
    │  │  ├─altAdj
    │  │  ├─fp1
    │  │  └─fp2
    │  ├─nw
    │  │  ├─adj
    │  │  ├─altAdj
    │  │  ├─fp1
    │  │  └─fp2
    │  └─wl
    │      ├─adj
    │      ├─fp1
    │      └─fp2
    ├─Chinese             # 中文語料
    │  ├─bn
    │  │  ├─adj
    │  │  ├─fp1
    │  │  └─fp2
    │  ├─nw
    │  │  ├─adj
    │  │  ├─fp1
    │  │  └─fp2
    │  └─wl
    │      ├─adj
    │      ├─fp1
    │      └─fp2
    ├─dtd               # 數據說明文件  
    └─English           # 英文語料
        ├─bc
        │  ├─adj
        │  ├─fp1
        │  ├─fp2
        │  └─timex2norm
        ├─bn
        │  ├─adj
        │  ├─fp1
        │  ├─fp2
        │  └─timex2norm
        ├─cts
        │  ├─adj
        │  ├─fp1
        │  ├─fp2
        │  └─timex2norm
        ├─nw
        │  ├─adj
        │  ├─fp1
        │  ├─fp2
        │  └─timex2norm
        ├─un
        │  ├─adj
        │  ├─fp1
        │  ├─fp2
        │  └─timex2norm
        └─wl
            ├─adj
            ├─fp1
            ├─fp2
            └─timex2norm
    

文件解讀

  • 每份語料由如下所示的5個文件組成

    Source Text (.sgm) Files
    	- 這些文件是SGM格式的源文本文件,.sgm文件是UTF-8編碼的
     ACE Program Format (APF) (.apf.xml) Files
    	- 這些文件採用ACE註釋文件格式。
     AG (.ag.xml) Files
        - 這些是使用LDC的註釋工具創建的註釋文件,這些文件被轉換爲對應的.apf.xml文件。
     ID table (.tab) Files
        - 這些文件通過使用ag.xml文件和相應的apf.xml文件存儲ID們之間的映射表
     AIF (.aif.xml) Files
    	- 這些是使用MITRE的Callisto創建的註釋文件,僅適用於Valorem產生的阿拉伯數據。
    

以下以/English/bn/CNN_ENG_20030630_085848.18爲例進行具體的解讀

  • CNN_ENG_20030630_085848.18.sgm中內容(關於類似<DOC>這些標籤的含義可見dtd/ace_source_sgml.v1.0.2.dtd)

    <DOC>
    <DOCID> CNN_ENG_20030630_085848.18 </DOCID>#文件名字
    <DOCTYPE SOURCE="broadcast news"> NEWS STORY </DOCTYPE>#文件來源
    <DATETIME> 2003-06-30 09:23:30 </DATETIME>#時間
    <BODY>
    <TEXT>
    <TURN>#具體內容
    a wildfire in california forced hundreds of people from their homes.
    the fire, near the historic state park started yesterday when a
    trailer, hauled by a pickup, ignited on the golden state freeway. the
    fire consumed more than 500 acres is only about 35% contained. no
    injuries have been reported thankfully hat this time.
    </TURN>
    </TEXT>
    </BODY>
    <ENDTIME> 2003-06-30 09:23:54 </ENDTIME>
    </DOC>
    
  • CNN_ENG_20030630_085848.18.apf.xml

    .apf.xml文件是ACE標註過實體、關係、事件等要素後以XML格式呈現的文本(.apf.xml文件的說明文檔是dtd/ace_source_sgml.apf.v5.1.1.dtd)。

    說一下dtd/ace_source_sgml.apf.v5.1.1.dtd應該怎麼讀

    <!ATTLIST relation           #relation的標籤具有以下的幾個屬性
                                 ID       ID                        #REQUIRED 
                                 									#這個REQUIRED表示必須的
                                 TYPE     (PHYS|PART-WHOLE|PER-SOC|ORG-AFF|
                                           ART|GEN-AFF|METONYMY)    #REQUIRED
                                 SUBTYPE  (Located|Near|Geographical| #二級分類
                                           Subsidiary|Artifact|Business|
                                           Family|Lasting-Personal|Employment|
                                           Ownership|Founder|Student-Alum|
                                           Sports-Affiliation|
                                           Investor-Shareholder|
                                           Membership|
                                           User-Owner-Inventor-Manufacturer|
                                           Citizen-Resident-Religion-Ethnicity|
                                           Org-Location)            #IMPLIED
                                 MODALITY (Asserted|Other)          #IMPLIED
                                 TENSE    (Past|Present|Future|		#時態
                                           Unspecified)             #IMPLIED
    >
    

    relation標籤:

    <relation ID="CNN_ENG_20030630_085848.18-R1" TYPE="ART" SUBTYPE="User-Owner-Inventor-Manufacturer" TENSE="Unspecified" MODALITY="Asserted">
    
  • 回到CNN_ENG_20030630_085848.18.apf.xml其中標記的要素包括

    1. ENTITY

      <entity ID="CNN_ENG_20030630_085848.18-E2" TYPE="PER" SUBTYPE="Group" CLASS="USP">
        <entity_mention ID="CNN_ENG_20030630_085848.18-E2-2" TYPE="NOM" LDCTYPE="NOM">
          <extent>
            <charseq START="100" END="117">hundreds of people</charseq>
          </extent>
          <head>
            <charseq START="112" END="117">people</charseq>
          </head>
        </entity_mention>
        <entity_mention ID="CNN_ENG_20030630_085848.18-E2-3" TYPE="PRO" LDCTYPE="PRO">
          <extent>
            <charseq START="124" END="128">their</charseq>
          </extent>
          <head>
            <charseq START="124" END="128">their</charseq>
          </head>
        </entity_mention>
      </entity>
      <entity ID="CNN_ENG_20030630_085848.18-E3" TYPE="FAC" SUBTYPE="Building-Grounds" CLASS="SPC">
        <entity_mention ID="CNN_ENG_20030630_085848.18-E3-4" TYPE="NOM" LDCTYPE="NOM">
          <extent>
            <charseq START="124" END="134">their homes</charseq>
          </extent>
          <head>
            <charseq START="130" END="134">homes</charseq>
          </head>
        </entity_mention>
      </entity>
      
      • entity包含4個必須具備的屬性:ID,TYPE,SUBTYPE和CLASS

      • entity屬性中的TYPE共有7類,分別是PER、ORG、LOC、GPE、FAC、VEH和WEA;每一類下都有若干對應的子類,具體可見dtd/ace_source_sgml.apf.v5.1.1.dtd文檔;

        TYPE="PER" SUBTYPE="Individual"
        TYPE="PER" SUBTYPE="Group"
        TYPE="PER" SUBTYPE="Indeterminate"
        
        TYPE="ORG" SUBTYPE="Government"
        ...
        
      • entity_mention是對實體進一步區分他有extent和head兩個子標籤,extent代表詞的全稱,head代表詞中最關鍵的單詞。他有一系列的屬性例如ID,TYPE,LDCTYPE,ROLE等。

      • entity還有external_link和entity_attributes這兩個屬性,external_link表示有些詞有什麼外部鏈接,entity_attributes表示將來可能要引入到庫裏的新詞

    2. VALUE

      <value ID="CNN_ENG_20030630_085848.18-V1" TYPE="Numeric" SUBTYPE="Percent">
        <value_mention ID="CNN_ENG_20030630_085848.18-V1-1">
          <extent>
            <charseq START="319" END="320">35</charseq>
          </extent>
        </value_mention>
      </value>
      
      • VALUE包含三個必備的屬性:ID,TYPE和SUBTYPE

      • VALUE的TYPE一共有5類分別是Numeric、Contact-Info、Crime、Job-Title和Sentence;每一類下都有若干對應的子類,具體可見dtd/ace_source_sgml.apf.v5.1.1.dtd文檔

        TYPE="Numeric" SUBTYPE="Money"
        TYPE="Numeric" SUBTYPE="Percent"
        TYPE="Contact-Info" SUBTYPE="Phone-Number"
        TYPE="Contact-Info" SUBTYPE="E-Mail"
        TYPE="Contact-Info" SUBTYPE="URL"
        
        TYPE="Crime"
        TYPE="Job-Title"
        TYPE="Sentence"
        
      • value_mention標籤和上述entity_mention標籤類似有extent和head兩個子標籤

    3. timex2

      <timex2 ID="CNN_ENG_20030630_085848.18-T1" VAL="2003-06-30T09:23:30">
        <timex2_mention ID="CNN_ENG_20030630_085848.18-T1-1">
          <extent>
            <charseq START="44" END="62">2003-06-30 09:23:30</charseq>
          </extent>
        </timex2_mention>
      </timex2>
      <timex2 ID="CNN_ENG_20030630_085848.18-T2" VAL="2003-06-29">
        <timex2_mention ID="CNN_ENG_20030630_085848.18-T2-1">
          <extent>
            <charseq START="184" END="192">yesterday</charseq>
          </extent>
        </timex2_mention>
      </timex2>
      <timex2 ID="CNN_ENG_20030630_085848.18-T3" VAL="2003-06-30TMO">
        <timex2_mention ID="CNN_ENG_20030630_085848.18-T3-1">
          <extent>
            <charseq START="380" END="388">this time</charseq>
          </extent>
        </timex2_mention>
      </timex2>
      
      • timex2可選屬性包括VAL(標準形式的時間)

      • timex2_mention與上邊同理

    4. RELATION

      <relation ID="CNN_ENG_20030630_085848.18-R1" TYPE="ART" SUBTYPE="User-Owner-Inventor-Manufacturer" TENSE="Unspecified" MODALITY="Asserted">
        <relation_argument REFID="CNN_ENG_20030630_085848.18-E2" ROLE="Arg-1"/>
        <relation_argument REFID="CNN_ENG_20030630_085848.18-E3" ROLE="Arg-2"/>
        <relation_mention ID="CNN_ENG_20030630_085848.18-R1-1" LEXICALCONDITION="Possessive">
          <extent>
            <charseq START="124" END="134">their homes</charseq>
          </extent>
          <relation_mention_argument REFID="CNN_ENG_20030630_085848.18-E2-3" ROLE="Arg-1">
            <extent>
              <charseq START="124" END="128">their</charseq>
            </extent>
          </relation_mention_argument>
          <relation_mention_argument REFID="CNN_ENG_20030630_085848.18-E3-4" ROLE="Arg-2">
            <extent>
              <charseq START="124" END="134">their homes</charseq>
            </extent>
          </relation_mention_argument>
        </relation_mention>
      </relation>
      
      • relation包含TYPE屬性表示後邊兩個詞ROLE='Arg-1’與’Arg-2’之間的關係,關係主要包括

        <!-- List of TYPE/SUBTYPE pairs (as of May 7, 2005)
        
        TYPE="PHYS" SUBTYPE="Located"
        TYPE="PHYS" SUBTYPE="Near"
        
        TYPE="PART-WHOLE" SUBTYPE="Geographical"
        TYPE="PART-WHOLE" SUBTYPE="Subsidiary"
        TYPE="PART-WHOLE" SUBTYPE="Artifact"
        ...
        TYPE="METONYMY" (no SUBTYPE)
        
    5. EVENT

      <event ID="CNN_ENG_20030630_085848.18-EV1" TYPE="Movement" SUBTYPE="Transport" MODALITY="Asserted" POLARITY="Positive" GENERICITY="Specific" TENSE="Past">
        <event_argument REFID="CNN_ENG_20030630_085848.18-E2" ROLE="Artifact"/>
        <event_argument REFID="CNN_ENG_20030630_085848.18-E3" ROLE="Origin"/>
        <event_mention ID="CNN_ENG_20030630_085848.18-EV1-1">
          <extent>
            <charseq START="93" END="134">forced hundreds of people from their homes</charseq>
          </extent>
          <ldc_scope>
            <charseq START="68" END="134">a wildfire in california forced hundreds of people from their homes</charseq>
          </ldc_scope>
          <anchor>
            <charseq START="93" END="98">forced</charseq>
          </anchor>
          <event_mention_argument REFID="CNN_ENG_20030630_085848.18-E2-2" ROLE="Artifact">
            <extent>
              <charseq START="100" END="117">hundreds of people</charseq>
            </extent>
          </event_mention_argument>
          <event_mention_argument REFID="CNN_ENG_20030630_085848.18-E3-4" ROLE="Origin">
            <extent>
              <charseq START="124" END="134">their homes</charseq>
            </extent>
          </event_mention_argument>
        </event_mention>
      </event>
      
      • event的TYPE屬性如下

        TYPE="Life" SUBTYPE="Be-Born"
        TYPE="Life" SUBTYPE="Die"
        TYPE="Life" SUBTYPE="Marry"
        TYPE="Life" SUBTYPE="Divorce"
        TYPE="Life" SUBTYPE="Injure"
        TYPE="Transaction" SUBTYPE="Transfer-Ownership"
        TYPE="Transaction" SUBTYPE="Transfer-Money"
        TYPE="Movement" SUBTYPE="Transport"
        TYPE="Business" SUBTYPE="Start-Org"
        TYPE="Business" SUBTYPE="End-Org"
        ...
        TYPE="Justice" SUBTYPE="Pardon"
        TYPE="Justice" SUBTYPE="Appeal"
        
      • event共有6個必須的屬性TYPE,SUBTYPE,MODALITY,POLARITY,GENERICITY,TENSE

      • 他的子標籤有event_argument、event_mention

      • event_mention包含extent、ldc_scope、anchor、event_mention_argument子標籤,其中ldc_scope表示整個一句話,anchor是event_trigger

參考了 https://blog.csdn.net/carrie_0307/article/details/91417203 的文章

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章