[paper] DuReader

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Paper: https://arxiv.org/abs/1711.05073

Page: http://ai.baidu.com/broad/subordinate?dataset=dureader

Code: https://github.com/baidu/DuReader/

DuReader,一個新的大型開放中文機器閱讀理解數據集。

DuReader 與以前的 MRC 數據集相比有三個優勢:

  1. 數據來源:問題和文檔均基於百度搜索和百度知道; 答案是手動生成的。

  2. 問題類型:它爲更多的問題類型提供了豐富的註釋,特別是是非類和觀點類問題。

  3. 規模:包含 200K 個問題,420K 個答案和 1M 個文檔; 是目前最大的中文 MRC 數據集。

This paper introduces DuReader, a new large-scale, open-domain Chinese machine reading comprehension (MRC) dataset, designed to address real-world MRC.

DuReader has three advantages over previous MRC datasets:

  1. data sources: questions and documents are based on Baidu Search and Baidu Zhidao; answers are manually generated.

  2. question types: it provides rich annotations for more question types, especially yes-no and opinion questions, that leaves more opportunity for the research community.

  3. scale: it contains 200K questions, 420K answers and 1M documents; it is the largest Chinese MRC dataset so far.

Introduction

Dataset Lang Que. Docs Source of Que. Source of Docs Answer Type
CNN/DM EN 1.4M 300K Synthetic cloze News Fill in entity
HLF-RC ZH 100K 28K Synthetic cloze Fairy/News Fill in word
CBT EN 688K 108 Synthetic cloze Children’s books Multi. choices
RACE EN 870K 50K English exam English exam Multi. choices
MCTest EN 2K 500 Crowd-sourced Fictional stories Multi. choices
NewsQA EN 100K 10K Crowd-sourced CNN Span of words
SQuAD EN 100K 536 Crowd-sourced Wiki. Span of words
SearchQA EN 140K 6.9M QA site Web doc. Span of words
TrivaQA EN 40K 660K Trivia websites Wiki./Web doc. Span/substring of words
NarrativeQA EN 46K 1.5K Crowd-sourced Book&movie Manual summary
MS-MARCO EN 100K 200K User logs Web doc. Manual summary
DuReader ZH 200k 1M User logs Web doc./CQA Manual summary

表 1: 機器閱讀理解數據集對比

Pilot Study

Examples Fact Opinion
Entity iphone哪天發佈 2017最好看的十部電影
- On which day will iphone be released Top 10 movies of 2017
Description 消防車爲什麼是紅的 豐田卡羅拉怎麼樣
- Why are firetrucks red How is Toyota Carola
YesNo 39.5度算高燒嗎 學圍棋能開發智力嗎
- Is 39.5 degree a high fever Does learning to play go improve intelligence

表 2: 中文六類問題的例子

Scaling up from the Pilot to DuReader

Data Collection and Annotation

Data Collection

DuReader 的樣本可用四元組表示: {q,t,D,A} ,其中 q 是問題,t 是問題類型,D 是相關文檔集合,A 是由人類標註產生的答案集合。

The DuReader is a sequence of 4-tuples: {q,t,D,A} , where q is a question, t is a question type, D is a set of relevant documents, and A is an answer set produced by human annotators.

Question Type Annotation

Answer Annotation

衆包

Crowd-sourcing

Quality Control

Training, Development and Test Sets

數量 訓練集 開發集 測試集
問題 181K 10K 10K
文檔 855K 45K 46K
答案 376K 20K 21K

The training, development and test sets consist of 181K, 10K and 10K questions, 855K, 45K and 46K documents, 376K, 20K and 21K answers, respectively.

DuReader is (Relatively) Challenging

challenges:

  1. The number of answers.

    圖 1. 答案數量分佈

  2. The edit distance.

    人類生成的答案和源文檔之間的差異很大。

    the difference between the human generated answers and the source documents is large.

  3. The document length.

    問題平均 4.8 詞,答案平均 69.6 詞,文檔平均 396 詞。

    In DuReader, questions tend to be short (4.8 words on average) compared to answers (69.6 words), and answers tend to be short compared to documents (396 words on average).

Experiments

Baseline Systems

  1. 從每個文件中選擇一個最相關的段落

  2. 在選定的段落中應用最先進的 MRC 模型

our designed systems have two steps:

  1. select one most related paragraph from each document

  2. apply the state-of-the-art MRC models on the selected paragraphs

Paragraph Selection

在訓練階段,我們從文檔中選擇與人類生成答案重疊最大的段落作爲最相關段落。

In training stage, we select one paragraph from a document as the most relevant one, if the paragraph has the largest overlap with human generated answer.

在測試階段,由於我們沒有人類生成答案,我們選擇與問題重疊最大的段落作爲最相關段落。

In testing stage, since we have no human generated answer, we select the most relevant paragraph that has the largest overlap with the corresponding question.

Answer Span Selection

  • Match-LSTM

    要在段落中找到答案,它會按順序遍歷段落,並動態地將注意力加權問題表示與段落的每個標記進行匹配。

    最後,使用答案指針層來查找段落中的答案範圍。

    To find an answer in a paragraph, it goes through the paragraph sequentially and dynamically aggregates the matching of an attention-weighted question representation to each token of the paragraph.

    Finally, an answer pointer layer is used to find an answer span in the paragraph.

  • BiDAF

    它使用上下文對問題的關注和問題對上下文的關注,以突出問題和上下文中的重要部分。

    之後,使用注意流層來融合所有有用的信息,以獲得每個位置的向量表示。

    It uses both context-to-question attention and question-to-context attention in order to highlight the important parts in both question and context.

    After that, the so-called attention flow layer is used to fuse all useful information in order to get a vector representation for each position.

Results and Analysis

評價方法:BLEU-4, Rouge-L

We evaluate the reading comprehension task via character-level BLEU-4 and Rouge-L.

Systems Baidu Search - Baidu Zhidao - All -
- BLEU-4% Rouge-L% BLEU-4% Rouge-L% BLEU-4% Rouge-L%
Selected Paragraph 15.8 22.6 16.5 38.3 16.4 30.2
Match-LSTM 23.1 31.2 42.5 48.0 31.9 39.2
BiDAF 23.1 31.1 42.2 47.5 31.8 39.0
Human 55.1 54.4 57.1 60.7 56.1 57.4

Table 6: Performance of typical MRC systems on the DuReader.

Question type Description - Entity - YesNo -
- BLEU-4% Rouge-L% BLEU-4% Rouge-L% BLEU-4% Rouge-L%
Match-LSTM 32.8 40.0 29.5 38.5 5.9 7.2
BiDAF 32.6 39.7 29.8 38.4 5.5 7.5
Human 58.1 58.0 44.6 52.0 56.2 57.4

Table 8: Performance on various question types.

Opinion-aware Evaluation

Question type Fact - Opinion -
- BLEU-4% Rouge-L% BLEU-4% Rouge-L%
Opinion-unaware 6.3 8.3 5.0 7.1
Opinion-aware 12.0 13.9 8.0 8.9

Table 9: Performance of opinion-aware model on YesNo questions.

Discussion

Conclusion

提出了 DuReader 數據集,提供了幾個 baseline。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章