DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications
Paper: https://arxiv.org/abs/1711.05073
Page: http://ai.baidu.com/broad/subordinate?dataset=dureader
Code: https://github.com/baidu/DuReader/
DuReader,一個新的大型開放中文機器閱讀理解數據集。
DuReader 與以前的 MRC 數據集相比有三個優勢:
數據來源:問題和文檔均基於百度搜索和百度知道; 答案是手動生成的。
問題類型:它爲更多的問題類型提供了豐富的註釋,特別是是非類和觀點類問題。
規模:包含 200K 個問題,420K 個答案和 1M 個文檔; 是目前最大的中文 MRC 數據集。
This paper introduces DuReader, a new large-scale, open-domain Chinese machine reading comprehension (MRC) dataset, designed to address real-world MRC.
DuReader has three advantages over previous MRC datasets:
data sources: questions and documents are based on Baidu Search and Baidu Zhidao; answers are manually generated.
question types: it provides rich annotations for more question types, especially yes-no and opinion questions, that leaves more opportunity for the research community.
scale: it contains 200K questions, 420K answers and 1M documents; it is the largest Chinese MRC dataset so far.
Introduction
Dataset | Lang | Que. | Docs | Source of Que. | Source of Docs | Answer Type |
---|---|---|---|---|---|---|
CNN/DM | EN | 1.4M | 300K | Synthetic cloze | News | Fill in entity |
HLF-RC | ZH | 100K | 28K | Synthetic cloze | Fairy/News | Fill in word |
CBT | EN | 688K | 108 | Synthetic cloze | Children’s books | Multi. choices |
RACE | EN | 870K | 50K | English exam | English exam | Multi. choices |
MCTest | EN | 2K | 500 | Crowd-sourced | Fictional stories | Multi. choices |
NewsQA | EN | 100K | 10K | Crowd-sourced | CNN | Span of words |
SQuAD | EN | 100K | 536 | Crowd-sourced | Wiki. | Span of words |
SearchQA | EN | 140K | 6.9M | QA site | Web doc. | Span of words |
TrivaQA | EN | 40K | 660K | Trivia websites | Wiki./Web doc. | Span/substring of words |
NarrativeQA | EN | 46K | 1.5K | Crowd-sourced | Book&movie | Manual summary |
MS-MARCO | EN | 100K | 200K | User logs | Web doc. | Manual summary |
DuReader | ZH | 200k | 1M | User logs | Web doc./CQA | Manual summary |
表 1: 機器閱讀理解數據集對比
Pilot Study
Examples | Fact | Opinion |
---|---|---|
Entity | iphone哪天發佈 | 2017最好看的十部電影 |
- | On which day will iphone be released | Top 10 movies of 2017 |
Description | 消防車爲什麼是紅的 | 豐田卡羅拉怎麼樣 |
- | Why are firetrucks red | How is Toyota Carola |
YesNo | 39.5度算高燒嗎 | 學圍棋能開發智力嗎 |
- | Is 39.5 degree a high fever | Does learning to play go improve intelligence |
表 2: 中文六類問題的例子
Scaling up from the Pilot to DuReader
Data Collection and Annotation
Data Collection
DuReader 的樣本可用四元組表示: ,其中 是問題, 是問題類型, 是相關文檔集合, 是由人類標註產生的答案集合。
The DuReader is a sequence of 4-tuples: , where is a question, is a question type, is a set of relevant documents, and is an answer set produced by human annotators.
Question Type Annotation
Answer Annotation
衆包
Crowd-sourcing
Quality Control
Training, Development and Test Sets
數量 | 訓練集 | 開發集 | 測試集 |
---|---|---|---|
問題 | 181K | 10K | 10K |
文檔 | 855K | 45K | 46K |
答案 | 376K | 20K | 21K |
The training, development and test sets consist of 181K, 10K and 10K questions, 855K, 45K and 46K documents, 376K, 20K and 21K answers, respectively.
DuReader is (Relatively) Challenging
challenges:
The number of answers.
圖 1. 答案數量分佈
The edit distance.
人類生成的答案和源文檔之間的差異很大。
the difference between the human generated answers and the source documents is large.
The document length.
問題平均 4.8 詞,答案平均 69.6 詞,文檔平均 396 詞。
In DuReader, questions tend to be short (4.8 words on average) compared to answers (69.6 words), and answers tend to be short compared to documents (396 words on average).
Experiments
Baseline Systems
從每個文件中選擇一個最相關的段落
在選定的段落中應用最先進的 MRC 模型
our designed systems have two steps:
select one most related paragraph from each document
apply the state-of-the-art MRC models on the selected paragraphs
Paragraph Selection
在訓練階段,我們從文檔中選擇與人類生成答案重疊最大的段落作爲最相關段落。
In training stage, we select one paragraph from a document as the most relevant one, if the paragraph has the largest overlap with human generated answer.
在測試階段,由於我們沒有人類生成答案,我們選擇與問題重疊最大的段落作爲最相關段落。
In testing stage, since we have no human generated answer, we select the most relevant paragraph that has the largest overlap with the corresponding question.
Answer Span Selection
Match-LSTM
要在段落中找到答案,它會按順序遍歷段落,並動態地將注意力加權問題表示與段落的每個標記進行匹配。
最後,使用答案指針層來查找段落中的答案範圍。
To find an answer in a paragraph, it goes through the paragraph sequentially and dynamically aggregates the matching of an attention-weighted question representation to each token of the paragraph.
Finally, an answer pointer layer is used to find an answer span in the paragraph.
BiDAF
它使用上下文對問題的關注和問題對上下文的關注,以突出問題和上下文中的重要部分。
之後,使用注意流層來融合所有有用的信息,以獲得每個位置的向量表示。
It uses both context-to-question attention and question-to-context attention in order to highlight the important parts in both question and context.
After that, the so-called attention flow layer is used to fuse all useful information in order to get a vector representation for each position.
Results and Analysis
評價方法:BLEU-4, Rouge-L
We evaluate the reading comprehension task via character-level BLEU-4 and Rouge-L.
Systems | Baidu Search | - | Baidu Zhidao | - | All | - |
---|---|---|---|---|---|---|
- | BLEU-4% | Rouge-L% | BLEU-4% | Rouge-L% | BLEU-4% | Rouge-L% |
Selected Paragraph | 15.8 | 22.6 | 16.5 | 38.3 | 16.4 | 30.2 |
Match-LSTM | 23.1 | 31.2 | 42.5 | 48.0 | 31.9 | 39.2 |
BiDAF | 23.1 | 31.1 | 42.2 | 47.5 | 31.8 | 39.0 |
Human | 55.1 | 54.4 | 57.1 | 60.7 | 56.1 | 57.4 |
Table 6: Performance of typical MRC systems on the DuReader.
Question type | Description | - | Entity | - | YesNo | - |
---|---|---|---|---|---|---|
- | BLEU-4% | Rouge-L% | BLEU-4% | Rouge-L% | BLEU-4% | Rouge-L% |
Match-LSTM | 32.8 | 40.0 | 29.5 | 38.5 | 5.9 | 7.2 |
BiDAF | 32.6 | 39.7 | 29.8 | 38.4 | 5.5 | 7.5 |
Human | 58.1 | 58.0 | 44.6 | 52.0 | 56.2 | 57.4 |
Table 8: Performance on various question types.
Opinion-aware Evaluation
Question type | Fact | - | Opinion | - |
---|---|---|---|---|
- | BLEU-4% | Rouge-L% | BLEU-4% | Rouge-L% |
Opinion-unaware | 6.3 | 8.3 | 5.0 | 7.1 |
Opinion-aware | 12.0 | 13.9 | 8.0 | 8.9 |
Table 9: Performance of opinion-aware model on YesNo questions.
Discussion
Conclusion
提出了 DuReader 數據集,提供了幾個 baseline。