[paper] DuReader

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Page: http://ai.baidu.com/broad/subordinate?dataset=dureader

Code: https://github.com/baidu/DuReader/

DuReader，一個新的大型開放中文機器閱讀理解數據集。

DuReader 與以前的 MRC 數據集相比有三個優勢：

數據來源：問題和文檔均基於百度搜索和百度知道; 答案是手動生成的。
問題類型：它爲更多的問題類型提供了豐富的註釋，特別是是非類和觀點類問題。
規模：包含 200K 個問題，420K 個答案和 1M 個文檔; 是目前最大的中文 MRC 數據集。

This paper introduces DuReader, a new large-scale, open-domain Chinese machine reading comprehension (MRC) dataset, designed to address real-world MRC.

DuReader has three advantages over previous MRC datasets:

data sources: questions and documents are based on Baidu Search and Baidu Zhidao; answers are manually generated.
question types: it provides rich annotations for more question types, especially yes-no and opinion questions, that leaves more opportunity for the research community.
scale: it contains 200K questions, 420K answers and 1M documents; it is the largest Chinese MRC dataset so far.

Introduction

Dataset	Lang	Que.	Docs	Source of Que.	Source of Docs	Answer Type
CNN/DM	EN	1.4M	300K	Synthetic cloze	News	Fill in entity
HLF-RC	ZH	100K	28K	Synthetic cloze	Fairy/News	Fill in word
CBT	EN	688K	108	Synthetic cloze	Children’s books	Multi. choices
RACE	EN	870K	50K	English exam	English exam	Multi. choices
MCTest	EN	2K	500	Crowd-sourced	Fictional stories	Multi. choices
NewsQA	EN	100K	10K	Crowd-sourced	CNN	Span of words
SQuAD	EN	100K	536	Crowd-sourced	Wiki.	Span of words
SearchQA	EN	140K	6.9M	QA site	Web doc.	Span of words
TrivaQA	EN	40K	660K	Trivia websites	Wiki./Web doc.	Span/substring of words
NarrativeQA	EN	46K	1.5K	Crowd-sourced	Book&movie	Manual summary
MS-MARCO	EN	100K	200K	User logs	Web doc.	Manual summary
DuReader	ZH	200k	1M	User logs	Web doc./CQA	Manual summary

表 1: 機器閱讀理解數據集對比

Pilot Study

Examples	Fact	Opinion
Entity	iphone哪天發佈	2017最好看的十部電影
-	On which day will iphone be released	Top 10 movies of 2017
Description	消防車爲什麼是紅的	豐田卡羅拉怎麼樣
-	Why are firetrucks red	How is Toyota Carola
YesNo	39.5度算高燒嗎	學圍棋能開發智力嗎
-	Is 39.5 degree a high fever	Does learning to play go improve intelligence

表 2: 中文六類問題的例子

Scaling up from the Pilot to DuReader

Data Collection and Annotation

Data Collection

DuReader 的樣本可用四元組表示：，其中是問題，是問題類型，是相關文檔集合，是由人類標註產生的答案集合。

The DuReader is a sequence of 4-tuples: , where is a question, is a question type, is a set of relevant documents, and is an answer set produced by human annotators.

Question Type Annotation

Answer Annotation

衆包

Crowd-sourcing

Quality Control

Training, Development and Test Sets

數量	訓練集	開發集	測試集
問題	181K	10K	10K
文檔	855K	45K	46K
答案	376K	20K	21K

The training, development and test sets consist of 181K, 10K and 10K questions, 855K, 45K and 46K documents, 376K, 20K and 21K answers, respectively.

DuReader is (Relatively) Challenging

challenges:

The number of answers.

圖 1. 答案數量分佈
The edit distance.

人類生成的答案和源文檔之間的差異很大。

the difference between the human generated answers and the source documents is large.
The document length.

問題平均 4.8 詞，答案平均 69.6 詞，文檔平均 396 詞。

In DuReader, questions tend to be short (4.8 words on average) compared to answers (69.6 words), and answers tend to be short compared to documents (396 words on average).

Experiments

Baseline Systems

從每個文件中選擇一個最相關的段落
在選定的段落中應用最先進的 MRC 模型

our designed systems have two steps:

select one most related paragraph from each document
apply the state-of-the-art MRC models on the selected paragraphs

Paragraph Selection

在訓練階段，我們從文檔中選擇與人類生成答案重疊最大的段落作爲最相關段落。

In training stage, we select one paragraph from a document as the most relevant one, if the paragraph has the largest overlap with human generated answer.

在測試階段，由於我們沒有人類生成答案，我們選擇與問題重疊最大的段落作爲最相關段落。

In testing stage, since we have no human generated answer, we select the most relevant paragraph that has the largest overlap with the corresponding question.

Answer Span Selection

Match-LSTM

要在段落中找到答案，它會按順序遍歷段落，並動態地將注意力加權問題表示與段落的每個標記進行匹配。

最後，使用答案指針層來查找段落中的答案範圍。

To find an answer in a paragraph, it goes through the paragraph sequentially and dynamically aggregates the matching of an attention-weighted question representation to each token of the paragraph.

Finally, an answer pointer layer is used to find an answer span in the paragraph.
BiDAF

它使用上下文對問題的關注和問題對上下文的關注，以突出問題和上下文中的重要部分。

之後，使用注意流層來融合所有有用的信息，以獲得每個位置的向量表示。

It uses both context-to-question attention and question-to-context attention in order to highlight the important parts in both question and context.

After that, the so-called attention flow layer is used to fuse all useful information in order to get a vector representation for each position.

Results and Analysis

評價方法：BLEU-4, Rouge-L

We evaluate the reading comprehension task via character-level BLEU-4 and Rouge-L.

Systems	Baidu Search	-	Baidu Zhidao	-	All	-
-	BLEU-4%	Rouge-L%	BLEU-4%	Rouge-L%	BLEU-4%	Rouge-L%
Selected Paragraph	15.8	22.6	16.5	38.3	16.4	30.2
Match-LSTM	23.1	31.2	42.5	48.0	31.9	39.2
BiDAF	23.1	31.1	42.2	47.5	31.8	39.0
Human	55.1	54.4	57.1	60.7	56.1	57.4

Table 6: Performance of typical MRC systems on the DuReader.

Question type	Description	-	Entity	-	YesNo	-
-	BLEU-4%	Rouge-L%	BLEU-4%	Rouge-L%	BLEU-4%	Rouge-L%
Match-LSTM	32.8	40.0	29.5	38.5	5.9	7.2
BiDAF	32.6	39.7	29.8	38.4	5.5	7.5
Human	58.1	58.0	44.6	52.0	56.2	57.4

Table 8: Performance on various question types.

Opinion-aware Evaluation

Question type	Fact	-	Opinion	-
-	BLEU-4%	Rouge-L%	BLEU-4%	Rouge-L%
Opinion-unaware	6.3	8.3	5.0	7.1
Opinion-aware	12.0	13.9	8.0	8.9

Table 9: Performance of opinion-aware model on YesNo questions.

Discussion

Conclusion

提出了 DuReader 數據集，提供了幾個 baseline。

Introduction

Pilot Study

Scaling up from the Pilot to DuReader

Data Collection and Annotation

Data Collection

Question Type Annotation

Answer Annotation

Quality Control

Training, Development and Test Sets

DuReader is (Relatively) Challenging

Experiments

Baseline Systems

Paragraph Selection

Answer Span Selection

Results and Analysis

Opinion-aware Evaluation

Discussion

Conclusion

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

通義千問 2.5 “客串” ChatGPT4，你分的清嗎？

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

今年在影院看的電影。。

Scene Parsing

[paper] GAN

唐詩生成器

[paper] Look Into Person

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結