FUNSD dataset 數據集介紹

原創

2020-06-20 09:18

數據集介紹

一個可用於FUNSD（噪聲很多的掃描文檔）上進行表單理解的數據集。

這裏的表單理解是指對錶單中的文本內容進行抽取，並生成結構化數據。

數據集包含199個真實的、完全註釋的、掃描的表單。

文檔有很多噪聲，而且各種表單的外觀差異很大，因此理解表單是一項很有挑戰性的任務。

該數據集可用於各種任務，包括文本檢測、光學字符識別、空間佈局分析和實體標記/鏈接。

第一個具有完整註釋的公共數據集，可用於處理FoUn任務。

數據集組成

這個數據集由原始圖片（images）和標註結果（annotations）組成。

這些原始圖片是e RVL-CDIP數據集的子集。e RVL-CDIP數據集是一個包含各種類型文檔的灰度圖片，
圖片分辨率大約在100像素，共400000張。由於圖片質量差且噪聲非常多，作者從25000張表單圖片中挑
選出3200張合格的圖片（去掉了不可讀和類似的），然後隨機選擇了199張進行標註。

標註結果爲JSON格式,如下圖：

注：

box位置用左上右下兩個點來確定，即box對應的4個值爲[x0, y0, x1, y1]。
lable的值有[question, answer, header, other]
linking對應的list爲其指向的其他實體

訓練集和測試集的數據分佈

數據分佈統計情況

Split	Forms	Words	Entities	Relations
Training	149	22, 512	7, 411	4, 236
Testing	50	8, 973	2, 332	1, 076

實體類別分佈情況

Split	Header	Question	Answer	Other	Total
Training	441	3, 266	2, 802	902	7, 411
Testing	122	1, 077	821	312	2, 332

論文地址：https://arxiv.org/pdf/1905.13538.pdf
數據下載地址：https://guillaumejaume.github.io/FUNSD/

（注：若有錯誤希望大家指出！）

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

Twitch表情中的情緒分析

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

Martin Anderson

2021-12-07 16:00:03

達摩院AliceMind上新！首箇中文表格預訓練模型發佈，已向業界開源

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-02 18:18:58

在元宇宙裏怎麼交朋友？Meta發佈跨語種交流語音模型，支持128種語言無障礙對話

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-11-23 14:03:53

人工智能時代，如何硬核玩音樂？| InfoQ《大咖說》

直播內容：在人工智能技術迅速發展的當下，越來越多的領域被這項技術注入新的活力。作爲多媒體領域中不可缺少的組成部分，音樂對於人類的重要性不言而喻。值得一提的是，人工智能在音樂領域的研究早在多年前就已經開始了，並且也落地了很多成熟應用。當前

InfoQ 中文站

2021-11-12 14:23:49

不是隻有數字化水平高，纔可以落地知識圖譜

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockq

2021-11-11 15:23:53

騰訊發佈超大預訓練系統派大星，聚焦解決BERT等超大模型訓練時的“GPU內存牆”問題

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-11-02 13:38:53

微軟和英偉達推出訓練語言模型MT-NLG：5300億參數量，是GPT-3的3倍

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-10-12 14:13:53

谷歌推出Translatotron 2，一種沒有深度僞造潛力的語音到語音直接翻譯神經模型

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-09-10 14:09:01

放心，GPT-3不會“殺死”編程

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragr

2021-09-03 17:58:55

爲什麼神經網絡不適合理解自然語言？

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-08-04 16:13:54

易聊科技宣佈在線客服系統IM永久免費，透視智能客服的商業化潛力

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockq

2021-07-27 17:33:49

5個流行的自然語言處理庫及入門用法

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-07-26 10:43:50

AI虛擬人多模態交互落地難題如何破解？我們在樂享A.I.技術沙龍成都站找到了答案

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-06-24 16:18:54

官宣！達摩院開源祕藏深度語言模型體系AliceMind，NLP正在走向大工業時代

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-06-22 14:48:49

讓普通人秒會編程？微軟在Power平臺上集成GPT-3，將自然語言直接變成現成代碼

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-05-28 17:48:57

24小時熱門文章

最新文章

最新評論文章