全球最大多領域中文語音識別數據集 WenetSpeech 正式發佈並開放下載

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"日前,西北工業大學音頻語音和語言處理研究組(ASLP Lab)、出門問問、希爾貝殼聯合發佈1萬小時多領域中文語音識別數據集 WenetSpeech,在騰訊會議天籟實驗室、華爲昇思 MindSpore、 西安未來人工智能計算中心等機構大力支持下,該數據集目前已經開放下載。數據申請入口: "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/wenet-e2e.github.io\/WenetSpeech\/ "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前該工作已經投稿語音研究頂級會議ICASSP2022,詳見:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/arxiv.org\/pdf\/2110.03370.pdf"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9a\/9afa84ea8e19f0012f88163ba3b82e8c.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"WenetSpeech 介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近十年以來,在深度學習的推動下,語音識別技術和應用均取得了突飛猛進的發展,搭載語音識別技術的相關產品和服務,諸如語音搜索、語音輸入法、智能音箱、智能電視、智能穿戴、智能客服、機器人等已經廣泛應用到我們生活的方方面面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但在現有的中文語音識別研究中,由於開源中文語音數據集數據量少,場景單一,缺乏挑戰性,不能反映研究模型在大數據量和複雜場景下的泛化能力,例如,當前最大的中文普通話開源數據集 AIShell-2,包含1000小時的朗讀風格錄製數據,主流識別系統在該數據的測試集上獲得的錯誤率低至5.3%左右。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"工業界往往使用更大規模的內部數據進行研究,而學術界無法獲取這些數據進行研究,這導致了中文語音識別研究在學術界和工業界的嚴重割裂。另一方面,當下研究的熱點無監督學習和自學習,在中文語音識別領域,也缺乏公開標準的大數據集的支持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今年以來,Facebook 發佈面向監督學習的5萬小時的英文 audiobook 數據集 Multilingual LibriSpeech;SpeechColab 發佈1萬小時的多領域英文數據集 GigaSpeech。受這些工作的啓發,同時中文語音識別研究也迫切需要一個標準的大規模多領域的數據集,爲此我們設計開發了 WenetSpeech 數據集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"WenetSpeech 除了含有 10000+ 小時的高質量標註數據之外,還包括2400+ 小時弱標註數據和 22400+ 小時的總音頻,覆蓋各種互聯網音視頻、噪聲背景條件、講話方式,來源領域包括有聲書、解說、紀錄片、電視劇、訪談、新聞、朗讀、演講、綜藝和其他等10大場景,領域詳細統計數據如下圖所示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/05\/05a4ed6d2c46a35e1785f4e9e4c44083.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據收集"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"WenetSpeech 所有的數據均來源於網絡,其中三分之二的數據來自 Youtube,三分之一來自 Podcast。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於 Youtube 數據,我們人工選擇含有嵌入式硬字幕(字幕嵌入在視頻流中,非外掛字幕)的視頻資源,並構建瞭如下圖的基於 OCR 的系統進行數據挖掘,流程如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"文本檢測,在當前視頻幀上進行文本檢測。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"字幕位置校驗,判斷檢測到的文本區域是否爲合法的字幕區域。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"字幕切換檢測,已得到字幕位置和區域,在連續的視頻幀上對該區域進行檢測,直至該區域的字幕變化爲止,得到字幕的起始和結束時間。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"文本識別,將字幕區域進行 OCR 識別,得到文本。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"將3中對應時間的音頻提取出來,結合4中的文本,即得到字幕文本和該文本對應的音頻,也就是語音識別訓練中所需的文本和語音的候選平行數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7a\/7a6cce3ab0ce15e83d48fc53603aab2e.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖中給出該 OCR 系統在不同場景下的幾個典型示例。圖中綠色的框爲檢測到的所有文字區域,紅色的框爲判定爲字幕的文字區域,紅色框上方的文本爲 OCR 的識別結果。 可以看到,該系統正確的判定了字幕區域,並準確的識別了字幕文本,同時經過我們測試,發現該系統也可以準確判定字幕的起始和結束時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6a\/6a5ec439169b2995d905e38f5a073c07.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於 Podcast 數據,我們使用國內最好的商業語音識別系統之一,對 Podcast 數據進行切分,並生成切分後音頻和其所對應的文本作爲候選平行數據。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"數據校驗"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OCR 字幕識別和 ASR 語音轉寫生成的候選平行數據中不可避免的存在一些錯誤,如人工字幕本身有錯誤,字幕時間不準,OCR 識別錯誤,轉寫錯誤等。爲了檢測該錯誤,WenetSpeech 中提出一種基於端到端的自動標註錯誤檢測算法,如下圖所示。該算法首先根據候選平行數據的文本(ref)構建一個一個強制對齊圖,該圖中允許在任意位置進行刪除、插入和替換操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後將候選平行數據的語音輸入到該圖進行解碼得到識別結果(hyp),最終計算 ref 和 hyp 的編輯距離並做歸一化從而得到該候選平行數據的置信度。當候選語音和文本一致性高時,ref 和 hyp 一致性高,置信度高,反之,當候選語音和文本一致性低時,置信度低。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3b\/3bb96ebae567e2990e9e9d5bf04b4fd6.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"WenetSpeech 中選取置信度>=95%的數據作爲高質量標註數據,選取置信度在0.6和0.95之間的數據作爲弱監督數據。關於該算法的詳細內容,請參考我們的論文。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"排行榜"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了訓練中校驗用途的 Dev 集外,我們還設計了兩個人工精標測試集,互聯網測試集 Test_Net 和會議測試集 Test_Meeting,作爲“匹配”和“不匹配”測試,同時提供三個語音識別主流工具包(Kaldi,ESPNet,WeNet)上搭建的基線系統,方便大家復現。在 10000+ 小時的高質量標註數據上,目前三個系統的語音識別率如下表所示(結果爲 MER%,中文算字錯誤,英文算詞錯誤)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6c\/6c8d6a33fb6569976e24fd9f1548677e.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"WenetSpeech 2.0"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然 WenetSpeech 將開源中文語音識別訓練數據規模提升到一個新的高度,然而我們希望進一步進行擴展和完善:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"從領域角度,現有數據集在口音、中英文混合、會議、遠場、教育、電話、語音助手等場景仍覆蓋不足。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.從數據量角度,現有的2萬+小時的總數據,對於無監督學習仍然遠遠不夠。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,WenetSpeech 在設計之初,就考慮到了未來做進一步擴展。目前我們已經開始 WenetSpeech 2.0 的工作,並且在 2.0 中,我們希望更多的行業機構和開發者能參與進來,能夠集行業之力更好、更快的去做一個更大更泛化的數據集,從而進一步反哺和造福整個行業。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"張彬彬,出門問問 WeNet 項目負責人,2018 年加入出門問問負責端到端語音識別系統的研發和落地,包括 WeNet 的開源推進,車載和 toB 項目等。2017 年碩士畢業於西北工業大學音頻語音與語言處理研究組,曾在微軟、百度、地平線等公司工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章