Facebook 新成果:用於語音識別、生成和壓縮的自監督表徵學習的 HuBERT

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"關於本研究"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"許多人工智能研究項目的北極星一直在不斷學習,通過簡單的聆聽和與他人互動來更好地識別和理解語言,就像嬰兒學習他們的第一語言一樣。這樣做不但要分析別人所說的話,而且要從它們的表達上,比如說話者的身份、情緒、優柔寡斷等,都有很多其他的線索。而且,要像人類一樣全面理解一個場景,人工智能系統就必須能夠區分和解釋與語音信號重疊的噪聲,如笑聲、咳嗽聲、咂嘴聲、背景車輛或鳥鳴。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了在音頻中對這些類型的豐富詞彙和非詞彙信息建模打開大門,我們推出了 HuBERT,這是一種學習自監督語音表徵的新方法。HuBERT 與 SOTA 方法在語音識別、語音生成、語音壓縮的語音表徵學習方面相匹配,甚至超過了 SOTA。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了做到這一點,我們的模型採用了一種離線 k- 均值聚類方法,通過預測掩蔽的音頻片段的正確聚類,學習了口語輸入的結構。HuBERT 通過在聚類和預測步驟之間交替進行,逐步提高其學習的離散表徵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HuBERT 的簡單性和穩定性將有助於自然語言處理和演講研究人員,在其工作中更廣泛地採用學到的離散表徵。另外,HuBERT 的學習報告質量可以幫助"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/2105.01051?fileGuid=k3fem0q4h6wKJaVt","title":"","type":null},"content":[{"type":"text","text":"輕鬆地部署到多種下游語音應用程序"}]},{"type":"text","text":"中。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"工作原理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HuBERT 的靈感來自於 Facebook AI 的"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/deepcluster?fileGuid=k3fem0q4h6wKJaVt","title":"","type":null},"content":[{"type":"text","text":"DeepCluster"}]},{"type":"text","text":"方法,它是一種自監督的視覺表徵方法。谷歌的"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/1810.04805?fileGuid=k3fem0q4h6wKJaVt","title":"","type":null},"content":[{"type":"text","text":"Bidirectional Encoder Representations from Transformers"}]},{"type":"text","text":"(BERT,即雙向 Transformer 的 Encoder)方法等序列掩蔽預測損失的方法被用來表示語音的順序結構。HuBERT 採用離線聚類的方法,爲掩蔽語言模型的預訓練產生噪聲標籤。具體地說,HuBERT 使用掩蔽的連續語音特徵來預測預定的聚類分配。預測損失只應用於掩蔽區域,強迫模型學習未掩蔽的輸入的良好的高層表徵,以便正確地推斷掩蔽目標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HuBERT 可以從連續輸入中學習聲學和語言模型。首先,該模型需要將未掩蔽的音頻輸入編碼爲有意義的連續潛在表徵,這就相當於經典的聲學建模問題。其次,爲了減少預測誤差,該模型需要捕捉所學表徵之間的長程時間關係(long-range temporal relations)。激勵這項工作的一個關鍵見解是,從音頻輸入到離散目標的 k- 均值映射的一致性的重要性,而不僅僅是它們的正確性,這使得模型能夠專注於對輸入數據的順序結構進行建模。舉例來說,如果早期的聚類迭代不能區分 \/k\/ 和 \/g\/ 的聲音,那麼就會產生一種包含這兩種聲音的超聚類,預測損失將學習模型其他輔音和元音如何與這個超簇一起構成單詞。因此,接下來的聚類迭代會使用新學習的表徵來創建更好的聚類。實驗表明,通過交替進行聚類和預測步驟,可使表徵得到逐步改善。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/e3\/33\/e3b472950d95f69acdcfcd2c4e086c33.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HuBERT 在標準的 LibriSpeech 960 小時或 Libri-Light 60000 小時的預訓練中,所有 10 分鐘、1 小時、10 小時、100 小時和 960 小時的微調子集均達到或超過最先進的 wav2vec 2.0 性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/6c\/3d\/6c480a73252fd5402ec5ac6622b8733d.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖中展示了 HuBERT 使用 LARGE(300M) 和 X-LARGE(1B) 兩種規模的模型進行預訓練的結果。X-LARGE 模型在預訓練 60000 小時的 Libri-Light 數據時,顯示了對 dev-other 和 test-other 評估子集的 19% 和 13% 的相對 WER 改進。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"語音表徵學習的顯著成功實現了語音信號的直接語言建模,而無需依賴於任何詞彙資源(無監督標籤、文本語料庫或詞典)。這反過來又開啓了非詞彙信息建模的大門,例如戲劇性的停頓或緊急中斷,以及背景噪聲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/2102.01192?fileGuid=k3fem0q4h6wKJaVt","title":"","type":null},"content":[{"type":"text","text":"生成式口語建模"}]},{"type":"text","text":"(Generative Spoken Language Modeling,GSLM)中,我們邁出了第一步,利用從 CPC、Wav2Vec2.0 和 HuBERT 中學到的語音表徵來合成語音。單元語言模型通過訓練離散的潛在表徵,可以有條件、無條件地產生語音。在自動評估和人工評估中,HuBERT 生成的樣本在質量上與基於字符的頂線監督的 LM 和生成相競爭。你可以在這裏聆聽由所有系統生成的有條件和無條件的樣本:https:\/\/speechbot.github.io\/。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/c8\/cc\/c8bc4107e98a5b34242e1fa59f8811cc.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"上面的圖表展示了 HuBERT 的語言生成性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"就語音壓縮而言,我們最近的"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/pdf\/2104.00355.pdf?fileGuid=k3fem0q4h6wKJaVt","title":"","type":null},"content":[{"type":"text","text":"論文"}]},{"type":"text","text":"《來自離散解纏自監督表徵的語音重合成》(Speech Resynthesis from Discrete Disentangled Self-Supervised Representations)是通過 HuBERT 實現 365bps 的比特率,而不是降低質量。你可以聽一下 HuBERT 壓縮音頻的樣本:"},{"type":"link","attrs":{"href":"https:\/\/resynthesis-ssl.github.io\/?fileGuid=k3fem0q4h6wKJaVt","title":"","type":null},"content":[{"type":"text","text":"https:\/\/resynthesis-ssl.github.io\/"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/e0\/ce\/e0034541b4a5232096fc810f61ef6fce.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HuBERT 在多激勵隱藏參考基準測試(Multi-Stimulus Test with Hidden Reference and Anchor,MUSHRA)中,僅次於未壓縮的音頻(256kbps)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"爲什麼重要"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HuBERT 可以幫助人工智能研究界開發完全基於音頻訓練的自然語言處理系統,而非依靠文本樣本。這樣,我們就能以一種自發的口頭語言充分表達出來,豐富現有的自然語言處理應用,從而使人工智能語音助理能說出與真人相同的細微差異和效果。學習語音表徵而不依賴於大量的標記數據,對於工業應用和產品也是至關重要的,因爲它們在新的語言和領域中的範圍越來越廣。這將有助於人工智能社區開發更加包容的應用程序,涵蓋只用口語表達的方言和語言。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Abdelrahman Mohamed、Wei-Ning Hsu,Facebook 研究科學家。Kushal Lakhotia,Facebook 軟件工程師。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/ai.facebook.com\/blog\/HuBERT-self-supervised-representation-learning-for-speech-recognition-generation-and-compression"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章