詳解SoundStream:一款端到端的神經音頻編解碼器

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"音頻編解碼器的用途是高效壓縮音頻以減少存儲或網絡帶寬需求。理想情況下,音頻編解碼器應該對最終用戶是透明的,讓解碼後的音頻與原始音頻無法從聽覺層面區分開來,並避免編碼\/解碼過程引入可感知的延遲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在過去幾年中,業界已經成功開發了多種音頻編解碼器來滿足這些需求,包括"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Opus_(audio_format)","title":"","type":null},"content":[{"type":"text","text":"Opus"}]},{"type":"text","text":"和增強語音服務("},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Enhanced_Voice_Services","title":"","type":null},"content":[{"type":"text","text":"EVS"}]},{"type":"text","text":")。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Opus是一種多功能語音和音頻編解碼器,支持從6kbps(千比特每秒)到510kbps的比特率,已廣泛部署在從視頻會議平臺(如Google Meet)到流媒體服務(如YouTube)的多種類型的應用程序中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"EVS是"},{"type":"link","attrs":{"href":"https:\/\/www.3gpp.org\/about-3gpp","title":"","type":null},"content":[{"type":"text","text":"3GPP標準化組織"}]},{"type":"text","text":"針對移動"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Telephony","title":"","type":null},"content":[{"type":"text","text":"電話"}]},{"type":"text","text":"開發的最新一代編解碼器。與Opus一樣,它是一種支持多種比特率(5.9kbps至128kbps)的編解碼器。使用這兩種編解碼器重建的音頻質量在中低比特率(12–20kbps)下表現很出色,但在以極低比特率(⪅3kbps)輸出時質量會急劇下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然這些編解碼器利用了人類感知領域的專業知識以及精心設計的信號處理管道來最大限度地提高壓縮算法的效率,但最近人們開始將興趣轉向了用機器學習方法替換這些手工製作的管道。這些機器學習方法會使用一種數據驅動的方式來學習音頻編碼技能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今年早些時候,我們發佈了"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/9EwKIaS9C1J2FZ1Cg2eS","title":"","type":null},"content":[{"type":"text","text":"Lyra"}]},{"type":"text","text":",一種用於低比特率語音的神經音頻編解碼器。在“SoundStream:一款端到端的神經音頻編解碼器”"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/2107.03312","title":"","type":null},"content":[{"type":"text","text":"論文"}]},{"type":"text","text":"中,我們介紹了一種新穎的神經音頻編解碼器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種編解碼器是上述成果的進一步發展,提供了更高質量的音頻並能編碼更多聲音類型,包括乾淨的語音、嘈雜和混響的語音、音樂和環境聲音。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SoundStream是第一個既能處理語音也能處理音樂的神經網絡編解碼器,同時能夠在智能手機CPU上實時運行。它能使用單個訓練好的模型在很大的比特率範圍內提供最一流的質量,這標誌着可學習編解碼器的一項重大進步。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"從數據中學習的音頻編解碼器"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SoundStream的主要技術組成部分是一個神經網絡,由編碼器、解碼器和量化器組成,它們都經過了端到端的訓練。編碼器將輸入的音頻流轉換爲編碼信號,量化器壓縮編碼信號,然後解碼器將其轉換回音頻。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SoundStream利用了神經音頻合成領域最先進的解決方案,通過訓練一個鑑別器來計算對抗性和重建損失函數的組合,使重建的音頻聽起來接近未壓縮的原始音頻,從而提供高感知質量的音頻輸出。經過訓練後,編碼器和解碼器可以分別運行在獨立的客戶端上,以通過網絡高效傳輸高質量的音頻。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/c4\/28\/c4619b7991ea26yy4b5517f550695728.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SoundStream的訓練和推理過程。在訓練期間,編碼器、量化器和解碼器參數使用重建和對抗性損失的組合進行優化,並由鑑別器計算;後者經過訓練以區分原始輸入音頻和重建音頻。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在推理期間,發送器客戶端上的編碼器和量化器將壓縮過的比特流發送到接收器客戶端,然後接收器客戶端負責解碼音頻信號。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"使用殘差向量量化學習可擴展的編解碼器"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SoundStream的編碼器生成的向量可以採用無限的數量值。爲了使用有限數量的比特將它們傳輸到接收器,必須用來自有限集(稱爲碼本,codebook)的近似向量替換它們,這一過程稱爲"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Vector_quantization","title":"","type":null},"content":[{"type":"text","text":"向量量化"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種方法適用於大約1kbps或更低的比特率,但在改用更高的比特率時很快就會達到其極限。例如,即使比特率低至3kbps,假設編碼器每秒產生100個向量,也需要存儲超過10億個向量的碼本,這在實踐中是不可行的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在SoundStream中,我們提出了一種新的殘差向量量化器(RVQ)來解決這個問題。該量化器由多個層組成(在我們的實驗中多達80個)。第一層以中等分辨率量化碼向量(code vector),接下來的每一層都處理前一層的殘差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將量化過程分成幾層可以大大減少碼本大小。例如,3kbps時每秒100個向量,使用5個量化層,碼本大小從10億減少到了320。此外,我們可以添加或移除量化層來輕鬆增加或減少比特率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於傳輸音頻時網絡條件可能會發生變化,理想情況下,編解碼器應該是“可伸縮的”,這樣它可以根據網絡狀態改變其比特率。雖然大多數傳統編解碼器都是可伸縮的,但以前的可學習編解碼器需要專門針對每種目標比特率進行訓練和部署。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了規避這個限制,我們利用了SoundStream中量化層數控制比特率的機制,提出了一種稱爲“量化器丟棄”的新方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在訓練期間,我們隨機刪除一些量化層來模擬不同的比特率。這會讓解碼器針對任何比特率的傳入音頻流都學到良好的表現,從而幫助SoundStream變得“可伸縮”,讓單個訓練模型可以運行在任何比特率下,表現還能與專門針對這些比特率訓練的模型一樣好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/31\/c9\/31e743144acd9d60130513e5f2caa9c9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SoundStream模型的對比(越高越好):在18kbps下訓練,有量化器丟棄(Bitrate scalable);沒有量化器丟棄(No bitrate scalable)並使用可變數量的量化器評估;或以固定比特率進行訓練和評估(Bitrate specific)。加入量化器丟棄後,與針對特定比特率的模型(每個比特率專門訓練一個模型)相比,比特率可伸縮模型(所有比特率使用一個模型)不會損失任何質量。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"最先進的音頻編解碼器"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SoundStream在3kbps下的質量就優於12kbps的Opus,接近9.6kbps的EVS質量,同時使用的數據量減少到了3.2到4分之一。這意味着使用SoundStream編碼音頻可以使用低得多的帶寬提供類似的質量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,在相同的比特率下,SoundStream的性能優於基於自迴歸網絡的Lyra當前版本。與已經針對生產用途進行部署和優化的Lyra不同,SoundStream仍處於試驗階段。未來,Lyra將整合SoundStream的組件,以提供更高的音頻質量並降低複雜性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/ec\/fe\/ec25954e15611115d261c8e2571e8bfe.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3kbps的SoundStream與最先進的編解碼器的質量對比。MUSHRA"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/MUSHRA","title":"","type":null},"content":[{"type":"text","text":"分數"}]},{"type":"text","text":"是主觀質量的指標(越高越好)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些音頻"},{"type":"link","attrs":{"href":"https:\/\/google-research.github.io\/seanet\/soundstream\/examples","title":"","type":null},"content":[{"type":"text","text":"示例"}]},{"type":"text","text":"展示了SoundStream與Opus、EVS和原始Lyra編解碼器的性能對比。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"聯合音頻壓縮和增強過程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在傳統的音頻處理管道中,壓縮和增強(去除背景噪聲)通常由不同的模塊執行。例如,音頻增強算法可以應用在發送端(在壓縮音頻之前),或接收端(在音頻解碼之後)。在這樣的設置中,每個處理步驟都會帶來端到端的延遲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相反,SoundStream的設計是壓縮和增強可以由同一模型聯合執行,而不會增加整體延遲。在以下示例中,我們展示了通過動態激活和停用去噪(5秒不去噪、5秒去噪、5秒不去噪,以此類推)來融合壓縮與背景噪聲抑制過程。(示例見原文"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"結論"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在需要傳輸音頻的場景,無論是在流式傳輸視頻時還是在電話會議期間,都需要高效的壓縮過程。SoundStream是改進機器學習驅動的音頻編解碼器的重要一步。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"它的表現優於之前最先進的編解碼器,如Opus和EVS;它可以按需增強音頻,並且只需部署一個(而非多個)可伸縮的模型即可處理多種比特率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SoundStream將作爲Lyra下一個改進版本的組件發佈。將SoundStream與Lyra集成後,開發人員可以在他們的工作中利用現有的Lyra API和工具鏈,從而兼顧靈活性和更好的音質。我們還將發佈一個單獨的TensorFlow模型用於實驗目的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"致謝"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文介紹的工作由Neil Zeghidour、Alejandro Luebs、Ahmed Omran、Jan Skoglund和Marco Tagliasacchi完成。我們非常感謝谷歌的同事提供的關於這項工作的所有討論和反饋。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/ai.googleblog.com\/2021\/08\/soundstream-end-to-end-neural-audio.html","title":"","type":null},"content":[{"type":"text","text":"https:\/\/ai.googleblog.com\/2021\/08\/soundstream-end-to-end-neural-audio.html"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章