快手成爲業內首家實現基於深度學習實時變聲直播的公司

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近日,快手成爲業內首家在 PC 客戶端實現基於深度學習實時變聲直播的公司。這項變聲技術可以實現任意用戶到目標音色的穩定變聲,變聲後語音具有自然度高,相似度高,音質清晰等優勢,同時整個系統的鏈路延遲可低至 200 毫秒。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"業內首次將基於深度學習的實時變聲技術應用於直播場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在快手、AcFun 等直播平臺上,主播往往需要一些趣味性強的功能增強和觀衆的互動,而變聲功能就是其中之一。而現有直播平臺支持的變聲技術都是基於數字信號處理的變聲,主要是對語音信號中的基頻和共振峯進行人工干預。這種方法雖然可以實現音色轉換,但變聲效果因人而異,而且變聲後的音色難以控制,甚至會產生混疊噪聲,受到不少用戶的詬病。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着人工智能技術的發展,基於深度學習的變聲研究也越來越深入。該技術可以很好控制變聲後的目標音色,而且變聲後語音的自然度也比較高。但是基於深度學習的變聲技術對計算資源的需求比較大,通常需要將其部署在服務器上,同時變聲實時性也很難保證。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於直播變聲業務的場景特殊性,對變聲後語音的自然度與系統運行實時性都要求非常高,而且爲了避免網絡抖動等帶來的干擾,變聲系統一般都需要部署在用戶客戶端(電腦、手機等)上,所以市面已有的變聲方案均無法滿足高質量的直播變聲需求。而快手又是一家直播業務佔比很大的公司,主播對在直播間使用高質量的變聲有巨大的需求。因此,如何研發一套既能保證變聲後音色自然,又能保證運行實時性的變聲系統,成爲了橫亙在快手面前的一道難題。最終,快手音視頻技術部和多媒體理解(MMU)部門聯手,通過反覆的實驗和分析,對現有的基於深度學習的變聲技術做了進一步優化,在模型尺寸壓縮、流式處理以及端上設備多核並行計算等方面做了大量研發工作,最終實現了一套既能保證變聲後音色自然穩定,同時又具有高實時性、低複雜度等優勢的變聲系統,滿足了直播變聲的要求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,該技術已經完成算法開發,工程質量測試以及用戶灰度測試,並在 AcFun 直播業務場景(windows 客戶端,i7 4 核以上機器)全量上線。主播可以通過 A 站直播伴侶中的變聲功能,選擇基於深度學習變聲的“憨憨音”或者“軟妹音”,實現音色切換。這個兩個音色甫一上線,就受到主播的喜愛和廣泛好評。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"據悉,此次 Acfun 直播業務中上線的變聲應用,是業內首個可在 PC 客戶端實時運行並採用深度學習框架的變聲直播應用,是快手在直播場景的語音交互領域的一個重大技術突破,有望引領直播變聲應用的新潮流。此外,快手還準備將直播變聲玩出更多花樣,比如多種方言與普通話的雙向切換,甚至可以進行用戶個性化定製變聲音色,更好實現人工智能爲直播平臺賦能。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"變聲技術的行業現狀"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"變聲技術出現良久,業內常用的變聲技術是基於數字信號處理的方法,比如虎牙直播、剪映、以及 MorphnVOX Pro 等各類變聲軟件。而一些提出基於深度學習變聲技術的公司,比如科大訊飛、搜狗等,部分沒有提供給普通用戶的試用接口,或者部署在雲端且不支持實時變聲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綜合目前的實現方式,大致可以將變聲技術分爲如下三類:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1. 基於數字信號處理的方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原理:主要是對語音中的基頻和共振峯兩個特徵進行修改,其中基頻是人發濁音時聲帶的振動頻率,而共振峯是指聲門波在聲道里的共振頻率。一般來說,女性的基頻高於男性,而男性的共振峯頻率比女性要高,這兩個特徵都與說話人的聲道結構和發聲特點密切相關,想要修改原始語音中的說話人音色,就需要通過信號處理的相關算法對原始語音中的基頻和共振峯進行人工干預。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優點:運算速度比較快,音準比較好;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺點:跨性別轉換效果差,變聲後的語音合成感很明顯,同時無法完成穩定變聲音色輸出。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"音色自然度:★☆☆☆☆"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"運行實時性:★★★★★"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2. 基於生成對抗網絡的方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原理:完成的是原始說話人到目標說話人之間聲學特徵的映射建模。包括生成網絡和對抗網絡兩個部分,生成網絡輸入原始語音的聲學特徵,預測對應的目標語音聲學特徵,而對抗網絡判斷輸入樣本屬於目標語音的生成樣本還是真實樣本,從而幫助生成網絡預測生成更接近目標說話人真實樣本的輸出。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優點:變聲音色自然度較高,真實感非常強,且音色穩定性強;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺點:生成對抗網絡的訓練數據是有限的,因此只能建模訓練集合內說話人之間的音色映射,而不適用於原屬說話人不確定的場景;同時該方法計算量很大,無法支持實時場景。該方案目前尚處於學術研究階段,不具備工業化應用基礎。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"音色自然度:★★★★★"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"運行實時性:★☆☆☆☆"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3. 基於音素後驗概率的方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原理:利用語音識別系統首先將語音轉換爲音素後驗概率或音素序列,然後通過變聲模型完成上述特徵到目標人語音的映射;目前市場上大部分深度學習變聲均選用該方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"優點:可支持訓練集外說話人轉換爲目標說話人,音色自然度較高,真實度較好,且音色穩定性較好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺點:使用的語音識別系統和聲碼器參數量和計算量比較大,只能在雲端運行,無法支持實時場景,更無法在端上設備運行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"音色自然度:★★★★☆"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"運行實時性:★★☆☆☆"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綜上所述,基於數字信號處理的變聲雖然運行速度很快可以實現實時變聲,但是變聲的自然度較差,用戶滿意度較差;基於生成對抗網絡的變聲可以保證輸出語音的高自然度,但無法滿足訓練集外說話人變聲需求,同時計算量過大無法保證實時性;基於音素後驗概率的變聲雖然效果可以支持任意用戶到指定音色的變聲,但同樣因所需計算量很大隻能在雲端部署,無法滿足直播變聲實時性的要求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"快手如何實現技術突破"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上一節的分析可見,基於音素後驗概率的方法可以滿足直播變聲功能的變聲後音色穩定、自然,音質清晰等需求,但無法實現客戶端上部署及實時運行。所以,如何在該方法的基礎上,既保證變聲後音色自然穩定,同時又可以使其在客戶端上實時運行,成爲了快手研發團隊的重點攻關任務。最終,快手的研發團隊在基於音素後驗概率的方法基礎上,對特徵提取模型、網絡聲碼器等重要模塊進行了針對性的優化開發,並引入了基於深度學習的低功耗降噪模型,完成了一套基於深度學習的實時變聲直播系統。該系統可在 PC 客戶端運行,延時低至 200 毫秒,可以滿足直播變聲功能的需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該系統主要由以下四個重要模塊構成:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"降噪模型:該模型基於深度神經網絡對主播輸入的語音進行降噪處理,增強系統對環境中平穩噪聲的魯棒性,減少環境噪聲對變聲系統音準的干擾;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":2,"normalizeStart":2},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"發音單元表徵模型:該模型從原始語音中提取出深層語音瓶頸特徵,用以描述原始語音中的內容信息,在保證精度的情況下對模型尺寸及計算量進行優化,以滿足端上設備運行的要求;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":3,"normalizeStart":3},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"變聲模型:該模型完成了說話人無關的深層語音瓶頸特徵到特定說話人聲學特徵的映射,針對實時、端上設備的應用場景,採用了基於 Encoder-Decoder 框架的非自迴歸的變聲模型,並且一個模型可以支持多個音色輸出;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":4,"normalizeStart":4},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"聲碼器:應用高維輸入特徵的高性能深度學習聲碼器,可實現高音質、高採樣率、低複雜度的語音特徵到語音信號的轉換;"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,針對端上設備的特點,快手還開發了多核並行計算的深度學習變聲系統架構以及抗抖動的低延遲 jitter buffer 模塊,進一步加速端上計算,提升系統穩定性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/f5\/f50f91c8d255619853a17abb932f0ddb.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中,音視頻技術部負責了高音質、高採樣率、低複雜度的深度學習聲碼器、基於深度學習的降噪模塊、多核並行計算的變聲系統架構及抗抖動的低延遲 jitter buffer 模塊的開發。MMU 負責了基於深度學習的發音單元特徵表徵模塊、基於 Encoder-Decoder 框架的變聲模塊和基於多說話人數據的變聲預訓練平臺的開發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來,整個研發團隊將會圍繞進一步提升音質,降低複雜度,以及用戶音色個性化定製等多個方向進行不斷的迭代優化,爭取在更多的產品、機型和場景落地。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考資料:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"會議論文"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] Ying Zhang, Hao Che, Chenxing Li, Xiaorui Wang, “One-shot Voice Conversion Based ON Speaker Aware Module”, in ICASSP 2021, 6-11 June 2021, Toronto, Canada"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] Ying Zhang, Hao Che, Xiaorui Wang, “Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers ,”in ISCSLP 2021,24-26 January, HongKong, China"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"專利"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] 直播變聲, 2021KI0494CN"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] 基於任意人一句話的語音轉換技術, 2020KI1910CN"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[3] 語音數據處理方法和裝置, 2020KI1304CN"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[4] 一種去噪去混響的網絡設計和數據增強方法, 2020KI1326CN;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[5] 一種循環神經網絡的深度學習降噪狀態控制方法, 2020KI0921CN;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[6] 一種基於 SNR 和音頻相位的深度學習音頻降噪方法, 2020KI0029CN;"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章