(深入篇)漫遊語音識別技術—帶你走進語音識別技術的世界

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 前有古人,後有小王,大家好,我是你們愛思考的小王學長,今天咱們繼續漫遊語音識別技術哈,今天內容稍微專業一些,大家可以結合","attrs":{}},{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/626e64a43102e1309b02500a7","title":"","type":null},"content":[{"type":"text","text":"上一篇","attrs":{}}]},{"type":"text","text":"漫遊語音識別技術一起學習。","attrs":{}}]},{"type":"horizontalrule","attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 上篇我們簡單瞭解了語音識別技術的概念、前世今生以及基本識別原理,一會學長帶着大家漫遊到語音識別技術更深(更專業)的世界裏。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文章目錄:(大家先預覽下)","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"一、語音識別基礎\n二、信號處理過程\n 1、降噪處理 \n ①小波變換降噪法\n ②譜減法\n ③自適應噪聲抵消法\n ④聲音濾波器\n 2、預加重\n 3、分幀加窗\n 4、端點檢測\n三、特徵提取\n四、語音識別方法\n 1、聲學模型\n 2、語言模型\n 3、解碼器\n 4、基於端到端的學習方法\n五、深度學習-CNN實戰舉例\n六、聲網 Agora 一站式智能語音識別方案\n七、語音識別開發平臺\n 深度學習平臺\n 語音識別開發平臺\n八、語音識別相關開源學習資料\n 開源數據集\n 開源語音識別項目\n作者介紹","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"一、語音識別基礎 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 說到語音識別,我們應該先思考一下聲音是什麼呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 通常我們可以認爲聲音是在空氣中傳播的波,但是它不像水波那樣傳播波的高低變化,它傳播的是空氣的密度變化。比如,我們拍手時手掌的振動將空氣擠出,對比周圍的大氣壓,空氣被擠入的地方壓力增高,而空氣被擠出的地方則相對壓力降低;壓力高的部分向手掌四周移動,而壓力低的部分則緊隨其後。這種由手掌振動所引發空氣密度發生週期性變化的波稱爲壓縮波,空氣中的壓縮波一旦碰到鼓膜那樣的薄膜,就會使其產生振動。麥克風的作用就是將這種振動以電信號的形式提取出來。下面的圖大家可以參考一下","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0c/0c5afaa9c20f03265270e56873b14bb7.png","alt":null,"title":"幾種波形或疊加形式(點擊圖片可見來源)","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"https://blog.csdn.net/qq_36767053/article/details/107081913?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%A3%B0%E9%9F%B3%E6%98%AF%E4%BB%80%E4%B9%88&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-3-107081913.pc_search_result_no_baidu_js","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 以振動的幅度爲縱軸,以時間爲橫軸,就能夠將聲音可視化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 換句話說,聲音以波的形式傳播,即聲波。當我們以波的視角來理解聲音時,幅度(Magnitude)、頻率(Frequency)、相位(Phase)便構成了聲波及其所有的疊加聲波,聲音的不同音高(Pitch)、音量(Loudness)、音色(Timbre) 也由這些基本單位組合而來。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 世界上各種各樣的聲波都可以“降解”到基本波身上,傅里葉變換(Fourier Transform)的基本思想也是這樣的。不同的聲波有不同的頻率和幅度(決定音量),人耳也有自己的接受範圍。人耳對頻率的接受範圍大致爲20 Hz至20 kHz,於是以人爲本地將更高頻率的聲波定義爲超聲波(Ultrasound Wave)、更低頻率的聲波定義爲次聲波(Infrasound Wave), 雖然其它動物可以聽到不同範圍的聲音。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/626e64a43102e1309b02500a7","title":"","type":null},"content":[{"type":"text","text":"上一篇","attrs":{}}]},{"type":"text","text":"大家應該對ASR有了個初步的瞭解,語音識別說白了最終是統計優化問題,給定輸入序列O={O1,...,On},尋找最可能的詞序列W={W1,...,Wm},其實就是尋找使得概率P(W|O)最大的詞序列。用貝葉斯公式表示爲:","attrs":{}}]},{"type":"katexblock","attrs":{"mathString":"P(W|O)=\\frac{P(O|W)P(W)}{P(O)}"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 其中P(O|W)叫做聲學模型,描述的是給定詞W時聲學觀察爲O的概率;P(W)叫做語言模型,負責計算某個詞序列的概率;P(O)是觀察序列的概率,是固定的,所以只看分母部分即可。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 語音選擇的基本單位是幀(Frame),一幀數據是由一小段語音經過ASR前端的聲學特徵提取模塊產生的,整段語音就可以整理爲以幀爲單位的向量組。每幀的維度固定不變,但跨度可調,以適應不同的文本單位,比如音素、字、詞、句子。","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 大多數語音識別的研究都是分別求取聲學和語言模型,並把很多精力放在聲學模型的改進上。但後來,基於深度學習和大數據的端到端(End-to-End)方法發展起來,能將聲學和語言模型融爲一體,直接計算P(W|O)。","attrs":{}}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"二、信號處理過程","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、降噪處理 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 在降噪之前,我先跟大家講講爲什麼要進行降噪處理?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 我們在錄製音頻數據的同時,大量噪聲都會摻雜進來,不同環境和情境下產生的噪聲也不盡相同,噪聲信號中的無規則波紋信息影響了聲學信號所固有的聲學特性,使得待分析的聲音信號質量下降,並且噪聲對聲音識別系統的識別結果會產生重要影響。所以說,我們在對聲音信號分析和處理之前,是一定要進行降噪處理的。(語音的具體噪聲分類:看學長","attrs":{}},{"type":"link","attrs":{"href":"https://xie.infoq.cn/article/50d0d252b1d44bb26725f1146","title":"","type":null},"content":[{"type":"text","text":"這篇文章","attrs":{}}]},{"type":"text","text":")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面我們來看幾個降噪的常用方法:","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"①小波變換降噪法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 小波變換降噪法簡稱小波降噪,一般在聲音降噪中使用最多的是小波閾值降噪法,它主要是說在帶噪聲音信號中,有效聲音信號與噪聲在不同頻率上有着不同的小波係數,其中有效信號能量譜表現會比較集中,在能量譜集中的區域小波係數的絕對值會比較大;而噪聲的能量譜比較分散,所以其係數的絕對值比較小。接下來,根據此特點,利用小波變換法將帶噪聲音信號分解到不同頻率上,然後設置閾值進行差分調整,保留有效聲音信號的小波係數,最後根據小波重構算法還原帶噪信號中的有效信號,從而可以達到降噪的效果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 這是其基本原理,其中閾值的設定也可以分爲硬閾值和軟閾值法。具體涉及的相關公式和計算方法大家感興趣的可以百度或者跟我留言。以下是利用小波降噪法得到的前後對比圖(在MATLAB環境下得到):","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a0/a02a773c3200e5db2a20600d844bf199.png","alt":null,"title":"含噪聲信號的波形","style":[{"key":"width","value":"50%"},{"key":"bordertype","value":"boxShadow"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/30/30384ae8558a2384ac9ef650b7bbeacf.png","alt":null,"title":"小波降噪後的波形","style":[{"key":"width","value":"50%"},{"key":"bordertype","value":"boxShadow"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"②譜減法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 譜減法又稱頻譜減法降噪,是根據噪聲的可加性、局部平穩性以及噪聲和有效聲音信號不相關性的一種聲音信號降噪方法。這種降噪方法不會涉及到參考信號,其","attrs":{}},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong","attrs":{}}],"text":"主要思想","attrs":{}},{"type":"text","text":"就是帶噪聲音信號是有效信號與噪聲的疊加,那麼帶噪信號的功率也是相當於有效聲音信號的功率和噪聲功率的疊加,利用計算得到“","attrs":{}},{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"靜音”片段(信號中不含有有效信號,只含有系統噪聲或者環境噪聲)","attrs":{}},{"type":"text","text":"中噪聲的頻譜估計值來等值替換有效聲音信號存在期間所含噪聲的頻譜,最後帶噪聲音信號的頻譜與噪聲的頻譜估計值相減,就可以得到有效聲音信號頻譜的估計值。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"③自適應噪聲抵消法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 自適應噪聲抵消法的核心組成部分是","attrs":{}},{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"自適應算法","attrs":{}},{"type":"text","text":"和","attrs":{}},{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"自適應濾波器","attrs":{}},{"type":"text","text":"。自適應算法可以自動調節輸入濾波器的加權係數使濾波器達到最優濾波效果,所以自適應噪聲抵消法的關鍵是在於找到某種算法,可以實現自動調節加權係數。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 自適應噪聲抵消法的主要思想是:除了帶噪聲音信號x(t)=s(t)+n(t),假設還可以得到另外一個參考信號 r(t),而這個參考信號與噪聲 n(t) 相關,但是與有效聲音信號s(t)不相關,那麼就可以根據Widrow算法(一種近似最速下降的神經網絡算法)抵消帶噪聲信號中的噪聲,從而達到降噪的效果。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"④聲音濾波器","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 數字濾波器作爲數字信號處理中的重要組成部分,可以通過數值之間的運算來實現濾波的效果,去除噪聲成分。數字濾波器有很多種類,根據衝激響應函數的時域特性數字濾波器可分爲兩種,即無限衝激響應(Infinite Impulse Response,IIR)濾波器和有限衝激響應(Finite Impulse Response,FIR)濾波器。這兩種濾波器可分別實現低通、高通、帶通和帶阻 4 種功能。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、預加重","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 預加重是一種在發送端對輸入信號高頻分量進行補償的信號處理方式。隨着信號速率的增加,信號在傳輸過程中受損很大,爲了在接收終端能得到比較好的信號波形,就需要對受損的信號進行補償,預加重技術的思想就是在傳輸線的始端增強信號的高頻成分,以補償高頻分量在傳輸過程中的過大衰減。而預加重對噪聲並沒有影響,因此有效地提高了輸出信噪比。(百科官方解釋)","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":" 預加重原理","attrs":{}},{"type":"text","text":":語音信號高頻段能量大,低頻段能量小。而鑑頻器輸出噪聲的功率譜密度隨頻率的平方而增加(低頻噪聲大,高頻噪聲小),造成信號的低頻信噪比很大,而高頻信噪比明顯不足,從而導致高頻傳輸衰弱,使高頻傳輸困難。因此,在傳輸之前把信號的高頻部分進行加重,然後接收端再去重,提高信號傳輸質量。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、分幀加窗","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" “分幀”是把一段聲音信號分成一些時間長度相等的音頻信號。它可以在預加重後的聲音信號上平滑地移動規定長度的窗函數得到,窗函數的窗口大小由聲音信號的採樣頻率來確定。採用可以隨着時間移動的窗函數對豬聲音信號進行“交疊分幀”,可以防止在分幀時出現遺漏有效的聲音信號,也可以保證每段聲音信號在滑動時保持平穩性和連續性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 常用的幾種窗函數:冪窗、矩形窗、三角窗、漢寧窗、漢明窗、高斯窗(不懂的自行百度下或評論問我哈)","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/51/512795c69ee740ec1129a0347d877270.jpeg","alt":null,"title":"漢明窗舉例","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、端點檢測","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 端點檢測是指確定一段聲音信號中有效信號的起始點和終止點。採集到的聲音信號中含有無效的聲音片段,進行端點檢測確定出豬聲音信號的起始點與終止點,可以排除大量的干擾信號,剪除靜音片段,爲後續的特徵參數提取減小了運算量,縮短了提取時間。","attrs":{}}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/43/43fde80761073d7a6cceeb279054860c.png","alt":null,"title":"matlab端點檢測對比","style":[{"key":"width","value":"50%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"常用方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong","attrs":{}}],"text":" 短時過零率","attrs":{}},{"type":"text","text":"是指每幀聲音信號通過零點的次數,其算法就是計算每幀聲音信號幅值符號改變的總次數,如果相鄰採樣點的幅值符號是相同的,則沒有發生過零點的情況,相反,如果相鄰採樣點幅值的符號發生了改變,那麼表示聲音信號發生了過零的情況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong","attrs":{}}],"text":"短時能量","attrs":{}},{"type":"text","text":"一定程度上反應了聲音信號的幅度變化,應用在區分聲音信號中的清音和濁音,因爲聲音信號中清音的能量比濁音的能量小很多;區分無聲片段和有聲片段,因爲無聲片段的短時能量基本等於零,而有聲片段是有能量存在的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong","attrs":{}}],"text":"雙門限端點檢測法","attrs":{}},{"type":"text","text":"是常用的端點檢測方法之一,其通過聲音信號的短時能量和短時平均過零率確定聲音信號的端點位置,","attrs":{}},{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"短時過零率","attrs":{}},{"type":"text","text":"檢測到聲音信號的起始點和終止點可能過於寬泛,這樣就降低了聲音信號處理系統的速度;而","attrs":{}},{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"短時能量檢測","attrs":{}},{"type":"text","text":"到聲音信號的起始點和終止點可能包含噪聲信號,這樣會導致提取的聲音信號不太準確。所以將二者","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"結合","attrs":{}},{"type":"text","text":"起來來檢測豬聲音信號的起始點和終止點,即雙門限檢測法提取聲音信號的端點。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"三、特徵提取","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 接下來帶大家詳細學習下MFCC特徵提取知識:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 先說下MFCC,人的耳朵在接收信號的時候,不同的頻率會引起耳蝸不同部位的震動。耳蝸就像一個頻譜儀,自動在做特徵提取並進行語音信號的處理。在語音識別領域中MFCC(Mel Frequency Cepstral Coefficents)特徵提取是最常用的方法,具體來說,MFCC特徵提取的步驟如下:","attrs":{}}]},{"type":"blockquote","content":[{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對語音信號進行分幀處理","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用週期圖(periodogram)法來進行功率譜(power spectrum)估計","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對功率譜用Mel濾波器組進行濾波,計算每個濾波器裏的能量","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對每個濾波器的能量取log","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進行離散餘弦變換(DCT)變換","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"保留DCT的第2-13個係數,去掉其它","attrs":{}}]}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 其中,前面兩步是短時傅里葉變換,後面幾步主要涉及到梅爾頻譜。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3b/3b8108b87b9bde4674abdd5386a882c9.png","alt":null,"title":"基本流程圖","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":12}},{"type":"color","attrs":{"color":"#FF827B","name":"pink"}},{"type":"strong","attrs":{}}],"text":"大家需要重要掌握的特徵提取知識點:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 過零率(zero crossing rate)","attrs":{}},{"type":"text","text":"是一個信號符號變化的比率,即在每幀中語音信號從正變爲負或從負變爲正的次數。 這個特徵已在語音識別和音樂信息檢索領域得到廣泛使用,通常對類似金屬、搖滾等高衝擊性的聲音的具有更高的價值。一般情況下,過零率越大,頻率近似越高。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 頻譜質心(Spectral Centroid)","attrs":{}},{"type":"text","text":"是描述音色屬性的重要物理參數之一,是頻率成分的重心,是在一定頻率範圍內通過能量加權平均的頻率,其單位是Hz。它是聲音信號的頻率分佈和能量分佈的重要信息。在主觀感知領域,譜質心描述了聲音的明亮度,具有陰暗、低沉品質的聲音傾向有較多低頻內容,譜質心相對較低,具有明亮、歡快品質的多數集中在高頻,譜質心相對較高。該參數常用於對樂器聲色的分析研究。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"  聲譜衰減(Spectral Roll-off)","attrs":{}},{"type":"text","text":"是對聲音信號形狀(波形圖)的一種衡量,表示低於總頻譜能量的指定百分比的頻率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"     ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"梅爾頻率倒譜系數(Mel-frequency cepstral coefficients,MFCC)","attrs":{}},{"type":"text","text":"是在Mel標度頻率域提取出來的倒譜參數,Mel標度描述了人耳頻率的非線性特性。其中梅爾尺度(Mel Scale)是建立從人類的聽覺感知的頻率;比如若把音調頻率從1000Hz提高到2000Hz,我們的耳朵只能覺察到頻率似乎提高了一些而不是一倍。但是通過把頻率轉換成梅爾尺度,我們的特徵就能夠更好的匹配人類的聽覺感知效果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 色度頻率(Chroma Frequencies)","attrs":{}},{"type":"text","text":"色度頻率是音樂音頻有趣且強大的表示,其中整個頻譜被投影到12個區間,代表音樂八度音的12個不同的半音。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"四、語音識別方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 在今天的主流語音識別系統中,聲學模型是一個混合(hybrid)模型,它包括用於序列跳轉的隱馬爾可夫模型(HMM)和根據當前幀來預測狀態的深度神經網絡。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、聲學模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" ","attrs":{}},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong","attrs":{}}],"text":" 隱馬爾可夫模型(Hidden Markov Model,HMM)","attrs":{}},{"type":"text","text":"是用於建模離散時間序列的常見模型,它在語音識別中已經使用了幾十年了,算是很典型的聲學模型。","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" HMM 涉及的主要內容有:兩組序列(隱含狀態和觀測值),三種概率(初始狀態概率,狀態轉移概率,發射概率),和三個基本問題(產生觀測序列的概率計算,最佳隱含狀態序列的解碼,模型本身的訓練),以及這三個問題的常用算法(前向或後向算法,Viterbi 算法,EM 算法)。語音識別的最終應用對應的是解碼問題,而對語音識別系統的評估、使用也叫做解碼(Decoding)。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 在研究HMM之前,先帶大家簡單的回顧一下馬爾科夫鏈。馬爾科夫鏈是建模隨機過程的一種方法,用天氣來舉個簡單點的例子就是,今天是否下雨和前一天是否下雨有關,有一種關聯的特點。放在語音識別裏就是,我們能知道語音的頻譜,但不知道之前的頻譜代表什麼意思的,就可以通過歷史的頻譜,來","attrs":{}},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong","attrs":{}}],"text":"推導","attrs":{}},{"type":"text","text":"新的頻譜的對應結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong","attrs":{}}],"text":" 高斯混合模型(GMM,Gaussian Mixed Model)","attrs":{}},{"type":"text","text":",主要就是通過GMM來求得某一音素的概率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 在語音識別中,HMM用於建模subword級別(比如音素)的聲學建模。通常我們使用3個狀態的HMM來建模一個音素,它們分別表示音素的開始、中間和結束。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 現在流行的語音系統不再使用GMM而是使用一個神經網絡模型模型,它的輸入是當前幀的特徵向量(可能還要加上前後一些幀的特徵),輸出是每個音素的概率。比如我們有50個音素,每個音素有3個狀態,那麼神經網絡的輸出是50x3=150。這種聲學模型叫做”混合”系統或者成爲HMM-DNN系統,這有別於之前的HMM-GMM模型,但是HMM模型目前還在被使用。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、語言模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 語言模型要解決的問題是如何計算 P(W),常用的方法基於 n 元語法(n-gram Grammar)或RNN。目前主要有n-gram語言模型和RNN語言模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" n-gram語言模型是典型的的自迴歸模型,而RNN語言模型因爲當前的結果依賴於之前的信息,所以可以使用單向循環神經網絡進行建模,在這裏感興趣的自己再去學習下哈,內容實在太多了,學長挑重要的跟大家講。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、解碼器","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 根據前面說的P(W|O),我們的最終目的是選擇使得 P(W|O) = P(O|W)P(W) 最大的 W ,所以解碼本質上是一個搜索問題,並可藉助加權有限狀態轉換器(Weighted Finite State Transducer,WFST) 統一進行最優路徑搜索(先了解下)","attrs":{}}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、基於端到端的學習方法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"CTC (連接時序分類,Connectionist temporal classification),","attrs":{}},{"type":"text","text":" CTC 方法早在2006年就已提出並應用於語音識別,但真正大放異彩卻是在2012年之後,隨之各種CTC研究鋪展開來。CTC僅僅只是一種損失函數,簡而言之,輸入是一個序列,輸出也是一個序列,該損失函數欲使得模型輸出的序列儘可能擬合目標序列。之前需要語音對齊到幀,用這個就可以不需要對齊,它只會關心預測輸出的序列是否和真實的序列是否接近(相同)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" Attention模型 ","attrs":{}},{"type":"text","text":"看了很多概念還是覺得引用之前的例子最容易理解了:","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 當我們人在看一樣東西的時候,我們當前時刻關注的一定是我們當前正在看的這樣東西的某一地方,換句話說,當我們目光移到別處時,注意力隨着目光的移動也在轉移。 Attention機制的實現是通過保留LSTM編碼器對輸入序列的中間輸出結果,然後訓練一個模型來對這些輸入進行選擇性的學習並且在模型輸出時將輸出序列與之進行關聯。","attrs":{}}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"五、深度學習-CNN實戰舉例","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 前面說了這麼多理論知識,現在用Python代碼把CNN網絡模型進行簡單講解(讓大家以實例來更加了解語音分類過程),同時推薦大家看下此","attrs":{}},{"type":"link","attrs":{"href":"http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/CNN.pdf","title":"","type":null},"content":[{"type":"text","text":"PPT","attrs":{}}]},{"type":"text","text":"(點擊可查看下載)實屬學習乾貨!!!","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"#搭建CNN模型\nmodel = Sequential()\n\n# 輸入的大小\ninput_dim = (16, 8, 1)\n\nmodel.add(Conv2D(64, (3, 3), padding = \"same\", activation = \"tanh\", input_shape = input_dim))# 卷積層\nmodel.add(MaxPool2D(pool_size=(2, 2)))# 最大池化\nmodel.add(Conv2D(128, (3, 3), padding = \"same\", activation = \"tanh\")) #卷積層\nmodel.add(MaxPool2D(pool_size=(2, 2))) # 最大池化層\nmodel.add(Dropout(0.1))\nmodel.add(Flatten()) # 展開\nmodel.add(Dense(1024, activation = \"tanh\"))\nmodel.add(Dense(20, activation = \"softmax\")) # 輸出層:20個units輸出20個類的概率\n\n# 編譯模型,設置損失函數,優化方法以及評價標準\nmodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])\n\nmodel.summary()\n\n# 訓練模型\nmodel.fit(X_train, Y_train, epochs = 20, batch_size = 15, validation_data = (X_test, Y_test))\n\n\n# 預測測試集\ndef extract_features(test_dir, file_ext=\"*.wav\"):\n feature = []\n for fn in tqdm(glob.glob(os.path.join(test_dir, file_ext))[:]): # 遍歷數據集的所有文件\n X, sample_rate = librosa.load(fn,res_type='kaiser_fast')\n mels = np.mean(librosa.feature.melspectrogram(y=X,sr=sample_rate).T,axis=0) # 計算梅爾頻譜(mel spectrogram),並把它作爲特徵\n feature.extend([mels])\n return feature\n X_test = extract_features('./test_a/')\n X_test = np.vstack(X_test)\npredictions = model.predict(X_test.reshape(-1, 16, 8, 1))\npreds = np.argmax(predictions, axis = 1)\npreds = [label_dict_inv[x] for x in preds]\n\npath = glob.glob('./test_a/*.wav')\nresult = pd.DataFrame({'name':path, 'label': preds})\n\nresult['name'] = result['name'].apply(lambda x: x.split('/')[-1])\nresult.to_csv('submit.csv',index=None)\n!ls ./test_a/*.wav | wc -l\n!wc -l submit.csv","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"六、聲網 Agora 一站式智能語音識別方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 講完前面的語音識別必備知識,接下來咱們思考一下語音識別在語音聊天、音樂社交、視頻直播,這些與“聲音”有關的社交場景中應用越來越廣的背景下,還會出現哪些問題呢,其中最突出的問題就是現有的語音內容審覈+實時音視頻服務,部署、調試、運維的成本高,而且很多方案對有背景音樂、噪聲的音頻識別效果差。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 小王學長也是查看了許多應用解決方案,覺得聲網 Agora 一站式智能語音識別方案是比較不錯的,推薦給大家。肯定會有人問爲什麼你覺得好,好在哪裏?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 先說說現有","attrs":{}},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong","attrs":{}}],"text":"傳統方案","attrs":{}},{"type":"text","text":",簡單分爲三步:","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 內容經過轉碼或直接推流至 CDN;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 內容審覈廠商從 CDN 拉流,然後進行 AI 、人工內容審覈;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 完成審覈後,傳回給服務器端。","attrs":{}}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f8/f890f63063fc5d80aa3b77968de871c2.webp","alt":null,"title":"傳統的實時音視頻內容審覈流程\n\n\n\n(點擊圖片可見來源)","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"https://mp.weixin.qq.com/s/ynnQ6MR-75OsHV-iCOXvCA","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}}],"text":" ","attrs":{}},{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong","attrs":{}}],"text":" 存在問題:","attrs":{}},{"type":"text","text":"首先,開發者需要對接三個廠商,要進行很多次的部署與調試,其中有很多調試會產生成本與風險,並且,當CDN出現故障的時候,會耗費較長時間來排查問題,也需要支付額外的拉流成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 另一方面,目前的方案還需解決噪聲的問題,例如語音社交、語音FM這些場景常常伴有背景音樂和環境噪聲,會影響現有內容審覈方案的識別率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"size","attrs":{"size":12}}],"text":" ","attrs":{}},{"type":"text","marks":[{"type":"size","attrs":{"size":12}},{"type":"strong","attrs":{}}],"text":"聲網","attrs":{}},{"type":"text","marks":[{"type":"size","attrs":{"size":12}}],"text":"現已提供業界","attrs":{}},{"type":"text","marks":[{"type":"size","attrs":{"size":12}},{"type":"color","attrs":{"color":"#40A9FF","name":"blue"}},{"type":"strong","attrs":{}}],"text":"獨有","attrs":{}},{"type":"text","marks":[{"type":"size","attrs":{"size":12}}],"text":"的一站式智能語音識別方案:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d6/d6853f757618791b7c102a8055bac2ab.webp","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 開發者只需要在應用中集成聲網 Agora SDK,即可讓音頻在 Agora SD-RTN™ 網絡中實時傳輸的過程中完成語音內容識別與審覈。並整合了業界 Top 3 語音識別服務,通過聲網獨家研發的 AI 音頻降噪引擎消除背景音,優化音頻質量,讓語音更加清晰。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":12}}],"text":"聲網語音識別方案的","attrs":{}},{"type":"text","marks":[{"type":"size","attrs":{"size":12}},{"type":"color","attrs":{"color":"#F5222D","name":"red"}},{"type":"strong","attrs":{}}],"text":"優勢","attrs":{}},{"type":"text","marks":[{"type":"size","attrs":{"size":12}}],"text":":","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"1、調用 RESTful API,一站式接入:","attrs":{}},{"type":"text","text":"在應用中集成 Agora SDK 後,開發者可以通過調用 RESTful API,即可爲自己的應用增加語音內容審覈服務。","attrs":{}},{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"相比傳統內容審覈方案,聲網方案可以節省開發時間、服務器等接入成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"2、AI 降噪,識別率更高","attrs":{}},{"type":"text","text":":利用聲網 AI 音頻降噪引擎對音頻進行優化,以提升語音的識別率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"strong","attrs":{}}],"text":"3、語音交互低延時:","attrs":{}},{"type":"text","text":"聲網 SDK 實現了全球端到端76ms 的實時音視頻低延時傳輸。聲網Agora SD-RTN™ 實時通信網絡採用私有 UDP 協議進行傳輸,基於軟件定義優化路由選擇最優傳輸路徑,自動規避網絡擁塞和骨幹網絡故障帶來的影響。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 所以說,看完聲網與傳統解決方案的優缺點對比是不是覺得聲網的一站式解決方案很香!!!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 除此之外,再跟大家推薦一個好用的工具-","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}},{"type":"color","attrs":{"color":"#9254DE","name":"purple"}},{"type":"strong","attrs":{}}],"text":"聲網Agora的工具水晶球","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 簡單說,水晶球是聲網Agora推出的RTC行業首個質量監控與數據分析工具,主要解決終端用戶的問題反饋鏈太長等問題。想深入瞭解的可以點擊","attrs":{}},{"type":"link","attrs":{"href":"https://rtcdeveloper.com/t/topic/21603","title":"","type":null},"content":[{"type":"text","text":"這裏","attrs":{}}]}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特點:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1.自建監控","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 2.集成多種 RTC 監控工具","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 3.使用同一RTC服務商所提供的質量調查工具","attrs":{}}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"七、語音識別開發工具","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}}],"text":"深度學習平臺","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/27/273782a6603fcd154c9b88bf1ec27b71.png","alt":null,"title":"小王學長用心總結(抓緊收藏)","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}}],"text":"語音識別開發工具","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/de/de79c1db1e3a0df6081f0d32355ccd66.png","alt":null,"title":"小王學長用心總結(抓緊收藏哈)","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"八、語音識別相關開源學習資料","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}}],"text":"開源數據集","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://github.com/google-research/sound-separation/blob/master/datasets/fuss/FUSS_license_doc/README.md","title":null,"type":null},"content":[{"type":"text","text":"Google發佈的語音分離數據集","attrs":{}}],"marks":[{"type":"underline"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://github.com/bytedance/GiantMIDI-Piano","title":null,"type":null},"content":[{"type":"text","text":"字節跳動發佈全球最大鋼琴MIDI數據集","attrs":{}}],"marks":[{"type":"underline"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://github.com/CLUEbenchmark/CLUEDatasetSearch","title":null,"type":null},"content":[{"type":"text","text":"中英文NLP數據集搜索庫:CLUEDatasetSearch","attrs":{}}],"marks":[{"type":"underline"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://www.openslr.org/18/","title":null,"type":null},"content":[{"type":"text","text":"清華中文語音數據集THCHS-30","attrs":{}}],"marks":[{"type":"underline"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s/pvXzhROkEMUBplrGCatwlQ","title":null,"type":null},"content":[{"type":"text","text":"中國明星聲紋數據集 CN-Celeb2","attrs":{}}],"marks":[{"type":"underline"}]}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#F5222D","name":"red"}}],"text":"開源語音識別項目","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://github.com/AgoraIO-Community/Solo","title":"","type":null},"content":[{"type":"text","text":"聲網Agora SOLO 開源音頻編碼器","attrs":{}}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://github.com/Uberi/speech_recognition","title":null,"type":null},"content":[{"type":"text","text":"https://github.com/Uberi/speech_recognition","attrs":{}}],"marks":[{"type":"underline"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://github.com/xxbb1234021/speech_recognition","title":null,"type":null},"content":[{"type":"text","text":"https://github.com/xxbb1234021/speech_recognition","attrs":{}}],"marks":[{"type":"underline"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://github.com/SeanNaren/deepspeech.pytorch","title":null,"type":null},"content":[{"type":"text","text":"https://github.com/SeanNaren/deepspeech.pytorch","attrs":{}}],"marks":[{"type":"underline"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://github.com/srvk/eesen","title":null,"type":null},"content":[{"type":"text","text":"https://github.com/srvk/eesen","attrs":{}}],"marks":[{"type":"underline"}]}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://github.com/kaldi-asr/kaldi","title":null,"type":null},"content":[{"type":"text","text":"https://github.com/kaldi-asr/kaldi","attrs":{}}],"marks":[{"type":"underline"}]}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(小夥伴們看完記得點贊收藏下哈,小王學長希望能幫助到大家~)","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"作者介紹","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"王凱,計算機在讀碩士,兩年音視頻學習開發經驗,主攻音頻語音識別方向,對 NLP、深度學習、神經網絡、數學建模、音視頻編解碼技術有一定研究和實踐經驗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章