作業幫基於 WeNet + ONNX 的端到端語音識別方案

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"WeNet是出門問問和西北工業大學聯合開源的端到端語音識別⼯具,WeNet基於PyTorch生態提供了開發、訓練和部署服務等一條龍服務方。自上線以來,在GitHub已經獲取近千star,受到業界的強烈關注。本文介紹了作業幫的WeNet + ONNX端到端語音識別推理方案,實驗表明,相比LibTorch,ONNX的方案可獲得20%至30%的速度提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"一、Why ONNX?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ONNX(Open Neural Network Exchange)格式,是一種針對機器學習所設計的開放式的文件格式,用於存儲訓練好的模型。它使得不同的人工智能框架(如Pytorch, MXNet)可以採用相同格式存儲模型數據並交互。將深度學習模型轉爲ONNX格式,可使模型在不同平臺上進行再訓練和推理。除了框架之間的互操作性之外,ONNX還提供了一些優化,可以加速推理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"二、"},{"type":"text","marks":[{"type":"strong"}],"text":"PyTorch轉ONNX"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將PyTorch模型轉爲ONNX格式在⼀定程度上是⽐較簡單的,PyTorch官⽹有較爲詳細的說明。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得注意的是,PyTorch轉ONNX格式的torch.onnx.export()⽅法需要torch.jit.ScriptModule而不是torch.nn.Module,若傳⼊的模型不是SriptModule形式,該函數會利用tracing的方式,通過追蹤輸⼊tensor的流向,來記錄模型運算時的所有操作並轉爲ScriptModule。當然這種方式進行轉換,會導致模型無法對動態的操作流進行捕獲,比如對torch.tensor的動態切片操作會被當做固定的長度切片,一旦切片的長度發生變化便會引發錯誤。爲了對這些動態操作流程進行保存,可以使用scripting的方式,直接將動態操作流改寫爲ScriptModule。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"三、具體困難和我們的解決方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於目前的ONNX主要還是應用在CV領域,在處理這種非序列模型時,轉寫和應用都比較方便,然而,其對NLP、ASR領域的序列模型,特別是涉及到流式解碼的應用場景支持比較有限,將PyTorch訓練的U2模型轉爲ONNX格式並在推理時調用,相對而言是個比較麻煩的事情。主要困難有兩個:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1、不支持torch.tensor轉index的切片操作"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這點上面有提到,若使用tracing方式進行轉寫,對torch.tensor的切片,只可能是靜態切片如:data[:3] = new_data,這裏的3只能是固定值3,不能是傳入的torch.tensor;或者依靠傳入的torch.tensor作爲index,來對張量進行切片,如data[torch.tensor([1, 2])] = new_data。除此之外是不支持其他動態切片方式的,如data[:data.shape[0]]。WeNet流式解碼時,需要encoder對輸⼊的cache tensor進行切片操作,這裏當然可以通過一次次地傳⼊需要切片的index tensor來進行切片,但這樣做明顯將模型變得複雜了很多,利用scripting的方式將需要切片的的操作直接改寫爲ScriptModule是更可取的方式,如EncoderLayer模塊中,我們添加了"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"@torch.jit.script\n\ndef slice_helper(x, offset):\n\nreturn x[:, -offset: , : ]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"chunk = x.size(1) - output_cache.size(1)\n\nx_q = x[:, -chunk:, :]\n\nresidual = residual[:, -chunk:, :]\n\nmask = mask[:, -chunk:, :]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"改寫爲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"chunk = x.size(1) - output_cache.size(1)\n\nx_q = slice_helper(x, chunk)\n\nresidual = slice_helper(residual, chunk)\n\nmask = slice_helper(mask, chunk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是值得注意的是,若將torch.nn.Module轉爲torch.jit.ScriptModule,模型在PyTorch上是無法進行計算的,即無法進行訓練。按照通用做法,可以將訓練代碼和轉寫代碼分爲兩部分,一個專門用來訓練,一個專門讀取模型並轉寫。實際上,也可以簡單地在使用到scripting的模塊中,添加bool屬性onnx_mode,在訓練時設置爲False,轉寫時設置爲True即可:"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"def set_onnx_mode(self, onnx_mode=False):\n\nself.onnx_mode = onnx_mode\n\nchunk = x.size(1) - output_cache.size(1)\n\nif onnx_mode:\n\nx_q = slice_helper(x, chunk)\n\nresidual = slice_helper(residual, chunk)\n\nmask = slice_helper(mask, chunk)\n\nelse:\n\nx_q = x[:, -chunk:, :]\n\nresidual = residual[:, -chunk:, :]\n\nmask = mask[:, -chunk:, :]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2、不支持傳入NoneType類型參數"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對WeNet流式解碼,encoder部分在第一個chunk輸入時,輸入的cache都爲NoneType,而在後續chunk特徵輸⼊時,各cache會儲存不同大小的值進行輸入,這樣做主要是爲了避免重複地對每一幀特徵進行計算。然而因爲ONNX轉寫的模型不支持NoneType輸入,無法簡單地導出一個模型進行推理,最原始的想法是在導出ONNX模型的時候,通過調整輸入不同值(不輸入cache、輸入cache),導出兩個模型,在第一個chunk輸入時使用前者,後續chunk輸入時使用後者。這種方法減輕了代碼量,但是明顯不太適合,畢竟encoder部分參數佔了整個模型一半以上,無論是線上還是本地化實現,兩個encoder導致的體積增加是難以容忍的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的方案是正常導出傳入非NoneType參數的模型,但是在runtime調用時,第一個chunk不再輸入None,而是一個dummy的張量。subsampling_cache及elayers_output_cache輸入音頻長度爲1、值爲0的張量"},{"type":"text","marks":[{"type":"italic"}],"text":","},{"type":"text","text":"conformer_cnn_cache直接輸入長度爲cnn_kernel_size - 1、值爲0的張量(對應causal CNN前置的padding)"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"batch_size = 1\n\naudio_len = 131\n\nx = torch.randn(batch_size, audio_len, 80, requires_grad=False)\n\nsubsampling_cache = torch.randn(batch_size, 1, 256, requires_grad=False)\n\nelayers_output_cache = torch.randn(12, batch_size, 1, 256, requires_grad=False)\n\nconformer_cnn_cache = torch.randn(12, batch_size, 256, 14, requires_grad=False)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對應的,訓練完模型後,在導出模型時的encoder的實現代碼中,需要將每次輸⼊的第一幀音頻特徵捨去,它不參與實際運算。利用前文提到的onnx_mode屬性,我們可以實現訓練時正常使用所有特徵,轉ONNX模型時忽略掉第⼀幀,如在attention計算時,提取x_q的chunk需要改爲"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"if onnx_mode:\n\nchunk = x.size(1) - output_cache.size(1) + 1\n\nelse:\n\nchunk = x.size(1) - output_cache.size(1)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了上述兩個較爲明顯的問題,轉ONNX模型還有⼀些坑需要注意,比如前文提到tracing是通過追蹤輸入tensor的流向來定位參與的運算,而不能通過其他類型如List[tensor],encoder模塊中的forward_chunk函數各個層的cache tensor不能使用使用list來保存,必須要通過torch.cat函數合併成tensor,否則在調用ONNX模型時,對模型輸出的索引將會出錯。(如下面的代碼不修改,輸出output對應索引位置的值,不是r_conformer_cnn_cache,⽽是r_conformer_cnn_cache[0])"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"r_conformer_cnn_cache.append(new_cnn_cache)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"改爲"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"r_conformer_cnn_cache = torch.cat((r_conformer_cnn_cache, new_cnn_cache.unsqueeze(0)), 0)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外需要注意的是,通過tracing來追蹤模型的opts,如果模型傳入的tensor沒有被使用,導出的模型就會認爲不會輸入該參數,若輸入該參數會導致報錯。最後,ONNX不支持tensor轉bool變量操作,訓練的python腳本中大量的assert將無法使用,不過具體使用時這個可以不用考慮。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"四、具體的實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"說完了困難和解決方案,具體實現就非常簡單了。首先U2模型是分爲三個大塊,encoder、CTC 以及 decoder,我們需要分別對三個塊進行導出,最簡單的CTC不必多說,decoder由於不涉及到cache,也較爲簡單,不過爲了方便decoder的輸出能直接被使用,我們在導出decoder時去掉了不需要的輸出,並且將輸出的值進行softmax變換"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"if self.onnx_mode:\n\nreturn torch.nn.functional.log_softmax(x, dim=-1)\n\nelse:\n\nreturn x, olens"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對encoder,按照第二部分將動態切片部分和cache的dummy進行處理後,按照如下操作將encoder的forward函數替換爲forward_chunk即可進行導出。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"model.eval()\n\nencoder = model.encoder\n\nencoder.set_onnx_mode(True)\n\nencoder.forward = encoder.forward_chunk\n\nbatch_size = 1\n\naudio_len = 131\n\nx = torch.randn(batch_size, audio_len, 80, requires_grad=False)\n\ni1 = torch.randn(batch_size, 1, 256, requires_grad=False)\n\ni2 = torch.randn(12, batch_size, 1, 256, requires_grad=False)\n\ni3 = torch.randn(12, batch_size, 256, 14, requires_grad=False)\n\nonnx_path = os.path.join(args.output_onnx_file, 'encoder.onnx')\n\ntorch.onnx.export(encoder,\n\n(x, i1, i2, i3),\n\nonnx_path,\n\nexport_params=True,\n\nopset_version=12,\n\ndo_constant_folding=True,input_names=['input', 'i1', 'i2', 'i3'],\n\noutput_names=['output', 'o1', 'o2', 'o3'],\n\ndynamic_axes={'input': [1], 'i1':[1], 'i2':[2],\n\n'output': [1], 'o1':[1], 'o2':[2]},\n\nverbose=True\n\n)\n\nonnx_model = onnx.load(onnx_path)\n\nonnx.checker.check_model(onnx_model)\n\nprint(\"encoder onnx_model check pass!\")\n\n# compare ONNX Runtime and PyTorch results\n\nencoder.set_onnx_mode(False)\n\ny, o1, o2, o3 = encoder(x, None, None, i3)\n\nort_session = onnxruntime.InferenceSession(onnx_path)\n\nort_inputs = {ort_session.get_inputs()[0].name: to_numpy(x),\n\nort_session.get_inputs()[1].name: to_numpy(i1),\n\nort_session.get_inputs()[2].name: to_numpy(i2),\n\nort_session.get_inputs()[3].name: to_numpy(i3)}\n\nort_outs = ort_session.run(None, ort_inputs)\n\nnp.testing.assert_allclose(to_numpy(y), ort_outs[0][:, 1:, :], rtol=1e-05, atol=1e-05)\n\nnp.testing.assert_allclose(to_numpy(o1), ort_outs[1][:, 1:, :], rtol=1e-05, atol=1e-05)\n\nnp.testing.assert_allclose(to_numpy(o2), ort_outs[2][:, :, 1:, :], rtol=1e-05, atol=1e-05)\n\nnp.testing.assert_allclose(to_numpy(o3), ort_outs[3], rtol=1e-05, atol=1e-05)\n\nprint(\"Exported encoder model has been tested with ONNXRuntime, and the result looks good!\")"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"導出模型後,WeNet的runtime也需要根據導出的模型進行修改,最主要是對dummy的張量的處理,如原本的TorchAsrDecoder中,初始化的subsampling_cache_、elayers_output_cache_、conformer_cnn_cache_應按照對應大小設置爲全爲0的張量(其他數字也可以,反正不會參與運算),對應的,offset_初始值應該設置爲1,每次Reset的時候也應重新設置爲上述值。其他方面按照onnxruntime給定的API以及demo就可以順利完成後續集成的工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"五、ONNX效果實測"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前我們測試的結果是onnxruntime運行速度要相對libtorch提升20%~30%左右,而且ONNX的解碼器完成之後,也能依葫蘆畫瓢比較順利的完成集成MNN的工作,便於後續可能的本地化加速需求,在centos 服務器上,onnxruntime、libtorch實時率對比見下表(2000條音頻測試結果)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/2c\/4f\/2cc070bd80cea9051f8027331e93864f.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"六、題外話:WeNet訓練相關調參經驗分享"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"WeNet自發布以來以其易用性以及模型優秀的落地效果獲取了大量關注,從去年起我們就一直在跟WeNet的相關工作,同時也在WeNet的基礎上做了大量相關實驗,有一些相關經驗可以和大家分享一下,需要說明的是,下述經驗只在我們場景下的數據集得到了驗證,不代表適應所有應用場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,對於數據,Spec Aug數據增強部分,我們將num_t_mask * max_t = 2 * 50 改爲 4 * 25,對最終效果有能觀察到的正向影響,猜想是短小而密集的mask更貼近白噪聲效果;feature_dither在訓練和推理時都設爲true,效果也會更好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次關於模型訓練的速度,爲了最大化GPU的使用效率,可以在GPU memory足夠的情況下儘可能把batch設的大一些。一般來說我們都會把數據按長度進行排序後,再分爲不同的batch進行訓練,因此可能存在的數據長度不均衡的情況,會導致靜態batch大小往往受限於最長音頻所在的批次,只能取較小值。爲了避免這種情況可以採用espnet的經驗,將batch設爲動態的,每當音頻長度增長到某些瓶頸就減小batch值,另外也可以直接在WeNet訓練時,將batch_type設置爲dynamic,使用data bucket的方式限制每個batch音頻的總長度,而不是每個batch的音頻條數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,對於模型大小,在我們的場景下(中文識別),線性層units個數從2048調整爲1024對最終結果影響較小,可以爲了更快的訓練、識別速度進行適當調整。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"七、關於作業幫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作業幫教育科技(北京)有限公司成立於2015年,一直致力於用科技手段助力教育普惠,運用人工智能、大數據等前沿技術,爲全國中小學生提供更高效的學習解決方案。公司旗下有作業幫APP、作業幫直播課、作業幫口算、喵喵機等多款教育科技產品。作業幫用戶遍佈全國各地,其中70%以上來自於三線及以下城市。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章