作业帮基于 WeNet + ONNX 的端到端语音识别方案

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"WeNet是出门问问和西北工业大学联合开源的端到端语音识别⼯具,WeNet基于PyTorch生态提供了开发、训练和部署服务等一条龙服务方。自上线以来,在GitHub已经获取近千star,受到业界的强烈关注。本文介绍了作业帮的WeNet + ONNX端到端语音识别推理方案,实验表明,相比LibTorch,ONNX的方案可获得20%至30%的速度提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"一、Why ONNX?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ONNX(Open Neural Network Exchange)格式,是一种针对机器学习所设计的开放式的文件格式,用于存储训练好的模型。它使得不同的人工智能框架(如Pytorch, MXNet)可以采用相同格式存储模型数据并交互。将深度学习模型转为ONNX格式,可使模型在不同平台上进行再训练和推理。除了框架之间的互操作性之外,ONNX还提供了一些优化,可以加速推理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"二、"},{"type":"text","marks":[{"type":"strong"}],"text":"PyTorch转ONNX"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"将PyTorch模型转为ONNX格式在⼀定程度上是⽐较简单的,PyTorch官⽹有较为详细的说明。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得注意的是,PyTorch转ONNX格式的torch.onnx.export()⽅法需要torch.jit.ScriptModule而不是torch.nn.Module,若传⼊的模型不是SriptModule形式,该函数会利用tracing的方式,通过追踪输⼊tensor的流向,来记录模型运算时的所有操作并转为ScriptModule。当然这种方式进行转换,会导致模型无法对动态的操作流进行捕获,比如对torch.tensor的动态切片操作会被当做固定的长度切片,一旦切片的长度发生变化便会引发错误。为了对这些动态操作流程进行保存,可以使用scripting的方式,直接将动态操作流改写为ScriptModule。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"三、具体困难和我们的解决方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于目前的ONNX主要还是应用在CV领域,在处理这种非序列模型时,转写和应用都比较方便,然而,其对NLP、ASR领域的序列模型,特别是涉及到流式解码的应用场景支持比较有限,将PyTorch训练的U2模型转为ONNX格式并在推理时调用,相对而言是个比较麻烦的事情。主要困难有两个:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1、不支持torch.tensor转index的切片操作"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这点上面有提到,若使用tracing方式进行转写,对torch.tensor的切片,只可能是静态切片如:data[:3] = new_data,这里的3只能是固定值3,不能是传入的torch.tensor;或者依靠传入的torch.tensor作为index,来对张量进行切片,如data[torch.tensor([1, 2])] = new_data。除此之外是不支持其他动态切片方式的,如data[:data.shape[0]]。WeNet流式解码时,需要encoder对输⼊的cache tensor进行切片操作,这里当然可以通过一次次地传⼊需要切片的index tensor来进行切片,但这样做明显将模型变得复杂了很多,利用scripting的方式将需要切片的的操作直接改写为ScriptModule是更可取的方式,如EncoderLayer模块中,我们添加了"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"@torch.jit.script\n\ndef slice_helper(x, offset):\n\nreturn x[:, -offset: , : ]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"将"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"chunk = x.size(1) - output_cache.size(1)\n\nx_q = x[:, -chunk:, :]\n\nresidual = residual[:, -chunk:, :]\n\nmask = mask[:, -chunk:, :]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"改写为"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"chunk = x.size(1) - output_cache.size(1)\n\nx_q = slice_helper(x, chunk)\n\nresidual = slice_helper(residual, chunk)\n\nmask = slice_helper(mask, chunk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是值得注意的是,若将torch.nn.Module转为torch.jit.ScriptModule,模型在PyTorch上是无法进行计算的,即无法进行训练。按照通用做法,可以将训练代码和转写代码分为两部分,一个专门用来训练,一个专门读取模型并转写。实际上,也可以简单地在使用到scripting的模块中,添加bool属性onnx_mode,在训练时设置为False,转写时设置为True即可:"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"def set_onnx_mode(self, onnx_mode=False):\n\nself.onnx_mode = onnx_mode\n\nchunk = x.size(1) - output_cache.size(1)\n\nif onnx_mode:\n\nx_q = slice_helper(x, chunk)\n\nresidual = slice_helper(residual, chunk)\n\nmask = slice_helper(mask, chunk)\n\nelse:\n\nx_q = x[:, -chunk:, :]\n\nresidual = residual[:, -chunk:, :]\n\nmask = mask[:, -chunk:, :]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2、不支持传入NoneType类型参数"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对WeNet流式解码,encoder部分在第一个chunk输入时,输入的cache都为NoneType,而在后续chunk特征输⼊时,各cache会储存不同大小的值进行输入,这样做主要是为了避免重复地对每一帧特征进行计算。然而因为ONNX转写的模型不支持NoneType输入,无法简单地导出一个模型进行推理,最原始的想法是在导出ONNX模型的时候,通过调整输入不同值(不输入cache、输入cache),导出两个模型,在第一个chunk输入时使用前者,后续chunk输入时使用后者。这种方法减轻了代码量,但是明显不太适合,毕竟encoder部分参数占了整个模型一半以上,无论是线上还是本地化实现,两个encoder导致的体积增加是难以容忍的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们的方案是正常导出传入非NoneType参数的模型,但是在runtime调用时,第一个chunk不再输入None,而是一个dummy的张量。subsampling_cache及elayers_output_cache输入音频长度为1、值为0的张量"},{"type":"text","marks":[{"type":"italic"}],"text":","},{"type":"text","text":"conformer_cnn_cache直接输入长度为cnn_kernel_size - 1、值为0的张量(对应causal CNN前置的padding)"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"batch_size = 1\n\naudio_len = 131\n\nx = torch.randn(batch_size, audio_len, 80, requires_grad=False)\n\nsubsampling_cache = torch.randn(batch_size, 1, 256, requires_grad=False)\n\nelayers_output_cache = torch.randn(12, batch_size, 1, 256, requires_grad=False)\n\nconformer_cnn_cache = torch.randn(12, batch_size, 256, 14, requires_grad=False)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对应的,训练完模型后,在导出模型时的encoder的实现代码中,需要将每次输⼊的第一帧音频特征舍去,它不参与实际运算。利用前文提到的onnx_mode属性,我们可以实现训练时正常使用所有特征,转ONNX模型时忽略掉第⼀帧,如在attention计算时,提取x_q的chunk需要改为"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"if onnx_mode:\n\nchunk = x.size(1) - output_cache.size(1) + 1\n\nelse:\n\nchunk = x.size(1) - output_cache.size(1)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了上述两个较为明显的问题,转ONNX模型还有⼀些坑需要注意,比如前文提到tracing是通过追踪输入tensor的流向来定位参与的运算,而不能通过其他类型如List[tensor],encoder模块中的forward_chunk函数各个层的cache tensor不能使用使用list来保存,必须要通过torch.cat函数合并成tensor,否则在调用ONNX模型时,对模型输出的索引将会出错。(如下面的代码不修改,输出output对应索引位置的值,不是r_conformer_cnn_cache,⽽是r_conformer_cnn_cache[0])"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"r_conformer_cnn_cache.append(new_cnn_cache)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"改为"}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"r_conformer_cnn_cache = torch.cat((r_conformer_cnn_cache, new_cnn_cache.unsqueeze(0)), 0)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外需要注意的是,通过tracing来追踪模型的opts,如果模型传入的tensor没有被使用,导出的模型就会认为不会输入该参数,若输入该参数会导致报错。最后,ONNX不支持tensor转bool变量操作,训练的python脚本中大量的assert将无法使用,不过具体使用时这个可以不用考虑。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"四、具体的实现"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"说完了困难和解决方案,具体实现就非常简单了。首先U2模型是分为三个大块,encoder、CTC 以及 decoder,我们需要分别对三个块进行导出,最简单的CTC不必多说,decoder由于不涉及到cache,也较为简单,不过为了方便decoder的输出能直接被使用,我们在导出decoder时去掉了不需要的输出,并且将输出的值进行softmax变换"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"if self.onnx_mode:\n\nreturn torch.nn.functional.log_softmax(x, dim=-1)\n\nelse:\n\nreturn x, olens"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对encoder,按照第二部分将动态切片部分和cache的dummy进行处理后,按照如下操作将encoder的forward函数替换为forward_chunk即可进行导出。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"model.eval()\n\nencoder = model.encoder\n\nencoder.set_onnx_mode(True)\n\nencoder.forward = encoder.forward_chunk\n\nbatch_size = 1\n\naudio_len = 131\n\nx = torch.randn(batch_size, audio_len, 80, requires_grad=False)\n\ni1 = torch.randn(batch_size, 1, 256, requires_grad=False)\n\ni2 = torch.randn(12, batch_size, 1, 256, requires_grad=False)\n\ni3 = torch.randn(12, batch_size, 256, 14, requires_grad=False)\n\nonnx_path = os.path.join(args.output_onnx_file, 'encoder.onnx')\n\ntorch.onnx.export(encoder,\n\n(x, i1, i2, i3),\n\nonnx_path,\n\nexport_params=True,\n\nopset_version=12,\n\ndo_constant_folding=True,input_names=['input', 'i1', 'i2', 'i3'],\n\noutput_names=['output', 'o1', 'o2', 'o3'],\n\ndynamic_axes={'input': [1], 'i1':[1], 'i2':[2],\n\n'output': [1], 'o1':[1], 'o2':[2]},\n\nverbose=True\n\n)\n\nonnx_model = onnx.load(onnx_path)\n\nonnx.checker.check_model(onnx_model)\n\nprint(\"encoder onnx_model check pass!\")\n\n# compare ONNX Runtime and PyTorch results\n\nencoder.set_onnx_mode(False)\n\ny, o1, o2, o3 = encoder(x, None, None, i3)\n\nort_session = onnxruntime.InferenceSession(onnx_path)\n\nort_inputs = {ort_session.get_inputs()[0].name: to_numpy(x),\n\nort_session.get_inputs()[1].name: to_numpy(i1),\n\nort_session.get_inputs()[2].name: to_numpy(i2),\n\nort_session.get_inputs()[3].name: to_numpy(i3)}\n\nort_outs = ort_session.run(None, ort_inputs)\n\nnp.testing.assert_allclose(to_numpy(y), ort_outs[0][:, 1:, :], rtol=1e-05, atol=1e-05)\n\nnp.testing.assert_allclose(to_numpy(o1), ort_outs[1][:, 1:, :], rtol=1e-05, atol=1e-05)\n\nnp.testing.assert_allclose(to_numpy(o2), ort_outs[2][:, :, 1:, :], rtol=1e-05, atol=1e-05)\n\nnp.testing.assert_allclose(to_numpy(o3), ort_outs[3], rtol=1e-05, atol=1e-05)\n\nprint(\"Exported encoder model has been tested with ONNXRuntime, and the result looks good!\")"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"导出模型后,WeNet的runtime也需要根据导出的模型进行修改,最主要是对dummy的张量的处理,如原本的TorchAsrDecoder中,初始化的subsampling_cache_、elayers_output_cache_、conformer_cnn_cache_应按照对应大小设置为全为0的张量(其他数字也可以,反正不会参与运算),对应的,offset_初始值应该设置为1,每次Reset的时候也应重新设置为上述值。其他方面按照onnxruntime给定的API以及demo就可以顺利完成后续集成的工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"五、ONNX效果实测"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前我们测试的结果是onnxruntime运行速度要相对libtorch提升20%~30%左右,而且ONNX的解码器完成之后,也能依葫芦画瓢比较顺利的完成集成MNN的工作,便于后续可能的本地化加速需求,在centos 服务器上,onnxruntime、libtorch实时率对比见下表(2000条音频测试结果)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/2c\/4f\/2cc070bd80cea9051f8027331e93864f.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"六、题外话:WeNet训练相关调参经验分享"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"WeNet自发布以来以其易用性以及模型优秀的落地效果获取了大量关注,从去年起我们就一直在跟WeNet的相关工作,同时也在WeNet的基础上做了大量相关实验,有一些相关经验可以和大家分享一下,需要说明的是,下述经验只在我们场景下的数据集得到了验证,不代表适应所有应用场景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,对于数据,Spec Aug数据增强部分,我们将num_t_mask * max_t = 2 * 50 改为 4 * 25,对最终效果有能观察到的正向影响,猜想是短小而密集的mask更贴近白噪声效果;feature_dither在训练和推理时都设为true,效果也会更好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次关于模型训练的速度,为了最大化GPU的使用效率,可以在GPU memory足够的情况下尽可能把batch设的大一些。一般来说我们都会把数据按长度进行排序后,再分为不同的batch进行训练,因此可能存在的数据长度不均衡的情况,会导致静态batch大小往往受限于最长音频所在的批次,只能取较小值。为了避免这种情况可以采用espnet的经验,将batch设为动态的,每当音频长度增长到某些瓶颈就减小batch值,另外也可以直接在WeNet训练时,将batch_type设置为dynamic,使用data bucket的方式限制每个batch音频的总长度,而不是每个batch的音频条数。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最后,对于模型大小,在我们的场景下(中文识别),线性层units个数从2048调整为1024对最终结果影响较小,可以为了更快的训练、识别速度进行适当调整。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"七、关于作业帮"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作业帮教育科技(北京)有限公司成立于2015年,一直致力于用科技手段助力教育普惠,运用人工智能、大数据等前沿技术,为全国中小学生提供更高效的学习解决方案。公司旗下有作业帮APP、作业帮直播课、作业帮口算、喵喵机等多款教育科技产品。作业帮用户遍布全国各地,其中70%以上来自于三线及以下城市。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章