招商证券BERT压缩实践(二):如何构建3层8bit模型?

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BERT,全称 Bidirectional Encoder Representation from Transformers,是一款于 2018 年发布,在包括问答和语言理解等多个任务中达到顶尖性能的语言模型。它不仅击败了之前最先进的计算模型,而且在答题方面也有超过人类的表现。招商证券希望借助BERT提升自研NLP平台的能力,为旗下智能产品家族赋能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在前一篇"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/fyWR8cOmI7xtfEY3rqA3","title":"","type":null},"content":[{"type":"text","text":"蒸馏模型"}]},{"type":"text","text":"中,招商证券信息技术中心 NLP 开发组已经初步实践了BERT模型压缩方法,成功将12层BERT模型缩减为3层。在本次分享中,研发人员们将介绍更简洁的模块替换方法,以及削减参数比特位的量化方法,并将这几种方法有机结合实现了将BERT体积压缩至1\/10的目标。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. BERT-of-Theseus模块替换"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.1 概述"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"BERT-of-Theseus"}]},{"type":"text","text":"[1]主要通过模块替换的方法进行模型压缩。不同于模型蒸馏方法需要根据模型和任务制定复杂的损失函数以及引入大量额外超参,Theseus压缩方法显得简洁许多:该方法同样需要一个大模型作为“先驱”,而规模较小的目标模型作为“后辈”(类似蒸馏方法中的教师模型和学生模型),对于“先驱”BERT模型来说,主体部分是由多个结构相同的Transformer Encoder组成,“后辈”模型将“先驱”中的每N个Transformer Encoder模块替换为1个Transformer Encoder模块,从而实现模型的压缩。具体实现过程如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在BERT模型中,第i个Encoder的输出为:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/5b\/6a\/5b13ba5736bf81aee20d773513c3906a.png","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“先驱”和“后辈”中第i个模块的输出分别为:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/06\/4f\/06d725effd292238d25f7fc1fce8b94f.png","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于“后辈”模型的1个Encoder模块将会替换N个“先驱”Encoder模块,因此可以将每N个“先驱”Encoder模块分为一个逻辑组,从而与“后辈”模型对应。Theseus方法就是用“后辈”Encoder模块替换对应的“先驱”逻辑组,具体的替换的过程比较直观:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,设置一个概率p,通过伯努利分布函数获得模块替换概率:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/c3\/fb\/c337080b7d90fdd0d1a30cdde7cc2afb.png","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章