达摩院大模型M6公布最新进展:参数突破10万亿,成为全球最大AI预训练模型

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当前,大规模预训练模型已成为学术界和工业界都非常关注的一大研究领域。随着达摩院大模型M6突破10万亿参数,中国成功实现了全球最大AI预训练模型。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"M6成全球最大AI预训练模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"11月8日,阿里巴巴达摩院公布"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/xIX9lekuuLcXewc5iphF","title":"xxx","type":null},"content":[{"type":"text","text":"多模态大模型M6"}]},{"type":"text","text":"最新进展,其参数已从万亿跃迁至10万亿,规模远超谷歌、微软此前发布的万亿级模型,成为全球最大的AI预训练模型。据了解,M6使用512 GPU在10天内即训练出具有可用水平的10万亿模型。相比去年发布的大模型GPT-3,M6实现同等参数规模,能耗仅为其1%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"M6是达摩院研发的通用性人工智能大模型,拥有多模态、多任务能力,其认知和创造能力超越传统AI,尤其擅长设计、写作、问答,在电商、制造业、文学艺术、科学研究等领域有广泛应用前景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自2020年中GPT-3提出以来,一系列国内外大企业都在大模型的研发上开展探索,专注各个领域任务的大模型相继提出,在各大下游任务都展现出优越的表现。无疑,超大规模预训练模型蕴含着巨大的学术研究价值和商业落地价值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"与传统AI相比,大模型拥有成百上千倍“神经元”数量,且预先学习过海量知识,表现出像人类一样“举一反三”的学习能力。因此,大模型被普遍认为是未来的“基础模型”,将成下一代AI基础设施。然而,其算力成本相当高昂,训练1750亿参数语言大模型GPT-3所需能耗,相当于汽车行驶地月往返距离。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此前达摩院陆续发布了多个版本的M6模型,从大规模稠密模型到超大规模的混合专家模型的探索,逐步从百亿参数升级到万亿参数规模。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/76\/85\/7628a20d6fa0161f0yy154yy84c47985.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今年5月,通过专家并行策略及优化技术,达摩院M6团队将万亿模型能耗降低超八成、效率提升近11倍。10月,M6再次突破业界极限,通过更细粒度的CPU offload、共享-解除算法等创新技术,让收敛效率进一步提升7倍,这使得模型规模扩大10倍的情况下,能耗未显著增加。这一系列突破极大降低了大模型研究门槛,让一台机器训练出一个千亿模型成为可能。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同时,达摩院联合阿里云推出了M6服务化平台,为大模型训练及应用提供完备工具,首次让大模型实现“开箱即用”,算法人员及普通用户均可方便地使用平台。达摩院还推出了当前最大规模的中文多模态评测数据集MUGE,覆盖图文描述、文本生成图像、跨模态检索任务,填补了缺少中文多模态权威评测基准的空白。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作为国内首个商业化落地的多模态大模型,M6已在超40个场景中应用,日调用量上亿。今年,大模型首次支持双11。M6在犀牛智造为品牌设计的服饰已在淘宝上线;凭借流畅的写作能力,M6正为天猫虚拟主播创作剧本;依靠多模态理解能力,M6正在增进淘宝、支付宝等平台的搜索及内容认知精度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d7\/4f\/d79c51ea6a095e164825ec180437e34f.jpg","alt":null,"title":"M6生成的未来感汽车图","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"10万亿M6技术实现"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前文提到,M6使用512 GPU在10天内即训练出具有可用水平的10万亿模型。而之前业界最好水平是微软最新发布的DeepSpeed,其使用了512张A100才完成3.5万亿参数基于MoE的GPT。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在研发的过程中,研究人员发现MoE模型结合高效的分组机制能够用有限资源快速训练完成一个效果优越的大模型。同时一系列大模型的工作都在说明,参数规模的扩展带来的便是模型能力边界的扩展,更多的数据+更大的模型=更强的能力。那么,如果要训练的是极限规模的十万亿参数模型,是不是就需要成倍地增加机器呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"M6团队提出的命题是,如何在有限资源的条件下高效地训练极限规模模型?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"M6团队提出了一种简单的方法解决此类极限规模模型训练的问题,不仅关注如何用有限的资源训练极限规模模型,还关注如何将其训练至真实可用。团队使用512张GPU将十万亿参数的模型训练至可用的水平,而如果训练此前的万亿参数模型也只需要64张GPU即可实现。相比此前的M6模型,M6-10T具有如下优势:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"相比此前的万亿参数M6,M6-10T的参数量是原先的10倍没有显著的资源增加(480 vs 512 GPU);"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"相比万亿参数M6,M6-10T在样本量的维度上具有更快的收敛速度;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"提出的共享解除机制将十万亿参数模型的训练速度提升7倍以上,并可广泛应用于其他同类大模型的训练。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/damo.alibaba.com\/","title":"xxx","type":null},"content":[{"type":"text","text":"达摩院"}]},{"type":"text","text":"智能计算实验室联合阿里云PAI团队,在"},{"type":"link","attrs":{"href":"https:\/\/baijiahao.baidu.com\/s?id=1708472807237741573&wfr=spider&for=pc","title":"xxx","type":null},"content":[{"type":"text","text":"Whale框架"}]},{"type":"text","text":"下实现M6模型。此前发布的千亿和万亿参数M6模型,均在Whale上实现,利用其强大的数据并行、模型并行以及专家并行的能力实现超大规模模型的训练和推理。Whale通过一系列优化,为M6模型的训练节约资源,提升效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"显存优化方面,Whale的自动Gradient Checkpointing、Group-wise Apply、CPU Offload技术和通信池化等技术均有效节约显存的使用,而在计算和通信方面,Whale支持了MoE所需的DP+EP的机制,并在EFLOPS集群高速通信能力的基础上,采用分组融合通信、半精度通信、拓扑感知的All2All通信算子等技术来提高通信效率,以及结合混合精度、编译优化等技术提高训练效率等。同时,EFLOPS团队联合PAI团队对attention进行优化,将访存密集型算子融合成一个cuda kernel实现,将multihead attention性能提升30%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而在十万亿M6模型的训练上,团队首先解决有限资源(512 GPU)“放下”10万亿参数的极限规模模型,而模型结构则采用此前万亿参数M6-T使用的结合expert prototyping的MoE模型。团队在分布式框架Whale中利用CPU offload的方法成功将十万亿参数的M6-10T模型在512张GPU的机器中放下并实现训练。相比其他的CPU offload方案,M6的CPU offload粒度可控,可以灵活地选择offload的模型层,可以不用将所有的权重offload到CPU memory中,而选择保留部分权重在GPU memory上进行计算,这样的做法可以进一步地提高GPU利用率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d7\/04\/d7697731f726571cd6c1eb374e369604.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解决了放入模型的问题后,团队针对训练效率的问题设计了Pseudo-to-Real(共享解除)机制,其核心思想为利用训练好的小模型初始化大模型。该算法首先利用参数共享的机制构建并快速训练小模型,此阶段无需使用CPU内存存放模型同时可以使用更大的批次。配合上专家拆分和合并的机制,算法团队只需要使用256张GPU即可快速训练一个Pseudo Giant。随后,训练好的模型层的参数用于为Real Giant的每一层提供初始化,大模型即可在训练好的小模型的基础上继续优化。尽管大模型的训练速度较慢,但无需经历漫长的收敛过程,只需从一个低点开始优化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"团队也通过实验证明该方案在收敛和下游迁移的有效性,同时在十万亿参数规模的M6-10T模型上做出成功实践,仅用10天左右的时间即得到非常突出的收敛效果。样本维度上收敛效果显著优于此前千亿参数M6和万亿参数模型M6-T。如上图所示,在经过了10M样本的训练后,同等实验设置下M6-10T的log PPL显著低于M6-MoE和M6-T,分别降低了34.7%和10.1%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在实验中,对比不使用Pseudo-to-Real机制直接训练的十万亿模型,Pseudo-to-Real机制达到相同预训练loss用时仅为原先的6%。对比M6万亿模型,Pseudo-to-Real十万亿模型达到相同预训练loss所需的样本量仅需约40%,充分显示出Pseudo-to-Real机制对于超大模型训练的优势。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"达摩院智能计算实验室负责人周靖人表示,“接下来,我们将深入研究大脑认知机理,致力于将M6的认知力提升至接近人类的水平,比如,通过模拟人类跨模态的知识抽取和理解方式,构建通用的人工智能算法底层框架;另一方面,不断增强M6在不同场景中的创造力,产生出色的应用价值。”"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章