突破 PyTorch、TensorFlow 并行瓶颈的开源训练加速框架到底是啥？

原創

2021-10-12 09:38

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"随着摩尔定律的失效，单个计算单元的能力已经远远无法满足数据的指数级增长。比如，快手每天上传的新视频超过千万条，即便训练简单的分类模型（比如 ResNet），使用单机单卡的算力，训练快手日内新增视频都需要超过一百天的时间。因此，在数据爆炸性增长的互联网行业，多机多卡的并行训练成为了大数据时代的必然。随着深度学习模型功能的日益强大，分布式训练任务的通信成本和所需算力也随之急剧增长。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"video","data":{"id":"421475","name":"0922 大咖说直播回放压缩","poster":"https:\/\/static001.infoq.cn\/resource\/image\/0b\/0c\/0b4982f37925162yyb2040166eb53a0c.jpeg","url":"https:\/\/media001.geekbang.org\/089d7375928e474c835033147fc2cda5\/7f549ea2e8f14c6184e407834640d169-3ab262b1391dee1e336928c1c6616860-sd.m3u8"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 然而，由于多机多卡并行带来的额外通讯成本，加速比（speedup）经常让大家失望，从而形成了大厂“堆资源”，没资源的“干瞪眼”的局面。比如，Google 的 Downpour 框架 [1] 使用 80 个 GPU 训练 ImageNet，加速比却只有 12\/80=15%。因此如何提升多机多卡中训练的通讯效率成为了并行训练乃至解决数据爆炸性增长的核心问题之一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"项目 GitHub 地址："},{"type":"link","attrs":{"href":"https:\/\/github.com\/BaguaSys\/bagua","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/github.com\/BaguaSys\/bagua"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"现有的深度学习开源框架（PyTorch，TensorFlow）主要针对系统层面优化，把已有的单机单卡优化算法扩展到多机多卡的场景。虽然系统层面的优化使得并行效率不断提升，但是边际效益却越来越明显。针对这个问题，快手和苏黎世理工（ETH Zürich）联合开发了一款名为“Bagua”（八卦）的训练加速框架。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"近日，快手 Senior Staff Research Scientist 廉相如现身"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/video\/Mv2PX7Xhb1ENJj4DIYyQ?utm_source=home_video&utm_medium=article","title":null,"type":null},"content":[{"type":"text","text":"大咖说"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"，与我们分享"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/BQwk3Vdvm3Tlcz7BLCrq","title":null,"type":null},"content":[{"type":"text","text":"Bagua"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"（八卦）的核心技术思路。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：方便先跟我们简单介绍一下快手研发这款框架的背景吗？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Bagua（八卦）是快手最新开源的训练加速框架，之所以做这个框架主要源于快手内部的实际业务需求。随着摩尔定律的失效，单个计算单元的能力已经远远无法满足数据的指数级增长，快手内部每天上传的新视频超过千万量级，如果是用简单的分类模型，比如ResNet，在单机单卡的情况下需要超过一百天的时间完成训练，这肯定是远远不能满足业务迭代的需求。企业使用GPU这种算力更高的硬件替代CPU进行训练任务，已经是业界共识，但是单个GPU仍然远远不能满足大规模数据训练的需要，使用多机多卡并行训练成为必然趋势。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是，多机多卡并行涉及GPU和GPU之间的协调通讯，会带来额外的通讯成本，整体的加速比不太乐观，大厂可以通过堆资源的方式完成这件事情，小厂只能干瞪眼，比如谷歌当年的Downpour 框架使用 80 个 GPU 训练 ImageNet，加速比却只有 12\/80=15%，大量资源被浪费，提升多机多卡的训练效率成为数据爆炸性增长时代的重要议题之一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"近年来，业内关于这个问题也有了一些可行的解决方案，比如PyTorch-DDP、Horovod等，这些框架普遍仅针对系统层面进行优化，比如对 NVIDIA NCCL 的 GPU 通讯库进行包装，加上通信计算重叠的简单优化。这样的实现和使用都相对简单，但在性能上还存在较大提升空间。快手结合内部业务对训练速度的需求研发了Bagua，以通用训练加速为目标，不仅对多机多卡通讯进行加速，在效率上超过现存的分布式训练方案，还可以提升单卡性能、加速数据流读取和模型并行效率，从而实现高收益的一站式训练加速。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：我们从研发伊始就决定好开源了吗？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"研发伊始，我们聚焦在快手内部业务的痛点问题上，主要是希望解决我们内部遇到的问题，后面发现这些问题是非常共性的，无论是快手这样的互联网公司还是小型创业公司都对训练加速有着极高需求，但是这一方面在社区里面相对来说是比较稀缺的，包括很多最新出现的训练算法在工业界一直没有很好的实现，我们将这些算法实现到了Bagua中，企业可以通过实际应用感受到效果，一方面可以帮助工业界更方便的使用学术界的最新算法，另一方面可以帮助对训练有需求的公司和研究机构更好的训练模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：自今年7月份开源至今，我们有收到一些反馈吗，整个部署和使用成本是怎么样的？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"开源之后，我们收到了很多反馈，有些来自于公司内部，有些来自于开源社区。我听到比较多的一个问题是：Bagua内部实现了非常多算法和优化是否会导致对系统依赖过多。我们也非常理解大家的这种担忧，因为依赖过多会降低易用性，所以我们在这方面做了不少努力，我们提供已预装Bagua的镜像，用户可以直接拿来运行，也会逐步在各大公有云上提供相关服务，方便用户运行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"另外也提供一键式的安装脚本，当前主流的linux操作系统可以直接运行安装脚本。当然，这些是远远不够的，易用性对一个开源项目而言是至关重要的，我们会在这方面持续进行优化，也欢迎社区里面的小伙伴多提意见。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：现有分布式训练框架的主要瓶颈卡在哪？如果不出现新的框架，现有的问题是如何解决的？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"这是很好的问题，不同的业务或者机器学习场景是不太一样的，最常见的是上述提到的多机多卡通讯瓶颈，导致每个GPU完成计算之后需要等待通讯完成再开启下一个计算周期，造成计算资源的不充分利用，这种情况要么放弃原有的训练模型选择通讯量较小的模型结构，或者使用更多资源。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：这类框架是否适用于不同体量、不同数据量的企业？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"没错，大家对速度和性能的要求是无止境的，无论公司大小，深度学习的训练过程往往需要比较大的开销，无论模型多大，只要是一个实用的模型都需要消耗GPU等资源，降低这部分花销，整体的投入产出比就会有明显提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：训练的复杂性与模型质量和规模之间的关系可以具体说说吗？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"其实这个问题比较好解决。在深度学习领域，基本共识是数据量越大，最后的模型效果就会越好，但是数据量太大会导致训练时间过长，这就需要将二者平衡在我们可接受的范围内。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"此外，模型结构更加复杂，使用了更多最新研究出来的模型结构，模型最后的训练效果也会更好，这种模型在数据量相同的情况下往往训练时间也会更长，需要消耗更多的计算资源，对系统的效率提出更大的挑战。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：根据之前公开的信息，“八卦”这个名字的由来与Bagua 中实现的通讯算法是呼应的，您方便聊一下Bagua 在通讯算法方面做的优化吗？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"说到Bagua这个名字还是很有意思的，一开始这只是快手内部使用的框架并没有很正式的名字，在决定对外开源时才决定认真起一个名字。Bagua内部实现的最新算法，比如去中心化算法、异步算法、梯度压缩算法很像八卦新闻，为什么这么说呢？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"从单机单卡的训练到多机多卡，每个卡都会把自己的计算结果进行累加传播，这个过程就好像每个人将自己知道的信息传递给别人，又从其他人那里获取信息，最后完成全局的信息同步。如果将GPU之间的信息同步类比成人与人之间的信息同步，社会经验告诉我们，八卦新闻或者小道消息是最高效的方式，因为传播速度特别快，这与Bagua框架中的算法一一呼应。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"以去中心化的通讯为例，八卦消息的传播一般来讲都不是中心化的，大家往往是和自己熟悉的人交换信息，这就是一种去中心化的模式，对应到Bagua本身，该框架支持这种去中心化的分布式训练算法，每个GPU只和自己邻近的GPU进行通讯，相比中心化的通讯，这种模式可以有效降低通讯代价，完成同样的收敛效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"再说异步通讯，八卦消息的传播往往都是异步的，并不是所有人同时进行，对应到Bagua，这种通讯模式就是GPU之间的异步通讯，这里的计算和通讯不要求同步，每个GPU不需要等其他所有GPU都有消息了才做同步，只要计算完成就可以与其他GPU进行通讯，通讯完成继续执行计算，这可以省去所有GPU同步的开销。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最后是信息压缩，八卦消息往往会有一些热搜关键词，言简意赅的将最重要的信息传递出去，而不会显示太多细节，否则大家也记不住，虽然总的信息传播量很少，但沟通效率很高。对应到Bagua，我们支持对GPU计算出的梯度进行信息压缩，留下最重要的部分，极大降低通讯开销。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：为了实现极致化的性能，Bagua主要做了哪些工作？其在技术上的难点是什么？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"性能层面的提升除了上述提到的诸多最新通讯算法的加入，我们对系统本身也做了很多优化，这些优化让Bagua在去掉这些通讯算法加成的情况下依旧可以获得不错的性能提升，原因在于我们有很多纯系统实现层面的优化，包括底层的通讯协议，目前大家使用的深度学习训练GPU通常由 NVIDIA 生产，NVIDIA 的GPU 提供通讯底层库NCCL，我们对NCCL本身的通讯性能也做了优化，获得了不错的性能提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"起初，我们主要只做多机分布式通讯的优化，像是读取数据并不在我们的优化范围内，但在快手内部业务落地的过程中，我们发现很多场景的性能瓶颈并不是因为多机通讯而是卡在了数据读取上，很多数据读不进来或者说数据读取之后做了很多预处理，这些占据了大部分的训练时间，因此Bagua对这部分进行了优化，可以自动完成数据读取，预处理完的结果可以做缓存以加速下一次读取和预处理的速度，这在内部很多场景效果显著。这也启发了我们将 Bagua 做成一个一站式的训练加速工具，涵盖训练流程的各个阶段。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"比如，我们还会做模型更新的梯度融合，举例来说，一般框架中针对模型参数更新采取的方式是逐一更新，Bagua是将很多组参数组合在一起更新，极大提升整体效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"技术上，比如在 Bagua 立项开始就进行探索的通讯算法方面，难点在于学术界对通讯算法的优化早有研究，但大多停留在理论层面，工业界一直没有很好的落地，我们需要找到这些算法中好用的、能用的，结合硬件特性做高性能实现，真正发挥出算法的潜力。这需要对新的通讯算法有深刻的理解，同时有丰富的系统实现和工业落地经验。像是底层通讯实现上，我们花了不少心思，在 NVIDIA 的 GPU 上我们优化后的通讯效率超过了 NVIDIA 自己实现的 NCCL 通讯库。其中一个优化点的基本思想是常用的分布式训练算法特点是每个GPU的收发数据量一样，而且要完成操作必须所有GPU都完成通讯，如果GPU彼此之间的收发速度不一致，就会造成“木桶定律”，整体的完成速度取决于最慢的那一个，我们在通讯上实现了一个相对平衡的操作，我们会控制GPU的收发速率，尽量让其保持同步，这样就可以提高整体的通讯效率。这需要对算法本身具有充分的了解。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：预处理缓存用的是SSD还是GPU的DDR？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们目前在开源版本里支持的是分布式KV存储，主要在内存里面，比如每个机器存一部分缓存，我们正在打算将本地SSD，完成复合的缓存方案也对外开源，这可以解决的问题是当数据集特别大，所有机器内存加起来都不够缓存数据集时可以在内存中缓存一部分，其他放在SSD上面，这可以支持更大的数据集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：在做内存结合SSD加速方案时，SSD是一般的SSD还是类似英特尔傲腾的SSD或其他考量？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"英特尔傲腾SSD使用起来跟内存没有区别，我们本身也支持这种方式，它是从硬件层面就可以当成内存用，这方面不需要Bagua来进行特殊支持。相反，普通SSD是需要Bagua单独支持的，我们近期也会做这方面的开源，当然我们不会只用SSD，我们会跟内存结合起来，除非个别大规模场景，其余需要运行分布式训练的场景，多机内存往往是够用的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：方便具体聊聊Bagua与 PyTorch-DDP，Horovod，BytePS 实现思路上的异同吗？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"对于当前比较流行的PyTorch-DDP，Horovod，BytePS等框架，我们首先覆盖了这些框架主要的已有功能（分布式同步训练），在此基础上通过算法和系统的联合优化，达成更进一步的一站式训练速度优化，不仅限于分布式训练。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：方便举一个具体的例子介绍下Bagua在具体的业务场景中的使用情况，包括部署门槛、使用前后的效果对比等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在快手内部，我们从去年初开始在业务场景落地，到今年底应该会实现更大规模的落地。如果我们想在工业场景中落地Bagua，会首先对任务本身进行分析，了解目前该场景的性能瓶颈是读数据IO还是分布式训练，针对不同的任务使用不同的优化，效果也是完全不一样的。开源之后，我们也总结了不少经验，内部也落地了不少相关的经验性文档，对于新的场景，我们往往会推荐先阅读文档，尝试开关一些优化，从而找到最适合的训练方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"目前，快手内部对自然语言处理、图像识别、推荐系统等场景均有不同程度的尝试。在自然语言处理方面，我们实现了一个GPT-2量级大小的模型，大概有65%的效率提升；图像识别方面则有大概20%到30%的效率提升；大规模语音识别是20%到30%的提升；推荐系统的效率提升在100%以上，其特点是通讯频率特别高，更容易发挥Bagua的优势。这些模型都属于数据并行在同一个GPU上可以放下整个模型的任务，接下来我们会针对超出单个GPU可控范围的模型进行落地和优化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：在快手的案例中，多机多卡里的多卡大概是用了多少卡？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一般来讲在百张卡的比较多，后续会在更多场景和规模上进一步验证。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：整个加速算法会影响模型准确度吗？在实际的业务中怎么做取舍？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们做工具的很怕东西做出来之后，别人使用时发现还需要自己调整很多东西，这就会导致整个项目很难推动。我们目前也在做尝试，针对接受度较高的用户，他完全可以尝试调节 Bagua 的各种选项，获得更高的加速比；对于只想获得性能收益但不想花时间调整选项的用户，可以使用 Bagua 的默认方案，默认方案在训练数值上与传统方案完全一致，省去了自己调参的麻烦，同时获得一定的训练加速收益。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：快手现在是通过AI平台来实现卡的资源虚拟化的？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"是的，这其中涉及到卡的调度等很多问题，我们内部基于K8s平台进行不同资源的调度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：在一次训练过程当中是每张卡单独负责一个训练模型？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"这种说法不完全准确，数据并行的方式是指同一个模型在每张卡上都会有一份，每张卡对该模型都会有一个更新的梯度，彼此之间沟通，目前Bagua就是如此，如果后面对超大模型进行并行，每个卡上可能都有模型的全部也可能只有模型的一部分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：八卦支持Data Parallel还是Model Parallel？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们目前开源部分是支持Data Parallel，近期会加入 Model Parallel 的支持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：为什么当初选用Rust去实现？有什么优势吗？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Rust最近几年的发展速度非常快，发展趋势也非常猛，类似Bagua这种框架对性能要求非常高，因此可选择的编程语言相对较少，除了C、C++这两个老牌编程语言，其他的也就只剩Rust了。对于我们团队来说，还是比较喜欢尝试一些新的实现方式，结果也证明Rust的效果确实很好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Rust本身有非常成熟的异步通讯机制，比如做大量多线程的复用，一个线程做各种异步通讯，然后灵活控制通讯的收发数据。此外，整个框架设计非常多的协同操作，很容易发生内存泄漏，这种情况在C++里面很常见，但Rust可以有效避免这些错误，所以我们最后选择Rust。毕竟，如果无法保证安全，速度再快也没用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：在开源框架层面，开发者现在也有不少选择，您对于框架选型方面有哪些好的建议？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在开源深度学习框架上面，开发者现在确实有很多选择，比如直接选择 PyTorch 定义模型，并使用 PyTorch 原生的分布式训练方案，目前来看，PyTorch 的用户 API 正在成为事实标准，包括TensorFlow 2.0版本、国产的 PaddlePaddle, OneFlow, MegEngine 等都在逐渐过渡到 PyTorch 的使用方式，主要因素就是易用性上的考虑。各种框架往往都在某个细分领域具有一些独有的优势，开发者在选择不同的框架时如果精力允许可以每种框架都进行一定程度的试用和了解，从而找到最适合自己落地场景的选择。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"受这方面启发，我们作为训练加速工具也在易用性上进行持续优化，比如尽量让用户做很少的操作就可以将 Bagua 应用到一个已有的 PyTorch 训练脚本中，在成本很小的情况下享受到训练加速的红利。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"另外，我们会在功能上进行一些添加，对大模型以及非分布式场景进行优化，也希望开源社区的小伙伴可以给我们更多的反馈，让我们知道从哪些方面进行提升和优化来让大家用的更加方便。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：除了框架本身，我们整个团队未来还会把精力放在哪些方面？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"廉相如："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"首先是上述提到的关于Bagua框架本身的优化依旧是我们整个团队的重点，此外在这个过程中，我们沉淀了很多GPU通讯方面的优化经验，以及GPU如何高效利用这些优化的经验，很多场景都对此有需要，比如强化学习系统和推荐系统，在做到大规模之后不可避免会出现组件之间的通讯交互，这也是很多系统的瓶颈所在，是并行系统或者分布式系统最大的痛点，我们希望目前积累下来的优化经验可以在不同的应用场景之间做平移。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"至于整个深度学习领域，目前比较大的方向是国产深度学习框架的发展，这些框架也都在努力构建自己的生态和社区，部分功能可以和传统框架做对标，在很多点上又有所超越，逐渐吸引自己的用户群。从底层来看，目前国内还是应用 NVIDIA 的GPU比较多，也有一些AI加速芯片的公司在尝试做一些创新，这个还是很有必要的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"此外，随着深度学习的逐渐成熟，工业界的应用也越来越多，这也催生了很多相关工作的出现，比如深度学习模型在工业场景下的部署、训练和推理，资源管理等。在这个过程中，学术界提出了很多好的想法和思路，为工业界的落地应用指出了很好的道路，但其实最后沉淀到工业界真正可以发挥作用的研究并没有那么多，工业界在这其中起到了检验的作用，这也是非常重要的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"目前，整个项目处于开源初期，欢迎大家加入Bagua技术交流群，一起学习交流。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/uploader.shimowendang.com\/f\/sPDFw3SHWYBqvFrj.png!thumbnail?accessToken=eyJhbGciOiJIUzI1NiIsImtpZCI6ImRlZmF1bHQiLCJ0eXAiOiJKV1QifQ.eyJhdWQiOiJhY2Nlc3NfcmVzb3VyY2UiLCJleHAiOjE2MzQwMDIzOTksImciOiJqOHl5V3FEdktDOHBqVEtHIiwiaWF0IjoxNjM0MDAyMDk5LCJ1c2VySWQiOjE2NDY2OTQzfQ.CJiQeVF8PQJjyg5gYDc3tktXxbgLT0b-IfobMD7eUKc","alt":null,"title":null,"style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"项目 GitHub 地址："},{"type":"link","attrs":{"href":"https:\/\/github.com\/BaguaSys\/bagua","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/github.com\/BaguaSys\/bagua"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

2024年DataOps趋势预测：AI不会取代数据工程师

APM digest收集了多位行業專家對DataOps在2024的發展形勢及對IT和業務的影響的預測，這些技術最高管理者，包括Confluent技術戰略負責人Andrew Sellers的深刻洞見可能與你的感覺一致嗎？快來探討一下。數據可

2024-04-30 11:49:29

通义灵码实战系列：一个新项目如何快速启动，如何维护遗留系统代码库？

作者：別象進入 2024 年，AI 熱度持續上升，翻閱科技區的文章，AI 可謂是軍書十二卷，卷卷有爺名。而麥肯錫最近的研究報告顯示，軟件工程是 AI 影響最大的領域之一，AI 已經成爲了軟件工程的必選項，也有研究稱開發者每天的事務性工作可

2024-04-30 21:12:20

30 秒出服装设计稿，森马用函数计算+AIGC 整“新活”!

創新項目如何去賦能我們的業務，這件事情在森馬很重要。阿里雲函數計算幫我們屏蔽掉了想把AI落地到實際業務場景中 GPU 算力資源儲備、採購成本、技術門檻等很多難題，從而迅速做出決策，快人一步站在正確的起點，體驗新技術對整個服裝爆款設計、營銷

2024-04-30 21:12:14

当「软件研发」遇上 AI 大模型

作者：陳鑫（神秀）大家好，我是通義靈碼的產品技術負責人陳鑫。過去有八年時間，我都是在阿里集團做研發效能，即研發工具相關的工作。我們從 2015 年開始做一站式 DevOps 平臺，然後打造了雲效，也就是將 DevOps 平臺實現雲化。到

2024-04-30 21:12:13

Apache DolphinScheduler支持Flink吗？

隨着大數據技術的快速發展，很多企業開始將Flink引入到生產環境中，以滿足日益複雜的數據處理需求。而作爲一款企業級的數據調度平臺，Apache DolphinScheduler也跟上了時代步伐，推出了對Flink任務類型的支持。 Flink

2024-04-30 11:49:27

云原生周刊：K8s 中的服务和网络｜ 2024.4.29

開源項目推薦 k8s-image-swapper k8s-image-swapper 是 Kubernetes 的一個變更 Webhook，它將鏡像下載到自己的鏡像倉庫，並將鏡像指向該新位置。它是 docker pull-through p

2024-04-30 10:48:10

全面提升 RAG 质量！Zilliz 携手智源集成 Sparse Embedding、Reranke

Zilliz 持續爲 AI 應用開發者賦能！近期，Zilliz 與智源研究院達成合作，將多種 BGE（BAAI General Embedding）開源模型與開源向量數據庫 Milvus 集成。得益於 Milvus 2.4 最新推出的

2024-04-29 21:20:24

一分钟部署 Llama3 中文大模型，没别的，就是快

前段時間百度創始人李彥宏信誓旦旦地說開源大模型會越來越落後，閉源模型會持續領先。隨後小扎同學就給了他當頭一棒，向他展示了什麼叫做頂級開源大模型。美國當地時間4月18日，Meta 在官網上發佈了兩款開源大模型，參數分別達到 80 億 (8

2024-04-29 21:14:30

如何从0到1设计诊断系统

引言在整車電子電氣體系中，診斷系統的設計扮演着至關重要的角色，負責支持整車的刷寫、故障排查和EOL(End of Line)等關鍵操作。這一重要性在於這些操作的實現都依賴於診斷系統的全面支持。因此，在設計診斷系統時，必須確保

2024-04-26 22:43:26

Nacos 安全零信任实践

作者：柳遵飛 Nacos 作爲配置中心經常存儲一些敏感信息，但是由於誤用導致安全風險，最常見的主要是以下兩個問題： 1）Nacos 暴露公網可以嗎？不可以，因爲 Nacos 定位是註冊配置中心，是內部系統，不應該暴露到公網使用。 2）不得已

2024-04-26 21:12:11

centos7下Docker 安装

Docker 是一個開源的商業產品，有兩個版本：社區版（Community Edition，縮寫爲 CE）和企業版（Enterprise Edition，縮寫爲 EE）。企業版包含了一些收費服務，個人開發者一般用不到。下面的介紹都針對社區

2024-04-26 13:11:00

技术实践｜大模型内容安全蓝军的道与术

1、引子大語言模型（LLM）在2023年大放異彩，在許多領域展現出強大的能力，包括角色扮演，文本創作，邏輯推理等。然而，隨着其應用範圍的擴大，生成內容的安全問題也日益凸顯。這包括但不限於生成虛假信息、有害內容、偏見或歧視性言論等。這些問題

2024-04-26 09:33:23

LoRA微调语言大模型的实用技巧

一、引言隨着深度學習技術的快速發展，語言大模型在自然語言處理領域取得了顯著的進展。然而，傳統的微調方法通常需要大量的計算資源和時間，對於實際應用來說並不友好。爲了解決這個問題，LoRA微調技術應運而生。LoRA（Low-Rank Adap

2024-04-28 11:30:13

京东广告研发——效率为王：广告统一检索平台实践

1、系統概述實踐證明，將互聯網流量變現的在線廣告是互聯網最成功的商業模式，而電商場景是在線廣告的核心場景。京東服務中國數億的用戶和大量的商家，商品池海量。平臺在兼顧用戶體驗、平臺、廣告主收益的前提推送商品具有挑戰性。京東廣告檢索平臺

2024-04-25 23:17:47

数字化转型新篇章：企业通往智能化的新范式

早在十多年前，一些具有前瞻視野的企業以實現“數字化”爲目標啓動轉型實踐。但時至今日，可以說尚無幾家企業能夠在真正意義上實現“數字化”。在實現“數字化”的征途上，人們發現，努力愈進，彷彿終點愈遠。究其原因，還在於轉型一直落後於技術邊界的拓展

2024-04-29 21:22:20

24小時熱門文章

最新文章

最新評論文章