Facebook推出数据并行训练算法FSDP:采用更少的GPU,更高效地训练更大数量级的模型

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大规模训练AI模型并非易事。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了需要大量算力和资源外,训练非常大的模型背后也有着相当大的工程复杂性。在Facebook AI Research(FAIR)Engineering,我们一直在努力构建各种工具和基础设施,让大型AI模型训练起来更加轻松。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们最近的一部分成果包括了"},{"type":"link","attrs":{"href":"https:\/\/github.com\/pytorch\/fairseq\/blob\/master\/examples\/megatron_11b\/README.md","title":"","type":null},"content":[{"type":"text","text":"层内模型并行"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/fairscale.readthedocs.io\/en\/latest\/deep_dive\/pipeline_parallelism.html","title":"","type":null},"content":[{"type":"text","text":"流水线模型并行"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/fairscale#optimizer-state-sharding-zero","title":"","type":null},"content":[{"type":"text","text":"优化器状态+梯度分片"}]},{"type":"text","text":"和"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/fairscale\/blob\/master\/fairscale\/nn\/moe\/moe_layer.py","title":"","type":null},"content":[{"type":"text","text":"多专家模型"}]},{"type":"text","text":"等领域的工作,旨在提升为任意数量的任务训练高级AI模型的效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"完全分片数据并行(Fully Sharded Data Parallel,FSDP)是我们推出的最新工具。它将AI模型的参数分片到数据并行worker中,并且可以选择将部分训练计算卸载到CPU。顾名思义,FSDP是一种数据并行训练算法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"尽管参数被分片到了不同的GPU,但每个微批次数据的计算仍然是每个GPU worker上的本地计算。这种简洁的概念让FSDP更容易理解,适用于更广泛的使用场景(与层内并行和流水线并行相比)。与优化器状态+梯度分片数据并行方法相比,FSDP分片参数更均匀,并且能够通过训练期间的通信和计算重叠来获得更好的性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了FSDP后,我们现在可以使用更少的GPU更高效地训练更大数量级的模型。FSDP已在"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/fairscale","title":"","type":null},"content":[{"type":"text","text":"FairScale库"}]},{"type":"text","text":"中实现,允许工程师和开发人员使用简单的API扩展和优化他们的模型训练。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Facebook,FSDP已被集成和测试,用于训练我们的一些"},{"type":"link","attrs":{"href":"https:\/\/github.com\/pytorch\/fairseq","title":"","type":null},"content":[{"type":"text","text":"NLP"}]},{"type":"text","text":"和"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/vissl","title":"","type":null},"content":[{"type":"text","text":"视觉"}]},{"type":"text","text":"模型。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"大规模训练的高计算成本"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NLP研究是一个特殊领域,其中我们可以看到有效利用算力来训练AI的重要性。去年,OpenAI宣布他们训练出了GPT-3,这是有史以来最大的神经语言模型,有1750亿个参数。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"据估计,训练GPT-3需要大约355个GPU年,或者相当于1,000个GPU连续工作四个多月。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了需要大量计算和工程资源外,大多数扩展到如此程度的方法都会带来额外的通信成本,并要求工程师仔细评估内存使用和计算效率之间的权衡。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,典型的数据并行训练需要在每个GPU上都维护模型的冗余副本,而模型并行训练需要在worker(GPU)之间移动激活,从而引入额外的通信成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比之下,FSDP牺牲的东西相对较少。它通过跨GPU分片模型参数、梯度和优化器状态来提高内存效率,并通过分解通信并将其与前向和后向传递重叠来提高计算效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FSDP产生的训练结果与标准分布式数据并行(DDP)训练相同,并且可在一个易用的界面中使用,该界面可直接替代PyTorch的DistributedDataParallel模块。我们的早期测试表明,FSDP可以扩展到数万亿个参数。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"FSDP的工作原理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在标准的DDP训练中,每个worker处理一个单独的批次,并且使用all-reduce操作对各个worker的梯度求和。虽然DDP已经变得非常流行,但它需要的GPU内存过多了,因为模型权重和优化器状态需要在所有DDP worker之间复制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"减少重复的一种方法是应用一个称为全参数分片的过程,其中只有局部计算所需的模型参数、梯度和优化器的一个子集是可用的。这种方法的一个实现,ZeRO-3,已被微软推而广之。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解锁全参数分片的关键在于,我们可以将DDP中的all-reduce操作分解为单独的reduce-scatter和all-gather操作:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/e8\/b6\/e8a741723a2d1a937c7c7ef9fb83c2b6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"All-reduce是reduce-scatter和all-gather的组合。聚合梯度的标准all-reduce操作可以分解为两个独立的阶段:reduce-scatter和all-gather。在reduce-scatter阶段,梯度在每个GPU上的各个rank之间的相等块中基于它们的rank索引相加。在all-gather阶段,每个GPU上可用的聚合梯度的分片部分可供所有GPU使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然后,我们可以重新排列reduce-scatter和all-gather,让每个DDP worker只需存储参数和优化器状态的单个分片。下图分别展示了标准DDP训练(上)和FSDP训练(下):"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/3a\/cd\/3a16e6bedbf0e541944c5c61f40b96cd.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"标准数据并行训练和完全分片数据并行训练的对比。在标准数据并行训练方法中,模型的副本存在于每个GPU上,并且仅在数据的一个分片上评估一系列前向和后向传递。在这些本地计算之后,每个本地进程的参数和优化器与其他GPU共享,以计算全局权重更新。在FSDP中,GPU上仅存在模型的一个分片。然后在本地,所有权重都通过all-gather步骤从其他GPU收集,来计算前向传递。然后在向后传递之前再次执行这种权重收集。在向后传递之后,局部梯度被平均,并通过reduce-scatter步骤在GPU上分片,这样每个GPU都能更新其局部权重分片。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了最大限度地提高内存效率,我们可以在每层前向传递后丢弃全部权重,从而为后续层节省内存。要实现这一点,可以将FSDP包装器应用于网络中的每一层(使用reshard_after_forward=True)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在伪代码中:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"FSDP forward pass:\n for layer_i in layers:\n all-gather full weights for layer_i\n forward pass for layer_i\n discard full weights for layer_i\nFSDP backward pass:\n for layer_i in layers:\n all-gather full weights for layer_i\n backward pass for layer_i\n discard full weights for layer_i\n reduce-scatter gradients for layer_\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"如何使用FSDP"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在大规模AI研究中使用FSDP有几种方法。当下我们提供四种解决方案,以适应不同的需求。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"一、在语言模型中使用FSDP"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对于语言模型,"},{"type":"link","attrs":{"href":"https:\/\/github.com\/pytorch\/fairseq","title":"","type":null},"content":[{"type":"text","text":"fairseq框架"}]},{"type":"text","text":"通过以下新参数支持FSDP:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"–ddp-backend=fully_sharded:通过FSDP启用完整分片"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"–cpu-offload:将优化器状态和FP32模型副本卸载到CPU(搭配–optimizer=cpu_adam)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"–no-reshard-after-forward:提高大型模型的训练速度(1B+参数),类似于ZeRO阶段2"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其他流行的选项(–fp16、–update-freq、–checkpoint-activations、–offload-activations等)可以继续正常使用"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"关于使用FSDP在八个GPU上,或在单个GPU上使用FSDP+CPU卸载训练13B参数模型的说明,请参阅fairseq"},{"type":"link","attrs":{"href":"https:\/\/github.com\/pytorch\/fairseq\/tree\/master\/examples\/fully_sharded_data_parallel","title":"","type":null},"content":[{"type":"text","text":"教程"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"二、在计算机视觉模型中使用FSDP"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对于计算机视觉模型,"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/vissl","title":"","type":null},"content":[{"type":"text","text":"VISSL"}]},{"type":"text","text":"已支持FSDP,并在RegNets架构上做了测试。像BatchNorm和ReLU这样的层被无缝处理,并测试了收敛。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用以下选项启用FSDP:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"config.MODEL.FSDP_CONFIG.AUTO_SETUP_FSDP=True"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"config.MODEL.SYNC_BN_CONFIG.SYNC_BN_TYPE=pytorch"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"config.MODEL.AMP_PARAMS.AMP_TYPE=pytorch"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"关于在VISSL中配置FSDP的其他选项,请参阅yaml配置的"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/vissl\/blob\/40441123a6f7098500676ca8800025c1f02e28b3\/vissl\/config\/defaults.yaml#L498-L513","title":"","type":null},"content":[{"type":"text","text":"这一部分"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"三、使用来自PyTorch Lightning的FSDP"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了更轻松地与更一般的用例集成,PyTorch Lightning已经以测试特性的形式支持了FSDP。这份"},{"type":"link","attrs":{"href":"https:\/\/pytorch-lightning.readthedocs.io\/en\/latest\/advanced\/advanced_gpu.html#fully-sharded-training","title":"","type":null},"content":[{"type":"text","text":"教程"}]},{"type":"text","text":"包含了如何将FSDP插件与PyTorch Lightning搭配使用的详细示例。在高层次上,在下面添加plugins='fsdp'可以激活它。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"model = MyModel()\ntrainer = Trainer(gpus=4, plugins='fsdp', precision=16)\ntrainer.fit(model)\ntrainer.test()\ntrainer.predict()\n"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"四、直接从FairScale使用FSDP库"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/fairscale.readthedocs.io\/en\/latest\/deep_dive\/oss_sdp_fsdp.html","title":"","type":null},"content":[{"type":"text","text":"FairScale"}]},{"type":"text","text":"是开发FSDP,以及你可以找到最新更新的主库。你可以在下面的示例中简单地替换DDP(my_module),直接使用FairScale的FSDP:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP\n...\nsharded_module = DDP(my_module)FSDP(my_module)\noptim = torch.optim.Adam(sharded_module.parameters(), lr=0.0001)\nfor sample, label in dataload.next_batch:\n out = sharded_module(x=sample, y=3, z=torch.Tensor([1]))\n loss = criterion(out, label)\n loss.backward()\n optim.step()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FairScale中的FSDP库为大规模训练的许多重要方面提供了底层选项。当你充分利用FSDP时,需要考虑以下几个重要方面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"模型包装"},{"type":"text","text":":为了最小化瞬时GPU内存需求,用户需要以嵌套方式包装模型。这引入了额外的复杂性。"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/fairscale\/blob\/master\/fairscale\/nn\/wrap\/auto_wrap.py","title":"","type":null},"content":[{"type":"text","text":"auto_wrap"}]},{"type":"text","text":"实用程序可用于注释现有PyTorch模型代码,用于嵌套包装目的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"模型初始化"},{"type":"text","text":":与DDP不同,FSDP不会在GPU worker之间自动同步模型权重。这意味着我们必须小心地完成模型初始化,以便所有GPU worker具有相同的初始权重。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"优化器设置"},{"type":"text","text":":由于分片和包装,FSDP仅支持某些类型的优化器和优化器设置。特别是,如果一个模块被FSDP包装,并且它的参数被扁平化为单个张量,那么用户就不能在这样的模块中为不同的参数组使用不同的超参数。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"混合精度"},{"type":"text","text":":FSDP支持具有FP16主权重的高级混合精度训练,以及梯度上的FP16减少和分散。有些情况下,只有在使用全精度时,模型的某些部分才能收敛。在这些情况下,需要额外的包装来选择以全精度运行模型的某些部分。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"状态检查点和推理"},{"type":"text","text":":当模型规模很大时,保存和加载模型状态会变得颇具挑战性。FSDP支持多种方式来完成该任务,但这绝非易事。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"最后,FSDP经常与FairScale的checkpoint_wrapper等"},{"type":"text","marks":[{"type":"strong"}],"text":"激活检查点"},{"type":"text","text":"函数搭配使用。用户可能需要仔细调整激活检查点策略,以在有限的GPU内存空间内适配大型模型。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"下一步计划"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FSDP是开源的,早期用户已经尝试过它并为之做出了贡献。我们认为它可以让整个研究社区受益,我们期待与所有人合作,让它变得更好。下面是一些尤为重要的领域。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"让FSDP更通用"},{"type":"text","text":"。到目前为止,FSDP已用于具有SGD和Adam优化器的NLP和视觉模型。随着更新的模型和优化器不断涌现,FSDP需要继续支持它们。作为纯粹的数据并行训练方案,FSDP在支持广泛的AI算法方面拥有极大的潜力。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"使FSDP"},{"type":"text","marks":[{"type":"strong"},{"type":"strong"},{"type":"strong"}],"text":"支持"},{"type":"text","marks":[{"type":"strong"}],"text":"自动调优"},{"type":"text","text":"。今天,用户在使用FSDP时可以调整许多旋钮,以实现扩展和性能提升。我们期待能开发出自动调优GPU内存使用和训练性能的算法。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"除了训练之外,"},{"type":"text","marks":[{"type":"strong"}],"text":"更具扩展性的推理"},{"type":"text","text":"和模型服务是FSDP可能需要支持的一个重要用例。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"最后一点也很重要,重构和继续"},{"type":"text","marks":[{"type":"strong"}],"text":"模块化FSDP"},{"type":"text","text":"及其核心组件对于更新更好的组件来说也是很关键的基础。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"试用并做出贡献!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FSDP目前可直接从FairScale库中获得。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文链接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章