Facebook推出數據並行訓練算法FSDP:採用更少的GPU,更高效地訓練更大數量級的模型

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大規模訓練AI模型並非易事。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了需要大量算力和資源外,訓練非常大的模型背後也有着相當大的工程複雜性。在Facebook AI Research(FAIR)Engineering,我們一直在努力構建各種工具和基礎設施,讓大型AI模型訓練起來更加輕鬆。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們最近的一部分成果包括了"},{"type":"link","attrs":{"href":"https:\/\/github.com\/pytorch\/fairseq\/blob\/master\/examples\/megatron_11b\/README.md","title":"","type":null},"content":[{"type":"text","text":"層內模型並行"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/fairscale.readthedocs.io\/en\/latest\/deep_dive\/pipeline_parallelism.html","title":"","type":null},"content":[{"type":"text","text":"流水線模型並行"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/fairscale#optimizer-state-sharding-zero","title":"","type":null},"content":[{"type":"text","text":"優化器狀態+梯度分片"}]},{"type":"text","text":"和"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/fairscale\/blob\/master\/fairscale\/nn\/moe\/moe_layer.py","title":"","type":null},"content":[{"type":"text","text":"多專家模型"}]},{"type":"text","text":"等領域的工作,旨在提升爲任意數量的任務訓練高級AI模型的效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"完全分片數據並行(Fully Sharded Data Parallel,FSDP)是我們推出的最新工具。它將AI模型的參數分片到數據並行worker中,並且可以選擇將部分訓練計算卸載到CPU。顧名思義,FSDP是一種數據並行訓練算法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管參數被分片到了不同的GPU,但每個微批次數據的計算仍然是每個GPU worker上的本地計算。這種簡潔的概念讓FSDP更容易理解,適用於更廣泛的使用場景(與層內並行和流水線並行相比)。與優化器狀態+梯度分片數據並行方法相比,FSDP分片參數更均勻,並且能夠通過訓練期間的通信和計算重疊來獲得更好的性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了FSDP後,我們現在可以使用更少的GPU更高效地訓練更大數量級的模型。FSDP已在"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/fairscale","title":"","type":null},"content":[{"type":"text","text":"FairScale庫"}]},{"type":"text","text":"中實現,允許工程師和開發人員使用簡單的API擴展和優化他們的模型訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Facebook,FSDP已被集成和測試,用於訓練我們的一些"},{"type":"link","attrs":{"href":"https:\/\/github.com\/pytorch\/fairseq","title":"","type":null},"content":[{"type":"text","text":"NLP"}]},{"type":"text","text":"和"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/vissl","title":"","type":null},"content":[{"type":"text","text":"視覺"}]},{"type":"text","text":"模型。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"大規模訓練的高計算成本"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NLP研究是一個特殊領域,其中我們可以看到有效利用算力來訓練AI的重要性。去年,OpenAI宣佈他們訓練出了GPT-3,這是有史以來最大的神經語言模型,有1750億個參數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"據估計,訓練GPT-3需要大約355個GPU年,或者相當於1,000個GPU連續工作四個多月。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了需要大量計算和工程資源外,大多數擴展到如此程度的方法都會帶來額外的通信成本,並要求工程師仔細評估內存使用和計算效率之間的權衡。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,典型的數據並行訓練需要在每個GPU上都維護模型的冗餘副本,而模型並行訓練需要在worker(GPU)之間移動激活,從而引入額外的通信成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比之下,FSDP犧牲的東西相對較少。它通過跨GPU分片模型參數、梯度和優化器狀態來提高內存效率,並通過分解通信並將其與前向和後向傳遞重疊來提高計算效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FSDP產生的訓練結果與標準分佈式數據並行(DDP)訓練相同,並且可在一個易用的界面中使用,該界面可直接替代PyTorch的DistributedDataParallel模塊。我們的早期測試表明,FSDP可以擴展到數萬億個參數。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"FSDP的工作原理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在標準的DDP訓練中,每個worker處理一個單獨的批次,並且使用all-reduce操作對各個worker的梯度求和。雖然DDP已經變得非常流行,但它需要的GPU內存過多了,因爲模型權重和優化器狀態需要在所有DDP worker之間複製。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"減少重複的一種方法是應用一個稱爲全參數分片的過程,其中只有局部計算所需的模型參數、梯度和優化器的一個子集是可用的。這種方法的一個實現,ZeRO-3,已被微軟推而廣之。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解鎖全參數分片的關鍵在於,我們可以將DDP中的all-reduce操作分解爲單獨的reduce-scatter和all-gather操作:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/e8\/b6\/e8a741723a2d1a937c7c7ef9fb83c2b6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"All-reduce是reduce-scatter和all-gather的組合。聚合梯度的標準all-reduce操作可以分解爲兩個獨立的階段:reduce-scatter和all-gather。在reduce-scatter階段,梯度在每個GPU上的各個rank之間的相等塊中基於它們的rank索引相加。在all-gather階段,每個GPU上可用的聚合梯度的分片部分可供所有GPU使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後,我們可以重新排列reduce-scatter和all-gather,讓每個DDP worker只需存儲參數和優化器狀態的單個分片。下圖分別展示了標準DDP訓練(上)和FSDP訓練(下):"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/3a\/cd\/3a16e6bedbf0e541944c5c61f40b96cd.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"標準數據並行訓練和完全分片數據並行訓練的對比。在標準數據並行訓練方法中,模型的副本存在於每個GPU上,並且僅在數據的一個分片上評估一系列前向和後向傳遞。在這些本地計算之後,每個本地進程的參數和優化器與其他GPU共享,以計算全局權重更新。在FSDP中,GPU上僅存在模型的一個分片。然後在本地,所有權重都通過all-gather步驟從其他GPU收集,來計算前向傳遞。然後在向後傳遞之前再次執行這種權重收集。在向後傳遞之後,局部梯度被平均,並通過reduce-scatter步驟在GPU上分片,這樣每個GPU都能更新其局部權重分片。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了最大限度地提高內存效率,我們可以在每層前向傳遞後丟棄全部權重,從而爲後續層節省內存。要實現這一點,可以將FSDP包裝器應用於網絡中的每一層(使用reshard_after_forward=True)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在僞代碼中:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"FSDP forward pass:\n for layer_i in layers:\n all-gather full weights for layer_i\n forward pass for layer_i\n discard full weights for layer_i\nFSDP backward pass:\n for layer_i in layers:\n all-gather full weights for layer_i\n backward pass for layer_i\n discard full weights for layer_i\n reduce-scatter gradients for layer_\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"如何使用FSDP"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在大規模AI研究中使用FSDP有幾種方法。當下我們提供四種解決方案,以適應不同的需求。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"一、在語言模型中使用FSDP"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於語言模型,"},{"type":"link","attrs":{"href":"https:\/\/github.com\/pytorch\/fairseq","title":"","type":null},"content":[{"type":"text","text":"fairseq框架"}]},{"type":"text","text":"通過以下新參數支持FSDP:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"–ddp-backend=fully_sharded:通過FSDP啓用完整分片"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"–cpu-offload:將優化器狀態和FP32模型副本卸載到CPU(搭配–optimizer=cpu_adam)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"–no-reshard-after-forward:提高大型模型的訓練速度(1B+參數),類似於ZeRO階段2"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其他流行的選項(–fp16、–update-freq、–checkpoint-activations、–offload-activations等)可以繼續正常使用"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於使用FSDP在八個GPU上,或在單個GPU上使用FSDP+CPU卸載訓練13B參數模型的說明,請參閱fairseq"},{"type":"link","attrs":{"href":"https:\/\/github.com\/pytorch\/fairseq\/tree\/master\/examples\/fully_sharded_data_parallel","title":"","type":null},"content":[{"type":"text","text":"教程"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"二、在計算機視覺模型中使用FSDP"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於計算機視覺模型,"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/vissl","title":"","type":null},"content":[{"type":"text","text":"VISSL"}]},{"type":"text","text":"已支持FSDP,並在RegNets架構上做了測試。像BatchNorm和ReLU這樣的層被無縫處理,並測試了收斂。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用以下選項啓用FSDP:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"config.MODEL.FSDP_CONFIG.AUTO_SETUP_FSDP=True"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"config.MODEL.SYNC_BN_CONFIG.SYNC_BN_TYPE=pytorch"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"config.MODEL.AMP_PARAMS.AMP_TYPE=pytorch"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於在VISSL中配置FSDP的其他選項,請參閱yaml配置的"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/vissl\/blob\/40441123a6f7098500676ca8800025c1f02e28b3\/vissl\/config\/defaults.yaml#L498-L513","title":"","type":null},"content":[{"type":"text","text":"這一部分"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"三、使用來自PyTorch Lightning的FSDP"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了更輕鬆地與更一般的用例集成,PyTorch Lightning已經以測試特性的形式支持了FSDP。這份"},{"type":"link","attrs":{"href":"https:\/\/pytorch-lightning.readthedocs.io\/en\/latest\/advanced\/advanced_gpu.html#fully-sharded-training","title":"","type":null},"content":[{"type":"text","text":"教程"}]},{"type":"text","text":"包含了如何將FSDP插件與PyTorch Lightning搭配使用的詳細示例。在高層次上,在下面添加plugins='fsdp'可以激活它。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"model = MyModel()\ntrainer = Trainer(gpus=4, plugins='fsdp', precision=16)\ntrainer.fit(model)\ntrainer.test()\ntrainer.predict()\n"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"四、直接從FairScale使用FSDP庫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/fairscale.readthedocs.io\/en\/latest\/deep_dive\/oss_sdp_fsdp.html","title":"","type":null},"content":[{"type":"text","text":"FairScale"}]},{"type":"text","text":"是開發FSDP,以及你可以找到最新更新的主庫。你可以在下面的示例中簡單地替換DDP(my_module),直接使用FairScale的FSDP:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP\n...\nsharded_module = DDP(my_module)FSDP(my_module)\noptim = torch.optim.Adam(sharded_module.parameters(), lr=0.0001)\nfor sample, label in dataload.next_batch:\n out = sharded_module(x=sample, y=3, z=torch.Tensor([1]))\n loss = criterion(out, label)\n loss.backward()\n optim.step()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FairScale中的FSDP庫爲大規模訓練的許多重要方面提供了底層選項。當你充分利用FSDP時,需要考慮以下幾個重要方面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"模型包裝"},{"type":"text","text":":爲了最小化瞬時GPU內存需求,用戶需要以嵌套方式包裝模型。這引入了額外的複雜性。"},{"type":"link","attrs":{"href":"https:\/\/github.com\/facebookresearch\/fairscale\/blob\/master\/fairscale\/nn\/wrap\/auto_wrap.py","title":"","type":null},"content":[{"type":"text","text":"auto_wrap"}]},{"type":"text","text":"實用程序可用於註釋現有PyTorch模型代碼,用於嵌套包裝目的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"模型初始化"},{"type":"text","text":":與DDP不同,FSDP不會在GPU worker之間自動同步模型權重。這意味着我們必須小心地完成模型初始化,以便所有GPU worker具有相同的初始權重。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優化器設置"},{"type":"text","text":":由於分片和包裝,FSDP僅支持某些類型的優化器和優化器設置。特別是,如果一個模塊被FSDP包裝,並且它的參數被扁平化爲單個張量,那麼用戶就不能在這樣的模塊中爲不同的參數組使用不同的超參數。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"混合精度"},{"type":"text","text":":FSDP支持具有FP16主權重的高級混合精度訓練,以及梯度上的FP16減少和分散。有些情況下,只有在使用全精度時,模型的某些部分才能收斂。在這些情況下,需要額外的包裝來選擇以全精度運行模型的某些部分。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"狀態檢查點和推理"},{"type":"text","text":":當模型規模很大時,保存和加載模型狀態會變得頗具挑戰性。FSDP支持多種方式來完成該任務,但這絕非易事。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"最後,FSDP經常與FairScale的checkpoint_wrapper等"},{"type":"text","marks":[{"type":"strong"}],"text":"激活檢查點"},{"type":"text","text":"函數搭配使用。用戶可能需要仔細調整激活檢查點策略,以在有限的GPU內存空間內適配大型模型。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"下一步計劃"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FSDP是開源的,早期用戶已經嘗試過它併爲之做出了貢獻。我們認爲它可以讓整個研究社區受益,我們期待與所有人合作,讓它變得更好。下面是一些尤爲重要的領域。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"讓FSDP更通用"},{"type":"text","text":"。到目前爲止,FSDP已用於具有SGD和Adam優化器的NLP和視覺模型。隨着更新的模型和優化器不斷湧現,FSDP需要繼續支持它們。作爲純粹的數據並行訓練方案,FSDP在支持廣泛的AI算法方面擁有極大的潛力。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"使FSDP"},{"type":"text","marks":[{"type":"strong"},{"type":"strong"},{"type":"strong"}],"text":"支持"},{"type":"text","marks":[{"type":"strong"}],"text":"自動調優"},{"type":"text","text":"。今天,用戶在使用FSDP時可以調整許多旋鈕,以實現擴展和性能提升。我們期待能開發出自動調優GPU內存使用和訓練性能的算法。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"除了訓練之外,"},{"type":"text","marks":[{"type":"strong"}],"text":"更具擴展性的推理"},{"type":"text","text":"和模型服務是FSDP可能需要支持的一個重要用例。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"最後一點也很重要,重構和繼續"},{"type":"text","marks":[{"type":"strong"}],"text":"模塊化FSDP"},{"type":"text","text":"及其核心組件對於更新更好的組件來說也是很關鍵的基礎。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"試用並做出貢獻!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"FSDP目前可直接從FairScale庫中獲得。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/engineering.fb.com\/2021\/07\/15\/open-source\/fsdp\/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章