單個GPU可訓練數十億參數模型:異構深度學習訓練技術ZeRO-Offload做到了

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近日,微軟和加州大學默塞德分校聯合推出了一種新穎的異構深度學習訓練技術ZeRO-Offload,這是基於Zero Redundancy Optimizer (ZeRO是微軟在 2020 年 2 月提出的一種萬億級模型參數訓練方法) 構建的。該技術可在單個GPU上訓練數十億個參數模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8f\/8f7c649b5bbac255080fbd91c1e935fa.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖片來源:"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/pdf\/2101.06840.pdf","title":"","type":null},"content":[{"type":"text","text":"https:\/\/arxiv.org\/pdf\/2101.06840.pdf"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術發展至今,我們正在邁入一個高度依賴深度學習(DL)模型的技術時代。隨着這些模型規模的成倍增加,訓練這些模型的成本也變得非常昂貴。 由於訓練這些大規模模型需要最先進的系統技術,這就使得這類大規模模型的訓練受到了一定的限制。僅有爲數不多的AI研究人員和機構擁有資源​​來訓練這些包含十億多個參數的、規模龐大的深度學習模型。例如,要訓練100億個參數模型,就需要一個DGX-2等效節點,該節點需要具有19張NVIDIA V100卡,成本超過10萬美元,這超出了許多數據科學家甚至許多學術機構的承受範圍。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了增加訓練大規模模型的可能性,加利福尼亞大學、默塞德大學和微軟的一組研究人員聯合開發了 ZeRO-Offload。這項新的異構深度學習技術可幫助數據科學家在單個GPU上訓練數十億個參數模型,而無需進行模型重構。它是一款具有高計算效率和近似線性擴展性的GPU-CPU混合深度學習訓練技術。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在訓練大規模模型時面臨的挑戰包括模型狀態,即參數、梯度、優化器狀態,​​以及缺乏有關利用CPU計算的研究。許多研究人員已經嘗試使用異構深度學習訓練來解決這些問題,以減少GPU內存需求,但這些辦法都是針對基於小型CNN模型的內存激活問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統的數據並行性通常是用於將深度學習訓練擴展到多個GPU的社區標準。儘管如此,它仍然需要數據和計算再現,這就導致了傳統數據並行不適用於深度學習模型的異構訓練。另一方面,ZeRO-Offload可以同時利用CPU和GPU內存,從而高效地進行訓練。ZeRO-Offload還可以在CPU內存上維護優化器狀態的單個副本,而與數據並行度無關,這可以實現多達128個GPU的可伸縮性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ZeRO-offload是基於三個原則設計的:效率、可伸縮性和可用性。研究人員已經確定了CPU和GPU設備之間獨特的數據分區和最佳計算策略。該方法涉及到的流程包括將梯度、優化器狀態和優化器計算分散到CPU,保留參數以及在GPU上保持向前和向後計算。研究人員觀察到,在計算條件有限的情況下,可訓練的模型大小增加了十倍,從而使單個NVIDIA V100GPU能夠以40 TFLOPS的速度訓練130億個參數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/01\/013923c66b67a62fe1c24a14a1287f8f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖片來源:"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/pdf\/2101.06840.pdf","title":"","type":null},"content":[{"type":"text","text":"https:\/\/arxiv.org\/pdf\/2101.06840.pdf"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ZeRO-Offload作爲開源PyTorch庫 DeepSpeed一部分,可在Github上獲取。只需更改幾行代碼,即可輕鬆將其添加到現有的訓練管道中。ZeRO-Offload提高了計算和存儲效率,並且易於使用。這些功能甚至能讓使用單個GPU的研究人員和數據科學家也可以進行大規模的模型訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"論文地址:"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/pdf\/2101.06840.pdf","title":"","type":null},"content":[{"type":"text","text":"https"}]},{"type":"text","text":":"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/pdf\/2101.06840.pdf","title":"","type":null},"content":[{"type":"text","text":"\/\/arxiv.org\/pdf\/2101.06840.pdf"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DeepSpeed項目地址:"},{"type":"link","attrs":{"href":"https:\/\/github.com\/microsoft\/DeepSpeed","title":"","type":null},"content":[{"type":"text","text":"https:\/\/github.com\/microsoft\/DeepSpeed"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/www.marktechpost.com\/2021\/02\/01\/microsoft-and-the-university-of-california-merced-introduces-zero-offload-a-novel-heterogeneous-deeplearning-training-technology-to-train-multi-billion-parameter-models-on-a-single-gpu\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/www.marktechpost.com\/2021\/02\/01\/microsoft-and-the-university-of-california-merced-introduces-zero-offload-a-novel-heterogeneous-deeplearning-training-technology-to-train-multi-billion-parameter-models-on-a-single-gpu\/"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章