達摩院大模型M6公佈最新進展:參數突破10萬億,成爲全球最大AI預訓練模型

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前,大規模預訓練模型已成爲學術界和工業界都非常關注的一大研究領域。隨着達摩院大模型M6突破10萬億參數,中國成功實現了全球最大AI預訓練模型。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"M6成全球最大AI預訓練模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"11月8日,阿里巴巴達摩院公佈"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/xIX9lekuuLcXewc5iphF","title":"xxx","type":null},"content":[{"type":"text","text":"多模態大模型M6"}]},{"type":"text","text":"最新進展,其參數已從萬億躍遷至10萬億,規模遠超谷歌、微軟此前發佈的萬億級模型,成爲全球最大的AI預訓練模型。據瞭解,M6使用512 GPU在10天內即訓練出具有可用水平的10萬億模型。相比去年發佈的大模型GPT-3,M6實現同等參數規模,能耗僅爲其1%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"M6是達摩院研發的通用性人工智能大模型,擁有多模態、多任務能力,其認知和創造能力超越傳統AI,尤其擅長設計、寫作、問答,在電商、製造業、文學藝術、科學研究等領域有廣泛應用前景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自2020年中GPT-3提出以來,一系列國內外大企業都在大模型的研發上開展探索,專注各個領域任務的大模型相繼提出,在各大下游任務都展現出優越的表現。無疑,超大規模預訓練模型蘊含着巨大的學術研究價值和商業落地價值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與傳統AI相比,大模型擁有成百上千倍“神經元”數量,且預先學習過海量知識,表現出像人類一樣“舉一反三”的學習能力。因此,大模型被普遍認爲是未來的“基礎模型”,將成下一代AI基礎設施。然而,其算力成本相當高昂,訓練1750億參數語言大模型GPT-3所需能耗,相當於汽車行駛地月往返距離。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此前達摩院陸續發佈了多個版本的M6模型,從大規模稠密模型到超大規模的混合專家模型的探索,逐步從百億參數升級到萬億參數規模。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/76\/85\/7628a20d6fa0161f0yy154yy84c47985.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今年5月,通過專家並行策略及優化技術,達摩院M6團隊將萬億模型能耗降低超八成、效率提升近11倍。10月,M6再次突破業界極限,通過更細粒度的CPU offload、共享-解除算法等創新技術,讓收斂效率進一步提升7倍,這使得模型規模擴大10倍的情況下,能耗未顯著增加。這一系列突破極大降低了大模型研究門檻,讓一臺機器訓練出一個千億模型成爲可能。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時,達摩院聯合阿里雲推出了M6服務化平臺,爲大模型訓練及應用提供完備工具,首次讓大模型實現“開箱即用”,算法人員及普通用戶均可方便地使用平臺。達摩院還推出了當前最大規模的中文多模態評測數據集MUGE,覆蓋圖文描述、文本生成圖像、跨模態檢索任務,填補了缺少中文多模態權威評測基準的空白。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲國內首個商業化落地的多模態大模型,M6已在超40個場景中應用,日調用量上億。今年,大模型首次支持雙11。M6在犀牛智造爲品牌設計的服飾已在淘寶上線;憑藉流暢的寫作能力,M6正爲天貓虛擬主播創作劇本;依靠多模態理解能力,M6正在增進淘寶、支付寶等平臺的搜索及內容認知精度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d7\/4f\/d79c51ea6a095e164825ec180437e34f.jpg","alt":null,"title":"M6生成的未來感汽車圖","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"10萬億M6技術實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前文提到,M6使用512 GPU在10天內即訓練出具有可用水平的10萬億模型。而之前業界最好水平是微軟最新發布的DeepSpeed,其使用了512張A100才完成3.5萬億參數基於MoE的GPT。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在研發的過程中,研究人員發現MoE模型結合高效的分組機制能夠用有限資源快速訓練完成一個效果優越的大模型。同時一系列大模型的工作都在說明,參數規模的擴展帶來的便是模型能力邊界的擴展,更多的數據+更大的模型=更強的能力。那麼,如果要訓練的是極限規模的十萬億參數模型,是不是就需要成倍地增加機器呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"M6團隊提出的命題是,如何在有限資源的條件下高效地訓練極限規模模型?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"M6團隊提出了一種簡單的方法解決此類極限規模模型訓練的問題,不僅關注如何用有限的資源訓練極限規模模型,還關注如何將其訓練至真實可用。團隊使用512張GPU將十萬億參數的模型訓練至可用的水平,而如果訓練此前的萬億參數模型也只需要64張GPU即可實現。相比此前的M6模型,M6-10T具有如下優勢:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"相比此前的萬億參數M6,M6-10T的參數量是原先的10倍沒有顯著的資源增加(480 vs 512 GPU);"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"相比萬億參數M6,M6-10T在樣本量的維度上具有更快的收斂速度;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"提出的共享解除機制將十萬億參數模型的訓練速度提升7倍以上,並可廣泛應用於其他同類大模型的訓練。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/damo.alibaba.com\/","title":"xxx","type":null},"content":[{"type":"text","text":"達摩院"}]},{"type":"text","text":"智能計算實驗室聯合阿里雲PAI團隊,在"},{"type":"link","attrs":{"href":"https:\/\/baijiahao.baidu.com\/s?id=1708472807237741573&wfr=spider&for=pc","title":"xxx","type":null},"content":[{"type":"text","text":"Whale框架"}]},{"type":"text","text":"下實現M6模型。此前發佈的千億和萬億參數M6模型,均在Whale上實現,利用其強大的數據並行、模型並行以及專家並行的能力實現超大規模模型的訓練和推理。Whale通過一系列優化,爲M6模型的訓練節約資源,提升效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"顯存優化方面,Whale的自動Gradient Checkpointing、Group-wise Apply、CPU Offload技術和通信池化等技術均有效節約顯存的使用,而在計算和通信方面,Whale支持了MoE所需的DP+EP的機制,並在EFLOPS集羣高速通信能力的基礎上,採用分組融合通信、半精度通信、拓撲感知的All2All通信算子等技術來提高通信效率,以及結合混合精度、編譯優化等技術提高訓練效率等。同時,EFLOPS團隊聯合PAI團隊對attention進行優化,將訪存密集型算子融合成一個cuda kernel實現,將multihead attention性能提升30%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而在十萬億M6模型的訓練上,團隊首先解決有限資源(512 GPU)“放下”10萬億參數的極限規模模型,而模型結構則採用此前萬億參數M6-T使用的結合expert prototyping的MoE模型。團隊在分佈式框架Whale中利用CPU offload的方法成功將十萬億參數的M6-10T模型在512張GPU的機器中放下並實現訓練。相比其他的CPU offload方案,M6的CPU offload粒度可控,可以靈活地選擇offload的模型層,可以不用將所有的權重offload到CPU memory中,而選擇保留部分權重在GPU memory上進行計算,這樣的做法可以進一步地提高GPU利用率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d7\/04\/d7697731f726571cd6c1eb374e369604.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決了放入模型的問題後,團隊針對訓練效率的問題設計了Pseudo-to-Real(共享解除)機制,其核心思想爲利用訓練好的小模型初始化大模型。該算法首先利用參數共享的機制構建並快速訓練小模型,此階段無需使用CPU內存存放模型同時可以使用更大的批次。配合上專家拆分和合並的機制,算法團隊只需要使用256張GPU即可快速訓練一個Pseudo Giant。隨後,訓練好的模型層的參數用於爲Real Giant的每一層提供初始化,大模型即可在訓練好的小模型的基礎上繼續優化。儘管大模型的訓練速度較慢,但無需經歷漫長的收斂過程,只需從一個低點開始優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"團隊也通過實驗證明該方案在收斂和下游遷移的有效性,同時在十萬億參數規模的M6-10T模型上做出成功實踐,僅用10天左右的時間即得到非常突出的收斂效果。樣本維度上收斂效果顯著優於此前千億參數M6和萬億參數模型M6-T。如上圖所示,在經過了10M樣本的訓練後,同等實驗設置下M6-10T的log PPL顯著低於M6-MoE和M6-T,分別降低了34.7%和10.1%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實驗中,對比不使用Pseudo-to-Real機制直接訓練的十萬億模型,Pseudo-to-Real機制達到相同預訓練loss用時僅爲原先的6%。對比M6萬億模型,Pseudo-to-Real十萬億模型達到相同預訓練loss所需的樣本量僅需約40%,充分顯示出Pseudo-to-Real機制對於超大模型訓練的優勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"達摩院智能計算實驗室負責人周靖人表示,“接下來,我們將深入研究大腦認知機理,致力於將M6的認知力提升至接近人類的水平,比如,通過模擬人類跨模態的知識抽取和理解方式,構建通用的人工智能算法底層框架;另一方面,不斷增強M6在不同場景中的創造力,產生出色的應用價值。”"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章