單機訓練6000萬類視覺分類模型,飛槳大規模分類庫PLSC做到了

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 大規模分類任務 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在介紹大規模分類任務之前,我們先簡短回顧一下通常的分類任務。大家熟知的視覺分類任務中,網絡模型由特徵提取器(Backbone)和分類器(Classifier)組成。分類的類別數有2類(如,前景/背景分類)、10類(如,MNIST 數據分類)、80類(如,COCO 數據分類)和1000類(如,ImageNet 數據分類)等等。比較主流的特徵提取器有 ResNet、MobileNet 等網絡結構,分類器則通常採用線性分類層(全連接層,FC)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大規模分類任務的『大規模』指模型參數規模非常大,包括以下3種情況:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"特徵提取器參數規模大、分類器參數規模大,以及兩者參數規模都大。","attrs":{}},{"type":"text","text":"萬物互聯時代,隨着人工智能、5G 和 IoT 等技術的發展,分類模型分類的類別數不斷增加,類別數可以達到上千萬甚至更多。在這種背景下,分類網絡模型的 FC 層的類別數增加,參數規模爆炸式增長。基於度量學習的分類模型,通常在訓練階段使用閉集數據集學習特徵提取器和分類器,在推理階段,僅使用特徵提取器提取輸入圖像的特徵,並與預提取的特徵進行相似性對比得出是否屬於同一類,如圖1所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c5/c53a460e8b6f1d58221540188aba62b9.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 圖1:大規模分類網絡模型訓練和部署示意圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 大規模分類模型訓練難點 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文聚焦於解決大規模分類模型訓練問題。有小夥伴可能會問:大規模分類不就是一個普通的圖像分類嗎,除了分類類別數較多導致的 FC 層參數量大以外,還有什麼難題?圖像分類領域每年有大量的論文和工作在 ImageNet 數據集上 取得新的 SOTA,隨便從 Github 上找個圖像分類庫來訓練是不是就可以了?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而,FC 層參數規模的急劇增長在訓練時會帶來以下兩方面挑戰:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是存儲問題。假設分類類別數爲4000萬,在訓練階段特徵向量的維度爲512,並且以32比特浮點數存儲模型參數,那麼僅 FC 層的參數就可達512*40000000*4(bytes)/1024/1024=76.29GB,遠遠超出主流顯卡的存儲容量。這是普通圖像分類庫無法解決的存儲問題。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次是速度問題。普通分類模型也面臨同樣的問題。隨着訓練數據、模型規模和分類類別數的增加,模型訓練的複雜度顯著增長,導致模型訓練所需要的時間不斷增長。速度是人類永無止境的追求,如何在更短的時間內訓練大規模分類模型也是工程實踐中迫切需要解決的問題。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決以上兩方面難題,學術界和工業界不斷圍繞着訓練的顯存消耗和速度進行優化;飛槳團隊也持續不斷地打磨升級大規模分類庫 PLSC(Paddle Large Scale Classification),提供數據並行&模型並行混合訓練、類別中心採樣、稀疏梯度參數更新和 FP16 訓練等解決方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 解決方案詳解 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來,小編將和大家一起來分享 PLSC 提供的數據並行&模型並行混合訓練、模型並行 Loss 計算、類別中心採樣、稀疏梯度參數更新和 FP16 訓練解決方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"● 混合並行訓練","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了提高訓練效率,通常使用多張 GPU 卡做數據並行訓練。但是對於大規模分類問題,類別數非常多,導致單卡無法訓練。例如,4000萬類的分類網絡模型僅 FC 層的參數量就高達76.29GB,遠遠超出主流顯卡的存儲容量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,數據並行下 FC 層的梯度通信量也巨大,導致訓練速度難以接受。針對 FC 層參數存儲和梯度通信問題,我們自然會想到是否可以將參數存放到多張 GPU 卡上?答案是肯定的。我們可以採用模型並行策略,將 FC 層參數切分到多張卡上。如圖2所示,Backbone 網絡採用數據並行,分類 FC 層採用模型並行,兼顧了數據並行的訓練效率和分類 FC 層參數的存儲和梯度通信需求。在分類類別數爲4000萬類時,假如使用單機8卡,那麼每張卡上的 FC 層僅需存放500萬類,參數的存儲大小爲76.29GB/8=9.54GB。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"採用數據並行和模型並行結合的方式,單機8卡情形下前向計算過程如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.每張卡接受一個 Batch 的數據,假設 batch size 爲64;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.每張卡使用輸入數據做數據並行計算,經過 Backbone 得到512維特徵向量,維度爲64x512;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.對每張卡上的特徵和標籤做 allgather 操作,從其他卡上收集特徵和標籤,此時每張卡上擁有全量特徵維度爲512x512,全量標籤維度爲512x1;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.全量特徵(512x512)和部分 FC 參數(512x5000000)做矩陣乘操作得到 logits,維度爲512x5000000;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5.使用模型並行的 SoftmaxWithCrossEntropy Loss 函數計算 loss。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f9/f9ae3e5b820e633e457e8629d6d20f0b.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 圖2:Backbone 數據並行&Classifer 模型並行","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"●","attrs":{}},{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"模型並行 Loss 計算——API 級 MarginLoss 函數","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在度量學習領域中,ArcFace[2]論文將 ArcFace,CosFace[3]和 SphereFace[4] Loss 函數用如下統一的公式表示,我們稱爲 MarginLoss:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/40/40d55a654ca4356918d0f5649fa30c48.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MarinLoss 函數是在 logits 上增加了 margin,最終基於的仍然是  SoftmaxWithCrossEntropy Loss 函數。模型並行下最容易想到的計算 Loss 的方法是用通信操作從其他卡上獲取全量的 logits。但是這種方法不僅需要巨大的通信量,同時需要臨時存儲其他卡上的 logits,帶來巨大的顯存開銷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"飛槳框架在模型並行策略下提供了對 MarginLoss--paddle.nn.functional.margin_cross_entropy 的原生支持。該接口通信量少且顯存開銷較小。圖3給出模型並行下 SoftmaxwithCrossEntropy 的計算過程。首先,在每張卡上逐行計算 logits 的最大值,然後通過 allreduce 操作獲取全局每行最大值。爲了保持數值計算的穩定性,每行減去其對應的全局最大值。接着,逐行計算分母和,然後通過 allreduce 操作獲取全局和值。最後,逐行計算 loss, 並通過 allreduce 操作獲取全局 loss。圖中,我們對 Loss 和 Softmax Probability 計算做了共同表達式提取。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/59/5912968bd72a831cd0ff274e978dd20c.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲圖3:模型並行 SoftmaxwithCrossEntropy 計算過程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"●","attrs":{}},{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"類別中心採樣——API 級支持 PartialFC","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"採用數據並行&模型並行解決了 FC 參數存儲的問題。但是,可以發現前向計算的 logits 存儲需求也非常大,在混合並行訓練小節的假設條件下爲512*5000000*4(bytes)/1024/1024=9.54 GB。考慮前向計算、反向計算和參數更新相關變量,當優化方法使用 Momentum 時,可以得到 FC 層需要的存儲大小:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/13/13810b58e8d7161ff822c4f73b38efd8.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 d 表示特徵的維度,c 表示總類別數,k 表示 GPU 卡數,N 表示 Batch 大小, Memw 表示參數存儲大小, Memlogits 表示 logits 存儲大小,MemFc 表示 FC 層總的存儲大小。當類別數增大時,我們可以將 FC 層參數切分到不同卡上,以保持每張卡上存儲的參數大小不變。然而, logits 的維度卻是隨卡數線性增長的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,卡數增大 k 倍, Memlogits 也增大 k 倍。訓練過程中,FC 層總的存儲大小 MemFc等於3倍 Memw (weight,gradient 和 velocity)加2倍Memlogits (activation 和 gradient)。爲了解決 logits 和對應的梯度存儲問題,PartialFC [5] 提出基於採樣的 FC 層,即從全量 FC 中採樣一部分類別中心參與本次迭代的學習。如圖4所示,對於正樣本對應的類別中心全部保留,對於負類別中心按比例隨機採樣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設採樣比例爲1/10,則 logits 的維度爲512x500000,存儲大小爲0.1*9.54 GB = 0.954GB。優化前存儲大小爲2* 9.54 GB = 19.08 GB,採用 PartialFC 時需要的顯存開銷爲2*0.954GB=1.908GB,可見使用 PartialFC 可以大幅減小顯存開銷。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/90/90f786a7ce3a5062958d05ed6575bf81.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 圖4:PatialFC 採樣過程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"飛槳提供了原生支持上述採樣過程的 API——","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"paddle.nn.functional.class_center_sample。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"●","attrs":{}},{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"稀疏梯度參數更新——SparseMomentum","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"稀疏梯度參數更新是 PLSC 一大亮點。雖然 PartialFC 通過採樣的方法緩解了logits 和對應梯度的顯存開銷,但是通過上述分析我們發現 FC 層仍然需要3倍的 Mem_w,分別對應 FC 層參數、梯度和優化器狀態。我們是否可以進一步優化顯存呢?答案是肯定的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖5左所示,在前向計算過程中,通過對參數 W 的採樣,得到採樣類別中心 Wsub;反向計算梯度時,首先得到稀疏梯度 Wsub@grad,進而得到參數梯度 W@grad;在參數更新階段,momentum 算子使用傳入的參數 Wsub,參數梯度 W@grad 和優化器狀態 W@velocity 更新參數 W 和優化器狀態 W@velocity。我們通過分析發現,參數梯度 W@grad 是冗餘的,其可以通過稀疏梯度 Wsub@grad 得到。爲此,我們設計和開發了 sparse_momentum 接口。相比於 momentum,該接口需要額外傳入參數 index,表示 FC 參數採樣的索引值,計算過程如圖5右所示。使用該接口,可以大幅減少梯度的存儲空間,從而可以訓練更大規模參數的模型。相比 momentum 需要3*9.54Gb+2* 0.954Gb=30.528GB 的存儲空間,使用 sparse_momentum ,FC 層僅需要2* 9.54Gb+2*0.954Gb=20.988GB 的存儲空間,顯存空間降低31.25%。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/55/55f93863463e41c5393af02f54cc3a8f.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 圖5:Momentum 和 SparseMomentum 的更新過程對比","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"● FP16 訓練——節省50%顯存","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PLSC 的另外一個亮點是使用 FP16 訓練,即整個訓練過程中參數、Activation、梯度和優化器狀態均使用 FP16,相比於 FP32 顯存空間節省50%。相比 FP32 和 AMP[6] ,FP16 可以大幅節省顯存,並大幅提升訓練速度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖6分別給出是 FP32,AMP 和 FP16 的計算過程。FP32 計算過程中,所有模型參數、梯度、Activation 和優化器狀態的數據類型均爲 FP32。AMP 計算過程中,模型參數和優化器狀態爲 FP32;計算過程中,將參數 cast 成 FP16,因此Activation 和梯度也是 FP16;優化階段參數梯度需要重新 cast 爲 FP32;所以,相比於 FP32,AMP 通過將 Activation 和對應的梯度存儲爲 FP16,節省顯存開銷。PLSC 使用的是真正意義的 FP16,即模型參數、Activation、梯度和優化器狀態均使用 FP16;相比 FP32 顯存開銷減少50%;此外,由於消除了 cast 操作,FP16 相比於 AMP 可以進一步提升訓練速度。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b8/b8b4611abd19ed518f59d0ae08c6064a.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 圖6:FP32、AMP 和 FP16 計算過程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 實驗結果 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上一節我們介紹了大規模分類模型訓練的一些解決方案,這些解決方案都已經在 PLSC 中實現且開源。此外,我們也已經將 PLSC 開源到人臉識別社區 InsighFace[1]。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PLSC 庫地址⬇️","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://github.com/PaddlePaddle/PLSC","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PLSC 具有以下亮點:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高吞吐,低顯存,簡單易用;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持單機和多機分佈式訓練,API 級支持模型並行、PartialFC 和 MarginLoss;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持 FP16 訓練;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時支持靜態圖和動態圖。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們將從訓練精度、顯存開銷和訓練速度3個維度來評測 PLSC。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"● MS1MV3 精度","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下表給出 MS1MV3 數據上的精度對比。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表1不同框架實現的 Repo 在 MS1MV3 數據集上的精度對比","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/43/4314a17ead15725ef801569fc373c430.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從表1中,我們可以看到雖然 PLSC 使用了 FP16,但在主要數據集上 PLSC 的精度仍然可以打平其它框架實現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"● 最大支持類別數","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實驗配置:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GPUs:8 NVIDIA Tesla V100 32G;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BatchSize:64/512(每張卡的 batch size 是64,全局 batch size 是512個樣本);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SampleRatio:0.1(PartialFC 採樣率爲0.1)。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表2不同框架實現支持的最大類別數對比","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/50/504404adc607c9417b921c9d1a10e09f.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表中數據說明,相比其他的框架實現,PLSC 在顯存優化方面具有顯著優勢:靜態圖最多支持6000萬類別分類,動態圖最多可支持6700萬類別分類。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"● 吞吐量對比","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"吞吐量是每秒訓練的樣本數。我們採用公開數據集 MS1MV3 來測試。爲了取得穩定和公平的結果,我們對每個實驗配置都運行了5組實驗,每組運行200個 steps,然後計算50到150個 steps 間的100個 steps 吞吐量的平均值,最後得到這5組實驗的平均吞吐量後再取中位數作爲最終的吞吐量結果。以下是實驗配置:","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tesla V100 (32G):Driver Version:450.80.02, CUDA Version:11.0;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tesla A100 (40G):Driver Version:460.32.03, CUDA Version:11.2;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Datasets:MS1MV3(93431l 類);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SampleRatio:0.1(使用了 PartialFC,採樣率爲0.1);","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BatchSize:128(每張卡128個樣本)。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e2/e29486c035d2e11d270358d24195bef2.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 圖7:不同框架實現的 Repo 吞吐量對比","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從圖7可以看出 PLSC 靜態圖模式下,優於所有框架實現,尤其在 A100,ResNet50,FP16,8卡的配置下,PLSC 吞吐量高達9500imgs/s。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"項目地址:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GitHub:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://github.com/PaddlePaddle/PLSC","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GitHub:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://github.com/deepinsight/insightface","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"參考引用:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[1] https://github.com/deepinsight/insightface.git","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[2] Deng, J., Guo, J., Xue, N. and Zafeiriou, S., 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4690-4699).","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[3] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z. and Liu, W., 2018. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5265-5274).","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[4] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B. and Song, L., 2017. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 212-220).","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[5] An, X., Zhu, X., Gao, Y., Xiao, Y., Zhao, Y., Feng, Z., Wu, L., Qin, B., Zhang, M., Zhang, D. and Fu, Y., 2021. Partial fc: Training 10 million identities on a single machine. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1445-1449).","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[6] Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G. and Wu, H., 2017. Mixed precision training. arXiv preprint arXiv:1710.03740.","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://developer.baidu.com/?from=111201","title":"","type":null},"content":[{"type":"text","text":"點擊進入獲得更多技術信息~~","attrs":{}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章