怎樣節省 2/3 的 GPU?愛奇藝 vGPU 的探索與實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着人工智能技術的發展,愛奇藝內部越來越多的服務使用深度學習模型和技術來驅動,爲我們的用戶提供更加智能和便捷的在線視頻觀看體驗。其中在線類的服務,通常單個容器實例需要獨佔一個 GPU,以實現在毫秒/秒級延時內完成例如視頻、圖片、語音、文本的深度學習模型推理請求;爲了保證響應延時,請求通常單獨進行,無法對請求做batch以提升計算效率,且不同請求間隔隨機,會導致這些服務的 GPU 計算資源的利用率通常較低(如圖1所示)。且在線類服務請求量在一天或者一定時間週期內存在波峯波谷的現象,進一步降低了 GPU 的利用率。鑑於GPU本身高昂的價格,較低的 GPU 利用率浪費了大量計算資源,增加了 AI 服務的成本。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5c/5ca627f83858d1ecd506595ed23ec1ae.png","alt":"怎樣節省 2/3 的 GPU?愛奇藝 vGPU 的探索與實踐","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖1:在線推理服務 GPU 利用率統計","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在此背景下,最直接的解決方案是將多個服務部署在同一張 GPU 卡上,在保證服務質量的前提下通過 GPU 共享來提升 GPU 的利用率。目前英偉達官方的 GPU 共享技術主要有兩種方案:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"vGPU","attrs":{}},{"type":"text","text":" ;(2)","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"MPS","attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們將簡單對比下兩種方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Nvidia vGPU方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Nvidia的vGPU方案採用虛擬化的技術,基於 SR-IOV 進行 GPU 設備虛擬化管理,在驅動層提供了時間分片執行的邏輯,並做了一定的顯存隔離,這樣在對顯卡進行初始化設置的時候就可以根據需求將顯卡進行劃分。其中時間分片調度的邏輯可以是按實例均分,或者是自定義比例,顯卡的顯存需要按照預設的比例進行劃分。Nvdia的vGPU方案在實施中有下面","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"兩點限制","attrs":{}},{"type":"text","text":":","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)vGPU劃分完成之後,如果要改變這種預定義的劃分,需要重啓顯卡才能生效,無法做到不重啓更改配置。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)其方案基於虛機,需要先對 GPU 物理機進行虛擬化之後,再在虛擬機內部署容器,無法直接基於物理機進行容器化的調度,另外 vGPU 方案需要收取 license 費用,增加了使用成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Nvidia MPS方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Nvidia的MPS方案是一種算力分割的軟件虛擬化方案。該方案和vGPU方案相比,配置很靈活,並且和docker適配良好。MPS 基於C/S架構,配置成MPS模式的GPU上運行的所有進程,會動態的將其啓動的內核發送給MPS server,MPS Server藉助CUDA stream,實現多個內核同時啓動執行。除此之外,MPS 還可配置各個進程對 GPU 的使用佔比。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該方案的一個","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"問題在於","attrs":{}},{"type":"text","text":",各個服務進程依賴 MPS,一旦 MPS 進程出現問題,所有在該GPU上的進程直接受影響,需要使用 Nvidia-smi 重置GPU 的方式才能恢復。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"01愛奇藝的vGPU方案","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調研以上方案後,爲了更好地適用於愛奇藝內部 AI 容器化應用場景,我們重新開發了容器場景下的 GPU 虛擬共享方案,基於CUDA API 截獲方式實現顯存及算力隔離和分配,並基於開源項目aliyun-gpushare scheduler[1]實現 K8S 上對虛擬 GPU 的調度和分配,實現了多應用容器部署在一張 GPU 卡的目標。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們方案的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"主要特點","attrs":{}},{"type":"text","text":"是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"配置靈活","attrs":{}},{"type":"text","text":",和K8S能夠有機的進行結合,按需實時分配用戶所需要的vGPU實例,同時儘可能的讓物理GPU實例能夠充分的被共享,實現資源的最大化利用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在完成方案的設計之後,我們對整體進行了效果的評估,測試這種隔離和共享對應用性能的影響。即對於單一進程來說,需要保證:首先它不會使用超過其被分配的算力大小,其次隔離本身不應該對於 GPU 算力有過多損耗,第三是多個進程同時共享的時候,與其單獨運行時相比,不應有太大的性能偏差,即共享可以有效避免進程之間的干擾。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對以上標準,我們對 GPU 虛擬共享方案進行了性能測試,結果如圖2所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"第一個測試是單進程算力隔離後性能的評估","attrs":{}},{"type":"text","text":"。物理 GPU上只運行單一進程,但配置了三次,分別爲100%,50%和10% 算力時,其性能和該程序獨立運行時的比例關係。縱軸爲達到無虛擬化運行時性能的百分比,橫軸爲進行的單元測試用例,區域相同的顏色表示該組測試用例爲同一CUDA kernel,但是不同的運行參數。其中圖內的綠點,藍點,和紅點分配是500多個測試用例在各自算力分配的情況下得到的性能,和完全沒有算力分割且獨佔GPU時運行的性能的相對比值。另外曲線是這些獨立點的數值在全體維度上做了一個平滑,以更好的進行可視化的對比。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"第二個和第三個測試分別用不同算力配比對兩個GPU進程進行相互干擾實驗","attrs":{}},{"type":"text","text":"。如第二個兩個進程分別配置爲50% 算力,綠點爲兩個GPU進程性能平均值,而紅色曲線爲這些綠點的平滑曲線。該曲線和第一個測試中50%算力的曲線對比相差無幾,這就說明了我們方案中配置50%算力時同時運行相互干擾是幾乎可以忽略的。第三個爲一個配置爲70%,另外一個配置爲30%算力,同樣可以和第一個測試中的獨立分配70%/30%時各自的曲線進行對比。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"測試結果表明了方案可以將GPU相互干擾控制在合理的範圍之內。服務上線後內部統計顯示,平均 100+ 深度學習容器服務可以共享的部署在 35 張物理 GPU 之上,並且做到應用相互之間無影響;對於單張GPU物理卡,平均承載的服務數量從 1 變爲了3;同時GPU的平均利用率也提升了2 倍以上。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/37516e31ede15a4f36d9d7233597cef5.png","alt":"怎樣節省 2/3 的 GPU?愛奇藝 vGPU 的探索與實踐","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖2:隔離性測試結果","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"02愛奇藝GPU虛擬共享的底層原理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先我們來看看 GPU 虛擬共享的底層原理。GPU作爲一個強大的計算外設,提供的兩個主要資源是顯存和算力。要實現單個 GPU 的共享,我們要實現對顯存和算力資源的隔離,並驗證隔離方案的效率及性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.1顯存隔離","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於深度學習應用來說,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"對於顯存的需求","attrs":{}},{"type":"text","text":"來自於三個方面。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)第一是模型的CUDA kernel context,可類比於CPU程序中的text段,提供給CUDA kernel執行的環境,這是一項剛需,沒有充足的顯存,kernel將無法啓動,且context的大小隨着kernel的複雜程度有增長,但在整體模型顯存需求中是最小的一部分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)第二部分來自於模型訓練得出的一些參數,如卷積中的weight和bias。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)第三部分來自於模型在推理過程中的臨時存儲,用於儲存中間的計算結果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於一般的模型來說,基本都不需要佔用整個GPU的顯存。但是這裏有","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"一個例外","attrs":{}},{"type":"text","text":",Tensorflow框架默認分配所有GPU的顯存來進行自己的顯存管理。當然Tensorflow框架有相應的選項可以屏蔽該行爲,但是對於平臺來說,要讓每個用戶修改 TF 的配置爲屏蔽該行爲,就不太可行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"應對這一問題","attrs":{}},{"type":"text","text":",一個巧妙的方法可以在不需要應用開發者參與的情況下,讓Tensorflow的部署應用只分配它所需的顯存大小而不出現問題。該方法即API動態攔截。Tensorflow之所以可以知道當前GPU的剩餘顯存,是通過cuDeviceTotalMem/cuMemGetInfo這兩個CUDA library API。通過LD_PRELOAD的方式,在的鉤子so中實現這兩個API,那麼Tensorflow執行的時候,link首先會調用的是的API實現,而不是CUDA的,這樣就可以動態的修改這兩個API的返回結果,如這裏想做的,將特定Tensorflow應用的顯存配額限制在其申請數值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在系統實現的過程中,還對cuMemAlloc/cuMemFree做了同樣的攔截,目的是爲了能夠對同容器中的多個GPU進程進程統一管理。當多個GPU進程分配顯存之和超過其配額時,可以通過cuMalloc來返回顯存不足的錯誤。容器內顯存配額管理是通過share mem來做的。圖3展示了顯存隔離和分配的整個流程。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/60/60487ebd2fd6131db5356ba01ae63943.png","alt":"怎樣節省 2/3 的 GPU?愛奇藝 vGPU 的探索與實踐","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖3:顯存分割中隔離和分配流程","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.2算力隔離","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了顯存,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"另外一個重要的GPU資源是算力","attrs":{}},{"type":"text","text":"。對於Nvidia volta顯卡架構來說,它的算力來源於三個方面,浮點計算單元、整形計算單元、tensor core加速計算單元。其中浮點計算單元和整形計算單元是流處理器SP的內部結構,而SM中包含着多個流處理器SP。對於V100來說,它有80個SM,每個SM中有64個SP,合5120個流處理器,其中tensor core是位於SM的內部,與SP共享寄存器/share mem/L1 cache。圖4給出了Nvidia GPU的硬件架構組織關係框圖。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/85/855073f42101fa024019d78ba4aba589.png","alt":"怎樣節省 2/3 的 GPU?愛奇藝 vGPU 的探索與實踐","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖4:Nvidia GPU硬件架構組織關係圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於Nvidia的GPU的編程語言CUDA來說,它的語言設計邏輯上和上圖的硬件層次是對應的。CUDA有","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"三層邏輯層次","attrs":{}},{"type":"text","text":",分別爲","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"grid","attrs":{}},{"type":"text","text":",","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"block","attrs":{}},{"type":"text","text":",和","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"thread","attrs":{}},{"type":"text","text":"。Grid可以認爲是對整個顯卡的邏輯抽象,block可以認爲是對SM單元的邏輯抽象,而thread是SP的邏輯抽象。爲了達到最高的併發程度,SM之間可以認爲是沒有交互的,當然這也不是絕對的,有一些程序爲了自己的特殊邏輯,也可以設計出SM之間依賴的程序,但這個代價是極大的性能浪費。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在知道了GPU的底層結構,以及CUDA的設計原理之後,可以就如何算力虛擬化來做一下初步設想。既然一些模型無法完全利用GPU的全部算力,那麼何不削減其佔用的SM個數,使得空閒下來的SM可以爲其他GPU程序所用?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣的想法是好的,但是一些限制阻止了這種優化的實現。GPU程序的執行,是通過kernel的片段來具體實施,在CPU側launch了 kernel之後,具體的kernel及其調用參數隨即交由GPU的硬件調度器來在某個未來的時間點真正運行起來。在默認的情況下,kernel是被派發給GPU上所有的SM,且執行過程中不能被中斷。如圖5所示,軟件系統在發送完畢啓動命令之後,隨即命令及參數由PCIe轉交給GPU硬件,並插入其隊列中,由GPU硬件中固化的邏輯去具體處理在何時真正啓動。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9a/9aafb70429702eb73f265e8a92ecdc09.png","alt":"怎樣節省 2/3 的 GPU?愛奇藝 vGPU 的探索與實踐","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖5:GPU軟件和硬件調度系統的交互圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但世事無絕對,默認情況下不行,不代表沒有別的辦法。讓我們再來回顧一下CUDA的設計。CUDA作爲一個用於操控GPU來完成高效並行計算的語言,它的代碼編寫邏輯是以thread爲基本單元的。SM上所有SP都運行着一份kernel的代碼,且在一定程度上來說連運行節奏都完全一致。CUDA中用來區分thread,來判斷代碼應該處理數據的偏移量的方法,是通過CUDA中的blockIdx/threadIdx這兩個內嵌變量。這兩個變量在機器碼上是隻讀的,在thread由硬件調度器派發的時候所指定。通過硬件調度器,就完成了抽象的blockIdx/threadIdx和具體的SM/SP的綁定。圖6大概的描述了這一映射關係。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6b/6b7c68a3ea11676a0f7c70aecefde355.png","alt":"怎樣節省 2/3 的 GPU?愛奇藝 vGPU 的探索與實踐","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖6:CUDA邏輯塊和硬件計算單元的映射關係","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了能夠精確的控制算力,我們就不能再依賴硬件調度器來控制內核啓動。在這裏用了一個取巧的方法,就是讓內核啓動之後被“困”在固定數目的SM上面,這個數目的值和GPU整體SM個數的比例就是給這個內核算力配比。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了形象化來闡述思路,這裏我們對GPU做了一個抽象化的改動,SM的個數被定義爲10個。然後有一個啓動參數爲<<<15,1>>>的內核,即CUDA block size爲15,thread size爲1。它正常啓動的時候,硬件調度器會給每一個SM上分配一個內核的副本。這樣在第一時間就消耗了10個block的副本,隨後每個SM上內核執行完畢之後會退出,硬件調度器會進一步分配剩下的5個block副本,在這個也執行完畢之後就完成了整個內核的執行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"算力切分之後,我們會在內核啓動時,動態的修改其啓動參數,將其CUDA block size從15變爲5。這樣硬件調度器就會將內核副本分配到GPU上一半數目的SM上,空閒的一半可以爲其他內核所使用,如圖7所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ec/eccd18298274965139edfa4f0f7f1f8d.png","alt":"怎樣節省 2/3 的 GPU?愛奇藝 vGPU 的探索與實踐","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖7:動態修改啓動參數來進行算力分割","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們雖然通過動態修改啓動參數的方法,避免了內核佔滿全部SM資源,但此時還沒完成“困”這一動作。所以此時的內核行爲是其完成預定邏輯之後,會退出,導致此時內核不能覆蓋block size爲15時的數據空間。爲了將其“困“住,我們在內核的彙編EXIT處,替換成了BRANCH操作。這樣內核完成本身的邏輯之後,會跳轉到我們預設的一段邏輯中。這個邏輯完成虛擬blockIdx/threadIdx的自增操作,隨後再跳轉到內核開始位置,來基於更新的blockIdx/threadIdx來進行新一輪計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這次需要指出的是blockIdx/threadIdx爲只讀寄存器,所以沒辦法直接更改它的值。作爲一個替代的解決方案時,將內核中的blockIdx/threadIdx進行整體替換爲可寫的寄存器,這樣我們就可以在預設的跳轉邏輯中做更改操作,如圖8所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e8d510961b1193f1778e87bb0365bd24.png","alt":"怎樣節省 2/3 的 GPU?愛奇藝 vGPU 的探索與實踐","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖8:彙編修改更改內核運行邏輯","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"03愛奇藝GPU虛擬共享的調度設計","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"完成了 GPU 底層的資源隔離之後,我們還需要基於 K8S 平臺實現對隔離的 GPU ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"資源的分配和調度管理","attrs":{}},{"type":"text","text":",以方便業務的深度學習服務能夠","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"快速部署","attrs":{}},{"type":"text","text":"到共享的 GPU。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"K8S 容器中使用 GPU 的方案一般採用 Nvidia device plugin(英偉達官方插件),它可以爲 Pod 分配一卡或多卡,分配的最小單元是1張卡,無法支持底層隔離的 GPU 資源調度。調研之後,我們選擇阿里雲容器服務開源的aliyun-gpushare作爲調度方案,實現對 GPU 隔離資源的調度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以顯存爲例,使用aliyun-gpushare,爲 Pod 分配的是一張卡中的部分顯存,這樣從邏輯上說,單卡的資源就可以再進一步被切分。假設有一張 V100 32GB 卡,可以給 Pod1 分配 4GB 顯存,也可以同時給 Pod2 分配 8GB 卡,直到 32GB 顯存分配完畢。整個調度過程如圖9所示","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/09/09b25056a6fa0423815af310e0d79a80.png","alt":"怎樣節省 2/3 的 GPU?愛奇藝 vGPU 的探索與實踐","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖9:阿里公開的整體調用方案圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中,Share GPU Device Plugin 和 Share GPU Schd Extender 是主要的新增組件,下文簡寫爲 SGDP和SGSE。其餘的組件均爲 k8s 官方組件。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"主要流程","attrs":{}},{"type":"text","text":"如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"1","normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"用戶創建一個 Share GPU Pod 時,必須帶 aliyun.com/gpu-mem 這種 K8S 自定義資源,表明其需要多少顯存。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"SGSE 根據用戶的 Share GPU 顯存請求和集羣整體資源情況,給該 Pod 分配一個 Node,並通過 patch Pod annotation 來指定使用某張卡。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"Kubelet 調用 SGDP 的 Allocate 方法,將指定的 GPU 卡分配給 Pod 使用,同時設置環境變量 ALIYUN_COM_GPU_MEM_CONTAINER(容器可用顯存)、LD_PRELOAD(其值爲限制顯存的動態鏈接庫路徑)。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"Pod 啓動後,因爲設置了 LD_PRELOAD,所有 AI 框架的 GPU 顯存申請都被動態鏈接庫的鉤子劫持,當使用的總資源超過 ALIYUN_COM_GPU_MEM_CONTAINER 的值,則拒絕請求。從而達到限制用戶使用顯存的效果。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"算力資源的調度策略類似以上顯存調度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實踐中,對於 1 張物理 GPU 卡,我們進行 1/4 和 1/2 的顯存和算力的分割,業務可根據實際需要選擇對應的比例,單張 GPU 最多可部署 4 個不同應用,並可實現有效隔離,互不影響。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"04結語和展望","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過底層LD_PRELOAD動態劫持的方式,我們實現了容器中輕量級 GPU 顯存和算力的隔離,從而支持多個容器應用部署在同一個 GPU 上。該方案從一個動態的維度實現單張 GPU資源的劃分,針對在線推理服務場景,很好的提升了GPU硬件的使用效率。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後續工作中,我們還計劃開發和實現跨主機 GPU 遠程調用的方案,來解決 GPU 虛擬共享之後,產生的單機多卡機器上 CPU/GPU 比例失衡,導致的部分虛擬 GPU 資源無 CPU可分配的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"參考文獻","attrs":{}},{"type":"text","text":":","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.aliyun-gpushare: https://github.com/AliyunContainerService/GPUshare-scheduler-extender","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.Nvidia vGPU:https://docs.Nvidia.com/grid/latest/grid-vGPU-user-guide/index.html","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. Nvidia MPS:https://docs.Nvidia.com/deploy/mps/index.html","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章