DAY80:閱讀Compute Capability 3.x

H.3. Compute Capability 3.x

H.3.1. Architecture

A multiprocessor consists of:

  • 192 CUDA cores for arithmetic operations (see Arithmetic Instructions for throughputs of arithmetic operations),
  • 32 special function units for single-precision floating-point transcendental functions,
  • 4 warp schedulers.

When a multiprocessor is given warps to execute, it first distributes them among the four schedulers. Then, at every instruction issue time, each scheduler issues two independent instructions for one of its assigned warps that is ready to execute, if any.

A multiprocessor has a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory.

There is an L1 cache for each multiprocessor and an L2 cache shared by all multiprocessors. The L1 cache is used to cache accesses to local memory, including temporary register spills. The L2 cache is used to cache accesses to local and global memory. The cache behavior (e.g., whether reads are cached in both L1 and L2 or in L2 only) can be partially configured on a per-access basis using modifiers to the load or store instruction. Some devices of compute capability 3.5 and devices of compute capability 3.7 allow opt-in to caching of global memory in both L1 and L2 via compiler options.

The same on-chip memory is used for both L1 and shared memory: It can be configured as 48 KB of shared memory and 16 KB of L1 cache or as 16 KB of shared memory and 48 KB of L1 cache or as 32 KB of shared memory and 32 KB of L1 cache, using cudaFuncSetCacheConfig()/cuFuncSetCacheConfig():

The default cache configuration is "prefer none," meaning "no preference." If a kernel is configured to have no preference, then it will default to the preference of the current thread/context, which is set using cudaDeviceSetCacheConfig()/cuCtxSetCacheConfig() (see the reference manual for details). If the current thread/context also has no preference (which is again the default setting), then whichever cache configuration was most recently used for any kernel will be the one that is used, unless a different cache configuration is required to launch the kernel (e.g., due to shared memory requirements). The initial configuration is 48 KB of shared memory and 16 KB of L1 cache.

Note: Devices of compute capability 3.7 add an additional 64 KB of shared memory to each of the above configurations, yielding 112 KB, 96 KB, and 80 KB shared memory per multiprocessor, respectively. However, the maximum shared memory per thread block remains 48 KB.

Applications may query the L2 cache size by checking the l2CacheSize device property (see Device Enumeration). The maximum L2 cache size is 1.5 MB.

Each multiprocessor has a read-only data cache of 48 KB to speed up reads from device memory. It accesses this cache either directly (for devices of compute capability 3.5 or 3.7), or via a texture unit that implements the various addressing modes and data filtering mentioned in Texture and Surface Memory. When accessed via the texture unit, the read-only data cache is also referred to as texture cache.

H.3.2. Global Memory

Global memory accesses for devices of compute capability 3.x are cached in L2 and for devices of compute capability 3.5 or 3.7, may also be cached in the read-only data cache described in the previous section; they are normally not cached in L1. Some devices of compute capability 3.5 and devices of compute capability 3.7 allow opt-in to caching of global memory accesses in L1 via the -Xptxas -dlcm=ca option to nvcc.

A cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

If the size of the words accessed by each thread is more than 4 bytes, a memory request by a warp is first split into separate 128-byte memory requests that are issued independently:

  • Two memory requests, one for each half-warp, if the size is 8 bytes,
  • Four memory requests, one for each quarter-warp, if the size is 16 bytes.

Each memory request is then broken down into cache line requests that are issued independently. A cache line request is serviced at the throughput of L1 or L2 cache in case of a cache hit, or at the throughput of device memory, otherwise.

Note that threads can access any words in any order, including the same words.

If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined.

Data that is read-only for the entire lifetime of the kernel can also be cached in the read-only data cache described in the previous section by reading it using the __ldg() function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use __ldg() to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both the const and __restrict__ qualifiers increases the likelihood that the compiler will detect the read-only condition.

Figure 18 shows some examples of global memory accesses and corresponding memory transactions.

Figure 18. Examples of Global Memory Accesses. Examples of Global Memory Accesses by a Warp, 4-Byte Word per Thread, and Associated Memory Transactions for Compute Capabilities 3.x and Beyond

H.3.3. Shared Memory

Shared memory has 32 banks with two addressing modes that are described below.

The addressing mode can be queried using cudaDeviceGetSharedMemConfig() and set using cudaDeviceSetSharedMemConfig() (see reference manual for more details). Each bank has a bandwidth of 64 bits per clock cycle.

Figure 19 shows some examples of strided access.

Figure 20 shows some examples of memory read accesses that involve the broadcast mechanism.

64-Bit Mode

Successive 64-bit words map to successive banks.

A shared memory request for a warp does not generate a bank conflict between two threads that access any sub-word within the same 64-bit word (even though the addresses of the two sub-words fall in the same bank): In that case, for read accesses, the 64-bit word is broadcast to the requesting threads and for write accesses, each sub-word is written by only one of the threads (which thread performs the write is undefined).

32-Bit Mode

Successive 32-bit words map to successive banks.

A shared memory request for a warp does not generate a bank conflict between two threads that access any sub-word within the same 32-bit word or within two 32-bit words whose indices i and jare in the same 64-word aligned segment (i.e., a segment whose first index is a multiple of 64) and such that j=i+32 (even though the addresses of the two sub-words fall in the same bank): In that case, for read accesses, the 32-bit words are broadcast to the requesting threads and for write accesses, each sub-word is written by only one of the threads (which thread performs the write is undefined).

文備註/經驗分享:

今天我們將從Kepler開始,詳細的說一下目前還流行的各代顯卡的主要差別,和他們的進化過程。 首先說,從這一代起,單精度浮點性能,並無本質變化:老的Kepler的泰坦(計算能力3.X),依然可以達到單精度大約8T的理論性能。而今天的RTX2080,計算能力7.5, 也依然單精度在8T-10TFlops之間(後者是Boost)。也就是說, 如果只追求單精度性能的用戶,在不考慮歷代製程的變化和功耗的變化的情況下,實際上從Kepler的3.X開始到現在,基本沒有太大的變化。這點用戶首先需要注意了。 那麼3.X/5.X/6.X/7.X這4代過來,NV更加註重的是新特性的增加(例如6.0開始的半精度),效率的提升(例如從5.0開始的同樣製程下的功耗性能比),以及,適應了潮流的新特性(例如6.X開始的__dp4a, 7.X開始的兩代Tensor Core等等。) 但是不能否認的是,Kepler是一代開創性的實驗品,它所帶來的很多特性,直到今天,依然發揮的重要作用,並逐漸成爲的CUDA的基本構成部分。 我們簡單的從GPU整體,以及SM內部,來看下Kepler都給我們帶來了什麼。 首先最重要的是,計算能力3.X引入了製程的提升(28nm),以及,引入了軟件調度。相比之前的Fermi那一代,降低了大量的功耗。你可能很難想象,在Fermi的當時,只有400多個SP的一張卡(例如GTX480),功耗就可以到200多瓦。當時人們都戲稱,GTX480可以煎雞蛋。猶如當年的Intel的失敗的奔騰4的一代一樣。 再看看到了計算能力3.X(Fermi是2.X),一張有2300多個SP的GTX780,功耗也才200多W出頭。提升的非常明顯。這是因爲本代的卡,試圖從多個方面進行功耗的優化,效果還是顯著的,但之所以給人留下了不好的印象,那是因爲一些地方沒有做的平衡,導致了性能的損失。首先是一個SM,在寄存器資源爲256KB(64K個4B寄存器)的情況下,塞入了192個SP! 這是什麼概念?下一代的5.X同樣的資源,之塞入了128個SP,6.0計算能力更是隻有64個SP。也包括今天在京東上熱賣的RTX2080(7.5)同樣只有64個SP。這導致了執行單元(SP)數量看起來非常驚人的多,但是可用的資源較少。 很多情況下,3.X這一代裏面的192個SP,因爲資源不飽它們,經常出現性能只有類似128個SP的效果(66.7%性能)。這是這一代的主要被人罵的地方,其他的地方都幾乎很好。如果人們不那麼貪心,用來和上一代的Fermi比,哪怕每次都是最壞的性能,依然SP數量在同樣的功耗下,取得了長足的提升,還是值得考慮的。這是本代卡最顯著的特點。其次,本代的卡對CUDA做出了很多基礎性的貢獻,例如:動態並行從這一代的卡開始引入(計算能力3.5),使得人們不在侷限於從CPU上進行任務調度,能夠讓GPU自己調度自己,寫代碼的方便性提升了很多。 也使得原本很難進行並行化,或者對並行化感到棘手的應用場合,變得適用了。(詳情請參考之前的動態並行章節。這是一個很重要的特性,感謝Kepler爲我們引入) 其次,則是這一代的卡引入了Hyper-Q,Hyper-Q這個概念已經成爲了現在使用CUDA的基礎,就如同剛纔的動態並行一樣。在這代卡(精確的說,3.5, Kepler2代)之前,硬件隊列只有1個,使得很多應用多流的場合,並不能起到效果(請參考之前的多流章節) 於是出現了很多CUDA參考書,引入了各種辦法進行修補,例如多流時候的深度發佈任務有限,和廣度任務發佈優先。(也就是先就着一個流發佈完其中的任務,然後再對下一個流發佈;還是每個流發佈一個任務後,立刻切換到下一個流) 當年不同的發佈方式,雖然都是在使用多流,導致了很多時候多流的效果消失。需要用戶反覆試驗發佈命令的方式。而Kepler引入了多個硬件任務隊列,叫Hyper-Q的東西,使得用戶的任何一種發佈方式,往往都可以取得較好的多流並行效果。從而很大程度的減輕了CUDA用戶編程時候的負擔。這是一個很大的改進(詳情參考我們之前的Hyper-Q章節) 如今這個特性也默默的沉澱在你所用的每一張N卡里面。成爲了CUDA的標配。除了這個特性外,這一代還引入了軟件調度,軟件指令內部含有原本硬件應該負責的調度信息,軟件指令內部含有原本硬件應該負責的調度信息,從而節省了很多的硬件晶體管,降低了功耗。 (這一代不明顯,但是作爲基礎在Maxwell(5.X)中發揚了廣大,同時一路沿用到6.X和7.0/7.5) 這個基礎是從Kepler開始奠定的, (所以這是爲何今天N卡這麼省電的原因) 從差不多Kepler的同時時期,競爭對手AMD拋棄了軟件調度(它的VLIW)試圖提升性能,最終今天變成火爐。 而NV這也從這個時期開始,放棄了自己原本的硬件調度,改用軟件調用,降低了功耗變成了冰箱(開玩笑哈) 這兩家大致從這個時候開始分道揚鑣,AMD變得NV化,NV變得AMD話。不能不說是一個諷刺而又需要面對的事實。但是需要說明的是,爲了配置這種改進,儘量的將晶體管分配給執行單元,而不是調度單元(執行單元才能發揮性能,調度單元只是爲了調度,如果員工和領導的關係,NV的Kepler如同一個公司,引入了海量的員工,和少量的領導)

實際上本章節中說的,192個SP vs 4調度器 vs 256KB的寄存器,不足以能讓這些SP(員工們)充分忙碌, 很多綜合性的測試表明,這麼多的SP(192)往往只表現出來了128個SP的性能。也就是說,Kepler有可能峯值,但是很困難。用戶應當只認爲有128/192的性能(很大原因在於寄存器的bank conflict)。 阿里曾經在GTC大會上,展現過他們自己編寫的Kepler彙編器,能儘量讓用戶能發揮性能,實際上,在它們的阿里雲的GPU實例中,也曾經提供過該彙編器給他們的雲GPU實例的用戶使用,並有過詳細說明。我們現在的角度來看,Kepler和後續的代的最大的特點是,kepler的調度器需要寄存器配合,儘量能雙發(至少需要50%的機率在雙發射指令)纔有可能峯值。而後續代的顯卡則是,單發射指令就足以壓滿SP的峯值了,雙發只是可選。這是一個很大的差異。 然後除了指令調度上的區別(本章節的前部分)和SP數量的暴增。Kepler還做了一些常用的用戶可能方便的改進。 例如說,增寬的shared memory(本章的後面),從4B一個bank,擴大到8B。這是的一些特殊的應用場合,例如對double操作的讀取能提升一倍。 也能讓類似圖像處理中的索引顏色,AES加密解密之類的場合進行提速。但是可能是用到的地方不多,在後續的代中取消了這點。 不過本章節還是對此進行了說明。因爲如果你手頭有這種卡的話(Kepler,3.X),適當的使用8B的寬度很多時候還是能提速你的代碼的。 此外,本章節還試圖對L1 Cache, Shared Memory, Texture Cache等進行改進,從本代開始,開始了這3者的分分合合的改動過程。(沒錯。NV改過去又改回來。也是醉了) 首先從這一代開始,L1 Cache不再像之前代的GPU一樣,負責通用的讀寫緩衝了。當時Kepler 3.0一出來後,大家紛紛感嘆,又回到了當年的計算能力1.X的時代了。爲何這樣說?因爲3.0如果想利用SM內部的數據緩存(這裏的read-only cache),必須使用紋理訪問。這也是導致了論壇上持久不停的紛爭,究竟是使用紋理好?還是直接讀取好?是直接讀取快?還是使用紋理快?也是目前市面上的很多CUDA書中這裏, 對紋理的使用,亂作一團的原因。用戶應當知道,如果是計算能力3.0的卡,應當考慮使用紋理。否則你的普通讀寫將不能使用SM內部的緩存,只能使用全局的次級的L2緩存。這也是本章節強調的原因。 但是NV很快發現這樣好像用戶們不適應,又在本章節的後面, 說了,如何在2代Kepler上,重新用回來L1 Cache的方式。此外,對普通用戶最重要的是,對於只讀的數據,提供了一個__ldg()函數。該函數能夠在計算能力3.5+/5.X/6.X上,對很多應用起到顯著的提速效果。(7.X我還不知道情況) 因爲3.5之後,到7.0之前,幾乎沿用了Kepler的這個特性,很多時候,如果你不特別的做一些處理(本章節有介紹),則L1(或者後續代的Unified Cache)是默認禁用的。性能比較慘。特別是對於一些很小的查找表。如果你不手工放入shared memory, 直接查表等於直接用全局的L2,性能可能有劇烈的損失。 這點用戶需要注意。 用戶可以直接將__ldg()看成是一個隱形的紋理,自動覆蓋了整個顯存。用它就可以在計算能力3.5+上自動的利用紋理(或者等效的)緩存,而不需要手寫紋理訪問的代碼。很是方便。(當然,用這個只有紋理的緩存效果。其他的加速效果是沒有的。詳見我們之前的紋理章節) 然後還需要說明的是,這一代的L1 Cache(和Shared Memory合併在一起),是最後一代和L2 Cache line大小不同的N卡了。 不恰當的使用L1(例如你在3.5+上要求它進行普通讀寫緩衝),可能會起到反面效果。因爲這代卡的L1是128B,L2是32B的最小傳輸大小。 不恰當的使用會導致過量傳輸,損失性能,詳情可以簡單的看一下本章節的描述,或者搜索更多信息(網上有很多的。特別是GTC的歷年幻燈片) 然後這一代還引入了循環移位的支持,使得特定的場合下,以前的循環移位所需要的移位 + 反向移位 + 邏輯或拼合的三步操作,變成了單條指令即可。 (但是需要說明的是,該指令不是全速率的。循環移位直到現在家用的2080, 才變成全速率的) 而循環移位這種操作,在很多應用場合,特別是散列或者密碼學計算中,有着廣泛的應用。單具體的說,N卡的實現是通過Funnel Shift來完成的(漏斗型移位, 因爲操作數的輸入和輸出是大口和小口的關係,形似漏斗而得名),而Funnel Shift來進行循環移位只是它的應用的一種,還有其他用途(例如用來拼接2個4B字節) 然後這一代的Shared Memory, 還是最後一代傳統的N卡的原子操作風格的一代。 具體的說,這代上的原子操作都是通過讀取-鎖定-SP計算-回寫解鎖的過程來完成的。 而後續的從Maxwell起,都採用了A卡風格的Shared memory自行計算的方式(沒錯,後續代的顯卡的shared memory存儲器將帶有計算功能) NV一度在GTC的演講中,叫它是:遠程原子操作。所以大家如果看到==這種字樣,不要驚訝,它就是目前最普通的shared memory上的原子操作。 這種方式很很多的好處,我們在明天到了現代的N卡的章節(Maxwell)的時候,再進行說明。 但是無可否認的是,我們目前用的很多特性的基礎,都來自Kepler。

有不明白的地方,請在本文後留言

或者在我們的技術論壇bbs.gpuworld.cn上發帖

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章