CUDA計算能力的含義

我們在學習GPU編程時經常看到計算能力(Compute Capability)這個詞語,那麼什麼是計算能力呢?

計算能力(Compute Capability)

計算能力不是描述GPU設備計算能力強弱的絕對指標,他是相對的。準確的說他是一個架構的版本號。也不是指cuda軟件平臺的版本號(如cuda7.0,cuda8.0等)

如TX1,版本號爲5.3,實際上指的是:

5、SM的主版本號,指maxwell架構

3、SM的次版本號,擁有一些在該架構前提下的一些優化特性

如官方文檔中所說(2.5節):

Compute Capability

The compute capability of a device is represented by a version number, also sometimes called its "SM version". This version number identifies the features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the present GPU.

The compute capability comprises a major revision number X and a minor revision number Y and is denoted by X.Y.

Devices with the same major revision number are of the same core architecture. The major revision number is 7 for devices based on the Volta architecture, 6 for devices based on the Pascal architecture, 5 for devices based on the Maxwell architecture, 3 for devices based on the Kepler architecture, 2 for devices based on the Fermi architecture, and 1 for devices based on the Tesla architecture.

The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features.

CUDA-Enabled GPUs lists of all CUDA-enabled devices along with their compute capability. Compute Capabilities gives the technical specifications of each compute capability.

Note: The compute capability version of a particular GPU should not be confused with the CUDA version (e.g., CUDA 7.5, CUDA 8, CUDA 9), which is the version of the CUDA software platform. The CUDA platform is used by application developers to create applications that run on many generations of GPU architectures, including future GPU architectures yet to be invented. While new versions of the CUDA platform often add native support for a new GPU architecture by supporting the compute capability version of that architecture, new versions of the CUDA platform typically also include software features that are independent of hardware generation. The Tesla and Fermi architectures are no longer supported starting with CUDA 7.0 and CUDA 9.0, respectively.

Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4wxGcUSTa Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

不同版本SM的功能

如上所述計算能力的含義應該解釋清楚了。那麼這些版本號代表了什麼呢?

每一種計算能力都擁有着不同的特點,主版本號和次版本號在硬件細節上究竟有着什麼不同呢?

如下圖所示,在浮點運算能力上的區別如下:


在性能指標上,運算能力區別如下(表格最下方沒有完全截圖,下方的指標都一致,各版本沒有區別):


可見不同計算能力的區別一目瞭然。

典型版本的示例

Compute Capability 3.x

架構(Architecture)

A multiprocessor consists of:
192 CUDA cores for arithmetic operations (see Arithmetic Instructions for throughputs of arithmetic operations),
32 special function units for single-precision floating-point transcendental functions,
4 warp schedulers.

When a multiprocessor is given warps to execute, it first distributes them among the four schedulers. Then, at every instruction issue time, each scheduler issues two independent instructions for one of its assigned warps that is ready to execute, if any.

A multiprocessor has a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory.

There is an L1 cache for each multiprocessor and an L2 cache shared by all multiprocessors. The L1 cache is used to cache accesses to local memory, including temporary register spills. The L2 cache is used to cache accesses to local and global memory. The cache behavior (e.g., whether reads are cached in both L1 and L2 or in L2 only) can be partially configured on a per-access basis using modifiers to the load or store instruction. Some devices of compute capability 3.5 and devices of compute capability 3.7 allow opt-in to caching of global memory in both L1 and L2 via compiler options.

The same on-chip memory is used for both L1 and shared memory: It can be configured as 48 KB of shared memory and 16 KB of L1 cache or as 16 KB of shared memory and 48 KB of L1 cache or as 32 KB of shared memory and 32 KB of L1 cache, using cudaFuncSetCacheConfig()/cuFuncSetCacheConfig():

// Device code 
__global__ void MyKernel() 
{ 
... 
} 
// Host code 
// Runtime API 
// cudaFuncCachePreferShared: shared memory is 48 KB 
// cudaFuncCachePreferEqual: shared memory is 32 KB 
// cudaFuncCachePreferL1: shared memory is 16 KB 
// cudaFuncCachePreferNone: no preference 
cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared)

Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4wxQe3BXP Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

L1緩存是每個多處理器單獨擁有的,用於做共享內存或一級緩存,而L2緩存是所有多處理器共有的,用於做二級緩存或者全局內存。L1緩存是可配置的,可調共享內存和一級緩存比例

往後關於內存緩存方面的並沒有看懂

Compute Capability 5.x

架構(Architecture)

A multiprocessor consists of:
128 CUDA cores for arithmetic operations (see Arithmetic Instructions for throughputs of arithmetic operations),
32 special function units for single-precision floating-point transcendental functions,
4 warp schedulers.

Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4wxY2CuLy Follow us: @GPUComputing on Twitter | NVIDIA on Facebook


最新的計算能力表格在這裏

https://developer.nvidia.com/cuda-gpus


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章