CUDA入門基本概念簡介

cuda concept

In November 2006, NVIDIA introduced CUDA®, a general purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU.

main concept of cuda

主機

    將CPU及系統的內存（內存條）稱爲主機。

設備

    將GPU及GPU本身的顯示內存稱爲設備。

線程(Thread)

    一般通過GPU的一個核進行處理。

線程塊(Block)

    1. 由多個線程組成（可以表示成一維，二維，三維）。即線程的集合叫做線程塊。
    2. 各block是並行執行的，block間無法通信，也沒有執行順序。
    3. 注意線程塊的數量限制爲不超過65535（硬件限制）。

For convenience, threadIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-dimensional, or three-dimensional block of threads, called a thread block. This provides a natural way to invoke computation across the elements in a domain such as a vector, matrix, or volume.

線程格(Grid)

    由多個線程塊組成（可以表示成一維，二維，三維）。即線程塊的集合叫線程格。

Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks. The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed.

線程束

在CUDA架構中，線程束是指一個包含32個線程的集合，這個線程集合被“編織在一起”並且“步調一致”的形式執行。在程序中的每一行，線程束中的每個線程都將在不同數據上執行相同的命令。

The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently. The term warp originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp. A quarter-warp is either the first, second, third, or fourth quarter of a warp.

核函數（Kernel）

1. 在GPU上執行的函數通常稱爲核函數。
2. 一般通過標識符__global__修飾，調用通過<<<參數1,參數2>>>，用於說明內核函數中的線程數量，以及線程是如何組織的。尖括號中的參數是傳遞給運行時系統，進行設置GPU上線程和進程的組織方式，並告訴運行時如何啓動設備代碼。尖括號中的參數並不是傳遞給設備代碼運行所需要的參數。設備代碼核函數運行的參數傳遞在圓括號裏面指定，就像標準的函數一樣。參數1說明GPU設備在執行核函數的時候使用的並行線程塊的數量。參數2說明GPU設備在執行核函數的時候，一個線程塊中包含多少個線程。
3. 以線程格（Grid）的形式組織，每個線程格由若干個線程塊（block）組成，而每個線程塊又由若干個線程（thread）組成。
4. 是以block爲單位執行的。
5. 叧能被在主機端代碼中調用。
6. 調用時必須聲明內核函數的執行參數。
7. 在編程時，必須先爲kernel函數中用到的數組或變量分配好足夠的空間，再調用kernel函數，否則在GPU計算時會發生錯誤，例如越界或報錯，甚至導致藍屏和死機。
8. CUDA編譯器和運行時將負責實現從主機代碼中調用設備代碼的功能。

CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.

A kernel is defined using the __global__declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>>execution configuration syntax (see C Language Extensions). Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

As an illustration, the following sample code adds two vectors A and B of size N and stores the result into vector C:

// Kernel definition 
__global__ void VecAdd(float* A, float* B, float* C) { 
    int i = threadIdx.x; 
    C[i] = A[i] + B[i]; 
} 
int main() {
 ... 
 // Kernel invocation with N threads 
     VecAdd<<<1, N>>>(A, B, C); 
 ... 
 }

dim3結構類型

 1. dim3是基亍uint3定義的矢量類型，相當亍由3個unsigned int型組成的結構體。uint3類型有三個數據成員unsigned int x; unsigned int y; unsigned int z;
2. 可使用亍一維、二維或三維的索引來標識線程，構成一維、二維或三維線程塊。
3. dim3結構類型變量用在覈函數調用的<<<,>>>中。
4. 對於一維的block，線程的threadID=threadIdx.x。
5. 對於大小爲（blockDim.x, blockDim.y）的 二維 block，線程的`threadID=threadIdx.x+threadIdx.y*blockDim.x。`
6. 對於大小爲（blockDim.x, blockDim.y, blockDim.z）的 三維 block，線程的`threadID=threadIdx.x+threadIdx.y*blockDim.x+threadIdx.z*blockDim.x*blockDim.y。`
7. 對於計算線程索引偏移增量爲已啓動線程的總數。如`stride = blockDim.x * gridDim.x; threadId += stride。`

This type is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.

線程發散(Thread Divergence)

當某些線程需要執行一條指令，而其他線程不需要執行的時候，這種現象就稱爲線程發散。

__syncthreads()

Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling the __syncthreads() intrinsic function; __syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed.

CUDA架構將確保，除非同一個線程塊中的每一個線程都執行了__syncthreads()，否則沒有任何線程能執行__syncthreads()函數之後的指令。遺憾的是，如果__syncthreads()位於發散分支中，那麼一些線程永遠都無法執行__syncthreads()。因此，由於要確保在每個線程執行完__syncthreads()後才能執行後面的語句，因此硬件將使這些線程保持等待。一直等，一直等，永久地等待下去。

threadIdx(線程索引)

線程thread的ID索引；如果線程是一維的那麼就取threadIdx.x，二維的還可以多取到一個值threadIdx.y，以此類推到三維threadIdx.z。
This variable is of type uint3 (see char, short, int, long, longlong, float, double ) and contains the thread index within the block.

blockIdx(線程塊的索引)

線程塊的ID索引；同樣有blockIdx.x，blockIdx.y，blockIdx.z。
This variable is of type uint3 (see char, short, int, long, longlong, float, double) and contains the block index within the grid.

gridDim(網格的大小)

線程格的維度，同樣有gridDim.x，gridDim.y，gridDim.z。
This variable is of type dim3 (see dim3) and contains the dimensions of the grid.

blockDim(線程塊的大小)

This variable is of type dim3 (see dim3) and contains the dimensions of the block.

warpSize(線程束)

This variable is of type int and contains the warp size in threads (see SIMT Architecture for the definition of a warp).

Function Type Qualifiers(函數修飾符)

Function type qualifiers specify whether a function executes on the host or on the device and whether it is callable from the host or from the device.

__global__，表明被修飾的函數在設備上執行，但在主機上調用。
The _global_ qualifier declares a function as being a kernel. Such a function is:

Executed on the device,
Callable from the host,
Callable from the device for devices of compute capability 3.x (see CUDA Dynamic Parallelism for more details).

NOTE:
__global__ functions must have void return type.

Any call to a __global__ function must specify its execution configuration as described in Execution Configuration.
A call to a __global__ function is asynchronous, meaning it returns before the device has completed its execution.

__device__，表明被修飾的函數在設備上執行，但只能在其他__device__函數或者__global__函數中調用。
The __device__ qualifier declares a function that is:

Executed on the device,
Callable from the device only.

__host__
The host qualifier declares a function that is:

Executed on the host,
Callable from the host only.

It is equivalent to declare a function with only the __host__ qualifier or to declare it without any of the __host__, __device__, or __global__ qualifier; in either case the function is compiled for the host only.

NOTE:
The __global__ and __host__ qualifiers cannot be used together.

The __device__ and __host__ qualifiers can be used together however, in which case the function is compiled for both the host and the device. The __CUDA_ARCH__ macro introduced in Application Compatibility can be used to differentiate code paths between host and device:

__host__ __device__ func() { 
    #if __CUDA_ARCH__ >= 500 
    // Device code path for compute capability 5.x 
    #elif __CUDA_ARCH__ >= 300 
    // Device code path for compute capability 3.x 
    #elif __CUDA_ARCH__ >= 200 
    // Device code path for compute capability 2.x 
    #elif !defined(__CUDA_ARCH__) 
    // Host code path 
    #endif 
}

__ noinline__ and __forceinline__

The compiler inlines any device function when deemed appropriate.

The __noinline__ function qualifier can be used as a hint for the compiler not to inline the function if possible. The function body must still be in the same file where it is called.

The __forceinline__ function qualifier can be used to force the compiler to inline the function.

Variable Type Qualifiers(變量類型修飾符)

Variable type qualifiers specify the memory location on the device of a variable.

An automatic variable declared in device code without any of the __device__, __shared__ and __constant__ qualifiers described in this section generally resides in a register. However in some cases the compiler might choose to place it in local memory, which can have adverse performance consequences as detailed in Device Memory Accesses.

device

The __device__ qualifier declares a variable that resides on the device.

At most one of the other type qualifiers defined in the next two sections may be used together with __device__ to further specify which memory space the variable belongs to. If none of them is present, the variable:

Resides in global memory space.
Has the lifetime of an application.
Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()).

constant

The __constant__ qualifier, optionally used together with __device__, declares a variable that:

Resides in constant memory space,
Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()).

shared

The shared qualifier, optionally used together with device, declares a variable that:

Resides in the shared memory space of a thread block,
Has the lifetime of the block,
Is only accessible from all the threads within the block.

When declaring a variable in shared memory as an external array such as

extern __shared__ float shared[];

the size of the array is determined at launch time (see Execution Configuration)

managed

The __managed__ qualifier, optionally used together with __device__, declares a variable that:

Can be referenced from both device and host code, e.g., its address can be taken or it can be read or written directly from a device or host function.
Has the lifetime of an application.

restrict

nvcc supports restricted pointers via the restrict keyword.

Restricted pointers were introduced in C99 to alleviate the aliasing problem that exists in C-type languages, and which inhibits all kind of optimization from code re-ordering to common sub-expression elimination.

Here is an example subject to the aliasing issue, where use of restricted pointer can help the compiler to reduce the number of instructions:

Memory Model

Parent and child grids share the same global and constant memory storage, but have distinct local and shared memory.

Global Memory

通俗意義上的設備內存。

Zero Copy Memory

Zero-copy system memory has identical coherence and consistency guarantees to global memory, and follows the semantics detailed above. A kernel may not allocate or free zero-copy memory, but may use pointers to zero-copy passed in from the host program.

可以在CUDA C核函數中直接訪問這種類型的主機內存，由於這種內存不需要將數據複製到GPU上，因此也稱爲零拷貝內存。零拷貝內存的定義是基於固定內存(或者說也鎖定內存)，這種新型的主機內存能夠確保不會交換出物理內存。通過調用cudaHostAlloc()來分配這種內存。

如果要使用零拷貝內存機制，首先需要判斷設備是否支持映射主機內存。

在使用標誌cudaHostAllocMapped來分配主機內存以後，就可以從GPU中訪問這塊內存。然而，GPU的虛擬內存空間與CPU是不同的，因此在GPU上訪問它們與在CPU上訪問它們有着不同的地址，調用cudaHostAlloc()將返回這塊內存在CPU上的指針，因此需要調用cudaHostGetDevicePointer()函數來獲得這塊內存在GPU上的有效指針。

調用cudaThreadSynchronize()將CPU和GPU進行同步。用來確保零拷貝內存的一致性。

int main()
{
    cudaDeviceProp prop;
    int deviceID;
    cudaGetDevice(&deviceID);

    cudaGetDeviceProperties(&prop, deviceID);
    if (!prop.canMapHostMemory != 1){
        printf("Device cannot map memory...\n");
        return 0;
    }
    if (!prop.deviceOverlap){
        printf("Device will not handle overlaps,so no speed up from streams\n");
        return 0;
    }
    else{
        printf("Device can handle overlaps,so there are speed up from streams\n");
    }
    system("pause");
    return 0;
}

對於集成GPU，使用零拷貝內存通常都會帶來性能提升，因爲內存在物理上就與主機是共享的。將緩衝區聲明爲零拷貝內存的唯一作用就是避免不必要的數據複製。

零拷貝內存同樣不例外：每個固定內存都會佔用系統的可用物理內存，這最終將降低系統的性能。

當輸入內存和輸出內存都只能使用一次時，那麼在獨立GPU上使用零拷貝內存將帶來性能提升。

但是，由於GPU不會緩存零拷貝內存的內容，如果多次讀取內存，那麼最終將得不償失。

每個GPU都有自己的線程。

Constant Memory

Constants are immutable(不可變的) and may not be modified from the device, even between parent and child launches. That is to say, the value of all __constant__ variables must be set from the host prior to launch. Constant memory is inherited automatically by all child kernels from their respective parents.

事實上，正是這種強大的計算優勢激發了人們開始研究如何在圖形處理器上執行通用計算。由於在GPU上包含有數百個數學計算單元，因此性能瓶頸通常並不在於芯片的數學計算吞吐量，而是在於芯片的內存帶寬。CUDA C除了支持全局內存和共享內存，還支持另一種類型的內存。即常量內存。從常量內存的名字就可以看出來，常量內存用於保存在覈函數執行期間不會發生變化的數據。常量內存採取了不同於標準全局內存的處理方式。在某些情況中，用常量內存來替換全局內存能有效地減少內存帶寬。

與標準的全局內存相比，常量內存存在着一些限制，但是在某些情況中，使用常量內存將提升應用程序的性能。特別是，當線程束中的所有線程都訪問相同的只讀數據時，將獲得額外的性能提升。在這種訪問模式中使用常量內存可以節約內存帶寬，不僅是因爲這種模式可以讀取操作在半線程束中廣播，而且還因爲在芯片上包含了常量內存緩存。

在許多算法中，內存帶寬都是一種瓶頸，因此採用一些機制來改善這種情況是非常有用的。

Taking the address of a constant memory object from within a kernel thread has the same semantics as for all CUDA programs, and passing that pointer from parent to child or from a child to parent is naturally supported.

Shared and Local Memory

Shared and Local memory is private to a thread block or thread, respectively, and is not visible or coherent between parent and child. Behavior is undefined when an object in one of these locations is referenced outside of the scope within which it belongs, and may cause an error.

The NVIDIA compiler will attempt to warn if it can detect that a pointer to local or shared memory is being passed as an argument to a kernel launch. At runtime, the programmer may use the __isGlobal() intrinsic to determine whether a pointer references global memory and so may safely be passed to a child launch.

Note that calls to cudaMemcpy*Async() or cudaMemset*Async() may invoke new child kernels on the device in order to preserve stream semantics. As such, passing shared or local memory pointers to these APIs is illegal and will return an error.

Local Memory

Local memory is private storage for an executing thread, and is not visible outside of that thread. It is illegal to pass a pointer to local memory as a launch argument when launching a child kernel. The result of dereferencing such a local memory pointer from a child will be undefined.

For example the following is illegal, with undefined behavior if x_array is accessed by child_launch:

int x_array[10]; // Creates x_array in parent's local memory child_launch<<< 1, 1 >>>(x_array);

It is sometimes difficult for a programmer to be aware of when a variable is placed into local memory by the compiler. As a general rule, all storage passed to a child kernel should be allocated explicitly from the global-memory heap, either with cudaMalloc(), new() or by declaring __device__ storage at global scope. For example:

// Correct - "value" is global storage 
__device__ int value; 
__device__ void x() 
{ 
    value = 5; 
    child<<< 1, 1 >>>(&value); 
}

// Invalid - "value" is local storage 
__device__ void y() { 
    int value = 5; 
    child<<< 1, 1 >>>(&value); 
}

Texture Memory

Writes to the global memory region over which a texture is mapped are incoherent with respect to texture accesses. Coherence for texture memory is enforced at the invocation of a child grid and when a child grid completes. This means that writes to memory prior to a child kernel launch are reflected in texture memory accesses of the child. Similarly, writes to memory by a child will be reflected in the texture memory accesses by a parent, but only after the parent synchronizes on the child’s completion. Concurrent accesses by parent and child may result in inconsistent data.

紋理緩存是專門爲那些在內存中訪問模式中存在大量空間局部性(Spatial Locality)的圖形應用程序而設計的。在某個計算應用程序中，這意味着一個線程讀取的位置可能與鄰近線程讀取的位置”非常接近”。

位置：設備內存
目的：能夠減少對內存的請求並提供高效的內存帶寬。是專門爲那些在內存訪問模式中存在大量空間局部性的圖形應用程序設計，意味着一個線程讀取的位置可能與鄰近線程讀取的位置“非常接近”。如下圖：

–
3. 紋理變量（引用）必須聲明爲文件作用域內的全局變量。
4. 形式：分爲一維紋理內存和二維紋理內存。
4.1. 一維紋理內存
4.1.1. 用texture<類型>類型聲明，如texture texIn。
4.1.2. 通過cudaBindTexture()綁定到紋理內存中。
4.1.3. 通過tex1Dfetch()來讀取紋理內存中的數據。
4.1.4. 通過cudaUnbindTexture()取消綁定紋理內存。
–
4.2. 二維紋理內存
4.2.1. 用texture<類型,數字>類型聲明，如texture<float，2> texIn。
4.2.2. 通過cudaBindTexture2D()綁定到紋理內存中。
4.2.3. 通過tex2D()來讀取紋理內存中的數據。
4.2.4. 通過cudaUnbindTexture()取消綁定紋理內存。

固定內存(Pinned Memory)

CUDA運行時提供了自己獨有的機制來分配主機內存:cudaHostAlloc()。事實上，malloc()分配的內存與cudaHostAlloc()分配的內存之間存在着一個重要的差異。C庫函數malloc()將分配標準的、可分頁的(Pagable)主機內存。而cudaHostAlloc()將分配頁鎖定的主機內存。頁鎖定內存也稱爲固定內存(Pinned Memory)或者不可分頁內存。它有一個重要的屬性，操作系統將不會對這塊內存分頁並交換到磁盤上。從而確保了該內存始終駐留在物理內存中。因此，操作系統能夠安全地使某個應用程序訪問該內存的物理地址，因爲這塊內存將不會被破壞或者重新定位。

事實上，當使用可分頁內存進行復制時，CUDA驅動程序仍然會通過DAM內存把數據傳輸給GPU。因此，複製操作將執行兩遍，第一遍從可分頁內存中複製數據到一塊“臨時的”也鎖定內存，然後再從這個頁鎖定內存中將數據複製給GPU的內存。

固定內存是一把雙刃劍，當使用固定內存時，你的電腦將失去虛擬內存的所有功能。特別是，在應用程序中使用每個頁鎖定內存時都需要分配物理內存，因爲這些物理內存不能交換到磁盤上。這就意味着，與使用標準的malloc()調用相比，系統將更快地耗盡物理內存。因此，應用程序在物理內存較少的機器上會運行失敗。而且意味着應用程序將影響已在系統上運行的其他應用程序的性能。

可移動的固定內存

固定內存是對於單個CPU線程來說是“固定的”。也就是說，如果某個線程分配固定內存，那麼這些內存只是對於分配它們的線程來說是頁鎖定的。如果在CPU線程之間共享指向固定內存的指針，那麼其他的線程都將會認爲這塊”固定內存”視爲標準的、可分頁的內存。

可移動的固定內存的含義：主機的多個線程之間移動這塊內存(即主機多個線程之間共享這塊固定內存)，並且每個線程都將其視爲固定內存。需要指定一個新的標誌：cudaHostAllocPortable。

固定內存《《《零拷貝內存《《《可移動的固定內存

固定內存：解決數據位置變化的問題。
零拷貝內存：解決CPU和GPU之間拷貝複製數據的問題。
可移動的固定內存：解決多個GPU之間共享數據的問題。

Streams

Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined).

CUDA流在加速應用程序方面起着重要的作用，CUDA流表示一個GPU操作隊列，並且該隊列中的操作將以指定的順序執行。例如核函數啓動、內存複製，以及事件的啓動和結束等。將這些操作添加到流的順序也就是他們的執行順序。你可以將每個流視爲GPU上的一個任務。並且這些任務可以並行執行。(即每個流的執行請塊將不受其他的流的影響)。

任何傳遞給cudaMemcpyAsync()的主機內存指針都必須已經通過cudaHostAlloc()分配好內存。也就是說，只能以異步方式對頁鎖定內存進行復制操作。

如果想要確保GPU執行完了計算和內存複製等操作，那麼就需要將GPU與主機進行同步。也就是說，主機在繼續執行之前，首先要等待GPU執行完成。可以調用cudaStreamSyncharonize()並指定想要等待的流。異步操作必須判斷CUDA設備是否支持重疊操作。判斷設備是否支持計算和內存複製操作的重疊，如果設備支持重疊，那麼就可以使用流，進行CUDA設備上的多個流的任務並行操作。

硬件在處理內存複製和核函數執行時分別採用了不同的引擎（內存複製引擎和核函數執行引擎），因此我們需要知道，將操作放入流中隊列中的順序，將影響着CUDA驅動程序調度這些操作以及執行方式。調整好放入流中的順序將很好地提高內存複製操作和核函數執行的重疊的時間效率。

通過使用多個CUDA流，我們可以使GPU在執行核函數的同時，還能在主機和GPU之間執行復制操作。然而，當採用這種方式是，需要注意兩個因素。首先，需要通過cudaHostAlloc()來分配主機內存，因爲接下來需要通過cudaMemcpyAsync()對內存複製操作進行排隊。而異步複製操作需要在固定緩衝區執行。其次，我們要知道，添加這些操作到流中，其中添加到流中操作的順序將對內存複製操作和核函數執行的重疊情況產生影響。

通常，應該採用寬度優先或者輪詢方式將工作分配到每個流中。

扯一扯：併發重點在於一個極短時間段內運行多個不同的任務；並行重點在於同時運行一個任務。
任務並行性：是指並行執行兩個或多個不同的任務，而不是在大量數據上執行同一個任務。
概念：CUDA流表示一個GPU操作隊列，並且該隊列中的操作將以指定的順序執行。我們可以在流中添加一些操作，如核函數啓動，內存複製以及事件的啓動和結束等。這些操作的添加到流的順序也是它們的執行順序。可以將每個流視爲GPU上的一個任務，並且這些任務可以並行執行。
硬件前提：必須是支持設備重疊功能的GPU。支持設備重疊功能，即在執行一個核函數的同時，還能在設備與主機之間執行復制操作。
聲明與創建：聲明cudaStream_t stream;，創建cudaSteamCreate(&stream);。
cudaMemcpyAsync()：前面在cudaMemcpy()中提到過，這是一個以異步方式執行的函數。在調用cudaMemcpyAsync()時，只是放置一個請求，表示在流中執行一次內存複製操作，這個流是通過參數stream來指定的。當函數返回時，我們無法確保複製操作是否已經啓動，更無法保證它是否已經結束。我們能夠得到的保證是，複製操作肯定會當下一個被放入流中的操作之前執行。傳遞給此函數的主機內存指針必須是通過cudaHostAlloc()分配好的內存。（流中要求固定內存）
流同步：通過cudaStreamSynchronize()來協調。
流銷燬：在退出應用程序之前，需要銷燬對GPU操作進行排隊的流，調用cudaStreamDestroy()。
針對多個流：
- 記得對流進行同步操作。
- 將操作放入流的隊列時，應採用寬度優先方式，而非深度優先的方式，換句話說，不是首先添加第0個流的所有操作，再依次添加後面的第1，2,…個流。而是交替進行添加，比如將a的複製操作添加到第0個流中，接着把a的複製操作添加到第1個流中，再繼續其他的類似交替添加的行爲。
- 要牢牢記住操作放入流中的隊列中的順序影響到CUDA驅動程序調度這些操作和流以及執行的方式。

設備指針使用限制：

可以將cudaMalloc()分配的指針傳遞給在設備上執行的函數。
可以在設備代碼中使用cudaMalloc()分配的指針進行內存塊讀/寫操作。
可以將cudaMalloc()分配的指針傳遞給在主機端上執行的函數。
不能在主機代碼中使用cudaMalloc()分配的指針進行內存讀/寫操作。
cudaMemcpyToSymbol()會將數據複製到常量內存中，而cudaMemcpy()會複製到全局內存中。
cudaMemset()是在GPU內存上執行，而memset()是在主機內存上運行。

總的來說：
主機指針只能訪問主機代碼中的內存，而設備指針也只能訪問設備代碼中的內存。

GPU計算的應用前景在很大程度上取決於能否從許多問題中發掘出大規模並行性。

NOTE:

當線程塊的數量爲GPU中處理器數量的2倍時，將達到最優性能。
核函數執行的第一個計算就是計算輸入數據的偏移。每個線程的起始偏移都是0到線程數量減1之間的某個值。然後，對偏移的增量爲已啓動線程的總數。

Execution Configuration

Device Memory Accesses

Reference
1. CUDA C Programming Guide
2. CUDA入門博客
3. CUDA U
4. CUDA從入門到精通
5. SISD、MIMD、SIMD、MISD計算機的體系結構的Flynn分類法