GPU programming notes

cuda tutorial

這是我個人學習過程中的一些筆記,菜鳥一枚。
抱歉所有的圖都掛了,有空的時候慢慢恢復。
cuda programming guide
NVIDIA best practice guide
CS344 udacity website

  • got an overview on the usage of stereo camera based on Yingcai’s code

communication pattern

  • map
    一對一,每個pix執行同樣的函數,比如×2
  • gather
    多對一,比如圖像模糊
    在這裏插入圖片描述
  • scatter
    同一個線程嘗試寫到許多memory,一對多,可能衝突。
    在這裏插入圖片描述
  • stencil
    好像是前兩個的結合,在臨近點gather。
  • transpose

image.png

memery model

using global memery is much slower than using local and shared memeries.

  • varibles defined in kernel func are local varibles.
  • use shared memery:
    [外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-QsYq76Sy-1583324831676)(https://i.postimg.cc/3rVP2k9f/image.png)]

put frequently used memory into shared memory!

syncronize

__shared__ int arry[128];
array[idx]=threadIdx;
// wait for every read op to compelete
__syncthreads();
int temp=array[idx+1];
// write from shared mem to local mem
__syncthreads();
array[idx]=temp;
__syncthreads();

atomic opp

  • is equivlent of adding variables in shared mem, but actually in global memery.
  • will slow down the app

g[i]=g[i]+1    ------>  atomicAdd(&g[i],1);

  • realize push_back in kernel func.

thrust

  • vector
thrust::device_vector<int> dv<100>;
thrust::host_vector<int> hv<100,25>;
// cudaMemcopy() in the background
dv=hv;
  • Note: device_vector.push_back() cannot be used in kernel function example

reduce

  1. add 0.5 back elements with 0.5 front elements in per block
  2. add 0.25 back elements with 0.25 front elements in per block
  3. until 1 element in every block

scan

  • Applications of Scan
  1. Stream Compaction
  2. Summed-Area Tables :
    A summed-area table (SAT) is a two-dimensional table generated from an input image in which each entry in the table stores the sum of all pixels between the entry location and the lower-left corner of the input image. often used when doing a box filter on image.

serialize implementation

  • inclusive
    for(int i = 0; i < ARRAY_SIZE; i++){
    	acc = acc + elements[i];
    	out[i] = acc;
    }

  • exclusive
    for(int i = 0; i < ARRAY_SIZE; i++){
        out[i] = acc;
    	acc = acc + elements[i];

    }

parallel implementation

Hillis and Steele

在這裏插入圖片描述

  • In practice , it should be excuted in the same block which has more threads than the number of array.
  • use double buffer, or the results of one warp will be overwritten by threads in another warp.
   __global__ void scan(float *g_odata, float *g_idata, int n)
{
  extern __shared__ float temp[]; // allocated on invocation
   int thid = threadIdx.x;
  int pout = 0, pin = 1;
  // Load input into shared memory.
   // This is exclusive scan, so shift right by one
   // and set first element to 0
  temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
  __syncthreads();
  for (int offset = 1; offset < n; offset *= 2)
  {
    pout = 1 - pout; // swap double buffer indices
    pin = 1 - pout;
    if (thid >= offset)
      temp[pout*n+thid] += temp[pin*n+thid - offset];
    else
      temp[pout*n+thid] = temp[pin*n+thid];
    __syncthreads();
  }
  g_odata[thid] = temp[pout*n+thid]; // write output
}
  • in/out: 2 buffers
  • d: step
  • offset=2^d

Blelloch

  • To do this we will use an algorithmic pattern that arises often in parallel computing: balanced trees. The idea is to build a balanced binary tree on the input data and sweep it to and from the root to compute the prefix sum. A binary tree with n leaves has d = log2 n levels, and each level d has 2 d nodes. If we perform one add per node, then we will perform O(n) adds on a single traversal of the tree.
  • The algorithm consists of two phases: the reduce phase (also known as the up-sweep phase) and the down-sweep phase.
upsweep

在這裏插入圖片描述

__global__ void Bscan(unsigned int *g_odata, int *g_idata, int n) {
  extern __shared__ int temp[]; // allocated on invocation
  int thid = threadIdx.x;
  int offset = 1;
  temp[2 * thid] = g_idata[2 * thid]; // load input into shared memory
  temp[2 * thid + 1] = g_idata[2 * thid + 1];
  for (int d = n >> 1; d > 0; d >>= 1) // build sum in place up the tree
  {
    __syncthreads();
    if (thid < d) {
      int ai = offset * (2 * thid + 1) - 1;
      int bi = offset * (2 * thid + 2) - 1;
      temp[bi] += temp[ai];
    }
    offset *= 2;
  }
  if (thid == 0) {
    temp[n - 1] = 0;
  }                              // clear the last element
  • size of temp: 2d+2(d-1)+…1=2^(d+1)=2n
  • store all intermedia values(binary tree)
down sweep

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-LMt9zDhw-1583324721325)(https://i.postimg.cc/cHWQz1qx/image.png)]

  for (int d = 1; d < n; d *= 2) // traverse down tree & build scan
  {
    offset >>= 1;
    __syncthreads();
    if (thid < d) {
      int ai = offset * (2 * thid + 1) - 1;
      int bi = offset * (2 * thid + 2) - 1;
      float t = temp[ai];
      temp[ai] = temp[bi];
      temp[bi] += t;
    }
  }
  __syncthreads();
  g_odata[2 * thid] = temp[2 * thid]; // write results to device memory
  g_odata[2 * thid + 1] = temp[2 * thid + 1];

segment scan

一個長array分成許多小array,然後在裏面scan.
用在稀疏矩陣乘法,稠密向量乘法。
page rank: n*n,有鏈接的纔不是0
這時,就可以使用segment scan將稀疏矩陣表示成CSR format,
在這裏插入圖片描述

sort

odd-even sort

parallel version of bubble sort.

  • work: O(n^2)
  • step: O(n)

merge sort

歸併排序, 3 main stages
Stage 2 use 1 thread-block.
Note that in stage 3, only 1 thread works and lots of SM will be idle. So we break 2 list into sub-lists to achieve parallesm.

  • work: O(n*logn)
  • step: O(logn)
    在這裏插入圖片描述

sorting networks

bitonic sort雙調排序,計算量oblivious to input content. 無論是random, sorted , reverted array,都一樣。
在這裏插入圖片描述

radix sort

從LSB(最小端)開始,把0放到前面,1放到後面,一直到最高位比較完。
每一位的比較,其實用的是compact, 也就是scan。
在這裏插入圖片描述

quick sort

本質是一個遞歸算法,先取pivot,分成<,=,>三個array,然後在三個array中繼續取pivot。以下是非遞歸實現。可以用dynamic parallism來遞歸。
在這裏插入圖片描述

optimization

幾個決定速度的方面:

  1. 算法,用算術複雜度來衡量
  2. 基本原則,cache-aware實現
  3. 基於平臺架構的優化;
  4. 小優化,比如快速逆平方根
    在這裏插入圖片描述

分析

首先分析自己代碼是否利用了bandwidth。
使用device query分析,可以計算出GPU的帶寬(時鐘和bus),看看自己寫的kernel是否完全利用了帶寬。
在這裏插入圖片描述
這很可能是coalesced的緣故.coalesce也就是,threadIdx.x相鄰的線程應該訪問相鄰的元素,否則如果跨度大的話,memory transaction中很大一部分就浪費。、

  • LITTLE‘S LAW
    在這裏插入圖片描述
    在這裏插入圖片描述
  • 有sharemem時候,降低latency的方法:
    在這裏插入圖片描述
    第二個是因爲,如果3232個thread,可能會有很多線程等待其他線程,改成1616就會好很多。
    第四個是因爲,一個SM有很多block,一個block在等sycthread(),其他的block可以行動。

thread divergence

一個warp中thread因爲if\else而異步進行。慢了多少主要看一個warp被分成了多少份。
在這裏插入圖片描述

一些ninja method

  1. 數字默認double,所以後面加上f會快,比如2.5f
  2. 使用intrinsic
    在這裏插入圖片描述

device query

5.14. 看看你硬件
在這裏插入圖片描述

pined memory

使用cudaHostMalloc可以讓CPU到GPU的copy更快。可以用於hash-table的streaming。
在這裏插入圖片描述

stream

讓不同的kernel同時運行。兩個kernel沒有相互依賴關係纔可以。
在這裏插入圖片描述
下圖也是同時的,s1和s2互不相干;
在這裏插入圖片描述
注意不要出現這種衝突的情況
在這裏插入圖片描述
stream的主要作用在,如果有一大陀數據,沒法在一個kernel裏全跑完,那就一小塊一小塊的考,比如一半在copy一半在process.讓data transfering和processing同時進行。
在這裏插入圖片描述

list ranking

就是把一個linked list變成array,給每個元素標號。
用更多的work(n* logn)來換更少的step(log n)。
本質思想是從找linked list最後一個元素來的。 每一個elem都找最後一個元素,然後從0開始wake, 一層一層wake。比如我們先wake 5
在這裏插入圖片描述
在這裏插入圖片描述

cuckoo hashing

chaining is bad for parallel.
kicking out things that already in the hash table.
在這裏插入圖片描述
有一定機率,每一個hash function都試過以後放不進去任何hash table。一定iteration之後,只能更換hash function了。
在lookup的時候,可能要把每一個hash function都試一遍。

  • 注意:
    write in和Kick out操作,需要atomic operation(AtomicExch)

dynamic parallelism

在這裏插入圖片描述
讓遞歸和nested 成爲可能!

注意點

  1. block裏面每一個thread都會launch一個child block,可以使用threadIdx.x來限制;
  2. stream, event都只屬於某個block,不能把他們pass到其他block或者子block。我還不懂,要看看lesson5。
  3. shared memory也是private的,沒法pass給 child block。child block在另一個grid裏面!
    在這裏插入圖片描述
    第一次知道kernel裏面還能malloc()…
  • quicksort的痛點和bfs是一樣的!
  1. 每次執行完一個kernel,都需要把gpu信息(output_len, is_change)傳到cpu
  2. wave形式,wave短的要等wave長的
    在這裏插入圖片描述
    結合cuda stream:
    在這裏插入圖片描述

matrix multiply

稀疏矩陣使用CSR格式。這裏的x是一個列向量。
在這裏插入圖片描述

cudaMallocPitch and cudaMemcpy2D

When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned…

Assuming that we want to allocate a 2D padded array of floating point (single precision) elements:

cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);

where

  • devPtr is an output pointer to float (float *devPtr);
  • devPitch is a size_t output variable denoting the length, in bytes, of the padded row;
  • Nrows and Ncols are size_t input variables representing the matrix size.

cudaMallocPitch will allocate a memory space of size, in bytes, equal to Nows * pitch. However, only the first Ncols * sizeof(float) bytes of each row will contain the matrix data.
Accordingly, cudaMallocPitch consumes more memory than strictly necessary for the 2D matrix storage, but this is returned in more efficient memory accesses.

CUDA provides also the cudaMemcpy2D function to copy data from/to host memory space to/from device memory space allocated with cudaMallocPitch.

cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)

where

  • devPtr and hostPtr are input pointers to float (float *devPtr and float *hostPtr) pointing to the - (source) device and (destination) host memory spaces, respectively;
  • devPitch and hostPitch are size_t input variables denoting the length, in bytes, of the padded rows for the device and host memory spaces, respectively;
  • Nrows and Ncols are size_t input variables representing the matrix size.

other operators

__ldg

optimize by using read-only cache.

在這裏插入圖片描述
refer: cuda sheet

cudaMemcopyAsyc

make use of stream. Copy engine and kernel engine can work cuncurrently.

cudaMallocManaged

data on host and device can share same pointer. May be slower than cudaMalloc.

ballot, bfind

cuda programming guide翻譯

  • create a bit mask in a 32 bits register using the GPU ballot instruction.
  • use the bfind PTX intrinsic to get the location of the first nonzero bit

how to optimize

warp

32 thread forms a warp. do computation concurrently in physical.

  • How multiple warp parallize?
    use computation(green) to hide latency(white).
    在這裏插入圖片描述

memory access pattern

在這裏插入圖片描述

在這裏插入圖片描述
so voxel hashing is not a good access pattern.

  1. voxel is not in native word length
  2. not aligned, not coalesced. (random.
    在這裏插入圖片描述

share mem

bank conflict

thread根據橫着的id來分warp,0–31是warp 1,32–65是warp2.
share mem根據2d的id來分bank,一共有32個bank,對應thread的一個warp。橫着的id<31時,id每次+1,那麼bank就+1。這樣設計是因爲,同一個warp裏的thread訪問到同一個bank裏不同地址,就會conflict。
萬一conflict怎麼辦呢?
可以在sharemem最右邊pad一個column,這個column純粹是佔位用的,不參加IO。這樣同一個warp就可以錯開來訪問不同的bank。具體參考共享內存csdn
在這裏插入圖片描述

GDB, CUDA-GDB, CUDA MEMCHECK

enter gdb\ cuda-gdb

  • if debug common cpp project:
    if excutable binary is called “detect” after compile, just type
gdb ./detect

in terminal.

  • if debug ros workspace:
    refer ROS WIKI

to run a node in cuda-gdb:

rosrun --prefix "cuda-gdb --args" edt edt_node 

to run a node in cuda-gdb:

roslaunch --prefix "gdb --args" edt/launch edt.launch

or you can write in launch file:

launch-prefix= "xterm -e gdb --args"
  • Caution:
  • to use gdb, please
set(ROS_BUILD_TYPE Debug)
set(CMAKE_BUILD_TYPE Debug)

in CmakeLists.

  • when debuging cuda using cuda-gdb, you should pass debug info to nvcc compiler as well.
SET(CUDA_NVCC_FLAGS "-g ;-G ;-arch=sm_60" CACHE STRING "nvcc flags" FORCE)

enter cuda memcheck

  • standalone:
rosrun --prefix "cuda-memcheck " edt edt_node

somehow, I cannot set params like --continue and --leackcheck .

  • intergrite with cuda-gdb
(cuda-gdb) set cuda memcheck on

cmds

gdb quick start

set breakpoints

  • break main
  • b main.cpp:14
  • b kernel.cu:58 if threadIdx.x==8
    It is a conditional breakpoint.

disable breakpoint

  • disable
  • delete breakpoints

watch varibles

  • p (var): print value
  • p a=1 : set a=1
  • info locals : print all local vars
  • p *array@len

other

  • n: next
  • l: list code
  • [Enter]: repeat the last md
  • q: quit
  • r:run or restart
  • c: continue
  • s: step one excution. if it is function, then step into.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章