CUDA核函數share memory

原創

wenxuegeng

2020-02-25 23:01

標籤： CUDAExample

CUDA核函數運行參數

核函數是GPU每個thread上運行的程序。必須通過gloabl函數類型限定符定義。形式如下：

            __global__ void kernel(param list){  }

核函數只能在主機端調用，調用時必須申明執行參數。調用形式如下：

            Kernel<<<Dg,Db, Ns, S>>>(param list);

其中，參數Ns是一個可選參數，用於設置kernel函數中動態分配shared memory大小，動態分配shared memory應小於每個塊允許的最大share memory大小，單位爲byte。不需要動態分配時該值爲0或省略不寫。

share memory的分配方式分爲靜態分配和動態分配，靜態分配指的是在覈函數中申請固定大小的share memory，動態分配指的是在覈函數運行過程中設置share memory的大小。兩者可以單獨存在與核函數中，也可以共同存在覈函數中，但是必須保證總大小不大於每一塊中share memory的大小。在覈函數調用是在<<<>>>中第三個參數就是用來設置動態申請share memory最大尺寸的允許值，也就是說，核函數中使用的動態share memory應該小於設置的大小，即第三個參數的大小。第四個參數S是一個cudaStream_t類型的可選參數，初始值爲零，表示該核函數處在哪個流之中。
以下例子說明第三個參數：

測試例子硬件配置

cpu Intel 至強四核E3-1231 V3 @ 3.40GHz
顯卡 NVIDIA GeForce GTX 980
vs2013 x64

測試代碼

// This example shows how to use the share memory

// System includes
#include <stdio.h>
#include <assert.h>

/*
* gtx 980 每一塊中允許最大share memory是48k
* 12288 = 48 * 1024 / 4;
* 11776 = 46 * 1024 / 4;
* 程序中改變靜態share mem的大小測試，靜態，動態，最大share memory之間的關係
*/
__global__ static void timedReduction(const float *input, float *output)
{
    const int staticShareMem = 11776; //測試值 11776, 11778, 12288, 12289 
    __shared__ float staticShared[staticShareMem];
    extern __shared__ float shared[];

    const int tid = threadIdx.x;
    const int bid = blockIdx.x;
    if (tid == 1) //給靜態內存賦值，防止被優化掉
    {
        for (int i = 0; i < staticShareMem ; i++)
        {
            staticShared[i] = i * 2.0f;
        }
    }

    // 動態顯存進行簡單計算
    shared[tid] = input[tid]*3.0f + input[tid + blockDim.x]*2.0f;
    shared[tid + blockDim.x] = input[tid + blockDim.x] + input[tid]*2;

    // Write result.每個block輸出一個參數
    if (tid == 0) output[bid] = shared[0];
}



#define NUM_BLOCKS    64
#define NUM_THREADS   256

// Start the main CUDA Sample here
int main(int argc, char **argv)
{

    float *dinput = NULL;
    float *doutput = NULL;

    float input[NUM_THREADS * 2];
    float output[NUM_BLOCKS];

    //動態share mem輸入值大小爲2kB
    for (int i = 0; i < NUM_THREADS * 2; i++)
    {
        input[i] = (float)i;
    }

    cudaMalloc((void **)&dinput, sizeof(float) * NUM_THREADS * 2);
    cudaMalloc((void **)&doutput, sizeof(float) * NUM_BLOCKS); 

    cudaMemcpy(dinput, input, sizeof(float) * NUM_THREADS * 2, cudaMemcpyHostToDevice);

    timedReduction << <NUM_BLOCKS, NUM_THREADS, sizeof(float) * 2 * NUM_THREADS >> >(dinput, doutput);

    cudaMemcpy(output, doutput, sizeof(float) * NUM_BLOCKS, cudaMemcpyDeviceToHost);

    cudaFree(dinput);
    cudaFree(doutput);

    //cpu端準備好測試結果數據
    float temp, sumall = 0;
    temp = input[0]*3.0 + input[256]*2.0;
    printf("the set result = %f\n", temp);

    for (int i = 1; i < NUM_BLOCKS; i++)
    {
        sumall += temp - output[i];
        //printf("out = %f\n", output[i]);
    }
    //輸出對比結果，0 - 核函數運行正確，其他值 - 錯誤
    printf("result = %f\n", sumall);

    cudaDeviceReset();

    return 1;
}

運行結果

const int staticShareMem = 11775;

the set result = 512.000000
result = 0.000000
const int staticShareMem = 11776;

the set result = 512.000000
result = 0.000000
const int staticShareMem = 11778;

the set result = 512.000000
result = 30240.000000
const int staticShareMem = 12288;

the set result = 512.000000
result = 30240.000000
const int staticShareMem = 12289;

報錯，超出share memory總尺寸

結論

share memory的靜態分配和動態分配可以單獨存在於核函數中，也可以共同存在覈函數中，但是必須保證總大小不大於每一塊中share memory的總大小。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

CUDA核函數share memory

CUDA核函數運行參數

內存尋址優化

CUDAExample-0-clock

Linux系統動態鏈接庫和靜態鏈接庫CMake的使用方法

統計-均值，期望，方差，協方差，協方差矩陣

上海復旦大學吳立德教授深度學習課程五

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結