一段代碼搞懂 gpu memory

GPU 的 memory 分爲三種，io速度從快到慢排序爲：

local memory
shared memory
global memory

其中 shared memory 的io 速度是遠快於 global memory 的。

這三種 memory 的訪問性質是：

local memory: 線程私有，只能本線程訪問
shared memory: 線程塊(thread block) 共享, 同一個線程塊中的線程可以訪問。
global memory: 所有線程都可訪問。

那麼在編程的過程中，這三種 memory 是從什麼地方體現出來的呢？

#include <stdio.h>

__global__ void memory_demo(float* array)
{
    // array 指針是在 local memory 上的，但是它指向的 memory 是 global memory
    // i, index 都是 local variable，每個 線程 私有。
    int i, index = threadIdx.x;

    // __shared__ variable 對 block 中的 線程可見
    // 並 和 thread block 有相同的 生命週期。
    __shared__ float sh_arr[128];

    // 將 global memory 的值 拷貝到 shared memory 上。
    sh_arr[index] = array[index];

    // barrier here
    __syncthreads();
    // 之後對 shared memory 的 IO 要快的多
    // do something
}

int main()
{
    float h_arr[128];
    float *d_arr;

    // cudaMalloc 分配的 memory 是在 global memory 上的。
    cudaMalloc((void **)&d_arr, sizeof(float)*128);
    cudaMemcpy((void*) d_arr, (void*) h_arr, sizeof(float)*128, cudaMemcpyHostToDevice);

    // 啓動 kernel
    memory_demo<<<1, 128>>>(d_arr);

    // .. do other stuff
}

ke1th

發佈了163 篇原創文章 · 獲贊 792 · 訪問量 212萬+

他的留言板關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

一段代碼搞懂 gpu memory

一段代碼搞懂 gpu memory

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

tensorflow學習筆記（四十三）：使用 tfdbg 來 debug

c++複雜聲明

MXNET學習筆記（一）：Module類（1）

pytorch: 如何優雅的將 int list 轉成 one-hot形式

pytorch學習筆記（十一）：fine-tune 預訓練的模型

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結