CUDA進階資料專題(一)pinned memory 和 unified memory

內存分配的初步解釋

關於協調CPU和GPU之間的內存創建和分配以及傳輸

CPU分配內存

CPU分配內存主要有兩種方式:

  • 通過C標準庫中的malloc函數完成
  • 調用CUDA中的cudaMallocHost函數

cudaMallocHost函數通過頁面鎖定,可以提供更高的CPU和GPU傳輸速率,吞吐量增加是Barracuda10的2.4倍,Barracuda04的2.0倍,Barracuda01的1.5倍。但是缺點是使用cudaMallocHost分配內存比malloc更加慢,
每次調用cudaMallocHost分配1M的內存需要2300微妙左右,當需要分配512M,時間上升到61毫秒,比malloc在分配時慢了3-5個量級

GPU分配內存

GPU的分配方式有:

  • 通過CUDA的cudaMalloc函數
  • cudaMallocPitch函數
  • cudaMallocArray函數

cudaMallocPitch 和 cudaMallocArray都沒有cudaMalloc來的快。cudaMalloc的分配時間如圖1所示,256bytes的調用分配耗時1微妙,2KB到4KB分配耗時有顯著提升,接下來在512KB分配大約需要50微妙,大於512K時,耗時顯著提升,當分配512M時,需要12.5毫秒左右。總而言之,在小於4M時,cudaMalloc分配速度比malloc慢1.5個量級,大於4MB時,cudaMallocmalloc慢2-4個量級。

 

例子來自https://github.com/CoffeeBeforeArch/cuda_programming

https://www.youtube.com/watch?v=LGhduZNudDY&list=PLxNPSjHT5qvu4Q2UElj3HUCh2lpSooQWo&index=4&frags=wn

中的vectorAdd_pinned.cu

vectorAdd_um_prefetch.cu的對比

 

權威解釋來自

《CUDA C++ Programming Guide》
 

Data Prefetching
Data prefetching means migrating data to a processor’s memory and mapping it in
that processor’s page tables before the processor begins accessing that data. The intent
of data prefetching is to avoid faults while also establishing data locality. This is most
valuable for applications that access data primarily from a single processor at any given
time. As the accessing processor changes during the lifetime of the application, the data
can be prefetched accordingly to follow the execution flow of the application. Since work
is launched in streams in CUDA, it is expected of data prefetching to also be a streamed
operation as shown in the following API:
cudaError_t cudaMemPrefetchAsync(const void *devPtr,
size_t count,
int dstDevice,
cudaStream_t stream);
where the memory region specified by devPtr pointer and count number of bytes, with
ptr rounded down to the nearest page boundary and count rounded up to the nearest
page boundary, is migrated to the dstDevice by enqueueing a migration operation in
stream. Passing in cudaCpuDeviceId for dstDevice will cause data to be migrated to
CPU memory.
Consider a simple code example below:
void foo(cudaStream_t s) {
char *data;
cudaMallocManaged(&data, N);
init_data(data, N); // execute on CPU
cudaMemPrefetchAsync(data, N, myGpuId, s); // prefetch to GPU
mykernel<<<..., s>>>(data, N, 1, compare); // execute on GPU
cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s); // prefetch to CPU
cudaStreamSynchronize(s);
use_data(data, N);
cudaFree(data);
}
Without performance hints the kernel mykernel will fault on first access to data
which creates additional overhead of the fault processing and generally slows down
the application. By prefetching data in advance it is possible to avoid page faults and
achieve better performance

This API follows stream ordering semantics, i.e. the migration does not begin until all
prior operations in the stream have completed, and any subsequent operation in the
stream does not begin until the migration has completed.

翻譯

數據預取:
數據預取意味着將數據遷移到處理器的內存,並在處理器開始訪問該數據之前將其映射到該處理器的頁表中。 數據預取的目的是避免錯誤,同時建立數據局部性。 這對於在任何給定時間主要從單個處理器訪問數據的應用程序都是最有價值的。 隨着訪問處理器在應用程序的生命週期內發生變化,可以相應地預取數據以跟蹤應用程序的執行流程。 由於工作是在CUDA中的流中啓動的,因此預計數據預取也是流式操作,如以下API所示:

cudaError_t cudaMemPrefetchAsync(const void *devPtr,
    size_t count,
    int dstDevice,
    cudaStream_t stream);

其中由devPtr指針指定的內存區域和計數字節數(ptr向下舍入到最接近的頁邊界並向上舍入到最接近的頁邊界的計數)將通過在流中排列遷移操作而遷移到dstDevice。 傳遞給dstDevice的cudaCpuDeviceId將導致數據遷移到CPU內存。
考慮下面的簡單代碼示例:

void foo(cudaStream_t s) {
    char *data;
    cudaMallocManaged(&data, N);
    init_data(data, N); // execute on CPU
    cudaMemPrefetchAsync(data, N, myGpuId, s); // prefetch to GPU
    mykernel << <..., s >> >(data, N, 1, compare); // execute on GPU
    cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s); // prefetch to CPU
    cudaStreamSynchronize(s);
    use_data(data, N);
    cudaFree(data);
}

如果沒有性能提示,內核mykernel會在首次訪問數據時出錯,這會造成額外的故障處理開銷,並且通常會降低應用程序的速度。 通過預先預取數據,可以避免頁面錯誤並實現更好的性能。
此API遵循流排序語義,即直到流中的所有先前操作都完成後纔開始遷移,並且在遷移完成之前,流中的任何後續操作都不會開始。
數據使用提示:
當多個處理器需要同時訪問相同的數據時,數據預取本身是不夠的。 在這種情況下,應用程序提供有關如何實際使用數據的提示很有用。 以下諮詢API可用於指定數據使用情況:


參考鏈接:

https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

https://zhuanlan.zhihu.com/p/82651065

https://satisfie.github.io/2016/06/06/Caffe%E8%A7%A3%E8%AF%BB1-Pinned-Memory-Vs-Non-Pinned-Memory/

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章