內存分配的初步解釋
關於協調CPU和GPU之間的內存創建和分配以及傳輸
CPU分配內存
CPU分配內存主要有兩種方式:
- 通過C標準庫中的
malloc
函數完成 - 調用CUDA中的
cudaMallocHost
函數
cudaMallocHost
函數通過頁面鎖定,可以提供更高的CPU和GPU傳輸速率,吞吐量增加是Barracuda10的2.4倍,Barracuda04的2.0倍,Barracuda01的1.5倍。但是缺點是使用cudaMallocHost
分配內存比malloc
更加慢,
每次調用cudaMallocHost
分配1M的內存需要2300微妙左右,當需要分配512M,時間上升到61毫秒,比malloc
在分配時慢了3-5個量級
GPU分配內存
GPU的分配方式有:
- 通過CUDA的
cudaMalloc
函數 cudaMallocPitch
函數cudaMallocArray
函數
cudaMallocPitch
和 cudaMallocArray
都沒有cudaMalloc
來的快。cudaMalloc
的分配時間如圖1所示,256bytes的調用分配耗時1微妙,2KB到4KB分配耗時有顯著提升,接下來在512KB分配大約需要50微妙,大於512K時,耗時顯著提升,當分配512M時,需要12.5毫秒左右。總而言之,在小於4M時,cudaMalloc
分配速度比malloc
慢1.5個量級,大於4MB時,cudaMalloc
比malloc
慢2-4個量級。
例子來自https://github.com/CoffeeBeforeArch/cuda_programming
https://www.youtube.com/watch?v=LGhduZNudDY&list=PLxNPSjHT5qvu4Q2UElj3HUCh2lpSooQWo&index=4&frags=wn
中的vectorAdd_pinned.cu
vectorAdd_um_prefetch.cu的對比
權威解釋來自
《CUDA C++ Programming Guide》
Data Prefetching
Data prefetching means migrating data to a processor’s memory and mapping it in
that processor’s page tables before the processor begins accessing that data. The intent
of data prefetching is to avoid faults while also establishing data locality. This is most
valuable for applications that access data primarily from a single processor at any given
time. As the accessing processor changes during the lifetime of the application, the data
can be prefetched accordingly to follow the execution flow of the application. Since work
is launched in streams in CUDA, it is expected of data prefetching to also be a streamed
operation as shown in the following API:
cudaError_t cudaMemPrefetchAsync(const void *devPtr,
size_t count,
int dstDevice,
cudaStream_t stream);
where the memory region specified by devPtr pointer and count number of bytes, with
ptr rounded down to the nearest page boundary and count rounded up to the nearest
page boundary, is migrated to the dstDevice by enqueueing a migration operation in
stream. Passing in cudaCpuDeviceId for dstDevice will cause data to be migrated to
CPU memory.
Consider a simple code example below:
void foo(cudaStream_t s) {
char *data;
cudaMallocManaged(&data, N);
init_data(data, N); // execute on CPU
cudaMemPrefetchAsync(data, N, myGpuId, s); // prefetch to GPU
mykernel<<<..., s>>>(data, N, 1, compare); // execute on GPU
cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s); // prefetch to CPU
cudaStreamSynchronize(s);
use_data(data, N);
cudaFree(data);
}
Without performance hints the kernel mykernel will fault on first access to data
which creates additional overhead of the fault processing and generally slows down
the application. By prefetching data in advance it is possible to avoid page faults and
achieve better performance
This API follows stream ordering semantics, i.e. the migration does not begin until all
prior operations in the stream have completed, and any subsequent operation in the
stream does not begin until the migration has completed.
翻譯
數據預取:
數據預取意味着將數據遷移到處理器的內存,並在處理器開始訪問該數據之前將其映射到該處理器的頁表中。 數據預取的目的是避免錯誤,同時建立數據局部性。 這對於在任何給定時間主要從單個處理器訪問數據的應用程序都是最有價值的。 隨着訪問處理器在應用程序的生命週期內發生變化,可以相應地預取數據以跟蹤應用程序的執行流程。 由於工作是在CUDA中的流中啓動的,因此預計數據預取也是流式操作,如以下API所示:
cudaError_t cudaMemPrefetchAsync(const void *devPtr,
size_t count,
int dstDevice,
cudaStream_t stream);
其中由devPtr指針指定的內存區域和計數字節數(ptr向下舍入到最接近的頁邊界並向上舍入到最接近的頁邊界的計數)將通過在流中排列遷移操作而遷移到dstDevice。 傳遞給dstDevice的cudaCpuDeviceId將導致數據遷移到CPU內存。
考慮下面的簡單代碼示例:
void foo(cudaStream_t s) {
char *data;
cudaMallocManaged(&data, N);
init_data(data, N); // execute on CPU
cudaMemPrefetchAsync(data, N, myGpuId, s); // prefetch to GPU
mykernel << <..., s >> >(data, N, 1, compare); // execute on GPU
cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s); // prefetch to CPU
cudaStreamSynchronize(s);
use_data(data, N);
cudaFree(data);
}
如果沒有性能提示,內核mykernel會在首次訪問數據時出錯,這會造成額外的故障處理開銷,並且通常會降低應用程序的速度。 通過預先預取數據,可以避免頁面錯誤並實現更好的性能。
此API遵循流排序語義,即直到流中的所有先前操作都完成後纔開始遷移,並且在遷移完成之前,流中的任何後續操作都不會開始。
數據使用提示:
當多個處理器需要同時訪問相同的數據時,數據預取本身是不夠的。 在這種情況下,應用程序提供有關如何實際使用數據的提示很有用。 以下諮詢API可用於指定數據使用情況:
參考鏈接:
https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/
https://zhuanlan.zhihu.com/p/82651065
https://satisfie.github.io/2016/06/06/Caffe%E8%A7%A3%E8%AF%BB1-Pinned-Memory-Vs-Non-Pinned-Memory/