因爲不想什麼函數都自己寫設備核函數,看到opencv有對應的cuda版本的函數比如濾波,然而CUDA的NPP庫也提供了對應的濾波函數,我不知道哪個性能更高(當然肯定要比純CPU版本快,但我沒測試過)
一、cv::cuda
#include <stdio.h>
#include <opencv2\core\core.hpp>
#include <opencv2\core\cuda.hpp>
#include <opencv2\imgproc.hpp>
#include <opencv2\opencv.hpp>
#include <chrono>
#include <fstream>
#define SIZE 25
int main()
{
cv::Mat ImageHost = cv::imread("E:\\CUDA\\imgs\\cv_cuda_testimg.png", cv::IMREAD_GRAYSCALE);
cv::Mat ImageHostArr[SIZE];
cv::cuda::GpuMat ImageDev;
cv::cuda::GpuMat ImageDevArr[SIZE];
ImageDev.upload(ImageHost);
for (int n = 1; n < SIZE; n++)
cv::resize(ImageHost, ImageHostArr[n], cv::Size(), 0.5*n, 0.5*n, cv::INTER_LINEAR);
for (int n = 1; n < SIZE; n++)
cv::cuda::resize(ImageDev, ImageDevArr[n], cv::Size(), 0.5*n, 0.5*n, cv::INTER_LINEAR);
cv::Mat Detected_EdgesHost[SIZE];
cv::cuda::GpuMat Detected_EdgesDev[SIZE];
std::ofstream File1, File2;
File1.open("E:\\CUDA\\imgs\\canny_cpu.txt");
File2.open("E:\\CUDA\\imgs\\canny_gpu.txt");
std::cout << "Process started... \n" << std::endl;
for (int n = 1; n < SIZE; n++) {
auto start = std::chrono::high_resolution_clock::now();
cv::Canny(ImageHostArr[n], Detected_EdgesHost[n], 2.0, 100.0, 3, false);
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_time = finish - start;
File1 << "Image Size: " << ImageHostArr[n].rows* ImageHostArr[n].cols << " " << "Elapsed Time: " << elapsed_time.count() * 1000 << " msecs" << "\n" << std::endl;
}
cv::Ptr<cv::cuda::CannyEdgeDetector> canny_edg = cv::cuda::createCannyEdgeDetector(2.0, 100.0, 3, false);
for (int n = 1; n < SIZE; n++) {
auto start = std::chrono::high_resolution_clock::now();
canny_edg->detect(ImageDevArr[n], Detected_EdgesDev[n]);
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_time = finish - start;
File2 << "Image Size: " << ImageDevArr[n].rows* ImageDevArr[n].cols << " " << "Elapsed Time: " << elapsed_time.count() * 1000 << " msecs" << "\n" << std::endl;
}
std::cout << "Process ended... \n" << std::endl;
return 0;
}
我的電腦測出來:
Image Size: 49476 CPU Elapsed Time: 13.9905 msecs
Image Size: 198170 CPU Elapsed Time: 38.4235 msecs
Image Size: 446082 CPU Elapsed Time: 71.059 msecs
Image Size: 792680 CPU Elapsed Time: 103.162 msecs
Image Size: 1238230 CPU Elapsed Time: 141.263 msecs
Image Size: 1783530 CPU Elapsed Time: 165.636 msecs
Image Size: 2428048 CPU Elapsed Time: 195.356 msecs
Image Size: 3170720 CPU Elapsed Time: 246.407 msecs
Image Size: 4012344 CPU Elapsed Time: 300.643 msecs
Image Size: 4954250 CPU Elapsed Time: 334.725 msecs
Image Size: 5995374 CPU Elapsed Time: 367.368 msecs
Image Size: 7134120 CPU Elapsed Time: 422.822 msecs
Image Size: 8371818 CPU Elapsed Time: 468.351 msecs
Image Size: 9710330 CPU Elapsed Time: 546.653 msecs
Image Size: 11148060 CPU Elapsed Time: 589.476 msecs
Image Size: 12682880 CPU Elapsed Time: 617.778 msecs
Image Size: 14316652 CPU Elapsed Time: 682.61 msecs
Image Size: 16051770 CPU Elapsed Time: 784.524 msecs
Image Size: 17886106 CPU Elapsed Time: 802.988 msecs
Image Size: 19817000 CPU Elapsed Time: 829.102 msecs
Image Size: 21846846 CPU Elapsed Time: 912.721 msecs
Image Size: 23978570 CPU Elapsed Time: 954.053 msecs
Image Size: 26209512 CPU Elapsed Time: 978.438 msecs
Image Size: 28536480 CPU Elapsed Time: 1045.46 msecs
Image Size: 49476 GPU Elapsed Time: 1.8581 msecs
Image Size: 198170 GPU Elapsed Time: 2.1446 msecs
Image Size: 446082 GPU Elapsed Time: 3.8053 msecs
Image Size: 792680 GPU Elapsed Time: 4.8882 msecs
Image Size: 1238230 GPU Elapsed Time: 5.9607 msecs
Image Size: 1783530 GPU Elapsed Time: 6.7705 msecs
Image Size: 2428048 GPU Elapsed Time: 7.3428 msecs
Image Size: 3170720 GPU Elapsed Time: 8.3768 msecs
Image Size: 4012344 GPU Elapsed Time: 9.8166 msecs
Image Size: 4954250 GPU Elapsed Time: 12.5099 msecs
Image Size: 5995374 GPU Elapsed Time: 14.9313 msecs
Image Size: 7134120 GPU Elapsed Time: 17.6367 msecs
Image Size: 8371818 GPU Elapsed Time: 20.3713 msecs
Image Size: 9710330 GPU Elapsed Time: 23.8835 msecs
Image Size: 11148060 GPU Elapsed Time: 25.3751 msecs
Image Size: 12682880 GPU Elapsed Time: 28.7937 msecs
Image Size: 14316652 GPU Elapsed Time: 31.7389 msecs
Image Size: 16051770 GPU Elapsed Time: 35.7431 msecs
Image Size: 17886106 GPU Elapsed Time: 38.3026 msecs
Image Size: 19817000 GPU Elapsed Time: 39.8344 msecs
Image Size: 21846846 GPU Elapsed Time: 43.0583 msecs
Image Size: 23978570 GPU Elapsed Time: 45.6539 msecs
Image Size: 26209512 GPU Elapsed Time: 54.4576 msecs
Image Size: 28536480 GPU Elapsed Time: 49.9312 msecs
cv::cuda比cv::竟然快這麼多?!把兩個函數放一起:
std::cout << "Process started... \n" << std::endl;
for (int n = 1; n < SIZE; n++) {
auto start = std::chrono::high_resolution_clock::now();
cv::resize(ImageHost, ImageHostArr[n], cv::Size(), 0.5*n, 0.5*n, cv::INTER_LINEAR);
cv::Canny(ImageHostArr[n], Detected_EdgesHost[n], 2.0, 100.0, 3, false);
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_time = finish - start;
File1 << "Image Size: " << ImageHostArr[n].rows* ImageHostArr[n].cols << " " << "CPU Elapsed Time: " << elapsed_time.count() * 1000 << " msecs" << "\n" << std::endl;
}
cv::Ptr<cv::cuda::CannyEdgeDetector> canny_edg = cv::cuda::createCannyEdgeDetector(2.0, 100.0, 3, false);
for (int n = 1; n < SIZE; n++) {
auto start = std::chrono::high_resolution_clock::now();
cv::cuda::resize(ImageDev, ImageDevArr[n], cv::Size(), 0.5*n, 0.5*n, cv::INTER_LINEAR);
canny_edg->detect(ImageDevArr[n], Detected_EdgesDev[n]);
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_time = finish - start;
File2 << "Image Size: " << ImageDevArr[n].rows* ImageDevArr[n].cols << " " << "GPU Elapsed Time: " << elapsed_time.count() * 1000 << " msecs" << "\n" << std::endl;
}
std::cout << "Process ended... \n" << std::endl;
Image Size: 49476 GPU Elapsed Time: 1.5971 msecs
Image Size: 198170 GPU Elapsed Time: 2.1869 msecs
Image Size: 446082 GPU Elapsed Time: 3.9316 msecs
Image Size: 792680 GPU Elapsed Time: 5.8947 msecs
Image Size: 1238230 GPU Elapsed Time: 6.8415 msecs
Image Size: 1783530 GPU Elapsed Time: 7.8679 msecs
Image Size: 2428048 GPU Elapsed Time: 8.4011 msecs
Image Size: 3170720 GPU Elapsed Time: 9.5377 msecs
Image Size: 4012344 GPU Elapsed Time: 11.3635 msecs
Image Size: 4954250 GPU Elapsed Time: 13.3181 msecs
Image Size: 5995374 GPU Elapsed Time: 16.5964 msecs
Image Size: 7134120 GPU Elapsed Time: 19.9122 msecs
Image Size: 8371818 GPU Elapsed Time: 22.7916 msecs
Image Size: 9710330 GPU Elapsed Time: 25.1661 msecs
Image Size: 11148060 GPU Elapsed Time: 28.3689 msecs
Image Size: 12682880 GPU Elapsed Time: 31.6261 msecs
Image Size: 14316652 GPU Elapsed Time: 34.7694 msecs
Image Size: 16051770 GPU Elapsed Time: 37.7313 msecs
Image Size: 17886106 GPU Elapsed Time: 39.5111 msecs
Image Size: 19817000 GPU Elapsed Time: 43.407 msecs
Image Size: 21846846 GPU Elapsed Time: 46.8648 msecs
Image Size: 23978570 GPU Elapsed Time: 47.9306 msecs
Image Size: 26209512 GPU Elapsed Time: 50.2719 msecs
Image Size: 28536480 GPU Elapsed Time: 53.922 msecs
Image Size: 49476 CPU Elapsed Time: 16.4558 msecs
Image Size: 198170 CPU Elapsed Time: 40.3942 msecs
Image Size: 446082 CPU Elapsed Time: 77.8448 msecs
Image Size: 792680 CPU Elapsed Time: 110.313 msecs
Image Size: 1238230 CPU Elapsed Time: 143.571 msecs
Image Size: 1783530 CPU Elapsed Time: 183.128 msecs
Image Size: 2428048 CPU Elapsed Time: 218.107 msecs
Image Size: 3170720 CPU Elapsed Time: 256.128 msecs
Image Size: 4012344 CPU Elapsed Time: 305.7 msecs
Image Size: 4954250 CPU Elapsed Time: 370.511 msecs
Image Size: 5995374 CPU Elapsed Time: 410.728 msecs
Image Size: 7134120 CPU Elapsed Time: 458.635 msecs
Image Size: 8371818 CPU Elapsed Time: 511.283 msecs
Image Size: 9710330 CPU Elapsed Time: 619.209 msecs
Image Size: 11148060 CPU Elapsed Time: 652.386 msecs
Image Size: 12682880 CPU Elapsed Time: 691.799 msecs
Image Size: 14316652 CPU Elapsed Time: 768.322 msecs
Image Size: 16051770 CPU Elapsed Time: 880.751 msecs
Image Size: 17886106 CPU Elapsed Time: 900.914 msecs
Image Size: 19817000 CPU Elapsed Time: 980.022 msecs
Image Size: 21846846 CPU Elapsed Time: 1037.32 msecs
Image Size: 23978570 CPU Elapsed Time: 1115.81 msecs
Image Size: 26209512 CPU Elapsed Time: 1123.15 msecs
Image Size: 28536480 CPU Elapsed Time: 1226.08 msecs
依舊是快很多的。但是不好意思發現算上cv::Mat與cv::cuda::gpuMat之間的上傳下載,如果只處理幾張圖片,cv::cuda總體是慢的:
int main()
{
cv::Mat ImageHost = cv::imread("E:\\CUDA\\imgs\\cv_cuda_testimg.png", cv::IMREAD_GRAYSCALE);
cv::Mat ImageHostArr[SIZE];
cv::Mat Detected_EdgesHost[SIZE];
std::ofstream File1, File2;
File1.open("E:\\CUDA\\imgs\\canny_cpu.txt");
File2.open("E:\\CUDA\\imgs\\canny_gpu.txt");
std::cout << "Process started... \n" << std::endl;
for (int n = 1; n <SIZE; n++) {
auto start = std::chrono::high_resolution_clock::now();
cv::resize(ImageHost, ImageHostArr[n], cv::Size(), 0.5*n, 0.5*n, cv::INTER_LINEAR);
cv::Canny(ImageHostArr[n], Detected_EdgesHost[n], 2.0, 100.0, 3, false);
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_time = finish - start;
File1 << "Image Size: " << ImageHostArr[n].rows* ImageHostArr[n].cols << " " << "CPU Elapsed Time: " << elapsed_time.count() * 1000 << " msecs" << "\n" << std::endl;
}
auto start2 = std::chrono::high_resolution_clock::now();
cv::cuda::GpuMat ImageDev;
cv::cuda::GpuMat ImageDevArr[SIZE];
ImageDev.upload(ImageHost);
cv::cuda::GpuMat Detected_EdgesDev[SIZE];
cv::Mat gpuresult[SIZE];
auto finish2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_time2 = finish2 - start2;
File2 << "GPU Elapsed Time: " << elapsed_time2.count() * 1000 << " msecs" << "\n" << std::endl;
cv::Ptr<cv::cuda::CannyEdgeDetector> canny_edg = cv::cuda::createCannyEdgeDetector(2.0, 100.0, 3, false);
for (int n = 1; n <SIZE; n++) {
auto start = std::chrono::high_resolution_clock::now();
cv::cuda::resize(ImageDev, ImageDevArr[n], cv::Size(), 0.5*n, 0.5*n, cv::INTER_LINEAR);
canny_edg->detect(ImageDevArr[n], Detected_EdgesDev[n]);
(Detected_EdgesDev[n]).download(gpuresult[n]);
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_time = finish - start;
File2 << "Image Size: " << ImageDevArr[n].rows* ImageDevArr[n].cols << " " << "GPU Elapsed Time: " << elapsed_time.count() * 1000 << " msecs" << "\n" << std::endl;
}
std::cout << "Process ended... \n" << std::endl;
return 0;
}
看結果上傳花了很多時間:
CPU Elapsed Time: 16.2129 msecs
GPU Elapsed Time: 827.039 msecs
Image Size: 49476 GPU Elapsed Time: 1.3695 msecs
所以如果後續只處理一兩張圖,那就得不償失了。所以網上有人建議要使用cv::cuda,最好cv::Mat與cv::cuda::gpuMat之間只轉換一次,而不要頻繁的轉來轉去,否則cv::cuda函數節約下來的時間趕不上轉換消耗的時間,那總體就慢了。
Image Size: 49476 CPU Elapsed Time: 16.5117 msecs
Image Size: 198170 CPU Elapsed Time: 40.3025 msecs
Image Size: 446082 CPU Elapsed Time: 81.2121 msecs
Image Size: 792680 CPU Elapsed Time: 110.101 msecs
Image Size: 1238230 CPU Elapsed Time: 148.415 msecs
Image Size: 1783530 CPU Elapsed Time: 186.113 msecs
Image Size: 2428048 CPU Elapsed Time: 228.306 msecs
Image Size: 3170720 CPU Elapsed Time: 261.014 msecs
Image Size: 4012344 CPU Elapsed Time: 316.615 msecs
Image Size: 4954250 CPU Elapsed Time: 363.326 msecs
Image Size: 5995374 CPU Elapsed Time: 410.894 msecs
Image Size: 7134120 CPU Elapsed Time: 479.375 msecs
Image Size: 8371818 CPU Elapsed Time: 509.868 msecs
Image Size: 9710330 CPU Elapsed Time: 596.871 msecs
GPU Elapsed Time: 811.702 msecs
Image Size: 49476 GPU Elapsed Time: 1.5237 msecs
Image Size: 198170 GPU Elapsed Time: 2.2596 msecs
Image Size: 446082 GPU Elapsed Time: 3.7014 msecs
Image Size: 792680 GPU Elapsed Time: 5.3606 msecs
Image Size: 1238230 GPU Elapsed Time: 6.7137 msecs
Image Size: 1783530 GPU Elapsed Time: 7.9725 msecs
Image Size: 2428048 GPU Elapsed Time: 9.5008 msecs
Image Size: 3170720 GPU Elapsed Time: 11.3495 msecs
Image Size: 4012344 GPU Elapsed Time: 13.556 msecs
Image Size: 4954250 GPU Elapsed Time: 16.0509 msecs
Image Size: 5995374 GPU Elapsed Time: 19.5233 msecs
Image Size: 7134120 GPU Elapsed Time: 22.7719 msecs
Image Size: 8371818 GPU Elapsed Time: 26.4892 msecs
Image Size: 9710330 GPU Elapsed Time: 28.1691 msecs
我覺得cv::cuda真的只適合一次上傳下載,然後很多很多函數處理或者是很多張圖片的很多操作,這種情況。
的確是這樣。