1、算法簡述
實現矩陣相加:Cn = An + Bn。這個例子雖然很簡單,但是由於矩陣元素之間相互獨立,每個元素可以非常容易地進行並行計算,可以非常理想地在OpenCL中實現。
2. C/C++實現
/*
* This confidential and proprietary software may be used only as
* authorised by a licensing agreement from ARM Limited
* (C) COPYRIGHT 2013 ARM Limited
* ALL RIGHTS RESERVED
* The entire notice above must be reproduced on all authorised
* copies and copies may only be made to the extent permitted
* by a licensing agreement from ARM Limited.
*/
#include <iostream>
using namespace std;
/**
* \brief Basic integer array addition implemented in C/C++.
* \details A sample which shows how to add two integer arrays and store the result in a third array.
* No OpenCL code is used in this sample, only standard C/C++. The code executes only on the CPU.
* \return The exit code of the application, non-zero if a problem occurred.
*/
int main(void)
{
/* [Setup memory] */
/* Number of elements in the arrays of input and output data. */
int arraySize = 1000000;
/* Arrays to hold the input and output data. */
int* inputA = new int[arraySize];
int* inputB = new int[arraySize];
int* output = new int[arraySize];
/* [Setup memory] */
/* Fill the arrays with data. */
for (int i = 0; i < arraySize; i++)
{
inputA[i] = i;
inputB[i] = i;
}
/* [C/C++ Implementation] */
for (int i = 0; i < arraySize; i++)
{
output[i] = inputA[i] + inputB[i];
}
/* [C/C++ Implementation] */
/* Uncomment the following block to print results. */
/*
for (int i = 0; i < arraySize; i++)
{
cout << "i = " << i << ", output = " << output[i] << "\n";
}
*/
delete[] inputA;
delete[] inputB;
delete[] output;
}
3 Open基本實現
3.1 內核代碼實現
內核代碼的實現如下,其中指針的修飾符restrict是C99中的關鍵字,只用於限定指針。該關鍵字用於告知編譯器,所有修改該指針所指向內容的操作全部都是基於該指針的,即不存在其它進行修改操作的途徑;這樣的後果是幫助編譯器進行更好的代碼優化,生成更有效率的彙編代碼。
/*
* This confidential and proprietary software may be used only as
* authorised by a licensing agreement from ARM Limited
* (C) COPYRIGHT 2013 ARM Limited
* ALL RIGHTS RESERVED
* The entire notice above must be reproduced on all authorised
* copies and copies may only be made to the extent permitted
* by a licensing agreement from ARM Limited.
*/
/**
* \brief Hello World kernel function.
* \param[in] inputA First input array.
* \param[in] inputB Second input array.
* \param[out] output Output array.
*/
/* [OpenCL Implementation] */
__kernel void hello_world_opencl(__global int* restrict inputA,
__global int* restrict inputB,
__global int* restrict output)
{
/*
* Set i to be the ID of the kernel instance.
* If the global work size (set by clEnqueueNDRangeKernel) is n,
* then n kernels will be run and i will be in the range [0, n - 1].
*/
int i = get_global_id(0);
/* Use i as an index into the three arrays. */
output[i] = inputA[i] + inputB[i];
}
/* [OpenCL Implementation] */
3.2 宿主機代碼實現
內核代碼中並沒有循環語句,只計算一個矩陣元素的值,每一個實例獲得一個獨一無二的所以需要運行的內核實例數目等同於矩陣元素個數。
/*
* Each instance of our OpenCL kernel operates on a single element of each array so the number of
* instances needed is the number of elements in the array.
*/
size_t globalWorksize[1] = {arraySize};
/* Enqueue the kernel */
if (!checkSuccess(clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalWorksize, NULL, 0, NULL, &event)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed enqueuing the kernel. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
因爲我們並沒有設置內核間的依賴性,OpenCL設備可以用並行的方式自由地運行內核實例。現在並行化上的唯一限制是設備的容量。在前面的代碼運行之前,需要建立OpenCL,下面分別介紹與建立OpenCL相關的各項內容。
因爲現在的操作是在GPU而不是CPU中,我們需要知道任何使用數據的位置。知道數據是在GPU內存空間還是CPU內存空間是非常重要的。在桌面系統中,GPU和CPU有它們自己的內存空間,被相對低速率的總線分開,這意味着在GPU和CPU之間共享數據是一個代價高昂的操作。在大多數帶Mali-T600系列GPU的嵌入式系統中,GPU和CPU共享同一個內存,因此這使得以相對低的代價共享GPU和CPU之間內存成爲可能。
由於這些系統的差異,OpenCL支持多種分配和共享設備間內存的方式。下面是一種共享設備間內存的方式,目的是減少從一個設備到另一個設備的內存拷貝(在一個共享內存系統中)。
a. 要求OpenCL設備分配內存
在C/C++實現中,我們使用數組來分配內存。
/* Number of elements in the arrays of input and output data. */
int arraySize = 1000000;
/* Arrays to hold the input and output data. */
int* inputA = new int[arraySize];
int* inputB = new int[arraySize];
int* output = new int[arraySize];
在OpenCL中,我們使用內存緩衝區。內存緩衝區其實是一定大小的內存塊。爲了分配緩衝區,我們如下做:
/* Number of elements in the arrays of input and output data. */
cl_int arraySize = 1000000;
/* The buffers are the size of the arrays. */
size_t bufferSize = arraySize * sizeof(cl_int);
/*
* Ask the OpenCL implementation to allocate buffers for the data.
* We ask the OpenCL implemenation to allocate memory rather than allocating
* it on the CPU to avoid having to copy the data later.
* The read/write flags relate to accesses to the memory from within the kernel.
*/
bool createMemoryObjectsSuccess = true;
memoryObjects[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
memoryObjects[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
memoryObjects[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
if (!createMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create OpenCL buffer. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
儘管這看上去更加複雜,但其實這裏只有三個OpenCL API調用。唯一的區別是這裏我們檢查錯誤(這是一個好的做法),而C++中並不用做。b. 映射內存到局部指針
現在內存已分配,但是隻有OpenCL實現知道它的位置。爲了訪問CPU上的內存,我們把它們映射到一個指針。
/* Map the memory buffers created by the OpenCL implementation to pointers so we can access them on the CPU. */
bool mapMemoryObjectsSuccess = true;
cl_int* inputA = (cl_int*)clEnqueueMapBuffer(commandQueue, memoryObjects[0], CL_TRUE, CL_MAP_WRITE, 0, bufferSize, 0, NULL, NULL, &errorNumber);
mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
cl_int* inputB = (cl_int*)clEnqueueMapBuffer(commandQueue, memoryObjects[1], CL_TRUE, CL_MAP_WRITE, 0, bufferSize, 0, NULL, NULL, &errorNumber);
mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
if (!mapMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to map buffer. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
現在這些指針可以想普通的C/C++指針那樣使用了。
c. 在CPU上初始化數據
因爲我們已有了指向內存的指針,這一步與在CPU上一樣。
for (int i = 0; i < arraySize; i++)
{
inputA[i] = i;
inputB[i] = i;
}
d. 取消映射緩衝區
爲了使OpenCL設備使用緩衝區,我們必須把它們在CPU上的映射取消。
/*
* Unmap the memory objects as we have finished using them from the CPU side.
* We unmap the memory because otherwise:
* - reads and writes to that memory from inside a kernel on the OpenCL side are undefined.
* - the OpenCL implementation cannot free the memory when it is finished.
*/
if (!checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[0], inputA, 0, NULL, NULL)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
if (!checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[1], inputB, 0, NULL, NULL)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
e. 映射數據到內核
在我們調度內核運行之前,我們必須告訴內核哪些數據作爲輸入使用。這裏,我們映射內存對象到OpenCL內核函數的參數中。
bool setKernelArgumentsSuccess = true;
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 0, sizeof(cl_mem), &memoryObjects[0]));
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 1, sizeof(cl_mem), &memoryObjects[1]));
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 2, sizeof(cl_mem), &memoryObjects[2]));
if (!setKernelArgumentsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed setting OpenCL kernel arguments. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
f. 運行內核
對於內核代碼見前面,如何調度它則不作詳述。
g. 獲取運行結果
一旦計算結束,我們像映射輸入緩衝區那樣映射輸出緩衝區。然後,我們就可以使用指針讀取結果數據,然後取消緩衝區映射,就像前面那樣。
基本實現的宿主機的完整代碼如下:
/*
* This confidential and proprietary software may be used only as
* authorised by a licensing agreement from ARM Limited
* (C) COPYRIGHT 2013 ARM Limited
* ALL RIGHTS RESERVED
* The entire notice above must be reproduced on all authorised
* copies and copies may only be made to the extent permitted
* by a licensing agreement from ARM Limited.
*/
#include "common.h"
#include "image.h"
#include <CL/cl.h>
#include <iostream>
using namespace std;
/**
* \brief Basic integer array addition implemented in OpenCL.
* \details A sample which shows how to add two integer arrays and store the result in a third array.
* The main calculation code is in an OpenCL kernel which is executed on a GPU device.
* \return The exit code of the application, non-zero if a problem occurred.
*/
int main(void)
{
cl_context context = 0;
cl_command_queue commandQueue = 0;
cl_program program = 0;
cl_device_id device = 0;
cl_kernel kernel = 0;
int numberOfMemoryObjects = 3;
cl_mem memoryObjects[3] = {0, 0, 0};
cl_int errorNumber;
if (!createContext(&context))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create an OpenCL context. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
if (!createCommandQueue(context, &commandQueue, &device))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create the OpenCL command queue. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
if (!createProgram(context, device, "assets/hello_world_opencl.cl", &program))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create OpenCL program." << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
kernel = clCreateKernel(program, "hello_world_opencl", &errorNumber);
if (!checkSuccess(errorNumber))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create OpenCL kernel. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* [Setup memory] */
/* Number of elements in the arrays of input and output data. */
cl_int arraySize = 1000000;
/* The buffers are the size of the arrays. */
size_t bufferSize = arraySize * sizeof(cl_int);
/*
* Ask the OpenCL implementation to allocate buffers for the data.
* We ask the OpenCL implemenation to allocate memory rather than allocating
* it on the CPU to avoid having to copy the data later.
* The read/write flags relate to accesses to the memory from within the kernel.
*/
bool createMemoryObjectsSuccess = true;
memoryObjects[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
memoryObjects[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
memoryObjects[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
if (!createMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create OpenCL buffer. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* [Setup memory] */
/* [Map the buffers to pointers] */
/* Map the memory buffers created by the OpenCL implementation to pointers so we can access them on the CPU. */
bool mapMemoryObjectsSuccess = true;
cl_int* inputA = (cl_int*)clEnqueueMapBuffer(commandQueue, memoryObjects[0], CL_TRUE, CL_MAP_WRITE, 0, bufferSize, 0, NULL, NULL, &errorNumber);
mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
cl_int* inputB = (cl_int*)clEnqueueMapBuffer(commandQueue, memoryObjects[1], CL_TRUE, CL_MAP_WRITE, 0, bufferSize, 0, NULL, NULL, &errorNumber);
mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
if (!mapMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to map buffer. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* [Map the buffers to pointers] */
/* [Initialize the input data] */
for (int i = 0; i < arraySize; i++)
{
inputA[i] = i;
inputB[i] = i;
}
/* [Initialize the input data] */
/* [Un-map the buffers] */
/*
* Unmap the memory objects as we have finished using them from the CPU side.
* We unmap the memory because otherwise:
* - reads and writes to that memory from inside a kernel on the OpenCL side are undefined.
* - the OpenCL implementation cannot free the memory when it is finished.
*/
if (!checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[0], inputA, 0, NULL, NULL)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
if (!checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[1], inputB, 0, NULL, NULL)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* [Un-map the buffers] */
/* [Set the kernel arguments] */
bool setKernelArgumentsSuccess = true;
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 0, sizeof(cl_mem), &memoryObjects[0]));
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 1, sizeof(cl_mem), &memoryObjects[1]));
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 2, sizeof(cl_mem), &memoryObjects[2]));
if (!setKernelArgumentsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed setting OpenCL kernel arguments. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* [Set the kernel arguments] */
/* An event to associate with the Kernel. Allows us to retrieve profiling information later. */
cl_event event = 0;
/* [Global work size] */
/*
* Each instance of our OpenCL kernel operates on a single element of each array so the number of
* instances needed is the number of elements in the array.
*/
size_t globalWorksize[1] = {arraySize};
/* Enqueue the kernel */
if (!checkSuccess(clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalWorksize, NULL, 0, NULL, &event)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed enqueuing the kernel. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* [Global work size] */
/* Wait for kernel execution completion. */
if (!checkSuccess(clFinish(commandQueue)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed waiting for kernel execution to finish. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Print the profiling information for the event. */
printProfilingInfo(event);
/* Release the event object. */
if (!checkSuccess(clReleaseEvent(event)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed releasing the event object. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Get a pointer to the output data. */
cl_int* output = (cl_int*)clEnqueueMapBuffer(commandQueue, memoryObjects[2], CL_TRUE, CL_MAP_READ, 0, bufferSize, 0, NULL, NULL, &errorNumber);
if (!checkSuccess(errorNumber))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to map buffer. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* [Output the results] */
/* Uncomment the following block to print results. */
/*
for (int i = 0; i < arraySize; i++)
{
cout << "i = " << i << ", output = " << output[i] << "\n";
}
*/
/* [Output the results] */
/* Unmap the memory object as we are finished using them from the CPU side. */
if (!checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[2], output, 0, NULL, NULL)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Release OpenCL objects. */
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
}
4 向量化你的OpenCL代碼
4.1 向量基礎
OpenCL設備可以通告它們爲不同數據類型的首選向量寬度,你可以使用這個信息來選擇一個內核。結果是,相當於該內核爲你正在運行的平臺做了優化。例如,一個設備可能僅有標量整數的硬件支持,而另一個設備則有寬度爲4的整數向量的硬件支持。可以寫兩個版本的內核,一個用於標量,一個用於向量,在運行時選擇正確的版本。
這裏是一個在特定設備上詢問首選整數向量寬度的例子。
/*
* Query the device to find out it's prefered integer vector width.
* Although we are only printing the value here, it can be used to select between
* different versions of a kernel.
*/
cl_uint integerVectorWidth;
clGetDeviceInfo(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT, sizeof(cl_uint), &integerVectorWidth, NULL);
cout << "Prefered vector width for integers: " << integerVectorWidth << endl;
對於其它OpenCL數據類型也是一樣的。 每一個Mali T600系列GPU核最少有兩個128位寬度的ALU(算數邏輯單元),它們具有矢量計算能力。ALU中的絕大多數操作(例如,浮點加,浮點乘,整數加,整數乘),可以以128位向量數據操作(例如,char16,
short8, int4, float4)。使用前面講述的詢問方法來爲你的數據類型決定使用正確的向量大小。
當使用Mali
T600系列GPU時,我們推薦在任何可能的地方使用向量。
4.2 向量化代碼
首先,修改內核代碼以支持向量運算。對於Mali T600系列GPU來說,一個向量運算的時間與一個整數加法的時間是一樣的。具體代碼解讀,見下面代碼中的註釋部分。
__kernel void hello_world_vector(__global int* restrict inputA,
__global int* restrict inputB,
__global int* restrict output)
{
/*
* We have reduced the global work size (n) by a factor of 4 compared to the hello_world_opencl sample.
* Therefore, i will now be in the range [0, (n / 4) - 1].
*/
int i = get_global_id(0);
/*
* Load 4 integers into 'a'.
* The offset calculation is implicit from the size of the vector load.
* For vloadN(i, p), the address of the first data loaded would be p + i * N.
* Load from the data from the address: inputA + i * 4.
*/
int4 a = vload4(i, inputA);
/* Do the same for inputB */
int4 b = vload4(i, inputB);
/*
* Do the vector addition.
* Store the result at the address: output + i * 4.
*/
vstore4(a + b, i, output);
}
由於現在每個內核實例能夠實現多個加法運算,所以必須減少內核實例的數量,在宿主機代碼中的修改部分如下所示。
/*
* Each instance of our OpenCL kernel now operates on 4 elements of each array so the number of
* instances needed is the number of elements in the array divided by 4.
*/
size_t globalWorksize[1] = {arraySize / 4};
/* Enqueue the kernel */
if (!checkSuccess(clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalWorksize, NULL, 0, NULL, &event)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed enqueuing the kernel. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
折減係數基於向量的寬度,例如,如果我們在內核中使用int8代替int4,折減係數此時則爲8。5 運行OpenCL樣例
(1). 在SDK根目錄的命令行提示符中
cd samples\hello_world_vector
cs-make install
這樣就編譯了向量化的OpenCL hello world樣例,拷貝了所有運行時需要的文件到SDK根目錄下的bin文件夾中。
(2) . 拷貝bin文件夾到目標板中
(3). 在板子上導航到該目錄,運行hello world二進制文件
chmod 777 hello_world_vector
./hello_world_vector