CUDA vs OpenCL performance on empty kernel -
While measuring the performance of the same kernel on cuda and openclick, I have found a strange thing.
When I leave my kernel completely empty without any input parameter and calculation, cuda gives me very poor performance compared to OpenCL.
Quda Kernel:
__ global__ zero kernel_empty () {}
Cuda host:
< Code> kernel_empty & lt; & Lt; & Lt; Dim3 (10000, 10000, 1), dim3 (8, 8, 1) & gt; & Gt; & Gt; (); OpenCL kernel:/ Code>__feature__ ((reqd_work_group_size (8, 8, 1)), zero kernel_empty (__kernel) {}
OpenCL Host:
cl_event perf_event; Size_t global_work_offset [3] = {0, 0, 0}; Size_t global_work_size [3] = {10000, 10000, 1}; Size_t local_work_size [3] = {8, 8, 1}; ClEnqueueNDRangeKernel (queue, kernel, 3, global_work_offset, global_work_size, local_work_size, 0, null, and perf_event); OpenCL 6ms CUDA returns 390ms
-
The kernel is working correctly on both APIs, because I am using them to calculate my staff.
-
There is no error code on both sides.
-
Visual Studio 2010 is used, release mode
-
OpenCL 1.1 Computing Toolkit 5.5 from NIIDIA GPU
-
Cuda lib 5.5 from NVIDIA GPU Computing Toolkit
-
The timing is also true, along with the CPU timer I've seen them double When using a huge grid, you can see that the CUDA takes time without any timer.
OpenCL is used for clGetEventProfilingInfo.
CUDA is used for cudaEventElapsedTime.
-
The tests were running on the same PC with NVIDIA Quadro K4000.
Can anyone explain why such a big difference?
In OpenCL, you specify the Global Work Size (the total amount of work-item to be launched) and the local work size (work-group size). In your example, you are launching 10000 * 10000 work items in a group of 8x8.
In CUDA, you specify the block size (corresponding to the workgroup size), and grid size, which means how many blocks to launch means that your CUDA example is 10000x10000 Launching block , which is a total of 80000x80000 CUDA threads
Then, it launches the CUDA kernel.
kernel_meta
This is equivalent to OpenCL kernel anquo:
size_t global_work_size [3] = {80000, 80000, 1}; Size_t local_work_size [3] = {8, 8, 1}; ClEnqueueNDRangeKernel (queue, kernel, 3, zero, global_work_ size, local_size_size, 0, zero, and perf_event);
Comments
Post a Comment