CUDA vs OpenCL performance on empty kernel -

- May 15, 2010

While measuring the performance of the same kernel on cuda and openclick, I have found a strange thing.

When I leave my kernel completely empty without any input parameter and calculation, cuda gives me very poor performance compared to OpenCL.

Quda Kernel:

  __ global__ zero kernel_empty () {}

Cuda host:

 < Code> kernel_empty & lt; & Lt; & Lt; Dim3 (10000, 10000, 1), dim3 (8, 8, 1) & gt; & Gt; & Gt; (); OpenCL kernel:    __feature__ ((reqd_work_group_size (8, 8, 1)), zero kernel_empty (__kernel) {}  
 / Code>

OpenCL Host:

  cl_event perf_event; Size_t global_work_offset [3] = {0, 0, 0}; Size_t global_work_size [3] = {10000, 10000, 1}; Size_t local_work_size [3] = {8, 8, 1}; ClEnqueueNDRangeKernel (queue, kernel, 3, global_work_offset, global_work_size, local_work_size, 0, null, and perf_event); OpenCL  6ms    CUDA returns  390ms      
   The kernel is working correctly on both APIs, because I am using them to calculate my staff. 
 
   There is no error code on both sides. 
 
   Visual Studio 2010 is used, release mode 
 
   OpenCL 1.1 Computing Toolkit 5.5 from NIIDIA GPU 
 
  Cuda lib 5.5 from NVIDIA GPU Computing Toolkit 
 
   The timing is also true, along with the CPU timer I've seen them double When using a huge grid, you can see that the CUDA takes time without any timer. 
  OpenCL is used for clGetEventProfilingInfo. 
  CUDA is used for cudaEventElapsedTime. 
   The tests were running on the same PC with NVIDIA Quadro K4000. 
 
  Can anyone explain why such a big difference? 
   In OpenCL, you specify the Global Work Size (the total amount of work-item to be launched) and the local work size (work-group size). In your example, you are launching 10000 * 10000 work items in a group of 8x8. 
  In CUDA, you specify the block size (corresponding to the workgroup size), and grid size, which means how many blocks to launch   means that your CUDA example is 10000x10000  Launching block , which is a total of 80000x80000 CUDA threads 
  Then, it launches the CUDA kernel. 
   kernel_meta  
  This is equivalent to OpenCL kernel anquo: 
   size_t global_work_size [3] = {80000, 80000, 1}; Size_t local_work_size [3] = {8, 8, 1}; ClEnqueueNDRangeKernel (queue, kernel, 3, zero, global_work_ size, local_size_size, 0, zero, and perf_event);

Get link Facebook X Pinterest Email Other Apps

Comments Post a Comment

Search This Blog

BAVO

CUDA vs OpenCL performance on empty kernel -

Comments

Post a Comment

Popular posts from this blog

Pygame memory leak with transform.flip -

python - Writing Greek in matplotlib labels, titles -

C# code to obtain Maximum SQL Server 2012 Write Performance -