OpenCLSummary and ConclusionsThe GPU doesn’t guarantee a shorter execution time. On the one hand is the overhead for just-in-time compilation of the OpenCL kernel, and on the other, the data first must be copied to GPU RAM, which is computationally expensive. For special cases (large convolution kernels) and large volumes of data, you can still save time without even considering optimization strategies. Far larger speed boosts are possible if you optimize the kernel functions. The threads in a work group share local memory, which is three orders of magnitude faster than the global GPU RAM. In the native kernel, the convolution kernel matrix elements are retrieved from global memory on access. If, instead, the elements were loaded once per work group into local memory, it would be possible to leverage the video card’s potential more efficiently. Additionally, the image convolution has some potential for optimization if you restrict the problem to separable kernels. However, I purposely did without improvements of this kind to keep the problem simple and provide an easier entry into OpenCL. At the same time, you can view this article as a guide that will help you solve problems by running portions of your programs on the video card. For more in-depth information, I recommend the NVidia OpenCL Programming Guide [13], which investigates the video card’s hardware architecture, as well as the sample code in the ATI and NVidia SDKs. OpenCL developers will not want to be without the OpenCL specification [16] and the documentation for the C++ bindings [17]. Info
[1] Wikipedia SIMD: The Author Markus Roth is a student of Computer Science at the Karlsruhe Institute of Technology (KIT), Germany, where he is researching GPU-supported acceleration in computer vision at the Institute of Anthropomatics. |
