To tackle this challenge, we propose HTrOP, a compilation approach and prototypical implementation. HTrOP is able to automatically analyze a sequential CPU application, detect computational hotspots and generate parallel OpenCL host and kernel code. The potential is demonstrated by offloading hotspots to different OpenCL-enabled resources (currently CPU, GPGPU and the manycore Intel Xeon Phi). Our contribution includes:
Automatic transformation of suitable data-parallel loops into independent OpenCL-typical work-items that are executed in parallel.
A two-layered approach of identifying hotspots at compile time and refining offloading decisions at runtime based on parameters like input sizes, availability of accelerators, etc.
Infrastructure for offloading to and migrating between accelerators, while minimizing data transfer overheads by reusing data though application-specific, generated code parts.
A thorough evaluation of performance gains and energy savings with different accelerator targets, taking into account one-time and recurring overheads introduced by our approach. The evaluation includes a comparison to handwritten pragma-based OpenACC code for multicore CPUs and GPUs.
The source code of our prototype implementation is available at github.com/pc2/htrop.