Introduction
Intel® C++ compiler enables offloading of existing C/C++ data-parallel code with very few source code changes to run on Intel(R) Processor Graphics. This article provides an example on how Intel(R) C++ compiler offloading feature help improving performance of Mandelbrot algorithm with very few program changes.
Version
Intel(R) C++ Compiler 15.0
Solution
You may download code samples for Mandelbrot from https://software.intel.com/en-us/code-samples/intel-c-compiler
With Intel® Cilk™ Plus, the containing hotspot loop can be parallelized on CPU in this way:
cilk_for (int j = 0; j < height; ++j) { #pragma simd for (int i = 0; i < width; ++i) {
To make use of computing units in Intel(R) Processor Graphics, we can use compiler offloading feature with simple changes:
#pragma offload target(gfx) pin(output: length(width * height * sizeof(unsigned char))) cilk_for (int j = 0; j < height; ++j) { cilk_for _Simd (int i = 0; i < width; ++i) {
Here pin clause declares output array to be shared between target and host. Using pin substantially reduces the cost of offloading because instead of copying data to or from memory accessible by the target, the pin clause organizes sharing the same physical memory area between the host and the target, which is much faster.
Using "cilk_for _Simd" keywords for the second inner loop to make full use of parallelism while keep vectorization with SIMD.
Even more, we can further improve the program to make use of both CPU and GPU computing units in parallel. In below example, function "cilk_simd_mandelbrot_execute" will execute on CPU only and the other function call to offload_simd_mandelbrot_execute will run on target integrated GPU. By using Intel® Cilk™ Plus clik_spawn, we can make the two tasks running in parallel:
cilk_spawn cilk_simd_mandelbrot_execute(x0, y0, x1, y0 + (y1-y0)*cpu_share, width, (int)height*cpu_share, max_depth, output); offload_simd_mandelbrot_execute(x0, y0 + (y1-y0)*cpu_share, x1, y1, width, height-height*cpu_share, max_depth, output + (int)(height*cpu_share)*width); cilk_sync;
Here cpu_share equals the percentage of workloads running on CPU.
Performance Results
We measured the performance on a laptop machine with processor: Intel(R) Core(TM) i5-4300U CPU @ 1.90 GHz 2.50 GHz.
To calculate a 1024*2048 image, we reached 2.4x performance comparing the previous fastest version on processor cores only and gained more than 10x to the original serial and scalar Mandelbrot. In below test, cpu_share is set to 0.5 to divide tasks evenly between processor cores and graphics:
References
https://software.intel.com/en-us/articles/using-intelr-c-compiler-with-intelr-processor-graphics
https://software.intel.com/en-us/articles/code-generation-options-for-intelr-graphics-technology