Below are some tuning tips which will help the programmer tune his kernel to get better performance from processor graphics:
- Offloaded loop nests must have enough iterations for all hardware threads available on Processor Graphics. Using perfectly nested parallel _Cilk_for loops allows parallelization in the dimensions of the parallel loop nest.
- Pragmas and code re-structuring can be employed to get offloaded code vectorized.
- Using __restrict and __assume_aligned keywords may help vectorization too.
- Using the pin clause of the offload pragma will eliminate data copying to/from the GPU.
- Scalar memory accesses are much less efficient than vector accesses. Using Cilk Plus array notation for memory accesses may help vectorize computation. A single memory access can handle up to 128 bytes. Gather/scatter operations of 4-byte elements are quite efficient, but with 2-byte elements are slower. Gather/scatter operations may result from array sections with non-unit strides.
- Re-structuring code to use SOA data layout usually helps compiler generate faster block memory accesses instead of scatter/gather operations
- The compiler attempts to allocate all qualifying local variables on the register file. The main conditions are that the local variable must be small enough (less than 3K) and its address cannot be passed to a function or assigned to a global pointer. Fully unrolling the for loop will turn it into a series of faster direct register accesses. On the other hand, excessive unrolling may result in code bloat which may hit the current kernel size limit which is around 250KB.
- The JIT compiler used at run time may still spill some register variables to memory, destroying performance when the register space overflows, so caching in local arrays should be done only for ‘hot’ data.
- Try changing the maximum number of threads by using GFX_MAX_THREAD_COUNT.
- Try different combinations of width and height for the number of threads within a thread group using GFX_THREAD_GROUP_WIDTH and GFX_THREAD_GROUP_HEIGHT environment variables. The maximum number of threads allowed in a thread group for 4th generation Intel(R) Core(TM) Processor is 64.