Tuning Tips for compute offload to Intel(R) Processor Graphics

Below are some tuning tips which will help the programmer tune his kernel to get better performance from processor graphics:

Offloaded loop nests must have enough iterations for all hardware threads available on Processor Graphics. Using perfectly nested parallel _Cilk_for loops allows parallelization in the dimensions of the parallel loop nest.
Pragmas and code re-structuring can be employed to get offloaded code vectorized.
Using __restrict and __assume_aligned keywords may help vectorization too.
Using the pin clause of the offload pragma will eliminate data copying to/from the GPU.
Scalar memory accesses are much less efficient than vector accesses. Using Cilk Plus array notation for memory accesses may help vectorize computation. A single memory access can handle up to 128 bytes. Gather/scatter operations of 4-byte elements are quite efficient, but with 2-byte elements are slower. Gather/scatter operations may result from array sections with non-unit strides.
Re-structuring code to use SOA data layout usually helps compiler generate faster block memory accesses instead of scatter/gather operations
The compiler attempts to allocate all qualifying local variables on the register file. The main conditions are that the local variable must be small enough (less than 3K) and its address cannot be passed to a function or assigned to a global pointer. Fully unrolling the for loop will turn it into a series of faster direct register accesses. On the other hand, excessive unrolling may result in code bloat which may hit the current kernel size limit which is around 250KB.
The JIT compiler used at run time may still spill some register variables to memory, destroying performance when the register space overflows, so caching in local arrays should be done only for ‘hot’ data.
Try changing the maximum number of threads by using GFX_MAX_THREAD_COUNT.
Try different combinations of width and height for the number of threads within a thread group using GFX_THREAD_GROUP_WIDTH and GFX_THREAD_GROUP_HEIGHT environment variables. The maximum number of threads allowed in a thread group for 4th generation Intel(R) Core(TM) Processor is 64.

Tuning Tips for compute offload to Intel(R) Processor Graphics

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List