With Intel® VTune™ Amplifier XE 2013 Update 12 and earlier it was possible to profile OpenMP applications with parallel regions as described in the article Profiling OpenMP* applications with Intel® VTune™ Amplifier XE by Kirill Rogozhin.
Intel® VTune™ Amplifier XE 2013 Update 13 provides extra information for profiling OpenMP applications – per-barrier information about load imbalance. Using this feature requires Intel® Composer XE 2013 SP1 Update 1 or newer.
Consider code like this
#pragma omp parallel { // Code 1 #pragma omp single { ... } // Implicit barrier (barrier 1) // Code 2 } // Implicit join barrier (barrier 2) // Code 3 #pragma omp parallel { // Code 4 #pragma omp barrier (barrier 3) // Code 5 #pragma omp for for (...) { // Code 6 } // Implicit for barrier (barrier 4) // Code 7 } // Implicit join barrier (barrier 5)
Since parallel regions have barriers inside them it is not enough to have per region frame information to understand the location of imbalance. Instead, VTune Amplifier XE will emit the following OpenMP frames where each barrier ends a frame and starts another:
28: #pragma omp parallel <- Frame 1 begin { // Code 1 31: #pragma omp single { ... } // Implicit barrier<- Frame 1 end (barrier 1) // Code 2 <- Frame 2 begin 34: } // Implicit join barrier <- Frame 2 end (barrier 2) // Code 3 37: #pragma omp parallel <- Frame 3 begin { // Code 4 44: #pragma omp barrier <- Frame 3 end (barrier 3) // Code 5 <- Frame 4 begin 50: #pragma omp for for (...) { // Code 6 55: } // Implicit for barrier <- Frame 4 end (barrier 4) // Code 7 <- Frame 5 begin 59: } // Implicit join barrier <- Frame 5 end (barrier 5) To control the new feature the OpenMP runtime introduced the new environment variable KMP_FORKJOIN_FRAMES_MODE that accepts values from 0 to 3.
Value 0 disables per-barrier frames which means only the existing (per region frames) functionality will be available.
KMP_FORKJOIN_FRAMES_MODE=1 enables frames for all barriers – explicit and implicit. In VTune Amplifier XE you will see something like this:
Note that the line number specifies the frame end point because the frame is associated with the barrier that ends it.
The corresponding Tasks and Frames view looks like this:
By setting KMP_FORKJOIN_FRAMES_MODE=2 users can get even more information about thread activity, namely, how much time passes between the moment when the first thread arrives at a barrier and the last thread leaves it. In this mode VTune Amplifier XE will display barrier-imbalance frame domains like this:
It is also possible to display combined information about per-barrier frames and barrier imbalance by setting KMP_FORKJOIN_FRAMES_MODE=3. So, the whole frame timing information as well as imbalance part is displayed:
Limitations
The information presented for OpenMP programs which use OpenMP tasking may be hard to understand, since threads which are “waiting at a barrier” may actually be executing OpenMP tasks.
Conclusion
The per-barrier frame information can provide better understanding of the behavior of OpenMP applications that have implicit or explicit barriers inside parallel regions and make it obvious where load imbalance is present making it easier to improve performance.
Please note however, that this new functionality has not been documented yet, so it may slightly change in future releases.