Let's talk about important specifics of computing FLOPS that can significantly affect FLOPS data interpretation especially in the Roofline chart.
At the moment, we are computing FLOPS using self-time (noninclusive). This means if you have nested loops, the FLOPS and arithmetic intensity computed for outer loop do not include operations happening in the inner loop. Our recommendation is to use FLOPS and Roofline information for outer loop taking in account these specifics.
This becomes trickier when you call functions inside your loop. Again, a noninclusive approach is used. So, operations, which happen inside such function, will not be counted to the loops' FLOPS and arithmetic intensity. The self time of the loop will be used to compute loops' FLOPS. Results of roofline analysis for such loop can lead to wrong conclusion and action plan.
Consider the following example where a modified matrix multiplication is used.
double compute(double a, double b) { double factor = a/b; return (((((1+factor)*factor+1)*factor+1)*factor+1)*factor+1)*factor+1; } void multiply_d_noinline(int arrSize, double **aMatrix, double **bMatrix, double **product) { for(int i=0;i<arrSize;i++) { for(int j=0;j<arrSize;j++) { double sum = 0; for(int k=0;k<arrSize;k++) { #pragma noinline sum += compute(aMatrix[i][k],bMatrix[k][j]); } product[i][j] = sum; } } } void multiply_d_inline(int arrSize, double **aMatrix, double **bMatrix, double **product) { for(int i=0;i<arrSize;i++) { for(int j=0;j<arrSize;j++) { double sum = 0; #pragma novector for(int k=0;k<arrSize;k++) { sum += compute(aMatrix[i][k],bMatrix[k][j]); } product[i][j] = sum; } } }
In the multiply_d_noinline function, most of computation performed in the compute routine that is called from the innermost computational loop. So all the computational and memory operations are excluded from the FLOPS and arithmetic intensity for the loop whereas all computations are inlined in the multiply_d_inline function and involved in the FLOPS metric calculation.
Let's see what roofline plot looks like.
Although both loops do similar compute work, their positions on the plot are diverged considerably. Moreover, interpretation of the roofline plot results tells that the not inlined version of the loop is memory bound and requires better cache usage. On the other hand, the inlined loop is compute bound and vectorization is essential for performance improvement. What actually required is vectorization improvement for both loops.To take notice of these specifics we recommend to enable "Loops and Functions" filtering in the filter bar. You can see an extra dot appeared on the chart that represents a compute function FLOPS. So, interpreting roofline data for the loop with nested calls you should not only take in account the loop itself but also all the nested calls.
The sample code used in this article can be downloaded by the following link.