Cause:
The vectorizer cannot safely use aligned loads or stores for this data access, either because the data are not aligned to an n-byte boundary in memory, or because the compiler does not know the alignment. The compiler must use unaligned memory accesses, which may be less efficient. The value of n depends on the targeted instruction set and corresponds to the width of the vector instructions: 16 for Intel® SSE, 32 for Intel® AVX and 64 for Intel® AVX-512 instructions.
Example:
subroutine d_15134(x,y,z,index,m1,m2,mm) implicit none real, dimension(m1,m2), intent(in ) :: x,y real, dimension(m1,m2), intent(out) :: z integer, dimension(m2), intent(in ) :: index integer, intent(in ) :: m1, m2 integer :: i, j !!dir$ assume_aligned x:32, y:32, z:32 !!dir$ assume (mod(m1,8).eq.0) do j=1,m2 do i=1,m1-1 z(i,j) = x(i,index(j)) + x(i,j)*y(i+1,j) enddo enddo end subroutine d_15134
You must compile with -vec-report6 with the Intel Compiler version 14.0 to get the alignment diagnostics. Without the aid of directives, the compiler does not know the alignment of the arrays x, y and z, and does not know the extent of the leading dimension. It assumes that any memory access could be unaligned.
There are 3 main issues:
- The compiler does not know the absolute alignment, for example of z in the above code sample. It can often correct for this at run-time, by peeling of some loop iterations, so that the loop kernel starts at a point where accesses to the first array are aligned.
- The compiler does not know the alignment of other arrays relative to the first, for example of x relative to z. If the compiler peels to align accesses to z, that may not align accesses to x. This can sometimes be worked around at run-time by generating two versions of the loop kernel, one where x and z have the same alignment, and one where they do not. But for larger numbers arrays, the compiler cannot generate kernel versions corresponding to all the possible combinations of alignment.
- Even if array accesses in the inner loop are aligned for the first iteration of the outer loop over j, they will not be aligned for subsequent values of j, (i.e., other columns of the matrix will not be aligned), unless the size of the first dimension (the column length) of x, y and z is a multiple of the vector width n. For single precision, m1 needs to be a multiple of 4 for Intel SSE, 8 for Intel AVX or 16 for Intel AVX-512 instructions.
Resolution:
If the arrays x, y and z are aligned in the routine in which they are first declared, e.g. by using a switch such as -align array32byte or an ATTRIBUTES ALIGN directive, then other directives can be used to assert that alignment to the compiler in routines where the arrays are used. In the above example, the first directive asserts that the arrays x, y and z are always aligned on (at least) 32 byte boundaries in memory. The second directive asserts that m1, the extent of the first array dimension, is a multiple of 8. If these directives are uncommented, the inner loop is vectorized using mostly aligned memory accesses:
Because the accesses to array y are offset by 1 element compared to accesses to x and z, the access to y must remain unaligned when the accesses to x and z are aligned. If that had not been the case, the two directives above could have been replaced by a single directive, !DIR$ VECTOR ALIGNED, which would assert that all memory accesses in the loop were aligned, (here, to a 32 byte boundary, since we are compiling with -xavx). Care must be taken when using such alignment directives. Invalid assertions of alignment may lead to poor performance, incorrect results or to a run-time error, depending on the context. However, careful alignment of data and ensuring the compiler knows the alignment can lead to improved performance.
Back to the list of vectorization diagnostics for Intel Fortran