Correct performance analysis of an application is absolutely vital to optimize the performance on any architecture. A previous article describes several metrics recommended for a basic analysis of your application on the Intel® Xeon Phi™ coprocessor. Since the Intel Xeon Phi coprocessor depends heavily on its wide vector units to deliver the bulk of its promised performance, vectorization intensity becomes a key metric to assess the performance of an application.
Vectorization intensity can be calculated as:
Vectorization Intensity = (VPU_ELEMENTS_ACTIVE)/(VPU_INSTRUCTIONS_EXECUTED)
where VPU_ELEMENTS_ACTIVE is the number of vector elements active while executing a VPU instruction, or equivalently, the number of vector operations (since each instruction performs multiple vector operations),
and VPU_INSTRUCTIONS_EXCUTED is the number of vector instructions executed by a hardware thread.
Ideally for an application that only executes perfectly vectorized single precision or double precision operations, vectorization intensity should be 16 for single precision and 8 for double precision. This is because the vector unit in the coprocessor is 512-bits wide and hence can process either 16 single precision elements or 8 double precision elements in one instruction. However, there are some scenarios where this metric does not correctly reflect the behavior of the application. This article will talk about two such scenarios where vectorization intensity does not accurately reflect the behavior of the application.
Scenario 1: Compiler “Magic” when using streaming Stores.
The Stream* benchmark is a prime example of this scenario. If you were to analyze the stream benchmark (built using these instructions) using the general-exploration analysis in Intel® VTune™ Amplifier XE, you will notice that Vectorization Intensity for this benchmark will be much greater than the expected value of 8 (since this a double precision benchmark).
If you were to take a look into the assembly generated by the compiler for this benchmark, you would notice a number of vmovnrngoaps instructions. Vmovnrngoaps is an instruction for streaming stores on the Intel Xeon Phi coprocessor. Streaming store instructions boost performance in the case of vector-aligned unmasked stores by avoiding Read For Ownership (RFO) accesses and writing the contents of the entire cache line to memory. However, unknown to the developer, the compiler makes an interesting choice of instructions. Instead of using the double precision instruction vmovnrngoapd, the compiler uses the single precision instruction vmovnrngosps. The compiler is able to get away with such a substitution because ultimately the hardware needs to store a cache line of data and hence it doesn’t make a difference if the cache line is interpreted as 16 single precision or 8 double precision elements. The use of single precision instructions such as vmovnrngops in place of double precision instructions skews the vectorization intensity to be greater than 8.
Scenario 2: Using Scatter/Gather instructions
Intel Xeon Phi coprocessor features Scatter/Gather instructions that allow manipulation of irregular data patterns in memory. As explained in this article, due to the special behavior of the masking register during the execution of each scatter/gather instruction, each such instruction must be executed repeatedly until all the bits in the mask register are reset (zero). Hence, a usage of a gather/scatter instruction looks as follows:
..L10: vgatherdps (%r8,%zmm0,4), %zmm6{%k1} jkzd ..L9, %k1 vgatherdps (%r8,%zmm0,4), %zmm6{%k1} jknzd ..L10, %k1 ..L9:
As evident from the above example, every gather/scatter instruction is followed by a conditional branching instruction.
The performance monitoring unit in the coprocessor counts the scatter/gather instruction and the conditional branching instruction jknzd/jkzd (since a masking register is used a conditional) as a vector instruction. However, since the number of set bits in the masking register is used to identify the number of active lanes, any masking instruction including jknzd and jkzd is not counted by the VPU_ELEMENTS_ACTIVE.
Consider the following snippet:
Instruction # | Instruction | VPU_INSTRUCTIONS_EXECUTED | VPU_ELEMENTS_ACTIVE |
n | … | x | y |
n+1 | vgatherdps(…) | x+1 | y+16 |
n+2 | jkzd … | x+1+1 | y+16+0 |
In the best case, the gather instruction will be executed only once. Hence, the VPU_INSTRUCTIONS_EXECUTED will be incremented by 1, but since all 16 elements were loaded in one instruction, VPU_ELEMENTS_ACTIVE will be incremented by 16. In case of the jkzd, VPU_INSTRUCTIONS_EXECUTED will be incremented by 1 but since jkzd is a masking instruction, it is not counted by VPU_ELEMENTS_ACTIVE.
Now, if we were to calculate the vector intensity for this example, we would get
Vectorization Intensity = (16+0)/(1+1)=8.
Hence, for scatter/gather instructions, the best vectorization intensity that one can achieve is 8 for single precision and 4 for double precision. Hence, applications that make use of scatter/gather instructions will exhibit much lower vectorization intensities than expected.
Hence, it is important to understand that metrics provide best-effort estimates to analyze applications and in some cases these metrics can have incorrect or possibly nonsensical values. If a metric seems off, it is generally a good idea to further investigate the metric to verify correctness.