Understanding the corner cases of Vectorization Intensity

Correct performance analysis of an application is absolutely vital to optimize the performance on any architecture. A previous article describes several metrics recommended for a basic analysis of your application on the Intel® Xeon Phi™ coprocessor. Since the Intel Xeon Phi coprocessor depends heavily on its wide vector units to deliver the bulk of its promised performance, vectorization intensity becomes a key metric to assess the performance of an application.

Vectorization intensity can be calculated as:

Vectorization Intensity = (VPU_ELEMENTS_ACTIVE)/(VPU_INSTRUCTIONS_EXECUTED)

where VPU_ELEMENTS_ACTIVE is the number of vector elements active while executing a VPU instruction, or equivalently, the number of vector operations (since each instruction performs multiple vector operations),

and VPU_INSTRUCTIONS_EXCUTED is the number of vector instructions executed by a hardware thread.

Ideally for an application that only executes perfectly vectorized single precision or double precision operations, vectorization intensity should be 16 for single precision and 8 for double precision. This is because the vector unit in the coprocessor is 512-bits wide and hence can process either 16 single precision elements or 8 double precision elements in one instruction. However, there are some scenarios where this metric does not correctly reflect the behavior of the application. This article will talk about two such scenarios where vectorization intensity does not accurately reflect the behavior of the application.

Scenario 1: Compiler “Magic” when using streaming Stores.

The Stream* benchmark is a prime example of this scenario. If you were to analyze the stream benchmark (built using these instructions) using the general-exploration analysis in Intel® VTune™ Amplifier XE, you will notice that Vectorization Intensity for this benchmark will be much greater than the expected value of 8 (since this a double precision benchmark).

If you were to take a look into the assembly generated by the compiler for this benchmark, you would notice a number of vmovnrngoaps instructions. Vmovnrngoaps is an instruction for streaming stores on the Intel Xeon Phi coprocessor. Streaming store instructions boost performance in the case of vector-aligned unmasked stores by avoiding Read For Ownership (RFO) accesses and writing the contents of the entire cache line to memory. However, unknown to the developer, the compiler makes an interesting choice of instructions. Instead of using the double precision instruction vmovnrngoapd, the compiler uses the single precision instruction vmovnrngosps. The compiler is able to get away with such a substitution because ultimately the hardware needs to store a cache line of data and hence it doesn’t make a difference if the cache line is interpreted as 16 single precision or 8 double precision elements. The use of single precision instructions such as vmovnrngops in place of double precision instructions skews the vectorization intensity to be greater than 8.

Scenario 2: Using Scatter/Gather instructions

Intel Xeon Phi coprocessor features Scatter/Gather instructions that allow manipulation of irregular data patterns in memory. As explained in this article, due to the special behavior of the masking register during the execution of each scatter/gather instruction, each such instruction must be executed repeatedly until all the bits in the mask register are reset (zero). Hence, a usage of a gather/scatter instruction looks as follows:

..L10:

                vgatherdps (%r8,%zmm0,4), %zmm6{%k1}

                jkzd      ..L9, %k1

                vgatherdps (%r8,%zmm0,4), %zmm6{%k1}

                jknzd     ..L10, %k1

..L9:

As evident from the above example, every gather/scatter instruction is followed by a conditional branching instruction.

The performance monitoring unit in the coprocessor counts the scatter/gather instruction and the conditional branching instruction jknzd/jkzd (since a masking register is used a conditional) as a vector instruction. However, since the number of set bits in the masking register is used to identify the number of active lanes, any masking instruction including jknzd and jkzd is not counted by the VPU_ELEMENTS_ACTIVE.

Consider the following snippet:

Instruction #	Instruction	VPU_INSTRUCTIONS_EXECUTED	VPU_ELEMENTS_ACTIVE
n	…	x	y
n+1	vgatherdps(…)	x+1	y+16
n+2	jkzd …	x+1+1	y+16+0

In the best case, the gather instruction will be executed only once. Hence, the VPU_INSTRUCTIONS_EXECUTED will be incremented by 1, but since all 16 elements were loaded in one instruction, VPU_ELEMENTS_ACTIVE will be incremented by 16. In case of the jkzd, VPU_INSTRUCTIONS_EXECUTED will be incremented by 1 but since jkzd is a masking instruction, it is not counted by VPU_ELEMENTS_ACTIVE.

Now, if we were to calculate the vector intensity for this example, we would get

Vectorization Intensity = (16+0)/(1+1)=8.

Hence, for scatter/gather instructions, the best vectorization intensity that one can achieve is 8 for single precision and 4 for double precision. Hence, applications that make use of scatter/gather instructions will exhibit much lower vectorization intensities than expected.

Hence, it is important to understand that metrics provide best-effort estimates to analyze applications and in some cases these metrics can have incorrect or possibly nonsensical values. If a metric seems off, it is generally a good idea to further investigate the metric to verify correctness.

Understanding the corner cases of Vectorization Intensity

Scenario 1: Compiler “Magic” when using streaming Stores.

Scenario 2: Using Scatter/Gather instructions

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112