How Intel® AVX2 Improves Performance on Server Applications

The latest Intel® Xeon® processor E5 v3 family includes a feature called Intel® Advanced Vector Extensions 2 (Intel® AVX2), which can potentially improve application performance related to high performance computing, databases, and video processing. Here we will explain the context, and provide an example of how using Intel® AVX2 improved performance for a commonly known benchmark.

For existing vectorized code that uses floating point operations, you can gain a potential performance boost when running on newer platforms such as the Intel® Xeon® processor E5 v3 family by doing one of the following:

Recompile your code, using the Intel® compiler with the proper AVX2 switch to convert existing SSE code. See the Intel® Compiler Options for Intel® SSE and Intel® AVX generation (SSE2, SSE3, SSSE3, ATOM_SSSE3, SSE4.1, SSE4.2, ATOM_SSE4.2, AVX, AVX2) and processor-specific optimizations for more details.
Modify your code's function calls to leverage the Intel® Math Kernel Libraries (Intel® MKL) which are already optimized to use AVX2 where supported
Use the AVX2 intrinsic instructions. For high level language (such as C or C++) developers, you can use Intel® Intrinsic instructions to make the call and recompile code. See the Intel® Intrinsic Guide and Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details
Code in assembly instructions directly. For low level language (such as assembly) developers, you can use those equivalent AVX2 instructions from their existing SSE code. See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details

In this article, I will share a simple experiment using the Intel® Optimized LINPACK benchmark to demonstrate the performance gain of three different sized workloads (30K, 40K, and 75K) from Intel AVX2 running on Windows* and Linux* operating systems. I will also share the list of AVX2 instructions that were executed and the equivalent SSE instructions for developers who are interested in direct coding.

I used the following platform for the experiment:

CPU & Chipset	Model/Speed/Cache: E5-2699 v3 QGN1, 2.3GHz, 45MB Cache, 145W TDP, C-1 Step # of cores per chip: 18 # of sockets: 2 Chipset: Intel C610 "Wellsburg" series chipset, QS (B-1 step) System bus: 9.6GT/s QPI
Platform	Brand/model: Intel EPSD Wildcat Pass Chassis: Intel 2U Rackable Baseboard: codenamed Wildcat Pass, 3 SPC DDR4 Board revision: Qual / PBA H30334-200 BIOS: SE5C610.86B.01.01.556.061320140714 BMC 0.20.6013 ME 03.00.05.402 SDR 0.10 Dimm slots: 24 Power supply: 1x 1100W Removable S-1100ADU00-201 (Rev S3) CD ROM: TEAC Slim Network (NIC): Onboard 10GbE
Memory	Memory Size: 128GB (8x16GB) DDR4 2133P Brand/model: Samsung M393A2G40DB0-CPB DIMM info: 16GB 2Rx4 PC4-2133P-RA0-10-DC0
Mass storage	Brand & model: Intel SSD S3500 Series (SSDSC2BB240G401) Number/size/RPM/Cache: 1ea - 240GB Plus Intel SSD P3700 Series (SSDPEDMD400G4)
Operating system	Microsoft* Windows Server 2012 R2 / SLES 11 SP3 Linux*

Procedure for running LINPACK:

Download and install the following:
1. Intel® Math Kernel Library – LINPACK Download
  http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
2. Intel® Math Kernel Library (Intel® MKL)
  http://software.intel.com/en-us/intel-math-kernel-library-evaluation-options
Create input files for 30K, 75K, 100K from the “...\linpack” directory
For optimal performance, make the following Operating System and BIOS setting changes before running LINPACK:
1. Turn off hyper-threading in the bios
2. For Windows, set “MKL_CBWR=AVX2” on the command line and update the runme_xeon64.bat file to use the input files you previously created. For Linux, export the “MKL_CBWR=AVX2” setting on the command line and update the runme_xeon64 shell script file to use the input files you created.
3. The results will be in Glops similar to Table 2
For Intel AVX runs, you will need to set the “MKL_CBWR=AVX” and repeat above steps.
For Intel SSE runs, you will need to set the “MKL_CBWR=SSE4_2” and repeat above steps.

What are the equivalent instructions for Intel AVX2, AVX, and SSE that were executed?

Table 1 has a list of equivalent instructions for Intel AVX2, AVX and SSE (SSE/SSE2/SSE3/SSE4). If you are thinking of moving your existing code to Intel AVX2.

Intel AVX2 Instructions from the LINPACK Runs	Equivalent Intel AVX Instructions	Equivalent Intel SSE Instructions (SSE/SSE2/SSE3/SSE4)	Definitions
VADDPD	VADDPD	ADDPD	Add Packed Double-Precision Floating-Point Values
VADDSD	VADDSD	N/A	Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.
VBROADCASTSD	VBROADCASTSD	N/A	Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register.
VCMPPD	VCMPPD	N/A	Compare packed double-precision floating-point values
VCOMISD	VCOMISD	N/A	Perform ordered comparison of scalar double-precision floating-point values and set flags in EFLAGS register
VDIVSD	VDIVSD	DIVSD	Divide low double-precision floating point value in xmm2 by low double-precision floating-point value in xmm3/m64
VEXTRACTF128	VEXTRACTF128	N/A	Extract 128 bits of float data from ymm2 and store results in xmm1/mem.
VEXTRACTI128	N/A	N/A	Extract 128 bits of integer data from ymm2 and store results in xmm1/mem.
VFMADD213PD	N/A	N/A	Multiply packed double-precision floating-point values from xmm0 and xmm1, add to xmm2/mem and put result in xmm0.
VFMADD213SD	N/A	N/A	Multiply scalar double-precision floating-point value from xmm0 and xmm1, add to xmm2/mem and put result in xmm0.
VFMADD231PD	N/A	N/A	Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, add to xmm0 and put result in xmm0.
VFMADD231SD	N/A	N/A	Multiply scalar double-precision floating-point value in xmm1 and xmm2/mem, add to xmm0 and put result in xmm0.
VFNMADD213PD	N/A	N/A	Multiply packed double-precision floating-point values from xmm1 and xmm2/mem. -negate the multiplication result and add to xmm0. Put the result in xmm0.
VFNMADD213SD	N/A	N/A	N/A
VFNMADD231PD	N/A	N/A	Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result and add to ymm0. Put the result in ymm0.
VINSERTF128	VINSERTF128	N/A	Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.
VMAXPD	N/A	N/A	Determines the maximum of float64 vectors. The corresponding Inte® AVX instruction is VMAXPD.
VMAXSD	N/A	N/A	Determines the maximum of Single-Precision float64 vectors. The corresponding Intel® AVX instruction is VMAXSD
VMOVAPD	VMOVAPD	MOVAPD	Move Aligned Packed Double-Precision Floating-Point Values
VMOVAPS	VMOVAPS	MOVAPS	Move Aligned Packed Single-Precision Floating-Point Values
VMOVD	VMOVD	N/A	Move Double
VMOVDQU	VMOVDQU	MOVDQU	Move Unaligned Double Quadword
VMOVHPD	VMOVHPD	MOVHPD	Move High Packed Double-Precision Floating-Point Value
VMOVQ	VMOVQ	N/A	Move Quadword
VMOVSD	VMOVSD	MOVSD	Move Data from String to String
VMOVUPD	VMOVUPD	MOVUPD	Move Unaligned Packed Double-Precision Floating-Point Values
VMOVUPS	VMOVUPS	N/A	Move Unaligned Packed Single-Precision Floating-Point Values
VMULPD	VMULPD	MULPD	Multiply Packed Double-Precision Floating-Point Values
VMULSD	VMULSD	N/A	Multiply Packed Single-Precision Floating-Point Values
VPADDQ	N/A	N/A	Add packed Quad-precision floating-point values
VPBLENDVB	N/A	N/A	Conditionally blends word elements of source vector depending on bits in a mask vector
VPBROADCASTQ	N/A	N/A	Take qwords from the source operand and broadcast to all elements of the result vector
VPCMPEQD	N/A	N/A	Compares packed bytes/words/doublewords/quadwords of two source vectors
VPCMPGTQ	N/A	N/A	Compares packed bytes/words/doublewords/quadwords of two source vectors
VPERM2F128	VPERM2F128	MULPD	Multiply Packed Double-Precision Floating-Point Values
VPSHUFD	VPSHUFD	N/A	Permutes 32-bit blocks of an int32 vector
VPXOR	VPXOR	PXOR	Logical Exclusive OR
VUCOMISD	VUCOMISD	UCOMISD	Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS
VUNPCKHPD	VUNPCKHPD	UNPCKHPD	Unpack and Interleave High Packed Double-Precision Floating-Point Values
VUNPCKLPD	VUNPCKLPD	UNPCKLPD	Unpack and Interleave Low Packed Double-Precision Floating-Point Values
VXORPD	VXORPD	XORPD	Bitwise Logical XOR for Double-Precision Floating-Point Values
VXORPS	VXORPS	N/A	Performs bitwise logical XOR operation on float32 vectors
VZEROUPPER	VZEROUPPER	N/A	Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.

Table 1– Intel AVX2, AVX, and Intel SSE Equivalent Instructions

The list in Table 1 is just a subset. The full list can be obtained from the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Intel AVX2 and Intel AVX are complementing each other’s, the instructions will be shared to provide the necessary functionality.

What is the performance gain from running the LINPACK benchmark with Intel AVX2 vs. Intel SSE enabled and Intel AVX2 vs. Intel AVX on the Intel Xeon E5-2699 v3 processor-based server?

Table 2 shows the results from the three different workloads running on Windows* and Linux*. In the “Ratio Intel AVX2 vs. Intel SSE” column, the numbers show that the LINPACK benchmark produces ~2.2x-2.8x better performance when running with the combination of an Intel AVX2 optimized LINPACK and an Intel AVX2 capable processor. For the Intel AVX2 vs. Intel AVX column, the numbers show that the LINPACK benchmark produces ~1.3x-1.6x better performance. This is just an example of the potential performance boost for LINPACK. For other applications, the performance gain will vary depending on the optimized code and the hardware environment.

Windows*	Intel AVX2 (Gflops)	Intel AVX (Gflops)	Intel SSE4 (Gflops)	Ratio: Intel AVX2 vs. Intel SSE	Ratio: Intel AVX2 vs. Intel AVX
LINPACK 30K v11.1.3	735.59	562.68	331.75	2.2	1.3
LINPACK 75K v11.1.3	952.93	589.18	347.99	2.7	1.6
LINPACK 100K v11.1.3	959.90	597.66	350.51	2.7	1.6

Linux*	Intel AVX2 (Gflops)	Intel AVX (Gflops)	Intel SSE4 (Gflops)	Ratio: Intel AVX2 vs. Intel SSE	Ratio: Intel AVX2 vs. Intel AVX
LINPACK 30K v11.1.3	822.35	574.78	335.41	2.3	1.4
LINPACK 75K v11.1.3	964.23	610.63	346.73	2.8	1.6
LINPACK 100K v11.1.3	985.31	611.71	353.34	2.8	1.6

Table 2 – Results and Performance Gain from the LINPACK benchmark running on Intel® Xeon E5-2699 v3 two sockets server

With new AVX2 instructions and the 256 bit registers on Intel E5 processor family, LINPACK was able to take advantage of the new instructions to achieve over 2x performance in comparison to running LINPACK with SSE instructions and over 1.3x performance against LINPACK running with AVX instructions.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Configurations: Intel® Xeon® processor E5-2699 v3 @ 2.30GHz, 45MB L3 cache, 18 core pre-production system. Intel SSD S3500 Series (SSDSC2BB240G401) + Intel® SSD DC P3700 Series @ 400GB, 128GB memory (8x16GB DDR4 -2133Mhz), BIOS by Intel Corporation Version: SE5C610.86B.01.01.556.061320140714 BMC 0.20.6013 ME 03.00.05.402 SDR 0.10, Power supply: 1x 1100W Removable S-1100ADU00-201, running Microsoft* Windows Server 2012 R2 / SLES 11 SP3 Linux*

For more information go to http://www.intel.com/performance

Conclusion

From our LINPACK experiment, we see compelling performance benefits when going to an AVX2-enabled Intel® Xeon® processor; in this specific case, we saw a performance increase of ~2.2x-2.8x for AVX2 vs. SSE and ~1.3x-1.6x for AVX2 vs. AVX in our test environment, which is a strong case for developers who have SSE-enabled code and are weighing the benefit of moving to a newer Intel® Xeon® processor-based system with AVX2. The reference materials below can help developers learn how to migrate existing SSE code to Intel AVX2 code.

References

How Intel® AVX Improves Performance on Server Application
https://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations">Intel® Compiler Options for Intel® SSE and Intel® AVX generation (SSE2, SSE3, SSSE3, ATOM_SSSE3, SSE4.1, SSE4.2, ATOM_SSE4.2, AVX, AVX2) and processor-specific optimizations
Intel® System Studio: Intel® AVX2 Support in the Intel® C++ Compiler
Intel® AVX2 optimization in Intel® MKL
Intel IPP support for Intel® AVX2
Processing Arrays of Bits with Intel® Advanced Vector Extensions 2 (Intel® AVX2
High Performance Multi-core Networked and Storage Systems for Linux
Optimized Pseudo Random Number Generators with AVX2

NOTICES

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

How Intel® AVX2 Improves Performance on Server Applications

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112