Contents
1. Executive Summary
2. Introduction
3. Intel® Xeon® Processor E7-8800/4800 v3 Product Family Enhancements
3.1 Intel® Advanced Vector Extensions 2 (Intel® AVX2)
3.2 Haswell New Instructions (HNI)
3.3 Intel® Transactional Synchronization Extensions (Intel® TSX)
3.4 Support for DDR4 memory
3.5 Power Improvements
3.6 New RAS features
4. Brickland platform improvements
4.1 Virtualization features
1.3 New Security features
1.4 Intel® Node Manager 3.0
5 Conclusion
About the Author
1. Executive Summary
The Intel® Xeon® processor E7-8800/4800 v3 product family, formerly codenamed “Haswell EX”, is a 4-socket platform based on Intel’s most recent microarchitecture, the new “TOCK,” which is based on 22nm process technology. The new processor brings additional capabilities for business intelligence, database and virtualization applications. Platforms based on the Intel Xeon processor E7-8800/4800 v3 product family yield up to 40% average improvement in performance1 compared to the previous generation, “Ivy Bridge EX.”
The latest generation processor has many new hardware and software features. On the hardware side its additional cores and memory bandwidth, DDR4 memory support, power enhancements, virtualization enhancements and some security enhancements (System Management Mode external call trap) can improve application performance significantly without any code changes. On the software side it has Haswell New Instructions (HNI), Intel® Transactional Synchronization Extensions (Intel® TSX), and Intel® Advanced Vector Extensions 2 (Intel® AVX2). Developers must enable these software features in their applications. Haswell-EX also brings additional reliability, availability, and serviceability (RAS) capabilities such as address-based mirroring for granular control of critical memory regions improving uptime.
2. Introduction
The Intel Xeon processor E7-8800/4800 v3 product family is based on the Haswell microarchitecture and has several enhancements over the Ivy Bridge EX microarchitecture. The platform supporting the Intel Xeon processor E7-8800/4800 v3 product family is based on the Intel C602J Chipset (codenamed “Brickland”). This paper discusses the new features available in the latest product family compared to the previous one. Each section includes information about what developers need to do to take advantage of the new features to improve application performance, security, and reliability.
3. Intel® Xeon® Processor E7-8800/4800 v3 Product Family Enhancements
Figure 1 shows an overview of the Intel Xeon processor E7-8800/4800 V3 product family microarchitecture. All processors in the family have up to 18 cores (compared to 15 cores in its predecessor), which add additional computing power. They also have faster, additional cache and more memory bandwidth.
[1] Up to 40% average performance improvement claim based on the geometric mean of 12 key benchmark workloads comparing 4-socket servers using Intel® Xeon® processor E7-8890 v3 to similarly configured Intel® Xeon® processor E7-4890 v2. Source: Internal Technical Reports. See http://www.intel.com/performance/datacenter for more details.
Figure 1: Intel® Xeon® processor E7-8800/4800 v3 product family overview
The Intel Xeon processor E7-8800/4800 v3 product family includes the following new features:
- Intel® Advanced Vector Extensions 2 (Intel® AVX2)
- Haswell New Instructions (HNI)
- Intel® Transactional Synchronization Extensions (Intel® TSX)
- Support for DDR4 memory
- Power management feature improvements
- New RAS features
Table 1 compares latest and previous generations of product families.
Table 1. Feature Comparison of the Intel® Xeon® processor E7-8800/4800 v3 product family to the Intel® Xeon® processor E7-4800 v2 product family
Features | Intel® Xeon® processor | Intel® Xeon® processor |
---|---|---|
Socket | R1 | R1 |
Cores | Up to 15 | Up to 18 |
Process technology | 22 nm | 22 nm |
TDP | 155W Max | 165W Max (includes Integrated Voltage Regulator) |
Intel® QuickPath Interconnect (Intel® QPI) ports/speed | 3x Intel QPI_v1.1, 8.0 GT/s max. | 3x Intel QPI_v1.1, 9.6 GT/s max. |
Core addressability | 46 bit /48 bit virtual | 46 bit /48 bit virtual |
Last Level Cache Size | Up to 37.5MB | Up to 45MB |
Memory DDR4 speeds | N/A | Perf Mode: 1333 MT/s, 1600 MT/s Lockstep Mode: 1333, 1600, 1866 MT/s |
Memory DDR3 speeds | Perf Mode: 1066,1333 MT/s Lockstep Mode: 1066,1333, 1600 MT/s | Perf Mode: 1066,1333 MT/s, 1600 MT/s Lockstep Mode: 1066,1333, 1600 MT/s |
VMSE Speeds | Up to 2667 MT/s | Up to 3200 MT/s |
DIMMs/Socket | 24 DIMMs/(3DIMMs/DDR3 channel) | 24 DIMMs/(3 DIMMs/DDR4 & DDR3 channel) |
RAS | Westmere EX Baseline Features +
| Ivy Bridge-EX Baseline + eMCA Gen2 + Address Based Memory Mirroring +Multiple Rank Sparing + DDR4 recovery for command & parity errors |
Intel® Integrated I/O | 32 PCIe* 3.0, 1x x4 DMI2 | 32 PCIe 3.0, 1x x4 DMI2 |
Security (the same on both families) | Intel® Trusted Execution Technology, Intel® Advanced Encryption Standard New Instructions, Intel® Platform Protection with Intel® OS Guard, Intel® Data Protection Technology with Intel® Secure Key |
The rest of this paper discusses some of the main enhancements in the latest product family.
3.1 Intel® Advanced Vector Extensions 2 (Intel® AVX2)
While all floating point vector instructions were extended from 128 bits to 256 bits in Intel AVX, Integer vector instructions are also extended to 256 bits in Intel AVX2. Intel AVX2 uses the same 256 bit YMM registers as Intel AVX. Intel AVX2 includes the fused multiply add (FMA), gather, shifts, and permute instructions and is designed to benefit high performance computing (HPC), database, and audio and video applications.
The fused multiply add (FMA) instruction computes ±(a×b)±c with only one rounding. axb intermediate results are not rounded, thus this instruction brings increased accuracy compared to MUL and ADD instructions. FMA increases performance and accuracy of many floating point computations such as matrix multiplication, dot product, and polynomial evaluation. With 256 bits, you can have 8 single precision and 4 double precision FMA operations. Since FMA combines two operations into one, it is possible to perform more floating point operations per second per cycle; additionally, because Haswell has two FMA units, the peak FLOPS are doubled.
The gather instruction loads sparse elements to a single vector. It can gather 8 single precision (Dword) or 4 double precision (Qword) data elements into a vector register in a single operation. There is a base address that points to the data structure in memory. Index (offset) gives the offset of each element from the base address. The mask register tracks which element needs to be gathered. The gather operation is complete when the mask register contains all zeros.
Other new operations in Intel AVX2 include an integer version of permute instructions and new broadcast and blend instructions.
3.2 Haswell New Instructions (HNI)
Haswell new instructions include 4 crypto instructions to speed up public key and SHA encryption algorithms and 12 (bit manipulation) instructions to speed up compression or signal processing algorithms. The bit manipulation instructions (BMI) perform arbitrary bit field manipulations, leading and trailing zero bit counts, trailing set bit manipulations, and improved rotates and arbitrary precision multiplies. They speed up algorithms that perform bit field extracting and packing, bit-granular encoded data processing (compression algorithms universal coding), arbitrary precision multiplication, and hashes.
To use HNI, you need to use an updated compiler, as shown in the Table 2 below.
3.3 Intel® Transactional Synchronization Extensions (Intel® TSX)
Intel TSX transactionally executes lock-protected critical sections. It executes instructions without acquiring a lock thereby exposing hidden concurrency. Hardware manages transactional updates to registers and memory. Everything looks atomic from a software perspective. If the instruction fails, hardware rolls back and restarts execution of the critical section.
Intel TSX has two different interfaces:
Hardware Lock Elision (HLE)– includes two prefixes XACQUIRE/XRELEASE. Software uses these legacy compatible hints to identify critical sections. These hints are ignored on legacy hardware. Instructions are executed transactionally without acquiring a lock. An abort causes re-execution without elision.
Restricted Transactional Memory (RTM)– allows software to use two new instructions XBEGIN/XEND to specify critical sections. This is similar to HLE but has a more flexible interface for software to perform lock elision. If the instruction fails, control is transferred to the target specified by XBEGIN operand. The software call back handler can implement any number of policies like exponential back off. What action to take is up to the developer and depends on the workload. XTEST and XABORT are additional instructions. XTEST allows software to determine quickly whether it is executing within a transaction or not. XABORT is the explicit abort instruction.
Intel TSX has a simple and clean ISA interface. This is particularly useful for shared memory multithreaded applications that employ lock-based synchronization mechanisms. For more details, please visit http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/
For more details on Intel AVX2, HNI, and Intel TSX, please refer to the Intel® architecture instruction set extensions programming reference manual at https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
Table 2: Compiler support options for the new instructions
For more details on Intel® C++ Compiler, visit http://software.intel.com/en-us/intel-parallel-studio-xe.
3.4 Support for DDR4 memory
The Intel Xeon processor E7-8800/4800 v3 product family supports both DDR3 and DDR4 memory. The product family also supports Intel® C112 or C114 Scalable memory buffers. With 8 DDR4/DDR3 channels per socket and up to 24 DDR4/DDR3 DIMMs per socket, this platform supports up to DDR4 64GB LR-DIMM (see Figure 2). This platform supports up to 6 Terabytes of memory in a 4-Socket/96-DIMM configuration.
Figure 2: Intel® Xeon® processor E7-8800/4800 v3 memory configuration
3.5 Power Improvements
The power improvements in Intel Xeon processor E7-8800/4800 v3 product family include:
- Per core P-states (PCPS)
- Each core can be programmed to the Operating System (OS) requested P-state
- Uncore frequency scaling (UFS)
- The uncore frequency is independently controlled from the cores’ frequencies
- Optimizing performance by applying power to where it is most needed
- Faster C-states
- When you wake a core out of a C3 or C6 state, the transition takes time. The time is less on Haswell-EX, making the transition faster.
- Lower idle power
Contact your operating system (OS) provider for details on which OS supports these features.
3.6 New RAS features
The Intel Xeon processor E7-8800/4800 v3 product family includes these new RAS features:
- Enhanced machine check architecture recovery Gen2
- Address-based memory mirroring
- Multiple rank sparing
- DDR4 recovery
Enhanced machine check architecture recovery Gen2 (EMCA2): As shown in Figure 3, the EMCA2 feature implements an enhanced “firmware first model (FFM)” fault handling: all of the corrected and uncorrected errors are first signaled to BIOS/SMM (System Management Mode) allowing the firmware (FW) to determine if and when errors need to be signaled to the virtual machine monitor (VMM)/OS/software (SW). Once FW determines that an error needs to be reported to VMM/OS/SW, it updates the Machine Check Architecture (MCA) banks and/or optionally the Enhanced Error Log Data Structure and signals the OS.
Figure 3: Enhanced machine check architecture recovery Gen2
Prior to EMCA2, IA32-legacy MCA implemented error handling by logging all the errors in architected registers (MCA Banks) and signaling the OS/VMM. Enhanced error log is a capability for BIOS to present error logs to the OS in an architectural manner using data structures located within the main memory. The OS can traverse the data structures that are pointed to by EXTENDED_MCG_PTR MSR and locate the enhanced error log. The memory range used for error logs is preallocated and reserved by FW during boot time. This allows the OS to provide correct mapping for this range at all times. These memory buffers cannot be part of System Management RAM (SMRAM) since the OS cannot read SMRAM. This range must be 4-K aligned and may be located below or above 4 GB.
To make use of this feature both OS- and application-level enabling are required.
For details on enhanced MCA, see: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/enhanced-mca-logging-xeon-paper.pdf.
Address-based memory mirroring: Memory mirroring provides protection against uncorrectable memory errors that would otherwise result in a platform failure. Address-based memory mirroring provides further granularity to mirror memory by allowing the FW or OS to determine a range of memory addresses to be mirrored.
A pair of mirrored DIMMs forms a redundant group. In a mirror configuration, one pair of memory DIMMs is designated the primary image and the other the secondary image. For memory writes, the write request is issued to both sets of DIMMs. For memory reads, the read request is issued to the primary DIMM. In the case of a detected correctable error, the primary image will toggle and the read will be issued to the “new” primary image. In the case of a detected uncorrectable error, the definition of the primary image will “hard fail” to the other image. In this case the “failed” image will never become the primary image until the failed DIMMs have been replaced and the image re-built.
Memory mirroring reduces the available memory by one half. You cannot configure memory mirroring and memory RAID at the same time.
Figure 4: Address-based memory mirroring
Address-based mirroring enables more cost-effective mirroring by mirroring just the critical portion of memory versus mirroring the entire memory space.
To use this feature, the OS must be enabled. Please contact your OS provider for details on what versions of the OS support or will support this feature.
Multiple rank sparing: Rank sparing provides a second rank for dynamic failover of a failing rank to a spare rank behind the same memory controller. Multi-rank sparing allows more than one sparing event and therefore increases the uptime of the system. The system’s BIOS provides multiple options to select from: 1, 2, 3, or auto (default) mode. In auto mode, up to half of the available ranks are identified to be allocated as spare ranks. Spare ranks must be equal to or greater than all other ranks. As shown in Figure 5, if more than one is equal and larger size rank, then nonterminating ranks will be selected as the spare rank (for example, rank1 in Dual-Rank (DR) DIMM, and rank 1 or 3 in Quad-Rank (QR) DIMM). Multiple rank sparing is enabled in firmware and requires no OS involvement. It supports up to two spare ranks per DDR channel.
Figure 5: Multiple rank sparing
This is an OEM configuration and no enabling is required in the OS or application level.
DDR4 Recovery:In DDR3 technology, recovery from command and address parity errors was not feasible. These errors were reported as fatal, requiring a system reset. DDR4 technology-based DIMMs incorporate logic to allow integrated memory controller (iMC) recovery from command and address parity errors. Approximately 1% performance tradeoff is incurred when this feature is used.
To make use of this feature, no OS or application level enabling is required.
Figure 6: DDR4 error recovery
4. Brickland platform improvements
Some of the new features that come with Brickland platform include:
- New virtualization features
- New security features
- Intel® Node Manager 3.0
4.1 Virtualization features
Virtual Machine Control Structure (VMCS) shadowing:Nested virtualization allows a root Virtual Machine Monitor (VMM) to support guest VMMs. However, additional Virtual Machine (VM) exits can impact performance. As shown in Figure 7, VMCS shadowing directs the guest VMM VMREAD/VMWRITE to a VMCS shadow structure. This reduces nesting induced VM exits. VMCS shadowing increases efficiency by reducing virtualization latency.
Figure 7: VMCS Shadowing
For this feature, the VMM must be enabled. Ask your VM provider when this feature will be supported.
Cache Monitoring Technology (CMT):CMT (also known as “Noisy Neighbor” management) provides last-level cache occupancy monitoring, which allows the VMM to identify cache occupancy at an individual application or VM level. With this information, virtualization software can better schedule and migrate workloads.
For this feature, the VMM must be enabled. Ask your VM provider when this feature will be supported.
Extended Page Table (EPT) Access/Dirty (A/D) bits: In the previous generation platform, Accessed and Dirty bits (A/D bits) are emulated in VMM and accessing them causes VM exits. Brickland implements EPT A/D bits in hardware to reduce VM exits (Figure 8). This enables efficient live migration of Virtual Machines and fault tolerance.
Figure 8: EPT A/D in HW
For this feature, the VMM must be enabled. Ask your VM provider when this feature will be supported
VT-X latency reduction: Performance overheads arise from virtualization transition round trips—“exits” from VM to VMM and “entry” from VMM to VM due to handling of privileged instructions. Intel has made continuing enhancements to reduce transition times on each platform generation. Brickland reduces the VMM overheads further and increases virtualization performance.
1.3 New Security features
System Management Mode (SMM) external call trap (SECT): SMM is an operating mode in which all normal execution (including the OS’s) is suspended, and special separate software (usually firmware or a hardware-assisted debugger) is executed in high-privilege mode. SMM is entered to run handler code due to the SMI (system management interrupt). Without SMM external call trap (SECT), the SMI handler could execute malicious code in user memory. With SECT, the handler can’t invoke code in user memory.
Figure 9: SMM external call trap
BIOS level enabling is required to turn on this feature.
General Crypto Assist – Intel AVX2, 4th ALU, RORX for hashing:Intel AVX2 (256-bit integer, better bit manipulation, permute granularity), and 4th ALU (arithmetic & logical unit) help crypto algorithms run faster. RORX accelerates hash algorithms. Please refer Intel® Architecture Instruction Set Extensions Programming Reference for more details on these instructions.
Asymmetric Crypto Assist – MULX for public key:The new MULX instruction improves asymmetric crypto and eases more crypto challenges. Please refer Intel® Architecture Instruction Set Extensions Programming Reference for more details on this instruction.
Symmetric Crypto Assist – Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) optimization: Brickland includes enhancements and extensions for symmetric cryptography–Intel AES-NI and beyond. Please refer to this article to find out more about Intel AES-NI and how to use it.
PCH-ME Digital Random Number Generator (DRNG): The Manageability Engine (ME) is an independent and autonomous controller in the platform’s architecture. ME requires well secured communication methods given its autonomy and access to low-level platform mechanisms. Providing the ME with a high-quality randomization source is necessary to maximize platform security. PCH-ME DRNG Technology provides real entropy and generates highly unpredictable random numbers for encryption use by ME, isolated from other system resources.
1.4 Intel® Node Manager 3.0
Brickland comes with the latest version of Intel Node Manager 3.0. Its improvements include:
- Predictive power limiting
- Power throttles engage predictively as system power approaches limit
- Power limit enforced during boot
- “Boot spike” is controlled without complex IT processes or disabling cores
- Power Management for the Intel® Xeon Phi™ coprocessor
- Separate power limits and controls for the Intel Xeon Phi coprocessor domain and rest-of-platform
- Node Manager Power Thermal Utility (PTU)
- Establishes key power characterization values for CPU and memory domains
- Delivered as firmware
Please visit this link for more details on Intel Node Manager.
5 Conclusion
The Intel Xeon processor E7-8800/4800 V3 product family combined with the Brickland platform provides many new and improved features that could significantly improve performance and power experience on enterprise platforms.
About the Author
Sree Syamalakumari is a software engineer in the Software & Service Group at Intel Corporation. Sree holds a Master's degree in Computer Engineering from Wright State University, Dayton, Ohio.