Performance of Multibuffer AES-CBC on Intel® Xeon® Processors E5 v3

This paper examines the impact of the multibuffer enhancements to OpenSSL* on the Intel® Xeon® processor E5 v3 family when performing AES block encryption in CBC mode. It focuses on the performance gains seen by the Apache* web server when managing a large number of simultaneous HTTPS requests using the AES128-SHA and AES128-SHA256 ciphers, and how they stack up against the more modern AES128-GCM-SHA256 cipher. With the E5 v3 generation of processors, web servers such as Apache can obtain significant increases in maximum throughput when using multibuffer-enabled algorithms for CBC mode encryption.

Background

One of the performance-limiting characteristics of the CBC mode of AES encryption is that it is not parallelizable. Each block of plaintext in the stream to be encrypted depends upon the encryption of the previous block as an input, as shown in Figure 1. Only the first block has no such dependency and substitutes an initialization vector, or IV, in its place.

Figure 1. CBC mode encryption

Mathematically, this is defined as:

C_i= E_k (P_iXOR C_i-1)

Where

C₀ = IV

From this definition, it’s clear that there are no opportunities for parallelization within the algorithm for the encryption of a single data stream. To perform encryption on any given data block, P_n, it must first be XOR’d with the previous cipher block, and that means that all previous blocks must be encrypted, in order, from 1 to n. The CBC mode of encryption is a classically serial operation.

The multibuffer approach introduced in “Processing Multiple Buffers in Parallel for Performance” describes a procedure for parallelizing algorithms such as CBC that are serial in nature. Operations are interleaved such that the latencies incurred while processing one data block are masked by active operations on another, independent data block. Through careful ordering of the machine instructions, the multiple execution units within a CPU core can be utilized to process more than one data stream in parallel within a single thread.

Multibuffer solutions generally require a job scheduler and an asynchronous application model, but the OpenSSL library is a synchronous framework so a job scheduler is not an option. The solution in this case is to break down the application buffer into TLS records of equal size that can be processed in parallel due to the explicit IV as described in “Improving OpenSSL Performance”. For a web server, the implication is that only file downloads from server to client—page fetches, media downloads, etc.—will see a performance boost. File uploads to the server will not.

The Test Environment

The performance limits of Apache were tested by generating a large number of parallel connection requests and repeating those connections as rapidly as possible for a total of two minutes. At the end of those two minutes, the maximum connection latency across all requests was examined along with the resulting throughput. The number of simultaneous connections was adjusted between runs to find the maximum throughput that apache could achieve for the duration without connection latencies exceeding two seconds. This latency limit was taken from the research paper “A Study on tolerable waiting time: how long are Web users willing to wait?” which concluded that two seconds is the maximum acceptable delay in loading a small web page.

Apache was installed on a pre-production, two-socket Intel Xeon processor-based server system populated with two production E5-2697 v3 processors clocked at 2.60 GHz with Intel® Turbo Boost Technology on and Intel® Hyper-Threading Technology (Intel® HT Technology) off. The system was running SUSE Linux* Enterprise Server 12. Each E5 processor had 14 cores for a total of 28 hardware threads. Total system RAM was 64 GB. Networking for the server load was provided by a pair of Intel® Ethernet Converged Network Adapters, XL710-QDA2 (Intel® Ethernet CNA XL710-QDA2).

The SSL capabilities for Apache were provided by the OpenSSL library. OpenSSL is an open source library that implements the SSL and TLS protocols in addition to general-purpose cryptographic functions. The 1.0.2 release is optimized for the Intel Xeon processor v3 and contains the multibuffer enhancements. For more information on OpenSSL see http://www.openssl.org/. The tests in this case study were made using 1.0.2a.

Two versions of OpenSSL 1.0.2a were built so that the performance of the multibuffer enhancements could be compared to unenhanced code on the same release. Multibuffer support was forcibly removed by defining the preprocessor symbol OPENSSL_NO_MULTIBLOCK:

$ ./Configure –DOPENSSL_NO_MULTIBLOCK options

The server load was generated by up to six client systems as needed, a mixture of Intel Xeon processor E5 v2 and E5 v3 class hardware. Load generators were connected to the Apache server through 40 Gbps links. Two of the clients had a single Intel Ethernet CNA XL710-QDA2 card and were connected to one of the dual-port Intel Ethernet CNA XL710-QDA2 cards on the server. The remaining four load clients each had a single port 10 Gbit card and their bandwidth was aggregated via a 40 Gbit switch.

The network diagram for the test environment is shown in Figure 2.

Figure 2. Test network diagram.

The actual server load was generated using multiple instances of the Apache* Benchmark tool, ab, an open source utility included in the Apache server distribution. A single instance of Apache Benchmark was not able to create a load sufficient to reach the server’s limits, so it had to be split across multiple processors and, due to network bandwidth and client CPU limitations, across multiple hosts.

Because each Apache Benchmark instance is completely self-contained, however, there is no built-in mechanism for distributed execution. A synchronization server and client wrapper were written to coordinate the launching of multiple instances of ab across the load clients, their CPUs, and their network interfaces, and then collate the results. Loads were distributed based on a simple weighting system that accounted for an individual client’s network bandwidth and processing power.

The Test Plan

The goal of the tests was to determine the maximum throughput that Apache could sustain throughout two minutes of repeated, incoming connection requests for a target file, and to compare the results for the multibuffer-enabled version of OpenSSL against the unenhanced version. Multibuffer benefits CBC mode encryption, so the AES128-SHA and AES128-SHA256 ciphers were chosen for analysis.

The secondary goal was to compare the multibuffer results against the more modern GCM mode of block encryption. For that comparison the AES128-GCM-SHA256 cipher was chosen.

This resulted in the following cases:

AES128-SHA, multibuffer ON
AES128-SHA, multibuffer OFF
AES128-SHA256, multibuffer ON
AES128-SHA256, multibuffer OFF
AES128-GCM-SHA256

For each case, performance tests were repeated for a fixed target file size, starting at 1 MB and increasing by powers of four up to 4 GB, where 1 GB = 1024 MB, 1 MB = 1024 KB, and 1 KB = 1024 bytes. The use of 1 MB files and larger minimized the impact of the key exchange on the session throughput. Keep-alives were disabled so that each connection resulted in fetching a single file.

Tests for each cipher were run for the following hardware configurations:

2 cores enabled (1 core per socket)
4 cores enabled (2 cores per socket)
8 cores enabled (4 cores per socket)
16 cores enabled (8 cores pre socket)
28 (all) cores enabled (14 cores per socket)

Intel HT Technology was disabled in all configurations. Reducing the system to one active core per socket, the minimum configuration in the test system, effectively simulates a low-core-count system and ensures that Apache performance is limited by the CPU rather than other system resources. These measurements can be used to estimate the overall performance per core, as well as estimate the projected performance of a system with many cores.

The many-core runs test the scalability of the system and introduce the possibility of system resource limits beyond just CPU utilization.

System Configuration and Tuning

Apache was configured to use the event Multi-Processing Module (MPM), which implements a hybrid multi-process, multi-threaded server. This is Apache’s highest performance MPM and the default on systems that support both multiple threads and thread-safe polling.

To support the large number of simultaneous connections that might occur at the smaller target file sizes, some system and kernel tuning was necessary. First, the number of file descriptors was increased via /etc/security/limits.conf:

Figure 3. Excerpt from /etc/security/limits.conf

And several kernel parameters were adjusted (some of these settings are more relevant to bulk encryption):

Figure 4. Excerpt from /etc/sysctl.conf

Some of these parameters are very aggressive, but the assumption is that this system is a dedicated TLS web server.

No other adjustments were made to the stock SLES 12 server image.

System Performance Limits

Before running the tests, the throughput limit of the server system was explored using unencrypted HTTP. Tests on the same target file sizes, with all cores active and the same constraint of a 2-second maximum connection latency, saw a maximum achievable throughput of just over 77 Gbps (with very little CPU utilization).

The exact reason for this performance limit is not known. A cursory investigation suggested that there may have been a configuration issue with the dual-port NIC and the use of both ports simultaneously, leading to much less than the maximum throughput for the adapter. In-depth debugging was not done however due to time constraints.

Results

The maximum throughputs in Gbps achieved for the AES128-SHA and AES128-SHA256 ciphers by file size are shown in Figure 5 and Figure 6. At the smallest file size, 1 MB, the multibuffer enhancements result in about a 44% gain on average, and at the larger file sizes this gain is as high as 115%. This holds true up through 8 cores. At 16 cores, the gains begin to drop off as the throughput reaches the ceiling of 77 Gbps. In the 28-core case, the unenhanced code has nearly reached the throughput ceiling, but with significantly higher CPU utilization as shown in Figure 7.

Figure 5. Maximum throughput on Apache* server by file size for given core counts using the AES128-SHA cipher

The AES128-SHA256 cipher shows even larger gains for the multibuffer enhanced code, with about a 65% improvement for 1 MB files and jumping to 130% at larger file sizes. Because the SHA256 hashing is more CPU intensive, the overall throughput is significantly lower than the SHA1-based cipher. A side effect of this lower performance is that the multibuffer code scales through the 16-core case, and the unenhanced code never reaches the throughput ceiling even when all 28 cores are active.

Figure 6. Maximum throughput on Apache* server by file size for given core counts using the AES128-SHA256 cipher

Figure 7. Maximum CPU utilization for Apache* server: AES128-SHA cipher and 28-cores

The performance of the multibuffer-enhanced ciphers is compared to the AES128-GCM-SHA256 cipher in Figure 8. The GCM cipher outperforms both of the multibuffer-enhanced ciphers, though AES128-SHA stays within about 20% of the GCM throughput. The AES128-SHA256 cipher is the lowest performer due to the larger CPU demands of the SHA256 hashing.

Figure 8. Maximum throughput on Apache* server by file size for given core counts, comparing CBC + multibuffer to GCM encryption

Conclusions

The multibuffer enhancements to AES CBC encryption in OpenSSL 1.0.2 provide a significant performance boost, yielding over 2x performance in some cases. Web sites that need to retain these older ciphers in their negotiation list can achieve performance that is nearly on par with GCM for page and file downloads.

Web site administrators considering moving to AES128-SHA256 to obtain the added security from the SHA256 hashing will certainly see a significant performance boost from multibuffer, but if at all possible they should switch to GCM, which offers significantly higher performance due to its design.

Performance of Multibuffer AES-CBC on Intel® Xeon® Processors E5 v3

Background

The Test Environment

The Test Plan

System Configuration and Tuning

System Performance Limits

Results

Conclusions

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List