Examining the Impact of the MULX Instruction on HAProxy* Performance
One of the key components of a large datacenter or cloud deployment is the load balancer. Providing a service with high-availability requires multiple, redundant servers, transparent failover, the ability to distribute load evenly across them and of course the appearance of being a single server to the outside world, especially when negotiating SSL sessions. This is the role that the SSL-terminating load balancer is designed to fill, and it is a demanding one: every incoming session must be accepted, SSL-terminated, and transparently handed off to a back-end server system as quickly as possible since the load balancer is a concentration point and potential bottleneck for incoming traffic. This case study examines the impact of the Intel® Xeon® v3 processor and the new MULX instruction on the SSL handshake, and how the MULX-optimized algorithms inside OpenSSL* can significantly increase the load capacity of the open source load balancer, HAProxy*.
Background
The goal of this case study is to examine the impact of code optimized for the Intel Xeon v3 line of processors on the performance of the HAProxy load balancer.
One of the new features introduced with this processor family is the MULX instruction, an extension of the MUL instruction that does not implicitly define a destination register and does not modify any register flags in the CPU. This new instruction provides greater flexibility in building algorithms at the assembly level, and in particular allows multiplication to be intermixed with add-carry instructions without corrupting the carry chain. Specifically, MULX enables fast implementations of large integer arithmetic. More information on the MULX instruction can be found in the white paper, New Instructions Supporting Large Integer Arithmetic on Intel® Architecture Processors.
This is relevant to HAProxy in SSL mode because many public key cryptography algorithms depend on modular exponentiation which in turn require the multiplication of very large integers. Accelerating these multiplications on the server directly impact the performance of the SSL handshake; the faster they can be performed, the more handshakes the server can handle, and the more connections per second that can be SSL-terminated and handed off to back-end servers.
The Test Environment
The performance limits of HAProxy were tested for various TLS cipher suites by generating a large number of parallel connection requests, and repeating those connections as fast as possible for a total of two minutes. At the end of those two minutes, the maximum latency across all requests was examined, as was the resulting connection rate that sent to the HAProxy server. The number of simultaneous connections was adjusted between runs to find the maximum connection rate that HAProxy could sustain for the duration without session latencies exceeding 2 seconds. This latency limit was taken from the research paper “A Study on tolerable waiting time: how long are Web users willing to wait?”, which concluded that two seconds is the maximum acceptable delay in loading a small web page.
HAProxy v1.5.8 was installed on a pre-production, two-socket Intel Xeon server system populated with two pre-production E5-2697 v3 processors clocked at 2.60 GHz with Turbo on, running Ubuntu* Server 13.10. Each E5 processor had 14 cores for a total of 28 hardware threads, and with Hyper-Threading enabled the system was capable of 56 threads in total. Total system RAM was 64 GB.
HAProxy is a popular, feature-rich, and high-performance open source load balancer and reverse proxy for TCP applications, with specific features designed for handling HTTP sessions. Beginning with version 1.5, HAProxy includes native SSL support on both sides of the proxy. More information on HAProxy can be found at http://www.haproxy.org/.
The SSL capabilities for HAProxy were provided by the OpenSSL library. OpenSSL is an Open Source library that implements the SSL and TLS protocols in addition to general purpose cryptographic functions. The 1.0.2 branch, in beta as of this writing, is enabled for the Intel Xeon v3 processor and supports the MULX instruction in many of its public key cryptographic algorithms. More information on OpenSSL can be found at http://www.openssl.org/.
The server load was generated by up to six client systems as needed, a mixture of Xeon E5 and Xeon E5 v2 class hardware. Each system was connected to the HAProxy server with one or more 10 Gbit direct connect links. The server had two 4x10 Gbit network cards, and two 2x10 Gbit network cards. Two of the clients had 4x10 Gbit cards, and the remaining four had a single 10 Gbit NIC.
The network diagram for the test environment is shown in Figure 1.
Figure 1. Test network diagram.
The actual server load was generated using multiple instances of the Apache* Benchmark tool, ab, an open source utility that is included in the Apache server distribution. A single instance of Apache Benchmark was not able to create a load sufficient to reach the server’s limits, so it had to be split across multiple processors and, due to client CPU demands, across multiple hosts.
Because each Apache Benchmark instance is completely self-contained, however, there is no built-in mechanism for distributed execution. A synchronization server and client wrapper were written to coordinate the launching of multiple instances of ab across the load clients, their CPU’s, and their network interfaces, and then collate the results.
The Test Plan
The goal of the test was to determine the maximum load in connections per second that HAProxy could sustain over 2-minutes of repeated, incoming connection requests, and to compare the Xeon v3 optimized code (which makes use of the MULX instruction) against previous generation code that does not contain these enhancements. For this purpose, two versions of HAProxy were built: one against the optimized 1.0.2-beta3 release, and one against the unoptimized 1.0.1g release.
To eliminate as many outside variables as possible, all incoming requests to HAProxy were for its internal status page, as configured by the monitor-uri parameter in its configuration file. This meant HAProxy did not have to depend on any external servers, networks or processes to handle the client requests. This also resulted in very small page fetches so that the TLS handshake dominated the session time.
To further stress the server, the keep-alive function was left off in Apache Benchmark, forcing all requests to establish a new connection to the server and negotiate their own sessions.
The key exchange algorithms that were tested are given in Table 1.
Table 1. Selected key exchange algorithms
Key Exchange | Certificate Type |
---|---|
RSA | RSA, 2048-bit |
DHE-RSA | RSA, 2048-bit |
ECDHE-RSA | RSA, 2048-bit |
ECDHE-RSA | ECC, NIST P-256 |
ECDHE-ECDSA | ECC, NIST P-256 |
Since the bulk encryption and cryptographic signing were not a significant part of the session, these were fixed at AES with a 128-bit key and SHA-1, respectively. Varying AES key size, AES encryption mode, or SHA hashing scheme would not have an impact on the results.
Tests for each cipher were run for the following hardware configurations:
- 2 cores enabled (1 core per socket), Hyper-Threading off
- 2 cores enabled (1 core per socket), Hyper-Threading on
- 28 cores enabled (all cores, both sockets), Hyper-threading off
- 28 cores enabled (all cores, both sockets), Hyper-Threading on
Reducing the system to one active core per socket, the minimum configuration in the test system, effectively simulates a low-core-count system and ensures that HAProxy performance is limited by the CPU rather than other system resources. These measurements can be used to estimate the overall performance per core, as well as estimate the performance of a system with many cores.
The all-core runs test the full capabilities of the system, show how well the performance scales to a many-core system and also introduces the possibility of system resource limits beyond just CPU utilization.
System Configuration and Tuning
HAProxy was configured to operate in multi-process mode, with one worker for each physical thread on the system. Because the ECDHE-ECDSA ciphers can be used with either RSA or ECC certificates, it was configured to use one or the other as needed.
An excerpt from the configuration file, haproxy.conf, is shown in Figure 2.
global daemon pidfile /var/run/haproxy.pid user haproxy group haproxy crt-base /etc/haproxy/crt # Adjust to match the physical number of threads # including threads available via Hyper-Threading nbproc 56 tune.ssl.default-dh-param 2048 defaults mode http timeout connect 10000ms timeout client 30000ms timeout server 30000ms frontend http-in bind :443 ssl crt /etc/haproxy/crt # Uncomment to use the ECC certificate # bind :443 ssl crt combined-ecc.crt monitor-uri /test default_backend servers
Figure 2. Excerpt from HAProxy configuration
To support the large number of simultaneous connections, some system and kernel tuning was necessary. First, the number of file descriptors was increased via /etc/security/limits.conf:
* soft nofile 150000 * hard nofile 180000
Figure 3. Excerpt from /etc/security/limits.conf
And several kernel parameters were adjusted (some of these settings are more relevant to bulk encryption):
net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_fin_timeout = 30 # Increase system IP port limits to allow for more connections net.ipv4.ip_local_port_range = 2000 65535 net.ipv4.tcp_window_scaling = 1 # number of packets to keep in backlog before the kernel starts # dropping them net.ipv4.tcp_max_syn_backlog = 3240000 # increase socket listen backlog net.ipv4.tcp_max_tw_buckets = 1440000 # Increase TCP buffer sizes net.core.rmem_default = 8388608 net.core.wmem_default = 8388608 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_mem = 16777216 16777216 16777216 net.ipv4.tcp_rmem = 16777216 16777216 16777216 net.ipv4.tcp_wmem = 16777216 16777216 16777216
Figure 4. Excerpt from /etc/sysctl.conf
Some of these parameters are very aggressive, but the assumption is that this system is a dedicated load-balancer and SSL/TLS terminator.
No other adjustments were made to the stock Ubuntu 13.10 server image.
Results
Results for the 2-core and 28-core runs follow. Because all tests were run on the same hardware all performance improvements are due solely to the algorithmic optimizations for the Xeon v3 processor.
Two Cores
The raw two-core results are shown in Figure 5 and the performance deltas are in Figure 6.
Figure 5. HAProxy performance on a 2-core system with Hyper-Threading off
Figure 6. Performance gains due to Xeon v3 optimizations
These results show significant improvements across all ciphers tested, ranging from 26% to nearly 220%. They also make a compelling argument for using ECC ciphers on Xeon v3 class systems: with an enabled version of OpenSSL, just two cores are able to handle a staggering 11,500 connections per second using an ECDHE-ECDSA key exchange which is more than 4.5x the performance of an RSA key exchange, alone. ECDHE-RSA with an ECC certificate is a distant second, but is still over 1.6x faster than an RSA key exchange. In addition to these performance gains, both of these ciphers also offer the cryptographic property of perfect forward secrecy.
Figure 7. Performance gains from enabling Hyper-Threading
Gains from enabling Hyper-Threading were very modest. The optimized algorithms are structured to use the execution resources as much as possible which does not leave much room for the additional threads.
Twenty-eight Cores
The 28-core tests look at the scalability of these performance gains to a many-core deployment. In an ideal world the values would scale linearly with the core count such that a 28-core system would have 14x the performance of a 2-core system.
The raw results are shown in Figure 8 and Figure 9. The ECDHE-ECDSA handshake once again leads the pack, with HAProxy achieving over 53,000 connections per second. However this maximum occurred at only 63% average CPU utilization on the server, implying that the performance tests ran up against system resource limits that were not CPU-bound. For all other ciphers, the performance tests were able to achieve an average CPU utilization above 99%.
This is clearly shown in Table 2, where the performance scaling was between 8.0 and 8.8 for all handshake protocols except ECDHE-ECDSA, which was only 4.1.
Table 2. Performance scaling of the 28-core system over the 2-core system
Cipher | Scaling |
---|---|
AES128-SHA (RSA) | 8.6 |
DHE-RSA-AES128-SHA (RSA) | 8.7 |
ECDHE-RSA-AES128-SHA (RSA) | 8.8 |
ECDHE-RSA-AES128-SHA (ECC) | 8.0 |
ECDHE-ECDSA-AES128-SHA (ECC) | 4.1 |
Figure 8. HAProxy performance on a 28-core system with Hyper-Threading off
Figure 9. Performance gains due to Xeon v3 optimizations
Effects of HT were more dramatic in the 28-core case. While the % gains were a little higher for most key exchange protocols, DHE-RSA actually saw a performance penalty.
Figure 10. Performance impact from enabling Hyper-Threading
Conclusions
The optimizations for the Xeon v3 processor result in significant performance gains for the HAProxy load balancer using the selected ciphers. Each key exchange algorithm realized some benefit, ranging from 20% to over 200%. While these percentages are impressive, however, it’s probably the absolute performance figures relative to a straight RSA key exchange that are of greatest interest.
While straight RSA benefits from the Xeon v3 optimizations, the Elliptic curve Diffie-Hellman algorithms see much larger gains. ECDHE with an RSA certificate performs roughly on par with RSA, but move to an ECC certificate and the number of connections per second that HAProxy can handle nearly doubles. Move up again to ECDSA, and HAProxy can handle over 4.5 times as many connections per second as straight RSA in the two-core case.
Since ECDHE ciphers provide perfect forward secrecy, there is simply no reason for a Xeon v3 server to use the older, RSA-only ciphers. The ECDHE ciphers not only offer this added security but higher performance as well (and in the case of ECDHE-ECDSA, significantly higher performance). This does come at the cost of added load on the client, but the key exchange in TLS only takes place at session setup time so this is not a significant burden nor much of a trade-off.
On a massively parallel installation with Hyper-Threading enabled, HAProxy can maintain connection rates exceeding 53,000 connections/second using the ECDHE-ECDSA cipher, and do so without fully utilizing the CPU. This is on an out-of-the-box Linux distribution with only minimal system and kernel tuning and it is conceivable that even higher connection rates could be achieved if the system could be optimized to remove the non-CPU bottlenecks. This task was beyond the scope of this study.
Hyper-Threading also offered a modest benefit to all ciphers save for the DHE-RSA exchange. While this penalty is a little unusual, the low performance of DHE-RSA compared to the other algorithms makes it an impractical choice for a high-performance data center, anyway. In general, the performance boost from Hyper-Threading, though modest, is still large enough that a dedicated HAProxy system would probably benefit from leaving it turned on.