Introduction
Communications software requires extremely high performance, with data being exchanged in a huge number of small packets. One of the tenets of developing Network Functions Virtualization (NFV) applications is that you virtualize as far as possible, but still optimize for the underlying hardware where necessary.
In this paper, I will talk you through three features of Intel® processors that you can use to optimize the performance of your NFV applications: cache allocation technology (CAT), Intel® Advanced Vector Extensions 2 (Intel® AVX2) for processing vectors of data, and Intel® Transactional Synchronization Extensions (Intel® TSX).
Solving priority inversion with CAT
When a low priority function steals resource from a high priority function, we call it priority inversion.
Not all virtual functions are equally important. A routing function, for example, would be time and performance critical, but a media encoding function wouldn’t be. It could afford to sometimes drop a packet without affecting the user experience because nobody will notice if a video drops from 20 frames per second to 19 frames per second.
The cache is organized by default so that the heaviest user gets the biggest share of it. The heaviest user won’t necessarily be the most important application, though. In fact, the opposite is often true. High priority applications are optimized by reducing their data to the smallest set possible. Low priority applications aren’t worth optimizing in that way, and so tend to consume more memory. Some are inherently memory-hungry too: a packet inspection function for statistical analysis would be low priority, for example, but would require a lot of memory and cache use.
Developers often assume that if they put a single high priority application on a particular core, it’s safe and can’t be affected by low priority applications. That’s not true, unfortunately. Each core has its own level 1 cache (L1, the fastest but smallest cache) and level 2 cache (L2, which is slightly bigger, and somewhat slower). There are separate L1 caches for data (L1D) and program code (L1I, where I stands for instructions). The slowest cache, L3, is shared between the cores in a processor. In Intel® processor architectures up to and including Broadwell, the L3 cache is fully inclusive, which means it contains everything in the L1 and L2 caches. Because of the way the fully inclusive cache works, if something is evicted from L3, it also disappears from the associated L1 and L2 caches. This means that a low priority application that needs space in the L3 cache can evict data from the L1 and L2 caches of a high priority application, even if it’s on a different core.
In the past, there has been a workaround to resolve this, called ‘warming up’. When functions compete for L3 cache, the winner is the application that accesses the memory more often. One solution, then, is for the high priority function to keep accessing the cache when it is idle. It’s not an elegant solution, but it is often good enough, and until recently there wasn’t an alternative. Now there is: The Intel® Xeon® processor E5 v3 family introduced cache allocation technology (CAT), which enables you to allocate cache according to your applications and classes of service.
Understanding the impact of priority inversion
To demonstrate the impact of priority inversion, I wrote a simple microbenchmark that periodically runs a linked list traversal in a high priority thread, while a memory copy function is constantly running in a low priority thread. The threads are pinned to different cores on the same socket. This simulates the worst possible contention, with the copy operation being memory hungry and highly likely to disturb the more important list access thread.
Here’s the C code:
// Build a linked list of size N with pseudo-random pattern void init_pool(list_item *head, int N, int A, int B) { int C = B; list_item *current = head; for (int i = 0; i < N - 1; i++) { current->tick = 0; C = (A*C + B) % N; current->next = (list_item*)&(head[C]); current = current->next; } } // Touch first N elements in a linked list void warmup_list(list_item* current, int N) { bool write = (N > POOL_SIZE_L2_LINES) ? true : false; for(int i = 0; i < N - 1; i++) { current = current->next; if (write) current->tick++; } } void measure(list_item* head, int N) { unsigned __long long i1, i2, avg = 0; for (int j = 0; j < 50; j++) { list_item* current = head; #if WARMUP_ON while(in_copy) warmup_list(head, N); #else while(in_copy) spin_sleep(1); #endif i1 = __rdtsc(); for(int i = 0; i < N; i++) { current->tick++; current = current->next; } i2 = __rdtsc(); avg += (i2-i1)/50; in_copy = true; } results[result++]=avg/N }
It contains three functions:
- The init_pool() function initializes a linked list located in a big and sparse memory area using a simple pseudorandom number generator. This avoids list elements being close together in memory, which would enable spatial locality, and disturb our measurements as some elements will be automatically prefetched. Each item in the list is exactly one cache line.
- The warmup() function constantly traverses the linked list. We have to touch the specific data we want to keep in the cache, so this function stops the linked list from being evicted from the L3 cache by the other threads.
- The measure() function measures a single list element traversal, then either sleeps for 1 millisecond or calls the warmup() function, depending on which benchmark we are running. The measure() function then averages the results.
The results of the microbenchmark running on a 5th generation Intel® Core™ i7 processor are shown on the graph below, where the X axis is the total number of cache lines in the linked list, and the Y axis shows the average number of CPU cycles per linked list access. As the size of the linked list increases, it spills over from the L1D cache into L2 and L3 cache, and finally into main memory.
The baseline is the red-brown line that shows the program running without the memory copy thread, and so without any contention. The blue line shows the effect of priority inversion: the memory copy function results in the list access taking significantly longer. The impact is particularly strong when the list fits into the high speed L1 cache or the L2 cache. The impact is insignificant when the list is larger than can fit into the L3 cache.
The green line shows the effect of warming up when the memory copy function is running: it dramatically cuts the access times, bringing them much closer to the baseline.
If we enable CAT and allocate parts of the L3 cache for the exclusive use of each core, the results are very close to the baseline (too close to plot here!), which is exactly our goal.
How to enable CAT
First you should make sure the platform supports CAT. You can use a CPUID instruction to check leaf 7, subleaf 0 which was added to indicate that CAT is available.
If CAT is enabled and supported, there are model specific registers (MSRs) that can be programmed to allocate different parts of L3 to different cores.
Each socket has the MSRs IA32_L3_MASKn (e.g. 0xc90, 0xc91, 0xc92, 0xc93). These registers store a bitmask that indicates how much of the L3 cache to allocate to each class of service (COS). 0xc90 stores the cache allocation for COS0, 0xc91 for COS1 and so on.
For example, this chart shows some possible bitmasks for the different classes of service, showing how the cache might be shared with COS0 getting half, COS1 getting a quarter, and COS2 and COS3 getting an eighth each. 0xc90 would contain 11110000, and 0xc93 would contain 00000001, for example.
Direct Data I/O (DDIO) has its own hidden bitmask that allows streaming data from high speed PCIe devices, such as network cards, to certain parts of the L3 cache. There is a possibility that this will conflict with the classes of service you define, so you have to take it into account when designing high throughput NFV applications. To test for conflict, use Intel® VTune™ Amplifier XE to test for cache misses. Some BIOSes have a setting to view and change the DDIO mask.
Each core has an MSR IA32_PQR_ASSOC (0xc8f), which is used to specify which class of service applies to that core. The default is 0, which means the bitmask in MSR 0xc90 is used. (By default, the bitmask of 0xc90 is all 1s, to provide maximum cache availability).
The most straightforward usage model for CAT in NFV is to allocate chunks of L3 using isolated bitmasks to different cores, and then pin your threads or VMs to the cores. If VMs have to share cores for execution, it is also possible to make a trivial patch to an OS scheduler to add a cache mask to threads running VMs, and dynamically enable it on each process schedule event.
There is also an unconventional way of using CAT for locking data in the cache. First, make an active cache mask and touch the data in memory so it is loaded to L3. Then disable the bits that represent this part of the L3 cache in any CAT bitmask that will be used in the future. The data will then be locked into L3 because there is no way to evict it (apart from DDIO). In an NFV application, this mechanism is useful to lock medium sized lookup tables for routing and packet inspection in the L3 cache, to enable constant access.
CAT configuration is described in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 17.15.
Using Intel AVX2 for processing vectors
Single instruction multiple data (SIMD) instructions enable you to carry out the same operation on different pieces of data at the same time. They’re often used to speed up floating point processing, but integer versions of arithmetic, logical and data manipulation instructions are also available.
Depending on which processor you are using, you will have a different family of SIMD instructions available to you, and a different size of vector that the commands can process:
- SSE offers 128-bit vectors;
- Intel AVX2 offers integer instructions for 256-bit vectors and also introduces instructions for gather operations;
- AVX3, coming in future Intel® Architecture, will offer 512-bit vectors.
A 128-bit vector could be used for two 64-bit variables, four 32-bit variables, or eight 16-bit variables, depending on the SIMD instructions you use. Larger vectors can accommodate more data items. Given the need for high-performance throughput in NFV applications, you should use the most advanced SIMD instructions (and supporting hardware) available, which is currently Intel AVX2.
The most common use of SIMD instructions is to perform the same operation with a vector of values at the same time, as shown in the picture. Here, the operation for generating X1opY1 to X4opY4 is a single instruction that handles the data items X1 to X4 and Y1 to Y4 at the same time. In this example, the speed-up would be 4x compared to normal (scalar) execution, because four operations are processed at the same time. The speed-up can be as large as the SIMD vector size. NFV applications often involve processing multiple packet streams in the same way, so SIMD is a natural fit for optimizing performance.
For simple loops, the compiler is often able to automatically vectorize operations using the latest SIMD instructions available in the CPU (if you use the correct compiler flags). The code can be optimized to use the most advanced instructions available on the hardware at run-time, or can be compiled for a specific target architecture.
SIMD operations also enable memory loads, copying up to 32 bytes (or 256 bits) from memory to a register, streaming loads between memory and the register bypassing the cache, and gathering data from different memory locations. You can also perform vector permutations, which shuffle data in a single register, and vector stores, which write up to 32 bytes from a register to memory.
Memcpy and memmov are famous examples of essential routines that historically were implemented using SIMD instructions because the REP MOV instruction was too slow. The memcpy code has been regularly updated in the system libraries to take advantage of later SIMD instructions, and a CPUID dispatch table has been used to see which is the latest one that can be used. But the libraries tend to lag behind the SIMD generations in their implementation.
For example, the following memcpy routine using a trivial loop is based on an intrinsic (rather than using library code) so the compiler can optimize it for the latest SIMD instructions:
_mm256_store_si256((__m256i*) (dest++), (__m256i*) (src++))
It compiles to the following assembly code, to deliver twice the performance of recent libraries:
c5 fd 6f 04 04 vmovdqa (%rsp,%rax,1),%ymm0 c5 fd 7f 84 04 00 00 vmovdqa %ymm0,0x10000(%rsp,%rax,1)
The assembly code from the intrinsic will copy 32 bytes (256 bits) using the latest SIMD instructions available, while library code using SSE would only copy 16 bytes (128 bits).
NFV applications also often need to perform a gather operation, loading data from several non-consecutive memory locations. For example, the network card might place the incoming packets in the cache using DDIO. The NFV application might only need to access the part of the network header with the destination IP address. Using a gather operation, the application could collect the data on eight packets at the same time.
For a gather, there’s no need to use an intrinsic or inline assembly because the compiler can vectorize code similar to the program below, which is based on a benchmark that sums numbers from pseudorandom memory locations:
int a[1024]; int b[64]; for (i = 0; i < 1024; i++) a[i] = i; for (i = 0; i < 64; i++) b[i] = (i*1051) % 1024; for (i = 0; i < 64; i++) sum += a[b[i]]; // This line is vectorized using gather.
The last line compiles to the following assembly:
c5 fe 6f 40 80 vmovdqu -0x80(%rax),%ymm0 c5 ed fe f3 vpaddd %ymm3,%ymm2,%ymm6 c5 e5 ef db vpxor %ymm3,%ymm3,%ymm3 c5 d5 76 ed vpcmpeqd %ymm5,%ymm5,%ymm5 c4 e2 55 90 3c a0 vpgatherdd %ymm5,(%rax,%ymm4,4),%ymm7
While a single gather operation is significantly faster than a sequence of loads, it only matters if the data is already in the cache. If it is not, the data has to be fetched from memory, which can cost tens or hundreds of CPU cycles. With data in the cache, a speed-up of 10x (1000%) is possible. If not, the speed-up might only be 5%.
When you’re using techniques like this, it’s important to measure your application to identify where the bottlenecks are, and whether your application is spending time on copying or gathering data. You can measure your program performance using Intel VTune Amplifier.
Another feature useful for NFV workloads in Intel AVX2 and other SIMD operations are bitwise and logical operations. These are used to speed up the implementation of custom cryptography code, and bit checks are useful for ASN.1 coders, often used for data in telecommunications. Intel AVX2 can also be used for faster string matching using advanced algorithms such as Multiple Pattern Streaming SIMD Extensions Filter (MPSSEF).
Intel AVX2 works well in virtual machines. There is no difference in performance and it does not normally cause virtual machine exits.
Using Intel TSX for greater scalability
One of the challenges of parallel programs is to avoid data races, which can occur when several threads are trying to use the same data item and at least one of them is modifying it. To avoid unpredictable results, the concept of the lock has often been used, with the first thread to use a data item blocking others from using it until it’s finished. That can be inefficient, though, if you have frequently contested locks or if the locks control a larger area of memory than strictly necessary.
Intel Transactional Synchronization Extensions provide processor instructions to elide locks with hardware memory transactions. This helps to achieve better scalability. It works like this: when the program enters a section that uses Intel TSX to guard memory locations, all memory accesses are recorded, and at the end of the protected section they are either atomically committed or rolled back. The rollback happens if there was a conflicting memory access during the execution from another thread that would cause a race condition (such as writing to a location that another transaction has read). A rollback can also occur if the memory access record becomes too big for the Intel TSX implementation, if there is an I/O instruction or syscall, or if there are exceptions or virtual machine exits. IO calls rollback because it cannot be executed speculatively, as it interferes with outside world. Syscall is a very complex operation that changes ring and memory descriptors, so it is also very difficult to roll back.
A frequently seen usage example of Intel TSX is managing hash table accesses. Usually a hash table lock is implemented to guarantee consistent table accesses, but it comes at the expense of the waiting time for contending threads. The lock is often too coarse, locking the entire hash table, although it’s usually rare for threads to attempt to access the same elements of it at the same time. As the number of cores (and threads) goes up, the coarse lock prevents scaling.
As the diagram below shows, the coarse lock can result in one thread waiting for another thread to release the hash table, even though the threads are using different elements. Using Intel TSX enables both threads to execute straight away, with their results committed when they successfully reach the end of the transaction. The hardware detects conflicts on the fly, and aborts transactions which violate correctness. In the Intel TSX implementation, thread 2 should experience no waiting, and both threads complete much sooner. The per-hash-table lock is effectively converted into a fine-grained lock delivering improved performance. Intel TSX has tracking granularity for conflicts down to the level of a cache line (64 bytes).
There are two software interfaces used in Intel TSX to indicate code sections for transactional execution:
- Hardware Lock Elision (HLE) is backwards compatible and could be used relatively easily to improve scalability without large modifications to the lock library. HLE introduces a prefix for locked instructions. The HLE instruction prefix provides hints to the hardware to track the status of the lock without acquiring it. In the example above, doing that would mean that unless there is a conflicting write access to a value stored in the hash table, accesses to other hash table elements will no longer lead to locking. As a result, they will not serialize access, so scalability will be greatly improved across four threads.
- The Restricted Transactional Memory (RTM) interface introduces explicit instructions to start (XBEGIN), commit (XEND), abort (XABORT) a transaction, and test a transaction’s status (XTEST). The instructions give locking libraries a more flexible way to implement lock elision. RTM allows the library to implement flexible transaction abort handling algorithms. This capability can be utilized to improve Intel TSX performance by optimistic transaction retry, transaction back-off and other advanced techniques. Using the CPUID instruction the library can fall back to an older lock implementation without RTM, keeping backwards compatibility for the user-level code.
- To learn more about HLE and RTM, I recommend the following articles on Intel Developer Zone:
https://software.intel.com/en-us/blogs/2013/06/07/web-resources-about-intelr-transactional-synchronization-extensions
https://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell
As well as improving your own synchronization primitives with HLE or RTM, data plane NFV functions can benefit from Intel TSX by using the Data Plane Development Kit (DPDK).
When using Intel TSX, the main challenge is not implementing it, but estimating and measuring the performance. There are Performance Monitoring Unit counters that can be used by Linux* perf, Intel® Performance Counter Monitor and Intel VTune Amplifier to see how often Intel TSX has been executed and how successful the execution was (committed vs. aborted cycles).
Intel TSX should be used cautiously in NFV applications and tested thoroughly because I/O operations in an Intel TSX-protected region always involve a rollback, and many NFV functions use a lot of I/O. NFV applications should avoid contended locks. If they have to have locks, then lock elision can help to improve scalability.
The full specification of Intel TSX can be found in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 15
About the Author
Alexander Komarov is an Application Engineer in Intel Software and Services Group. For the last 10 years Alexander’s main job is optimizing customer’s code to achieve better performance on current and upcoming Intel server platforms. This involves using Intel software development tools: profiler, compiler, libraries, and always utilizing the latest instructions, u-architecture and architecture advancements of newest x86 CPUs and chipsets.
Further information
For more information on NFV, see the following videos: