The cores and vector processors on modern multi-core processors are voracious consumers of data. If you do not organize your application to supply data at high-enough bandwidth to the cores, any work you do to vectorize or parallelize your code will be wasted because the cores will stall while waiting for data. The system will look busy, but not fast.
The following chart, which reflects the numbers for a hypothetical 4-processor system with 18 cores per processor, shows the:
- Bandwidth that each major component of a core/processor/memory subsystem can produce or consume
- Latencies from when the core requests data from a device until the data arrives
Note: This chart data (from various public sources) is approximate.
As you can see:
- The capabilities of the L1 cache, L2 cache, and vector processing units dwarf the other components, whether the data is streaming in or randomly accessed. To provide the core operations with data at the rate they can consume it, most of the data must come from the L1 or L2 caches.
- A small fraction can come from the in-processor L3 cache, but even 1% of the accesses traveling beyond that cache make the memory subsystem the dominant source of delays in applications that are load/store intensive.
- While the MCDRAM on the Intel® Xeon Phi™ processor code named Knights Landing is a huge performance boost for memory-bound applications, it is clearly insufficient to keep the cores busy.
Performance Improvement Overview
You can often achieve huge performance gains by choosing better algorithms or using better coding techniques. There are many resources on this subject, so this topic is not covered here except for some best design practices at the end of this article. This article assumes you have the best algorithm and concentrates solely on reducing the time spent accessing the data.
Almost all data used across a long set of instructions – the working set or active data for those instructions – must be loaded into the L1 or L2 cache once and should be used from there repeatedly before it is finally evicted to make room for other needed data.
It is possible to get more active data into a cache by decreasing the number of bytes needed to store it. For example, use:
- 32-bit pointers instead of 64-bit pointers
- int array indices instead of pointers
- float instead of double
- char instead of bool
Avoid storing bytes that do not contain active data. Data is stored in the caches in aligned, contiguous, 64-byte quantities called cache lines. It is often possible to ensure all 64 bytes contain active data by rearranging data structures.
It is unlikely the above techniques alone will fit your data into cache, but they may reduce your memory traffic beyond L2 significantly.
Minimizing data movement is your best opportunity for improvement.
- Consume the data on the core where it is produced, soon after it is produced.
- Put as many consumers of the data as possible on the same core.
- Consume the data before it is evicted from the cache or register.
- Use all the bytes of the 64-byte cache lines that are moved.
If you cannot sufficiently reduce data movement, try one or both of the following techniques. Although they do not transform a memory-bound situation into a compute-bound situation, they may let you overlay more memory and compute operations than the hardware does automatically, so an application that is alternately compute and memory bound changes to using both resources concurrently. You can use the following techniques to optimize a loop that is only a few lines of code, or an algorithm coded in many tens of thousands lines of code.
- Start moving the data earlier. There are Intel® architecture prefetch instructions to support this.
- Do something useful while the data is moving. Most Intel architecture cores can perform out-of-order execution.
Performance Improvement Preparation
Before investing your time to ensure most data accesses hit in the registers, L1 cache, or L2 cache, measure performance to verify your application is indeed stalling while waiting for data. The Intel® Vtune™Amplifier XE (version 2016 and later) has enhanced support for doing these measurements.
After you identify the portions of the execution that are waiting for memory accesses, subdivide the relevant parts of application execution into steps where one step’s output data is another step’s input data. You can view each step as a whole, or subdivide each step the same way. A large step is often referred to as a phase.
For our purposes, a convenient representation is data flowing through a graph of compute nodes performing the steps and data nodes holding the data.
Often there is a startup phase that loads databases or other initial data from a file system into memory. (With non-volatile memory, the data may be available without this phase.)
For each step you need to know its data, the amount of data, and the number and pattern of accesses:
- The inputs, temporary data, shared data, and outputs may be anything from under 1 MB to over 1000 GB.
- Each data item can be read or written between zero and a huge number of times per step.
- Data item accesses can be evenly spread across the step, or concentrated at intervals during the step.
- Data items can be accessed in many patterns, such as sequentially, randomly, and in sequential bursts.
Consider the behavior of each step. Ask yourself, or measure, if the step is memory bound as it accesses its inputs, outputs, and temporaries. If it is, get answers to the following questions:
- Where are the step inputs produced, and how large are they?
- Where are the step outputs consumed, and how large are they?
- What access patterns are used to read the input and produce the output?
- Which portion of the outputs is affected by each portion of the inputs?
- How large is the temporary data and what access patterns does it have?
Once you have the answers, you are in position to modify the code to reduce the memory traffic the accesses are causing. For instance, you can replace produce a huge array; consume it item by item;
with loop { produce a small portion of the huge array ; consume it item by item }
so the portions stay in one of the closer, faster, private caches.
Performance Improvement Tactics
The operating system and the hardware choose the specific hardware used to execute instructions and store data. The choices meet requirements specified by the compilers, which in turn reflect the specifications implied by the programming language or explicit in the source code.
For instance, application source code specifies memory or file, but the operating system and hardware choose which levels of cache are used, when the data moves between caches, and which wires and transistors move the data.
The performance improvement tactics involve modifying the code or execution environment to guide these choices. If the choices are important enough, write code that enables and constrains these choices.
- Where to place the data – Specify an alignment in the allocating call instead letting
malloc
ornew
default it for you. - When to move it – Execute instructions specifying prefetching, cache flushing, and reads and writes that do not go through the cache.
- Where to execute the step – Assign a computation to a thread bound to a core.
- When to do it – Change the priority of a thread, tile loops, and rearrange the code.
Performance Improvement Tactics to Reduce Traffic
The solution for long stall times caused by memory reads or (rarely) writes is to reduce the number of transfers between the more distant and closer storage devices. This results in fewer cache misses and thus fewer or shorter stalls. You can reduce the number of transfers with the following tactics.
- Move more data per transfer by increasing the density of the information in the cache lines.
- Reuse the data moved to the closer caches by changing the access patterns (tiling).
- Store data near where it is produced, especially if it repeatedly changes before going to the next phase.
- Store data near where it is consumed, especially if it is produced just once.
- Larger memories and persistence provide the opportunity for more data to stay in memory rather than in a file system. Because accessing a file system is always slower than accessing memory, carefully consider if you can keep the data in memory, perhaps in a compressed form.
If these tactics are insufficient, try increasing the bandwidth to the storage where the data is held. This requires changing the assignment of the data to storage devices, including changing the hardware if the needed hardware is not available.
Performance Improvement Tactics to Increase Available Bandwidth or Decrease Latency
- Use more of the hardware by splitting a computation into several computations on several cores.
- Request data before it is needed so it arrives in time.
- Find other things to do while waiting for data to arrive.
- If data is needed in several places, duplicate it.
The remainder of this article discusses each tactic in more detail.
Bring More Useful Data per Transfer
Objective: Reduce traffic.
Because transfers across most of the fabric move 64-byte cache lines, it is desirable to fit as much data as possible into these bytes.
To eliminate unnecessary data from a 64-byte cache line:
- Place used data into the fewest possible cache lines by both aligning the containing struct and ordering the data members appropriately.
- Place used and unused data into different variables or different cache lines within the variable.
- Use smaller numeric types, such as
float
instead ofdouble
. - Use packed data members or bit masks.
Such techniques almost always result in faster execution, because the same data can be transferred in fewer transactions and hence in less time.
Tile
Objective: Reduce traffic.
If possible, reorder accesses to reuse data before dropping or flushing it. This is a critical optimization, and you should be aware of the research into cache-aware and cache-oblivious algorithms, especially techniques such as loop tiling. Read more in Tiling.
Place Data on Closer or Faster Devices
Objective: Reduce traffic.
Tiling, which uses caches to reduce data movement, is an instance of a more general idea: Place data on closer or faster devices.
Typically the devices closer to a core have higher bandwidth and lower latency. Moving all the data is often not an option, because the closer storage usually has a higher cost and smaller size. One exception is moving data from another node to the node containing the core, but in this case the data moves away from another core, and that has a penalty also.
Storing data closer may not require adding devices to the system if some of the existing hardware is not fully used or if other data can move farther from the core. Deciding which data to move can be difficult, especially for applications with long run times that are hard to extrapolate from sample workloads.
Control Data Placement
To control where data is placed relative to the thread that writes or reads it, you need to control:
- Where the thread is executing – This is done by specifying the thread-to-processor affinity.
- Where the physical memory assigned to the virtual memory for the data is placed – This is done using operating system calls, although these calls may be hidden in a library. There are new compiler directives and libraries for placing data in high-bandwidth memory (HBM).
- When data is accessed in relation to other uses of the cache – This is done using tiling and instructions such as clflushopt and prefetch.
Store Data Near Where It Is Produced
If your algorithm repeatedly updates data during a step, then it is clearly ideal if the data stays in the L1 or L2 cache of the producing system until its final value is determined.
Use tiling to achieve this effect.
Store Data Near Where It Is Consumed
If the data is only written once, it starts in the L1 cache and drifts outwards as its cache slot is needed for something else. Ideally, the consumer should get data from the closest common cache before it is flushed beyond this level.
Use tiling to achieve this effect.
Handle Shared Variables
If two or more variables are in the same cache line and are accessed by different cores, the whole cache line moves between the cores, slowing down both cores.
To fix this:
- Move the variables to separate cache lines.
- Change to not sharing the variable.
- Accumulate changes in a local variable, and only rarely modify the shared variable.
Keep Data in Memory
Non-volatile DIMMs (NVDIMMs) are becoming available, as are fast SSDs. If your application spends a lot of time doing I/O, consider keeping the data in NVDIMMs or faster I/O devices.
Use More of the Hardware
Objective: Increase available bandwidth or decrease latency.
The extra hardware may be an additional core, an additional processor, or an additional motherboard. You may need to split a step into several steps that can be spread across the additional hardware. Be careful: You may need to move the inputs to more cores than before, and the extra movement must be offset by the amount of computation on the extra cores. You may need to execute many tens, and perhaps hundreds, of executed instructions or accesses on a new core for every moved 64 bytes to recover the cost of moving the bytes.
Be sure you are getting more of the critical hardware – and hardware that is shared by the existing core and the additional core to perform the operations is a potential bottleneck. For instance, the moved step may need to effectively use its L1 cache because two cores thrashing a shared L2 or L3 cache may run slower than a single core doing all the actions itself.
If enough work is moved and the L1 and L2 caches are effectively used, this change can decrease the elapsed time by a factor almost equal to the number of cores used.
If you can accomplish this on a large scale, you may be able to spread the step over a whole MPI cluster. If you can keep portions of the data on specific nodes and not move them around, this change can decrease the elapsed time by the number of nodes used.
Duplicate Data If It Is Needed in Several Places
Objective: Increase available bandwidth or decrease latency.
The caches are good at duplicating data being read; however, if cache size is too small, it may be more effective to have copies of the data in the memories near the cores that require it, rather than fetching the data from memory attached to a different processor.
The caches are less effective when data is written by several cores. In this case, accumulate updates in local storage before combining them into shared data. OpenMP* technology automatically does this with reductions, but other frameworks may require explicit coding.
It may be best to write the duplicate when first computed (if there are few rewrites) or after rewriting is finished (if there are many rewrites).
Request Data Before It Is Needed
Objective: Increase available bandwidth or decrease latency.
Some processors are good at automatically prefetching data accessed by constant strides in a loop. For other situations, you may need to use the instructions and techniques described in such materials as Compiler Prefetching for the Intel® Xeon Phi™ coprocessor.
Find Other Things to Do While Waiting for Data to Arrive
Objective: Increase available bandwidth or decrease latency.
Processors capable of out-of-order execution, such as Intel® Xeon® processors and later generations of Intel® Xeon Phi™ processors, may be able to execute other nearby instructions while waiting. For other cases, you may be able to restructure your code to move such operations nearby.
In the worst case, if the L1 and L2 caches have room for several threads to run without increasing the cache misses, you may need to put more threads on the core to exploit its hyperthreading capability.
Efficiently Access Far Memory That Is Near Another Processor
Objective: Increase Available Bandwidth or Decrease Latency.
If there is a processor nearer the memory containing the data, it may be more effective to have cores on that processor do the operations.
One of the most difficult programs to optimize can be modelled as randomly accessing a huge array.
size_t hugeNUMAArray[trillionsOfEntries]; void updateManyEntries() { for (size_t i = 0; i < loopCount; i++) { hugeNUMAArray[randomBetween(0,trillionsOfEntries)]++; } }
If the hugeNUMAArray
is far enough away, the time taken to fetch the entry and send it back may be less than the time taken to identify the nearby processor and send a thread on that processor a message to update the entry.
This is especially true if several operations can be sent in a single message. Use all the techniques above to optimize this approach.
Best Design Practices
Algorithms That Sequentially Access Data
The ideal streaming algorithm:
- Reads its inputs exactly once, from a nearby cache into the closest private cache.
- Has all its temporary data in a private cache.
- Writes all its outputs exactly once into a cache near the consumer, where it stays until the consumer needs it.
If the data is too large to be kept in the caches, it is important to use all the bytes transferred between the caches and dual inline memory modules (DIMMs). But be aware that data is transferred in 64-byte cache lines; do not intermingle useful and useless data within a cache line.
Algorithms That Randomly Access Data
This section applies if you have determined you cannot efficiently transform an algorithm with non-sequential reads to an algorithm with sequential reads.
Note: There are rare algorithms that serially read their data but randomly write it. The writes are rarely a problem because the processor does not wait for a write to complete.
The performance of non-sequential reads may vary significantly across processors. Processors that support out-of-order execution may be able to hide some of read delays behind other operations. Similarly, hyperthreading may allow another thread to use the core when one is stalled. However, sharing L1 and L2 caches may limit effectiveness.
Processor cores that do not perform out-of-order execution (such as the first generations of Intel Xeon Phi processors), and cases where out-of-order execution cannot hide the read delays, can be improved by prefetching data using:
_mm_prefetch
or similar instructions- A second thread to load the data into the L3 cache
These techniques are well documented in such materials as Compiler Prefetching for the Intel® Xeon Phi™ coprocessor (PDF).
Summary
The previous article, MCDRAM and HBM, gives more details about Intel’s on-package memory, one of Intel’s new hardware technologies. This article discusses the algorithms and analysis that are critical to getting the best possible performance out of an Intel Xeon Phi and a MCDRAM. If you have not already read the article on Tiling, we recommend you do so now; otherwise, the next article is How Memory Is Accessed, which gives a detailed description about how data moves through the system – knowledge that will help you become a real expert in optimizing memory traffic.
You may also want to read Performance Improvement Opportunities with NUMA, a series that covers the basics of efficiently using Intel’s new memory technologies.
About the Author
Bevin Brett is a Principal Engineer at Intel Corporation, working on tools to help programmers and system users improve application performance. He has always been fascinated by memory, and enjoyed a recent plane ride because the passenger beside him was a neuroscientist who gracefully answered Bevin’s questions about the current understanding of how human memory works. Hint: It is much less reliable than computer memory.
Resources
- Intel® Xeon Phi™ Coprocessor High Performance Programming – Book on tools and approaches for memory system optimization by James Reinders and Jim Jeffers
- Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors, 2nd Edition – Book on tools and approaches for memory system optimization by Andrey Vladimirov, Ryo Asai, and Vadim Karpusenko
- Intel® Product Specifications (ARK)
- Intel QuickPath Interconnect
- Haswell L2 cache bandwidth to L1 (64 bytes/cycle)?
- Memory Latencies on Intel® Xeon® Processor E5-4600 and E7-4800 product families
- Intel® Memory Latency Checker v3.1
- Why Intel added cache partitioning in Broadwell
- Memory subsystem performance
- Cache and Memory