To improve the performance of applications and kernels we are constantly on the search for novel Best Known Methods or BKMs, but as our searches grow more esoteric, it is important to keep in mind the basics and how many performance improvements rely on them. This article will describe some common BKMs for improving parallel performance and show their application over this spectrum of processor architectures. The advice collected here should help you speed up your code, whether running on an Intel® Xeon Phi™ coprocessor or an Intel Xeon processor as host.
BKM: Beware the strictures of Amdahl’s Law; use it to your advantage.
Amdahl’s Law is a fundamental restriction on the performance of parallel applications and comes from the simple observation that any parallel program can be divided into parts that do multiple calculations simultaneously, and parts that don’t. For the parts that do work in parallel, adding more processing units can reduce the time that portion takes to execute but won’t improve the time of the parts that can only use one unit. That is true whether the units are independent hardware threads, or lanes of a vector processing unit. The parallel and vector-processing parts of the program are time-compressible with the application of more units but the serial and scalar parts are not, and they can quickly come to dominate the total execution time. It is really surprising how little serial and scalar code it takes in an algorithm to seriously hamper the utilization of a highly parallel vector machine. Even as little as 1% serial time can be a significant factor. Take advantage of Amdahl’s Law by driving more and more of the total computation into parallel and vector operations. Then strive to make those parallel and vector codes distribute the work as evenly as possible and use all the vector lanes as much as possible. Finally, keep in mind that as the concurrent work is driven into smaller intervals, there are some fixed, system overheads do not compress and can also dominate—doing a fork/join, required to dispatch and synchronize a large number of threads to parallelize a loop, can cost around 30,000 cycles. Gustafson was right: it’s not enough to increase the parallel fraction—you may also want to increase the work load size to more fully utilize such increases in threads and vector widths.
Corollary: Focus your performance tuning interest where the application/kernel spends its time.
Another important insight within Amdahl’s Law is that scale matters. Improvements in serial or scalar code will have a modest effect, but similar improvements to parallel code gain a multiplier of the number of active threads or processes. Likewise, codes that take up greater parts of the overall execution time are the best candidates for optimization. Locating these optimization opportunities is the principle reason for “hot spot” analysis: identifying those areas of a program or computational kernel that occupy the hardware threads the greatest amount of time. Getting these code sections and functions to run faster should have the biggest impact on overall execution time. Intel® Architecture provides powerful tools for collecting various measures of how the processors are running, performance events that might point to bottlenecks in the code, like cache misses. But you can waste a lot of time for little benefit trying to fix codes showing densities of such events without also considering whether those same, high-event codes show up as hot spots. Serial hot spots may also be candidates for parallelization and possibly vectorization. Hot spot analysis and correlation with performance monitoring events are the bread and butter of Intel’s VTune™ Amplifier XE. Use it to find the high-use code sections and to identify possible bottlenecks and resource crunches in those areas.
BKM: Manage limited system resources by consolidating their allocation and minimizing their use while maximizing reuse.
Various system resources generally are required to complete computational activities: threads to run the work, memory to hold data, memory bandwidth to deliver data to the processors and back again, communications bandwidth to get and receive data from other nodes, time to measure performance or time-stamp transactions. And for most of these resources there is some burden in acquiring/releasing them and some cost for exceeding their limits. Allocating memory or getting the current time may cost a system call, with the attendant delay imposed by visiting the system kernel. Communications codes may have to share the channel with other concurrently running codes, and rely on i/o drivers in an essentially serial kernel. Parsimony is the watchword for dealing with these resources, and many layers have already been provided to help developers economize: all the major threading models, including OpenMP*, Intel Threading Building Blocks (Intel TBB) and Intel Cilk Plus employ thread pools so that they don’t need to create new threads every time there is a need. Intel TBB has a memory allocation package that distributes memory pools to individual threads to enable local allocations without requiring a global resource lock. Continue these economical practices in your coding: be cognizant of the overhead of particular system calls and use them sparingly. Plan dynamic data use to include reuse. A few big memory allocations at the beginning may be more efficient than smaller allocations scattered throughout the code. Cache-blocking (described later) can improve memory performance by keeping the current data footprint small enough to take advantage of data reuse opportunities, avoiding extra fetches from memory.
BKM: Look before you leap: Determine whether your application/kernel has coprocessor potential by testing certain features on the host.
Applications with potential to show better performance on the Intel Xeon Phi coprocessor exhibit certain characteristics that can be measured upon a host system. Try measuring how well it scales on the host: run a workload with only a single execution thread, then compare that performance to the same work using half or all the available host threads, or other counts in between. Does the performance scale over that range? If parallel efficiency declines as you step through higher thread counts, then additional hardware threads alone are unlikely to provide speedup. But take care in your measurements; note that the nature of dividing work evenly can hit a natural load imbalance. To get good balance and a fair assessment of scaling, loop counts should be at least ten times the number of available threads. Bandwidth limited on the host? Measure memory bandwidth with the same thread counts, same workload: compare bandwidth with measurements of an optimized McCalpin Stream* Triad on that machine. If the application is not approaching the bandwidth limits of an Intel® Xeon processor, it won’t benefit from the higher bandwidth an Intel® Xeon Phi™ coprocessor can provide. What aboutvectorization potential? Vectorization is enabled by default at optimization level -O2 and above. To disable vectorization, use the "-no-vec -no-simd" compiler options. Compare the performance of your workload between the vectorized and non-vectorized cases. Does vectorization improve performance of your application? If so, and if the speedup was significant (compared to the max speedup based on your platform), that is a good indicator that your application is taking advantage of vectorization. If not, you should work on improving the vectorization efficiency for your hot loops by following these tips: http://software.intel.com/en-us/articles/vectorization-essential. To benefit from the greater parallel capabilities of Intel® Xeon Phi™ coprocessors relative to an Intel® Xeon processor, the application must make use of well over 100 threads, and must either be able to benefit from higher memory bandwidth or be effectively vectorized. For more information, see this article on Intel Many Integrated Core Architecture suitability analysis. You can also find materials on this subject at finding the right fit for your application on Intel® Xeon and Intel Xeon Phi™ processors.
BKM: Align vector data on 64-byte boundaries to maximize vectorization throughput, but be sure to let the compiler know all the data alignments required to enable it to generate efficient vectorized code.
Current Intel Architecture processors employ 64-byte cache lines, and 64 bytes is the width of one Intel Xeon Phi Coprocessor vector register. If data are not aligned for a vector read, it will require extra instructions and reading multiple cache lines to supply those data. By properly aligning the arrays when they are created using compiler directives such as __declspec(align(64)) for static data or _mm_malloc for dynamic data, we can minimize the number of cache line reads required to load and store the data. But we’re not done yet. In addition to aligning the data, we must be certain the compiler is aware of those data alignments, in order to take advantage of them. Compilers may not always be able to determine the alignment of individual references, particularly when it comes to array parameters and pointers passed into functions. If data access is via pointers, the compiler must be certain the pointer is aligned. If using array indexing, both the array and the index must represent aligned locations. Use the appropriate __assume_aligned and __assume clausesbefore any affected data references in C/C++ or use the –align array64byte command-line option and ASSUME_ALIGNED directive in Fortran, where there is the same concern to align both base arrays and their indices. For more details, read Data Alignment to Assist Vectorization.
BKM: To maximize the likelihood that the compiler can effectively vectorize loops, learn the syntax for specifying vector operations in your language of choice and look for high trip-count and multiply-nested loops.
Loops over arrays of data are often the most accessible structures for applying parallelism and vectorization. Frequently parallelization is applied to the outer loop levels and vectorization to the inner. While vectors can be gathered together from non-contiguous memory, better optimizations can result if they come from contiguous memory, better still is if the indices incremented in the innermost loop progress through adjacent memory locations in their respective arrays, so the vectors can be read directly into vector registers from memory. The innermost scalar loop becomes a vector loop. To learn more about maximizing vector element utilization, see http://software.intel.com/en-us/articles/utilizing-full-vectors.
Effective source preparation for vectorization also places some requirements on that source: there can be no backward loop-carried dependences: for example, the loop must not require statement 2 of iteration 1 to be executed before statement 1 of iteration 2. If there are function calls in the loop, those functions should be inline-able within the loop, or be vector elemental functions that the compiler can use directly as vector components. The loops should have predetermined iteration counts. Avoiding conditional checks with each iteration makes it a lot easier to vectorize. For more details, see http://software.intel.com/en-us/articles/requirements-for-vectorizable-loops/.
Corollary: loop interchange is a powerful technique to bring iteration across adjacent data to the innermost loop and available for vectorization—in fact, the Intel compiler will attempt to do this for cases it recognizes.
The easiest explanation here comes by way of a classic example:
matmul3(float *c[], float *a[], float *b[], int msize) { int i, j, k; for (i=0; i<msize; ++i) { for (j=0; j<msize; ++j) { for (k=0; k<msize; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } }
This is a basic algorithm for matrix multiply, where the destination array c receives the product of arrays a and b, a canonical implementation laying out the operations over iterators i, j and k. But note that the innermost loop is incrementing k, meaning that subsequent iterations will require a new aikand a new bkj. Moreover, in the bkj term, each subsequent value is in the next row, which in row-major languages like C and C++ means the next value is probably in a different cache line. But try swapping the innermost two loop indices:
matmul3(float *c[], float *a[], float *b[], int msize) { int i, j, k; for (i=0; i<msize; ++i) { for (k=0; j<msize; ++j) { for (j=0; k<msize; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } }
All the same combinations of loop indices will be touched, so this just changes the order of evaluation, not the terms that will be summed together. But look what it does to that sum. The a-term is now a constant across the inner loop—no innermost index changing at all. And for the b-term the innermost iteration occurs in the second, “column” index: subsequent indices address adjacent values in the array. Not only could they all be coming from the same cache line now, but they could be loaded together into a common vector, and then multiplied by a term comprising a broadcast of the a-term value into all of the vector lanes. In most cases, this version will run much quicker.
As mentioned at the top, when invoked at –O3 the Intel performs advanced loop optimizations to enable efficient vectorization. Check the compiler optimization report (-opt-report 3) to see details of what optimizations were performed on particular loop-nests. The compiler report also offers useful tips where performing a manual permutation of a loop-nest may be beneficial (when the compiler could not perform the transformation automatically).
BKM: Cache Blocking can boost timely data reuse, minimizing memory access volume and reducing memory pressure.
Intel Xeon Phi coprocessors use special memory devices to maximize deliverable bandwidth, at a penalty of extra memory latency. A common way to hide such latency is through the use of prefetching, and in particular arranging memory accesses to proceed linearly through data used in streaming-style algorithms, where each datum may only be needed once, in order to take advantage of hardware and compiler injected software prefetch (for more details see http://software.intel.com/sites/default/files/article/326703/5.3-prefetching-on-mic-4.pdf). However, if select data are needed in multiple calculations, they have the potential to be reused; an optimal program will minimize the number of times any particular datum needs to be fetched from memory. Cache blocking offers an attractive way to increase such reuse. Imagine a calculation in a 3D array, wherein for each element at each time step, a new value is computed as a combination of the adjacent old values. The simplest loop spans each dimension of the array, whose spans may be very large. You can imagine the calculation proceeding across the volume, line by line in one dimension, plane by plane in two dimensions. Such array operations using data from adjacent planes might reuse those data if they’re still in cache when the next plane is processed. Cache blocking can improve that possibility by reducing the size of the plane processed by each thread before moving on to an adjacent area on the next plane. In such adjacent array element operations, data fetched on the next plane may be reused when it becomes the current plane, and then again when it becomes the previous plane. http://software.intel.com/en-us/articles/cache-blocking-techniques has more details on the technique of cache blocking.
BKM: Best performance can be found by favoring data organizations that bring together array components participating in vector operations, maximizing content use from each cache line. The ever longer vector lengths offered in newer architectures often can be used more efficiently by reorganizing Arrays of Structures (AoS) into corresponding Structures of Arrays (SoA).
It is easy to see how in C and C++ there is a natural tendency to organize data into structures and objects, and then to array those objects in an indexable order to represent a complex system. These structures may stay small and focused, or may grow to include data needed for other calculations or tests. Such multi-use data structures pose a tradeoff in organizing data for most efficient cache use: is it better to isolate an array of structures into several arrays of divided structures, indexed in parallel, thus reducing the number of cache line reads needed to pull in the vector data? Or does the overall calculation benefit from pulling in all components when fetching the vector data?
If all the components of a data structure participate in a vector operation, there’s still a further change you may want to consider. Newer architectures such as the Intel Xeon Phi Coprocessor continue to raise the bar on vector widths—currently at 16 floats per unit. Suppose you’re dealing with position data, structs of X, Y and Z. You could get about 5 objects’ worth in each vector, with one lane idle. Or, you could try reorganizing the data, putting all the Xs, Ys and Zs together: the first 16 object Xs into one vector, the next 16 into the next vector. If your calculations can use vector components arrayed in this form, it’s worth a test to see whether such reorganization benefits the whole computational job. For more details on data organization see http://software.intel.com/en-us/articles/memory-layout-transformations.
So the advice here is to match data layout to the data access patterns inherent in the calculations: if the algorithm can compute with all Xs, all Ys, and all Zs coalesced together as vectors then the data might best be organized that way. A successful example of that can be seen here: http://software.intel.com/en-us/articles/a-case-study-comparing-aos-arrays-of-structures-and-soa-structures-of-arrays-data-layouts.
If on the other hand your algorithm benefits overall by pulling in whole objects, then it would be wise to look for opportunities to employ gather/scatter techniques. With both vector-gather-scatter and vector-gather-prefetch instructions, the Intel Xeon Phi coprocessor is well suited to handle these complexities, but sometimes the compiler requires a little help to recognize an opportunity. http://software.intel.com/en-us/articles/bkm-coaxing-the-compiler-to-vectorize-structured-data-via-gathers shows how to help the compiler recognize such opportunities, when SoA organization is impractical.
BKM: Vectorize! Vectorize! Vectorize! And let –vec-report be your friend.
The Intel Xeon Phi coprocessor is capable of great arithmetic performance with lots of powerful vector instructions and large memory bandwidth to support them, but they can only be useful if they get used. In some sense this could be another corollary of Amdahl’s Law. Intel compilers can do many things to improve the vector performance of particular kernels if they recognize the opportunities. Optimal vectorization requires certain prerequisites such as data alignment and loop independence, and the Intel compiler can give you plenty of information about when loops vectorize and why they don’t. Use –vec-report3 or the new –vec-report6 as described in http://software.intel.com/en-us/articles/overview-of-vectorization-reports-and-new-vec-report6.
Corollary: Use pragma simd (or !DEC$ SIMD) to enforce vectorization of loops (but pay attention to vec-report).
When analyzing vectorization opportunities the compiler is very conservative, always favoring correctness over speed. When faced with data of unknown origin (e.g., passed into a function via a pointer), the compiler will likely balk at vectorization, as can be determined using vec-report. Experience suggests that if you’re certain vectorization is a safe alternative in a particular loop where the compiler does not vectorize, using #pragma simd often provides the best, most predictable benefits. To see an example of the use of #pragma simd in an outer loop vectorization, check out this: http://software.intel.com/en-us/articles/outer-loop-vectorization.
BKM: In loops, size matters.
Though there are excellent examples of applied parallelism using pipelines and more exotic task structures, the workhorse for parallel execution—both multi-thread and vector—is still the loop. But available resources may determine an optimal loop size for each architecture and algorithm. Sometimes the best advice is to fuse adjacent loops that are going over the same index ranges in order to maximize the reuse of data as it is available in cache. And sometimes the loop engages so much data movement that its working set size exceeds the cache, or it saturates available memory bandwidth (less likely but still possible in the Intel Many Integrated Core Architecture) or it uses all the available store buffers. In these cases there may be enough thrashing of machine resources that dividing the work into a couple loops, even if it necessitates some recalculation, may prove to be faster overall. Performance analysis can reveal much regarding the operation of each loop. Hot spot analysis can identify the most important loops. Correlation with specific events like those indicating cache misses or high memory bandwidth can help to identify possible candidates for fusion or fission, but ultimately it comes down to the algorithm and its resource requirements to determine advisability. Some basic guidelines: avoid more than 16 (confirm) address streams to make best use of the hardware prefetcher, and fit the working size across all of the threads in each core to the capacity of the L1 and across all cores to the L2. The compiler performs advanced loop transformations (loop fusion, distribution, etc.) at the -O3 optimization level. Check the output of –opt-report to understand the compiler optimizations for a loop-nest, and then see if you can tune for more performance by doing more optimizations by hand. The compiler also supports several options and pragmas to fine-tune optimization behavior, see the page here: http://software.intel.com/en-us/articles/getting-started-with-intel-composer-xe-2013-compiler-pragmas-and-directives.
This is our initial stab at common Best Known Methods for optimizing parallel vector programs, but not the last. We expect to update this document based on the feedback we get and upon receipt of other performance BKMs that come to be regarded as common knowledge among the performance tuning cognoscenti. Stay tuned!