Modern high performance computers are built with a combination of resources including: multi-core processors, many core processors, large caches, high speed memory, high bandwidth inter-processor communications fabric, and high speed I/O capabilities. High performance software needs to be designed to take full advantage of these wealth of resources. Whether re-architecting and/or tuning existing applications for maximum performance or architecting new applications for existing or future machines, it is critical to be aware of the interplay between programming models and the efficient use of these resources. Consider this a starting point for information regarding Code Modernization. When it comes to performance, your code matters!
Building parallel versions of software can enable applications to run a given data set in less time, run multiple data sets in a fixed amount of time, or run large-scale data sets that are prohibitive with un-optimized software. The success of parallelization is typically quantified by measuring the speedup of the parallel version relative to the serial version. In addition to that comparison, however, it is also useful to compare that speedup relative to the upper limit of the potential speedup. That issue can be addressed using Amdahl's Law and Gustafson's Law.
Good code design takes into consideration several levels of parallelism.
- The first level of parallelism is Vector parallelism (within a core) where identical computational instructions are performed on large chunks of data. Both scalar and parallel portions of code will benefit from the efficient use of vector computing.
- A second level of parallelism called Thread parallelism, is characterized by a number of cooperating threads of a single process, communicating via shared memory and collectively cooperating on a given task.
- The third level is when many codes have been developed in the style of independent cooperating processes, communicating with each other via some message passage system. This is called distributed memory Rank parallelism, so named as each process is given a unique rank number.
Developing code which uses all three levels of parallelism effectively, efficiently, and with high performance is optimal for modernizing code.
Factoring into these considerations is the impact of the memory model of the machine: amount and speed of main memory, memory access times with respect to location of memory, cache sizes and numbers, and requirements for memory coherence.
Poor data alignment for vector parallelism will generate a huge performance impact. Data should be organized in a cache friendly way. If it is not, performance will suffer, when the application requests data that’s not in the cache. The fastest memory access occurs when the needed data is already in cache. Data transfers to and from cache are in cache-lines, and as such if the next piece of data is not within the current cache-line or is scattered amongst multiple cache-lines, the application may have poor cache efficiency.
Divisional and transcendental math functions are expensive even when directly supported by the instruction set. If your application uses many division and square root operations within the run-time code, the resulting performance may be degraded because of the limited functional units within the hardware; the pipeline to these units may be dominated. Since these instructions are expensive, the developer may wish to cache frequently used values to improve performance.
There is no “one recipe, one solution” technique. A great deal depends on the problem being solved and the long term requirements for the code, but a good developer will pay attention to all levels of optimization, both for today’s requirements and for the future.
Intel has built a full suite of tools to aid in code modernization - compilers, libraries, debuggers, performance analyzers, parallel optimization tools and more. Intel even has webinars, documentation, training examples, and best known methods and case studies which are all based on over thirty years of experience as a leader in the development of parallel computers.
Code Modernization 5 Stage Framework for Multi-level Parallelism
The Code Modernization optimization framework takes a systematic approach to application performance improvement. This framework takes an application though five optimization stages, each stage iteratively improving the application performance. But before you start the optimization process, you should consider if the application needs to be re-architected (given the guidelines below) to achieve the highest performance, and then follow the Code Modernization optimization framework.
By following this framework, an application can achieve the highest performance possible on Intel® Architecture. The stepwise approach helps the developer achieve the best application performance in the shortest possible time. In another words, it allows the program to maximize its use of all parallel hardware resources in the execution environment. The stages:
- Leverage optimization tools and libraries: Profile the workload using Intel® VTune™ Amplifier to identify hotspots. Use Intel C++ compiler to generate optimal code and apply optimized libraries such as Intel® Math Kernel Library, Intel® TBB, and OpenMP* when appropriate.
- Scalar, serial optimization: Maintain the proper precision, type constants, and use appropriate functions and precision flags.
- Vectorization: Utilize SIMD features in conjunction with data layout optimizations Apply cache-aligned data structures, convert from arrays of structures to structure of arrays, and minimize conditional logic.
- Thread Parallelization: Profile thread scaling and affinitize threads to cores. Scaling issues typically are a result of thread synchronization or inefficient memory utilization.
- Scale your application from multicore to many core (distributed memory Rank parallelism): Scaling is especially important for highly parallel applications. Minimize the changes and maximize the performance as the execution target changes from one flavor of the Intel architecture (Intel® Xeon® processor) to another (Intel® Xeon Phi™ Coprocessor).
Code Modernization – The 5 Stages in Practice
Stage 1
At the beginning of your optimization project, select an optimizing development environment. The decision you make at this step will have a profound influence in the later steps. Not only will it affect the results you get, it could substantially reduce the amount of work to do. The right optimizing development environment can provide you with good compiler tools, optimized, ready-to-use libraries, and debugging and profiling tools to pinpoint exactly what the code is doing at the runtime.
Stage 2
Once you have exhausted the available optimization solutions, in order to extract greater performance from your application you will need to begin the optimization process on the application source code. Before you begin active parallel programming, you need to make sure your application delivers the right results before you vectorize and parallelize it. Equally important, you need to make sure it does the minimum number of operations to get that correct result. You should look at the data and algorithm related issues such as:
- Choosing the right floating point precision
- Choosing the right approximation method accuracy; polynomial vs. rational
- Avoiding jump algorithms
- Reducing the loop operation strength by using iteration calculations
- Avoiding or minimizing conditional branches in your algorithms
- Avoiding repetitive calculations, using previously calculated results.
You may also have to deal with language-related performance issues. If you have chosen C/C++, the language related issues are:
- Use explicit typing for all constants to avoid auto-promotion
- Choose the right types of C runtime function, e.g. doubles vs. floats:
exp()
vs.expf()
;abs()
vs.fabs()
- Explicitly tell compiler about point aliases
- Explicitly Inline function calls to avoid overhead
Stage 3
Try vector level parallelism. First try to vectorize the inner most loop. For efficient vector loops, make sure that there is minimal control flow divergence and that memory accesses are coherent. Outer loop vectorization is a technique to enhance performance. By default, compilers attempt to vectorize innermost loops in nested loop structures. But, in some cases, the number of iterations in the innermost loop is small. In this case, inner-loop vectorization is not profitable. However, if an outer loop contains more work, a combination of elemental functions, strip-mining, and pragma/directive SIMD can force vectorization at this outer, profitable level.
- SIMD performs best on “packed” and aligned input data, and by its nature penalizes control divergences. In addition, good SIMD and thread performance on modern hardware can be obtained if the application implementation puts a focus on data proximity.
- If the innermost loop does not have enough work (e.g., the trip count is very low; the performance benefit of vectorization can be measured) or there are data dependencies that prevent vectorising the innermost loop, try vectorising the outer loop. The outer loop is likely to have control flow divergence; especially of the trip count of the inner loop is different for each iteration of the outer loop. This will limit the gains from vectorization. The memory access of the outer loop is more likely to be divergent than that of an inner loop. This will result in gather / scatter instructions instead of vector loads and stores and will significantly limit scaling due to vectorization. Data transformations, such as transposing a two dimensional array, may alleviate these problems, or look at switching from Arrays of Structures to Structures of Arrays.
- When the loop hierarchy is shallow, the above guideline may result in a loop that needs to be both parallelized and vectorized. In that case, that loop has to both provide enough parallel work to compensate for the overhead and also maintain control flow uniformity and memory access coherence.
- Check out the Vectorization Essentials for more details.
Stage 4
Now we get to thread level parallelization. Identify the outermost level and try to parallelize it. Obviously, this requires taking care of potential data races and moving data declaration to inside the loop as necessary. It may also require that the data be maintained in a cache efficient manner, to reduce the overhead of maintaining the data across multiple parallel paths. The rationale for the outermost level is to try to provide as much work as possible to each individual thread. Amdahl’s law states: The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. Since the amount of work needs to compensate for the overhead of parallelization, it helps to have as large a parallel effort in each thread as possible. If the outermost level cannot be parallelized due to unavoidable data dependencies, try to parallelize at the next-outermost level that can be parallelized correctly.
- If the amount of parallel work achieved at the outermost level appears sufficient for the target hardware and likely to scale with a reasonable increase of parallel resources, you are done. Do not add more parallelism, as the overhead will be noticeable (thread control overhead will negate any performance improvement) and the gains are unlikely.
- If the amount of parallel work is insufficient, e.g. as measured by core scaling that only scales up to a small core count and not to the actual core count, attempt to parallelize additional layer, as outmost as possible. Note that you don’t necessarily need to scale the loop hierarchy to all the available cores, as there may be additional loop hierarchies executing in parallel.
- If step 2 did not result in scalable code, there may not be enough parallel work in your algorithm. This may mean that partitioning a fixed amount of work among many threads gives each thread too little work, so the overhead of starting and terminating threads swamps the useful work. Perhaps the algorithms can be scaled to do more work, for example by trying on a bigger problem size.
- Make sure your parallel algorithm is cache efficient. If it is not, rework it to be cache efficient, as cache inefficient algorithms do not scale with parallelism.
- Check out the Intel Guide for Developing Multithreaded Applications series for more details.
Stage 5
Lastly we get to multi-node (Rank) parallelism. To many developers message passing interface (MPI) is a black box the “just works” behind the scenes, to transfer data from one MPI task (process) to another. The beauty of MPI for the developer is that the algorithmic coding is hardware independent. The concern that developers have, is that with the many core architecture with 60+ cores, the communication between tasks may create a communication storm either within a node or across nodes. To mitigate these communication bottlenecks, applications should employ hybrid techniques, employing a few MPI tasks and many OpenMP threads.
- Check out the Parallelization using Intel® MPI for more details.
A well-optimized application should address vector parallelization, multi-threading parallelization, and multi-node (Rank) parallelization. However to do this efficiently it is helpful to use a standard step-by-step methodology to ensure each stage level is considered. The stages described here can be (and often are) reordered depending upon the specific needs of each individual application; you can iterate in a stage more than once to achieve the desired performance.
Experience has shown that all stages must at least be considered to ensure an application delivers great performance on today’s scalable hardware as well as being well positioned to scale effectively on upcoming generations of hardware.
Give it a try!