Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

High-Performance, Modern Code Optimizations for Computational Fluid Dynamics

$
0
0

Modern server farms consist of a large number of heterogeneous, energy-efficient, and very high-performance computing nodes connected with each other through a high-bandwidth network interconnect.  Such systems pose one of the biggest challenges for engineers and scientists today:  how to solve complex, real-world problems by efficiently using the enormous computational horsepower available from the vast number of multi-core arrays comprising these systems. To accomplish this, we need to understand the hierarchical nature of the underlying hardware and use hierarchical parallelism in software to break down the computational problems into discrete parts that fully use the computing power available.

A major trend in recent years is the growing gap between processor and memory speeds. Because of this, getting the data from memory is usually the most expensive operation, while the computation itself is cheap. Thus, to get optimal performance, software must also be tuned for efficient memory utilization, which requires careful use of various memory hierarchy layers available in the form of caches.

The governing partial differential equations (PDEs) used for solving the computational fluid dynamics (CFD) challenges  in aerospace apply to various other fields of science as well. Moreover, the methodology for discretizing these PDEs on a finite mesh and solving them using explicit and implicit time-integration schemes can be adapted to applications from various industries and to scientific study. As such, the straightforward code-modernization framework presented here—with upper-bounds on performance set by the Roofline Model and Amdahl’s Law—can be extrapolated to optimize a variety of applications and workloads, yielding faster execution times, and allowing you to improve the performance of your software.

About SU2

SU2 is an open-source, CFD analysis and design software suite released by the Aerospace Design Laboratory at Stanford University in 2012. The suite enables high-performance, scalable Reynolds-Averaged Navier-Stokes (RANS) calculations using explicit and implicit time integration. A recent paper jointly authored by Intel and the Stanford University team focused on performance optimizations of SU2. The team investigated the opportunities for parallelism of the software components and for finding highly-scalable algorithms. This work is an outcome of the Intel® Parallel Computing Center (IPCC) established at Stanford University with Prof. Juan Alonso’s research group.

  The SU2 optimizations are classified into three categories:

  1. Fine grained parallelism using Open Multi-Processing (OpenMP or OMP)
  2. Single-Instruction, Multiple-Data (SIMD) Vectorization
  3. Memory optimizations

Hierarchical Parallelization

Current-day processors expose parallelism at multiple-levels even within a single node—with multiple processors (2-4 sockets), many cores/threads within each processor (up to 72 threads), and SIMD execution units within each core. The compiler is adept at taking care of these to a great extent. However, to effectively use all these levels of parallelism, consider using explicit hierarchical parallelism in your software.

Exploiting the hierarchical nature of the hardware via hierarchical parallelism in software is important for optimal performance. An unstructured-grid flow solver comprises a diverse range of kernels with varying compute and memory requirements, irregular data accesses, as well as variable and limited amounts of instruction-, vector-, and thread-level parallelism. By breaking the problem down into pieces, however, solutions do emerge. For a high-level breakdown of the types of compute kernels associated with computational fluid dynamics, please see a recent paper by Mudigere, et al.

For the work discussed here, the authors selected to optimize an inviscid, transonic ONERA M6 workload together with the Runge-Kutta (RK) explicit time-stepping scheme. This forms a building block for more involved turbulent and implicit time-stepping simulations, and at the same time retains the edge-loops, which form the top hotspots for any other workload. Moreover, because this forms the bare bones of the SU2 solver, it has one of the lowest overall compute intensities (that is, compute flops per byte of memory accessed) and, therefore, is most difficult to optimize.

Fine-grained Parallelization with OpenMP* (OMP)

OMP is an API that supports multi-platform, shared-memory, multiprocessing programming in C*, C++*, and Fortran* on most processor architectures and operating systems. Consisting of a set of compiler directives, library routines, and environment variables, OMP can be implemented in two ways:

  1. The loop-level, OMP parallel regions method is easy to implement by incrementally adding OMP parallelization. Less error prone due to implicit barriers at the end of the parallel regions, this method incurs a large fork-join overhead because the number of parallel regions is usually fairly high.
  2. The high-level, functional, OMP approach involves a single OMP parallel region at a very high level in the program. This approach looks similar to well-known Message Passing Interface (MPI) domain decomposition. Here the iteration space for edge loops is pre-divided by coloring the edges such that all edges of one color belong to a given OMP thread.

Both OMP approaches were implemented in SU2, however the high-level approach was retained as it showed better performance. Figures 1(a) and (b) show the OMP strong-scaling for the Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor, respectively. Note that Intel® Hyper-Threading Technology is enabled for Intel Xeon processors such that two OMP threads are affinitized (compactly) to a physical core for these processors, and four OMP threads are affinitized (again compactly) to a physical core for the Intel Xeon Phi coprocessors to take advantage of the four hardware-threads per physical core. This helps hide the latency associated with in-order execution on a core of the Intel Xeon Phi coprocessor. Figures 1(a) and (b) show the results for both small and large meshes for both the processor (in Figure 1(a)) and the coprocessor (in Figure 1(b)). For the Intel Xeon processor, the maximum scaling achieved is 11.06x for the small mesh and 12.28x for the large mesh. For the Intel Xeon Phi coprocessor, the corresponding scaling results are 31.72x and 44.28x. The large mesh shows better scaling compared to the small mesh because the effects of OMP load-imbalance reduce as the amount of computation increases. This is even more so for Intel Xeon Phi coprocessor as the number of OMP threads are higher.

Intel® Xeon® results
(a) Intel® Xeon® results.

Intel® Xeon® Phi™ coprocessor results
(b) Intel® Xeon® Phi™ coprocessor results.

Figure 1. OMP strong scaling.  As indicated, two OMP threads are affinitized to a physical core for the Intel® Xeon® processor and four OMP threads are affinitized to a physical core for the Intel® Xeon Phi™ coprocessor. Compact affinity is used in both cases. 

The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program. Amdahl’s Law, also known as Amdahl’s Argument, is used here to find the maximum expected improvement to an overall system when only part of the system is improved. This is often used in parallel computing to predict the theoretical maximum speedup that can be achieved using multiple processors.

Balancing Work and Minimizing Dependencies

For OMP thread scheduling, a combination of dynamic- and static-scheduling for different edge-loops is used to optimize and balance various costs. The atomic operations are required in the dynamic scheduling case because of write-contention.

In the statically scheduled case, the edges of the mesh are “pre-colored” so each thread knows what part of the mesh it owns. Decomposing the edge graph balances work by evenly distributing edges while minimizing dependencies at shared nodes (the “edge cuts” of the edge graph). In order to eliminate contention at the shared nodes, all edges that touch a shared node are replicated on each thread that shares the node. The appropriate data structures for these repeated edges are then added to the code to eliminate contention, and the result is similar to a halo layer approach in a distributed-memory application. The subdomains can then be further reordered, vectorized, and so forth. In this methodology, no atomic operations are required.

Vectorization

Vectorization is extremely critical for achieving high performance on modern CPUs and co-processors. In addition to scalar units, the Intel Xeon processor used in this study has four-wide double-precision (DP) SIMD units that support a wide range of SIMD instructions through Intel® Advanced Vector Extensions (Intel® AVX).  In a single cycle, the processor can issue a four-wide DP floating-point multiply or add.

The Intel Xeon Phi co-processor used in this study has eight-wide DP SIMD units for vector instructions. Thus, one can achieve a 4x and 8x speedup over scalar code for double-precision computes on an Intel Xeon processor and an Intel Xeon Phi co-processor, respectively. As such,  vectorization is even more important for the co-processor.

Note that Amdahl’s Law also extends to vectorization. If the vector units are not used efficiently, that is, a large portion of the code is scalar, a calculable performance penalty is incurred.

Strategies for High Compute Intensity

Functions with a very high compute intensity are great candidates for vectorization. By implementing an outer-loop vectorization, savings are achieved by computing on multiple edges simultaneously in multiple SIMD-lanes.  This contrasts with vectorizing within edges, a method with lower SIMD efficiency. Oftentimes, outer-loop vectorization is better than the innermost loop vectorization, especially if innermost loops perform very little computes with low trip counts.

Critically important is that the parameters passed into the vectorized (elemental) functions are accessed in a unit-stridden way for different values of the loop iteration index (elemental functions are a feature of the Intel compiler). When this is not possible for original variables of the application, which may be two-dimensional arrays or double pointers, packing a local copy of these variables into temporary arrays is useful. Typically these temporary arrays are small in size (a small multiple of the SIMD-length is sufficient). Sometimes a transpose is required while copying the original variables into temporary arrays such that the fastest changing dimension varies linearly with vector stride. Other clauses can be used in your elemental function definition to give hints to the compiler, such as the uniform and linear classes.

 As mentioned before, vectorization is done on the outer-loop, which is a loop over the edges of the mesh. This leads to four edges (on the Intel Xeon processor) being processed concurrently within a thread. To address the possible dependency across these edges, the write-out part is separated from the compute part, with the computed values stored in a temporary buffer for each SIMD width of edges. After the compute, scalar operations are used to write out results from the temporary buffer. The performance impact from the scalar write-out is minimal, because it is amortized by a large amount of compute in the vectorized kernels.

Memory Optimizations

Roofline is a performance model used to estimate an upper bound on the performance of various numerical methods and operations running on multi-core, many-core, or accelerator processor architectures. The most basic Roofline model can be used to bound floating-point performance as a function of machine peak performance, machine peak bandwidth, and arithmetic intensity. The model can be used to assess the quality of attained performance by combining locality, bandwidth, and different parallelization paradigms into a single performance figure. One can examine the resultant Roofline figure in order to determine both the implementation and inherent performance limitations. Most kernels in second-order accurate CFD codes are memory bandwidth bound; therefore, the Roofline model gives a good upper bound on performance for these.

A number of approaches are available when attempting to improve memory performance, and specific techniques are described here as implemented in SU2. In general, the idea is to apply optimizations in order to improve the spatial and temporal locality of data.

Three particular techniques used for improving data locality compared to the baseline SU2 version are as follows:

  1. Minimize cache misses with edge/vertex reordering
  2. Allocate class objects more intelligently
  3. Change the data structures from array-of-structures (AOS) to structures-of-arrays (SOA)

Minimize Cache Misses

The first approach for memory optimization is a reordering of the nodes (unknowns) to minimize cache misses. This is accomplished via a Reverse Cuthill-McKee (RCM) algorithm that minimizes the bandwidth of the adjacency matrix of the mesh. By using RCM, the overall bandwidth of the adjacency matrix of the unstructured mesh used in SU2 was significantly reduced. An adjacency matrix essentially shows the edge connections in an unstructured mesh. The rows and columns of this matrix are vertices; a non-zero entry in the matrix means that the vertices are connected by an edge. This can be seen in Figures 2(a) and (b), which show the adjacency matrix before and after the RCM transformation, respectively. The matrix bandwidth reduces from 170,691 to 15,515 by applying RCM re-numbering for the smaller tetrahedral ONERA M6 mesh. This reduction in matrix bandwidth directly translates into improved cache utilization.

Smarter Memory Allocation

The second approach for improving memory performance involves reworking some of the class structure in SU2 in order to support more parallelization- and cache-friendly initializations of class data. The key idea is to reduce the working set sizes and to reduce the number of indirect memory accesses as much as possible. Indirect memory accesses lead to gather-scatter instructions which incur a very large performance penalty. For this purpose, the CNumerics class (parent class) has been modified so that it is purely virtual with no class data, while the child classes allocate all of the data that is necessary for computing fluxes along the edges. This leads to a speed up of the code and also simplifies the parallelization of the flux loops using OMP. Another example is an improved (contiguous) memory allocation approach for the variables that are stored at each node (our unknowns) within the CVariable class. This guarantees that memory for the objects is allocated in a contiguous array (in C-style), rather than using typical C++ allocations.

Reduce Memory Footprint

Another major memory optimization performed is a change in class structure from AOS (array-of-structures) to SOA (structures-of-arrays) as explained in detail in a recent AOS-to-SOA case study. This is another strategy for reducing the indirect memory accesses.  The baseline code is written in AOS form for the key C++ CSolver class of the code. That is, the CSolver class contains a double pointer to an object of the CVariable class, which contains the solution variables (unknowns) at a given vertex of the mesh (such as fluid pressure p, or fluid velocity vector <u, v, w> for example). Thus, when accessing these quantities in an edge-loop, one requires an indirect access by de-referencing the CVariable object at the given vertex. These variables are stored in memory as [p1,u1,v1,w1], [p2,u2,v2,w2], …, where xi denotes the variable x at point i. The memory address for each set of vertex data can be spread out quite a bit, which results in expanded working sets. This structure has been modified to SOA format where the CVariable class has been removed entirely, and the required variables at all of the vertices are stored contiguously as members of the CSolver class itself. This avoids the need for indirect access. In SOA format, the variables are stored in memory as: [p1,u1,v1,w1,p2,u2,v2,w2,…], which compacts the working sets and results in cache-efficient traversal of the edge-loop.

Before RCM
(a) Before RCM

After RCM
(b) After RCM

Figure 2. Effect of RCM re-numbering on the edge adjacency matrix of the ONERA M6 mesh.

Because AOS provides a more modular software design, a tension exists between performance and programming flexibility that must be balanced as needed. In this example, the AOS-to-SOA transformations are coded using pre-processor directives. Be sure to compile with AOS-to-SOA enabled if you desire better performance.

Performance Results Achieved

Performance results given in this section were obtained using all of the optimizations described in this article The results were derived when run on the Intel Xeon processor and on the Intel Xeon Phi coprocessor with native execution. (Native execution on the Intel Xeon Phi coprocessor means that the code binaries are compiled for direct execution on the coprocessor, and the host is not involved at all in the computation.)

Host with Intel® Xeon®  ProcessorIntel ® Xeon® E5,- 2.70 GHz, 2 x 12 cores (dual-socket workstation), 64GB DDR3 1600MHz RAM, Hyper-threading (HT) enabled
Intel® Xeon Phi™ CoprocessorIntel® Xeon Phi™ C0-7120A, 1.238 GHz, 61 cores, 16 GB GDDR5 RAM, Turbo enabled
ToolsIntel® Composer XE 2015 (beta)

Table 1. Machine and tools configuration used to generate performance results.

Figures 3(a) and (b) show the speedups obtained by adding various optimizations for both the Intel Xeon processor and the Intel Xeon Phi coprocessor, respectively. Results for both the small and large ONERA M6 meshes are shown. The simulation is run for 100 nonlinear iterations, and the time per iteration for the 100th iteration is taken as the performance metric. 

Intel® Xeon® results
(a) Intel® Xeon® results.

Intel® Xeon® Phi™ coprocessor results
(b) Intel® Xeon® Phi™ coprocessor results.

Figure 3. Fine-grained single-node optimizations.

Results on the Intel® Xeon® Processor

The speedups shown in Figure 3(a) are relative to the Base (Message Passing Interface, or MPI only) code, which is run with 48 MPI ranks using all 24 physical cores (48 ranks because Intel® Hyper-Threading Technology is enabled). Note that hybridization to MPI+OMP improves the performance by 1.11x for the small mesh. However, it does not make a difference for the large mesh. This is because the small mesh is more sensitive to memory latencies. For the large mesh, sufficient compute occurs to hide some of these latencies and, hence, a significant speedup is not achieved from hybridization.

By further adding AOS-to-SOA transformations, a very noticeable jump in speedup is obtained for both small and large meshes. This is the most productive optimization. By adding auto OMP scheduling (as shown in Roland W. Green’s paper on OMP open-loop scheduling) more speedup is derived (even more so for the small mesh again because it is more sensitive to load-imbalance among threads than is the large mesh). Finally, by adding vectorization, about 10 percent overall speedup is derived for both small and large meshes. Note that this is a significant gain from vectorization given the fact that only a single kernel (Centered Residual) was vectorized. The other kernels didn’t have enough compute intensity for the vectorization to be worthwhile.

Results on the Intel® Xeon Phi™ Coprocessor

For the Intel Xeon Phi coprocessor, the code was executed natively on the coprocessor without host involvement. Overall, the picture for the coprocessor looks similar to that for the Intel Xeon processor. This is a big advantage for the Intel Xeon Phi coprocessor because there is no need to write and maintain different codebases for host and coprocessor. The optimizations done to improve host performance help significantly in improving the coprocessor performance and vice-versa.

Conclusion

By placing a particular emphasis on parallelism (both fine- and coarse-grained), vectorization, efficient memory usage, and identification of the best-suited algorithms for modern hardware, the authors have discussed how to optimize SU2 for massively parallel simulations in several key areas:

  1. Code profiling and understanding current bottlenecks
  2. Implementation of coarse- and fine-grain parallelism approaches via a hierarchical framework in software
  3. Optimizing the code for vectorization and efficient memory usage

Future work will build up on the current work and focus on the performance of the code in massively parallel settings for other workloads using the implicit turbulent solver as well. We are also investigating other novel scalable linear system solvers for implicit solution of governing equations of computational fluid dynamics.

The Team

The key members of Intel team include Gaurav Bansal from the Intel Software and Services Group and Dheevatsa Mudigere, Alexander Heinecke, and Mikhail Smelyanskiy from the Intel Labs; the key members of the Stanford University team are Thomas Economon and Prof. Juan J. Alonso.

Resources

Read the complete paper entitled Towards High-Performance Optimizations of the Unstructured Open-Source SU2 Suite at:  arc.aiaa.org/doi/abs/10.2514/6.2015-1949

For details on the Intel® Parallel Studio see:  software.intel.com/en-us/intel-parallel-studio-xe


Viewing all articles
Browse latest Browse all 3384

Trending Articles