The two “Pearls” books contain an outstanding collection of examples of code modernization, complete with discussions by software developers of how code was modified with commentary on what worked as well as what did not! Code for these real world applications is available for download from http://lotsofcores.com whether you have bought the books or not. The figures are freely available as well, a real bonus for instructors who choose to uses these examples when teaching code modernization techniques. The books, edited by James Reinders and Jim Jeffers, had 67 contributors for volume one, and 73 contributors for volume 2.
Experts wrote about their experiences in adding parallelism to their real world applications. Most examples illustrate their results on processors and on the Intel® Xeon Phi™ coprocessor. The key issues of scaling, locality of reference and vectorization are recurring themes as each contributed chapter contains explanations of the thinking behind adding use of parallelism to their applications. The actual code is shown and discussed, with step-by-step thinking, and analysis of their results. While OpenMP* are MPI are the dominant method for parallelism, the books also include usage of TBB, OpenCL and other models. There is a balance of Fortran, C and C++ throughout. With such a diverse collection of real world examples, the opportunities to learn from other experts is quite amazing.
Volume 1 includes the following chapters:
Foreword by Sverre Jarp, CERN.
Chapter 1: Introduction
Chapter 2: From ‘Correct’ to ‘Correct & Efficient’: A Hydro2D Case Study with Godunov’s Scheme
Chapter 3: Better Concurrency and SIMD on HBM
Chapter 4: Optimizing for Reacting Navier-Stokes Equations
Chapter 5: Plesiochronous Phasing Barriers
Chapter 6: Parallel Evaluation of Fault Tree Expressions
Chapter 7: Deep-Learning and Numerical Optimization
Chapter 8: Optimizing Gather/Scatter Patterns
Chapter 9: A Many-Core Implementation of the Direct N-body Problem
Chapter 10: N-body Methods
Chapter 11: Dynamic Load Balancing Using OpenMP 4.0
Chapter 12: Concurrent Kernel Offloading
Chapter 13: Heterogeneous Computing with MPI
Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor
Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment
Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors
Chapter 17: NWChem: Quantum Chemistry Simulations at Scale
Chapter 18: Efficient Nested Parallelism on Large-Scale Systems
Chapter 19: Performance Optimization of Black-Scholes Pricing
Chapter 20: Data Transfer Using the Intel COI Library
Chapter 21: High-Performance Ray Tracing
Chapter 22: Portable Performance with OpenCL
Chapter 23: Characterization and Optimization Methodology Applied to Stencil Computations
Chapter 24: Profiling-Guided Optimization
Chapter 25: Heterogeneous MPI optimization with ITAC
Chapter 26: Scalable Out-of-Core Solvers on a Cluster
Chapter 27: Sparse Matrix-Vector Multiplication: Parallelization and Vectorization
Chapter 28: Morton Order Improves Performance
Volume 2 includes the following chapters:
Foreword by Dan Stanzione, TACC
Chapter 1: Introduction
Chapter 2: Numerical Weather Prediction Optimization
Chapter 3: WRF Goddard Microphysics Scheme Optimization
Chapter 4: Pairwise DNA Sequence Alignment Optimization
Chapter 5: Accelerated Structural Bioinformatics for Drug Discovery
Chapter 6: Amber PME Molecular Dynamics Optimization
Chapter 7: Low Latency Solutions for Financial Services
Chapter 8: Parallel Numerical Methods in Finance
Chapter 9: Wilson Dslash Kernel From Lattice QCD Optimization
Chapter 10: Cosmic Microwave Background Analysis: Nested Parallelism In Practice
Chapter 11: Visual Search Optimization
Chapter 12: Radio Frequency Ray Tracing
Chapter 13: Exploring Use of the Reserved Core
Chapter 14: High Performance Python Offloading
Chapter 15: Fast Matrix Computations on Asynchronous Streams
Chapter 16: MPI-3 Shared Memory Programming Introduction
Chapter 17: Coarse-Grain OpenMP for Scalable Hybrid Parallelism
Chapter 18: Exploiting Multilevel Parallelism with OpenMP
Chapter 19: OpenCL: There and Back Again
Chapter 20: OpenMP vs. OpenCL: Difference in Performance?
Chapter 21: Prefetch Tuning Optimizations
Chapter 22: SIMD functions via OpenMP
Chapter 23: Vectorization Advice
Chapter 24: Portable Explicit Vectorization Intrinsics
Chapter 25: Power Analysis for Applications and Data Centers