Purpose
This article provides instructions for code access, build, and run directions for the miniGhost code, running on Intel® Xeon® processors and Intel® Xeon Phi™ Coprocessors.
Introduction
miniGhost is a Finite Difference mini-application which implements a difference stencil across a homogenous three dimensional domain.
The kernels that it contains are:
- computation of stencil options,
- inter-process boundary (halo, ghost) exchange.
- Global summation of grid values.
miniGhost was mainly designed to study the performance characteristics of the BSPMA configuration within the context of computations widely used across a variety of scientific algorithms.
BSPMA (Bulk synchronous parallel with message aggregation) model is where the face data is accumulated for each variable into user managed buffers. The buffers are then transmitted to (up to) six neighbor processes and computations of the selected stencil is applied to each variable. The other model is SVAF (single variable, aggregated face data), but this article is focused on BSMPA model only.
miniGhost serves as a proxy (or miniapp) for CTH (Shock Physics) code from Sandia.
Code Access
miniGhost is part of the Mantevo project. To get access to the code refer to: https://mantevo.org/packages/ or here
Please note that v0.9 of the code has been used for performance and optimizations in this article.
Build Directions
The reference version of miniGhost can be built for serial execution or parallel (MPI + OpenMP) execution:
Build
The following examples illustrate different commands for different Intel Xeon processors. Choose the command for the appropriate processor for your build.
NOTE: In the following code examples, extra paragraph spacing shows where very long command lines terminate. Single-spaced code examples are one command line.
- Uncompress the source files.
tar -x MiniGhost<ver>.tar cd source
- Source the latest Intel® C and C++ Compilers and Intel® MPI Library. The following example is for the Intel® Compiler (15.0.1 20141023) and Intel® MPI Library (5.0.2.044), including which instructions to use for the type of processor and coprocessor support you need. (The parenthetical comments are not to be included with the code.)
source /opt/intel/composer_xe_2015.1.133/bin/compilervars.sh intel64
(for Intel® Xeon® processor):
source /opt/intel/impi/5.0.2.044/intel64/bin/mpivars.sh
(for Intel® Xeon Phi™ Coprocessor):
source /opt/intel/impi/5.0.2.044/mic/bin/mpivars.sh
- To build for the Intel® Xeon® processor v2 family (formerly codenamed Ivy Bridge):
Make the following changes in makefile.mpi
FC=mpiifort CC=mpiicc CFLAGS += -Df2c_ -O3 -openmp -g –xCORE-AVX-I FFLAGS += -D_MG_INT4 -D_MG_REAL8 -O3 -openmp -g –xCORE-AVX-I Build Command make –f makefile.mpi
- To build for the Intel® Xeon® processor v3 family (formerly codenamed Haswell):
Make the following changes in makefile.mpi
FC=mpiifort CC=mpiicc CFLAGS += -Df2c_ -O3 -openmp -g –xCORE-AVX2 FFLAGS += -D_MG_INT4 -D_MG_REAL8 -O3 -openmp -g –xCORE-AVX2 Build Command make –f makefile.mpi
- To build for the Intel Xeon Phi Coprocessor (Knights Corner):
FC=mpiifort CC=mpiicc CFLAGS += -Df2c_ -O3 -openmp -g –mmic FFLAGS += -D_MG_INT4 -D_MG_REAL8 -O3 -openmp -g –mmic Build Command make –f makefile.mpi
Compiler Flags Used
Compiler Flag | Effect |
---|---|
O3 | Optimize for maximum speed and enable more aggressive optimizations that may not improve performance on some programs |
CORE-AVX-I | May generate Intel® Advanced Vector Extensions (Intel® AVX), including instructions in Intel® Core 2™ processors in process technologies smaller than 32nm, Intel® SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel® processors |
CORE-AVX2 | May generate Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel processors |
mmic | Build an application that runs natively on Intel® Multi-integrated Core Architecture (Intel® MIC Architecture) |
Run Directions
Run Time input parameters can be listed using runtime input (--help )
./miniGhost.x --help
Check MG_OPTIONS.F
for list of all parameterized options
In the following instructions, select the executing command appropriate for the processor you are using.
- Move to the “run” directory:
cd run
Be sure to run the appropriate binary for each architecture.
- Source the Intel Compilers and Intel MPI Library as appropriate for the architecture.
- For Intel Xeon processor v2 family (e.g. Intel® Xeon® processor E5-2697v2), execute the following:
export OMP_NUM_THREADS=4 export I_MPI_PIN_DOMAIN=omp export KMP_AFFINITY=compact, verbose mpirun -n 12 ../source/miniGhost.ivb --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 3 --npy 2 --npz 2 --error_tol 8
- For Intel Xeon processor v3 family (e.g. Intel® Xeon® processor E5-2697 v3), execute the following:
export OMP_NUM_THREADS=4 export I_MPI_PIN_DOMAIN=omp export KMP_AFFINITY=compact, verbose mpirun -n 14 ../source/miniGhost.hsw --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 7 --npy 2 --npz 1 --error_tol 8
- For both Intel Xeon processor v2 family (e.g. Intel® Xeon® processor E5-2697 v2) and Intel Xeon Phi Coprocessor 7120A (Symmetric Mode), first source the Intel Compilers and Intel MPI Library, then execute the following:
Create two files (executable) run.ivb and run.knc with the following commands:
run.ivb:
../source/miniGhost.ivb --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 2 --npy 1 --npz 1 --error_tol 8
run.knc:
../source/miniGhost.knc --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 2 --npy 1 --npz 1 --error_tol 8 export I_MPI_MIC=enable
Use the appropriate MPI fabric using the following:
export I_MPI_FABRICS=shm:dapl export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u mpiexec.hydra –host `hostname` –n 1 –env OMP_NUM_THREADS 48 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./run.ivb : -host `hostname`-mic0 –wdir `pwd` -n 1 –env OMP_NUM_THREADS 240 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./run.knc
- For Intel Xeon processor v3 family (e.g. Intel® Xeon® processor E5-2697 v3) and the Intel Xeon Phi Coprocessor 7120A (Symmetric Mode), execute the following:
Create two files (executable) run.hsw and run.knc with the following commands:
run.hsw:
../source/miniGhost.hsw --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 2 --npy 1 --npz 1 --error_tol 8
run.knc:
../source/miniGhost.knc --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 2 --npy 1 --npz 1 --error_tol 8 export I_MPI_MIC=enable
Use the appropriate MPI fabric using the following:
export I_MPI_FABRICS=shm:dapl export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u mpiexec.hydra –host `hostname` –n 1 –env OMP_NUM_THREADS 56 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./run.hsw : -host `hostname`-mic0 –wdir `pwd` -n 1 –env OMP_NUM_THREADS 240 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./run.knc
Performance and Optimizations:
In the MG_FLUX_ACCUMULATE
subroutine (MG_FLUX_ACCUMULATE.F
) six loops were parallelized with $OMP PARALLEL DO/$OMP PARALLEL END DO
pragmas to eliminate some serial sections.
Below is the performance speedup of miniGhost (v0.9) using Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessor. Here Intel® Xeon® processor E5-2697v2 is used as the baseline.
References and Resources
[1] Richard F. Barrett, Courtenay T. Vaughan, and Michael A. Heroux. MiniGhost: A Miniapp for Exploring Boundary Exchange Strategies Using Stencil Computations in Scientific Parallel Computing.
http://prod.sandia.gov/techlib/access-control.cgi/2012/122437.pdf
[2] MiniGhost details as part of the NERSC-8/Trinity Benchmarks
https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/minighost/