Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

miniGhost on Intel® Xeon® processors and Intel® Xeon Phi™ Coprocessor

$
0
0

Purpose

This article provides instructions for code access, build, and run directions for the miniGhost code, running on Intel® Xeon® processors and Intel® Xeon Phi™ Coprocessors.

Introduction

miniGhost is a Finite Difference mini-application which implements a difference stencil across a homogenous three dimensional domain.

The kernels that it contains are:
- computation of stencil options,
- inter-process boundary (halo, ghost) exchange.
- Global summation of grid values.

miniGhost was mainly designed to study the performance characteristics of the BSPMA configuration within the context of computations widely used across a variety of scientific algorithms.

BSPMA (Bulk synchronous parallel with message aggregation) model is where the face data is accumulated for each variable into user managed buffers. The buffers are then transmitted to (up to) six neighbor processes and computations of the selected stencil is applied to each variable. The other model is SVAF (single variable, aggregated face data), but this article is focused on BSMPA model only.

miniGhost serves as a proxy (or miniapp) for CTH (Shock Physics) code from Sandia.

Code Access

miniGhost is part of the Mantevo project. To get access to the code refer to: https://mantevo.org/packages/ or here

Please note that v0.9 of the code has been used for performance and optimizations in this article.

Build Directions

The reference version of miniGhost can be built for serial execution or parallel (MPI + OpenMP) execution:

Build

The following examples illustrate different commands for different Intel Xeon processors. Choose the command for the appropriate processor for your build.

NOTE: In the following code examples, extra paragraph spacing shows where very long command lines terminate. Single-spaced code examples are one command line.

  1. Uncompress the source files.
    tar -x MiniGhost<ver>.tar
    cd source
  2. Source the latest Intel® C and C++ Compilers and Intel® MPI Library. The following example is for the Intel® Compiler (15.0.1 20141023) and Intel® MPI Library (5.0.2.044), including which instructions to use for the type of processor and coprocessor support you need. (The parenthetical comments are not to be included with the code.)
    source /opt/intel/composer_xe_2015.1.133/bin/compilervars.sh intel64

    (for Intel® Xeon® processor):

    source /opt/intel/impi/5.0.2.044/intel64/bin/mpivars.sh

    (for Intel® Xeon Phi™ Coprocessor):

    source /opt/intel/impi/5.0.2.044/mic/bin/mpivars.sh
  3. To build for the Intel® Xeon® processor v2 family (formerly codenamed Ivy Bridge):

    Make the following changes in makefile.mpi

    				FC=mpiifort
    				CC=mpiicc
    				CFLAGS += -Df2c_ -O3 -openmp -g –xCORE-AVX-I
    				FFLAGS += -D_MG_INT4 -D_MG_REAL8 -O3 -openmp -g –xCORE-AVX-I
    
    				Build Command
    
    				make –f makefile.mpi
    			
  4. To build for the Intel® Xeon® processor v3 family (formerly codenamed Haswell):

    Make the following changes in makefile.mpi

    				FC=mpiifort
    				CC=mpiicc
    				CFLAGS += -Df2c_ -O3 -openmp -g –xCORE-AVX2
    				FFLAGS += -D_MG_INT4 -D_MG_REAL8 -O3 -openmp -g –xCORE-AVX2
    
    				Build Command
    
    				make –f makefile.mpi
    			
  5. To build for the Intel Xeon Phi Coprocessor (Knights Corner):
    				FC=mpiifort
    				CC=mpiicc
    				CFLAGS += -Df2c_ -O3 -openmp -g –mmic
    				FFLAGS += -D_MG_INT4 -D_MG_REAL8 -O3 -openmp -g –mmic
    
    				Build Command
    
    				make –f makefile.mpi
    			

Compiler Flags Used

Compiler FlagEffect
O3Optimize for maximum speed and enable more aggressive optimizations that may not improve performance on some programs
CORE-AVX-IMay generate Intel® Advanced Vector Extensions (Intel® AVX), including instructions in Intel® Core 2™ processors in process technologies smaller than 32nm, Intel® SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel® processors
CORE-AVX2May generate Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel processors
mmicBuild an application that runs natively on Intel® Multi-integrated Core Architecture (Intel® MIC Architecture)

Run Directions

Run Time input parameters can be listed using runtime input (--help )

./miniGhost.x --help

Check MG_OPTIONS.F for list of all parameterized options

In the following instructions, select the executing command appropriate for the processor you are using.

  1. Move to the “run” directory:
    		cd run

    Be sure to run the appropriate binary for each architecture.

  2. Source the Intel Compilers and Intel MPI Library as appropriate for the architecture.
  3. For Intel Xeon processor v2 family (e.g. Intel® Xeon® processor E5-2697v2), execute the following:
    		export OMP_NUM_THREADS=4
    		export I_MPI_PIN_DOMAIN=omp
    		export KMP_AFFINITY=compact, verbose
    
    		mpirun -n 12 ../source/miniGhost.ivb --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 3 --npy 2 --npz 2 --error_tol 8
    		
  4. For Intel Xeon processor v3 family (e.g. Intel® Xeon® processor E5-2697 v3), execute the following:
    		export OMP_NUM_THREADS=4
    		export I_MPI_PIN_DOMAIN=omp
    		export KMP_AFFINITY=compact, verbose
    
    		mpirun -n 14 ../source/miniGhost.hsw --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 7 --npy 2 --npz 1 --error_tol 8
    		
  5. For both Intel Xeon processor v2 family (e.g. Intel® Xeon® processor E5-2697 v2) and Intel Xeon Phi Coprocessor 7120A (Symmetric Mode), first source the Intel Compilers and Intel MPI Library, then execute the following:

    Create two files (executable) run.ivb and run.knc with the following commands:

    run.ivb:

    			../source/miniGhost.ivb --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 2 --npy 1 --npz 1 --error_tol 8

    run.knc:

    			../source/miniGhost.knc --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 2 --npy 1 --npz 1 --error_tol 8
    
    			export I_MPI_MIC=enable
    			

    Use the appropriate MPI fabric using the following:

    			export I_MPI_FABRICS=shm:dapl
    
    			export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u
    
    			mpiexec.hydra –host `hostname` –n 1 –env OMP_NUM_THREADS 48 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./run.ivb : -host `hostname`-mic0 –wdir `pwd` -n 1 –env OMP_NUM_THREADS 240 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./run.knc
    			
  6. For Intel Xeon processor v3 family (e.g. Intel® Xeon® processor E5-2697 v3) and the Intel Xeon Phi Coprocessor 7120A (Symmetric Mode), execute the following:

    Create two files (executable) run.hsw and run.knc with the following commands:

    run.hsw:

    			../source/miniGhost.hsw --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 2 --npy 1 --npz 1 --error_tol 8

    run.knc:

    			../source/miniGhost.knc --scaling 1 --nx 445 --ny 445 --nz 445 --num_vars 40 --num_spikes 1 --debug_grid 1 --report_diffusion 21 --percent_sum 100 --num_tsteps 20 --stencil 24 --comm_method 10 --report_perf 1 --npx 2 --npy 1 --npz 1 --error_tol 8
    
    			export I_MPI_MIC=enable
    			

    Use the appropriate MPI fabric using the following:

    			export I_MPI_FABRICS=shm:dapl
    
    			export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u
    
    			mpiexec.hydra –host `hostname` –n 1 –env OMP_NUM_THREADS 56 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./run.hsw : -host `hostname`-mic0 –wdir `pwd` -n 1 –env OMP_NUM_THREADS 240 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./run.knc
    			

Performance and Optimizations:

In the MG_FLUX_ACCUMULATE subroutine (MG_FLUX_ACCUMULATE.F) six loops were parallelized with $OMP PARALLEL DO/$OMP PARALLEL END DO pragmas to eliminate some serial sections.

Below is the performance speedup of miniGhost (v0.9) using Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessor. Here Intel® Xeon® processor E5-2697v2 is used as the baseline.

References and Resources

[1] Richard F. Barrett, Courtenay T. Vaughan, and Michael A. Heroux. MiniGhost: A Miniapp for Exploring Boundary Exchange Strategies Using Stencil Computations in Scientific Parallel Computing.
http://prod.sandia.gov/techlib/access-control.cgi/2012/122437.pdf

[2] MiniGhost details as part of the NERSC-8/Trinity Benchmarks
https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/minighost/


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>