Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

Introduction

MILC software represents a set of codes written by the MIMD Lattice Computation (MILC) collaboration used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics. It performs simulations of four-dimensional SU lattice gauge theory on MIMD (Multiple Instruction, Multiple Data) parallel machines. “Strong interactions” are responsible for binding quarks into protons and neutrons and holding them all together in the atomic nucleus. MILC applications address fundamental questions in high energy and nuclear physics, and is directly related to major experimental programs in these fields. MILC is one of the largest compute cycle users at many U.S. and European supercomputing centers.

This article provides instructions for code access, build, and run directions for the “ks_imp_rhmc” application on Intel® Xeon® processors and Intel® Xeon Phi™ processors. The “ks_imp_rhmc” is a dynamical RHMC (rational hybrid Monte Carlo algorithm) code for staggered fermions. In addition to the naive and asqtad staggered actions, the highly improved staggered quark (HISQ) action is also supported.

Currently, the conjugate gradient (CG) solver in the code uses the QPhiX library. Efforts are ongoing to integrate other operations (gauge force (GF), fermion force (FF)) with the QPhiX library as well.

The QPhiX library provides sparse solvers and Dslash kernels for Lattice QCD simulations optimized for Intel® architectures.

Code Access

The MILC Software and QPhiX library are primarily required. The MILC software can be downloaded from GitHub here: https://github.com/milc-qcd/milc_qcd. Download the master branch. QPhiX support is integrated into this branch for CG solvers.

The QPhiX library and code generator for use with Wilson-Clover fermions (for example, for use with chroma) are available from https://github.com/jeffersonlab/qphix.git and https://github.com/jeffersonlab/qphix-codegen.git, respectively. For the most up to date version, we suggest you use the devel branch of QPhiX. The MILC version is currently not open source. Please contact the MILC collaboration group for access to the QPhiX (MILC) branch.

Build Directions

Compile the QPhiX Library

Users need to build QPhiX first before building the MILC package.

The QPhiX library will have two tar files, mbench*.tar and qphix-codegen*.tar.

Untar the above.

Build qphix-codgen

The files with intrinsics for QPhiX are built in the qphix-codegen directory.

Enter the qphix-codegen directory.

Edit line #3 in “Makefile_xyzt”, enable “milc=1” variable.

Compile as:

source /opt/intel/compiler/<version>/bin/compilervars.sh intel64
source /opt/intel/impi/<version>/mpi/intel64/bin/mpivars.sh
make –f Makefile_xyzt avx512 -- [for Intel® Xeon Phi™ Processor]
make –f avx2 -- [for Intel® Xeon® v3 Processors /Intel® Xeon® v4 Processor]

Build mbench

Enter the mbench directory.

Edit line #3 in “Makefile_qphixlib”, set “mode=mic” to compile with Intel® AVX-512 for Intel® Xeon Phi™ Processor and “mode=avx” to compile with Intel® Advanced Vector Extensions 2 (Intel® AVX2) for Intel® Xeon® Processors.

Edit line #13 in “Makefile_qphixlib” to enable MPI. Set ENABLE_MPI = 1.

Compile as:

make -f Makefile_qphixlib mode=mic AVX512=1 -- [Intel® Xeon Phi™ Processor]
make -f Makefile_qphixlib mode=avx AVX2=1 -- [Intel® Xeon® Processors]

Compile MILC Code

Install/download the master branch from the above GitHub location.

Download the Makefile.qphix file from the following location:

http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/.

Copy the Makefile.qphix to the corresponding application directory. In this case, copy the Makefile.qphix to the “ks_imp_rhmc” application directory and rename it as Makefile.

Make the following changes to the Makefile:

On line #17 - Add/uncomment the appropriate ARCH variable:
- For example, ARCH = knl (compile with Intel AVX-512 for Intel® Xeon Phi™ Processor architecture).
- For example, ARCH = bdw (compile with Intel AVX2 for Intel® Xeon® Processor architecture).
On line #28 - Change MPP variable to “true” if you want MPI.
On line #34 - Pick the PRECISION you want:
- 1 = Single, 2 = Double. We use Double for our runs.
Starting line #37 - Compiler is set up and this should work:
- If directions above were followed. If not, customize starting at line #40.
On line #124 - Setup of Intel compiler starts:
- Based on ARCH it will use the appropriate flags.
On line #395 - QPhiX customizations starts:
- On line #399 – Set QPHIX_HOME to correct QPhiX path (Path to mbench directory).
- The appropriate QPhiX FLAGS will be set if the above is defined correctly.

Compile as:

Enter the ks_imp_rhmc. The Makefile with the above changes should be in this directory. Source the latest Intel® compilers and Intel® MPI Library.

make su3_rhmd_hisq -- Build su3_rhmd_hisq binary
make su3_rhmc_hisq -- Build su3_rhmc_hisq binary

Compile the above binaries for Intel® Xeon Phi™ Processor and Intel® Xeon® Processor (edit Makefile accordingly).

Run Directions

Input Files

There are two required input files, params.rest, and rat.m013m065m838.

They can be downloaded from here:

http://denali.physics.indiana.edu/~sg/MILC_Performance_Recipe/.

The file rat.m013m065m838 defines the residues and poles of the rational functions needed in the calculation. The file params.rest sets all the run time parameters, including the lattice size, the length of the calculation (number of trajectories), and the precision of the various conjugate-gradient solutions.

In addition, a params.<lattice-size> file with required lattice size will be created during runtime. This file essentially has the params.rest appended to it with the lattice size (Nx * Ny * Nz * Ny) to run.

The Lattice Sizes

The size of the four-dimensional space-time lattice is controlled by the “nx, ny, nz, nt” parameters.

As an example, consider a problem as (nx x ny x nz x nt) = 32 x 32 x 32 x 64 running on 64 MPI ranks. To weak scale this problem a user would begin by multiplying nt by 2, then nz by 2, then ny by 2, then nx by 2 and so on, such that all variables get sized accordingly in a round-robin fashion.

This is illustrated in the table below. The original problem size is 32 x 32 x 32 x 64, to keep the elements/rank constant (weak scaling); for 128 rank count, first multiply nt by 2 (32 x 32 x 32 x 128). Similarly, for 512 ranks, multiply ntby 2, nz by 2, ny by 2 from the original problem size to keep the same elements/rank.

Ranks	64	128	256	512
nx	32	32	32	32
ny	32	32	32	64
nz	32	32	64	64
nt	64	128	128	128

Total Elements	2097152	4194304	8388608	16777216
Multiplier	1	2	4	8
Elements/Rank	32768	32768	32768	32768

Table: Illustrates Weak Scaling of Lattice Sizes

Running with MPI x OpenMP*

The calculation takes place on a four-dimensional hypercubic lattice, representing three spatial dimensions and one time dimension. The quark fields have values on each of the lattice points and the gluon field has values on each of the links connecting nearest-neighbors of the lattice sites.

The lattice is divided into equal subvolumes, one per MPI rank. The MPI ranks can be thought of as being organized into a four-dimensional grid of ranks. It is possible to control the grid dimensions with the params.rest file. Of course, the grid dimensions must be integer factors of the lattice coordinate dimensions.

Each MPI rank executes the same code. The calculation requires frequent exchanges of quark and gluon values between MPI ranks with neighboring lattice sites. Within a single MPI rank, the site-by-site calculation is threaded using OpenMP* directives, which have been inserted throughout the code. The most time-consuming part of production calculations is the CG solver. In the QPhiX version of the CG solver, the data layout and the calculation at the thread level is further organized to take advantage of the Intel Xeon and Intel Xeon Phi processors SIMD(single instruction, multiple data) lanes.

Running the Test Cases

Create a “run” directory in the top-level directory and add the input files obtained from above.
cd <milc>/run
P.S: Run the appropriate binary for each architecture.
Create the lattice volume:
```
cat << EOF > params.$nx*$ny*$nz*$nt
prompt 0
nx $nx
ny $ny
nz $nz
nt $nt
EOF
cat params.rest >> params.$nx*$ny*$nz*$nt
```
For this performance recipe, we evaluate the single node and multinode (16 nodes) performance with the following weak scaled lattice volume:
Single Node (nx * ny * nz * nt): 24 x 24 x 24 x 60
Multinode [16 nodes] (nx * ny * nz * nt): 48 x 48 x 48 x 120

Run on Intel Xeon processor (E5-2697v4).
Source the latest Intel compilers and Intel MPI Library

Intel® Parallel Studio 2017 and above recommended

Single Node:

mpiexec.hydra –n 12 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose'<path-to>/ks_imp_rhmc/su3_rhmd_hisq.bdw < params.24x24x24x60

Multinode (16 nodes, via Intel® Omni-Path Host Fabric Interface (Intel® OP HFI)):

# Create a runScript (run-bdw) #<path-to>/ks_imp_rhmc/su3_rhmd_hisq.bdw < params.48x48x48x120
#Intel® OPA fabric-related environment variables#
export I_MPI_FABRICS=shm:tmi
export I_MPI_TMI_PROVIDER=psm2
export PSM2_IDENTIFY=1
export I_MPI_FALLBACK=0
#Create nodeconfig.txt with the following#
-host <hostname1> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 <path-to>/run-bdw
…..
…..
…..
-host <hostname16> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 12 <path-to>/run-bdw
#mpirun command#
mpiexec.hydra –configfile nodeconfig.txt

Run on Intel Xeon Phi processor (7250).
Source Intel compilers and Intel MPI Library

Intel® Parallel Studio 2017 and above recommended

Single Node:

mpiexec.hydra –n 20 –env OMP_NUM_THREADS 3 –env KMP_AFFINITY 'granularity=fine,scatter,verbose' numactl –p 1 <path-to>/ks_imp_rhmc/su3_rhmd_hisq.knl < params.24x24x24x60

Multinode (16 nodes, via Intel OP HFI):

# Create a runScript (run-knl) #
numactl –p 1 <path-to>/ks_imp_rhmc/su3_rhmd_hisq.knl < params.48x48x48x120
#Intel OPA fabric-related environment variables#
export I_MPI_FABRICS=shm:tmi
export I_MPI_TMI_PROVIDER=psm2
export PSM2_IDENTIFY=1
export I_MPI_FALLBACK=0
#Create nodeconfig.txt with the following#
-host <hostname1> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 <path-to>/run-knl
…..
…..
…..
-host <hostname16> -env OMP_NUM_THREADS 3 -env KMP_AFFINITY 'granularity=fine,scatter,verbose' -n 20 <path-to>/run-knl
#mpirun command#
mpiexec.hydra –configfile nodeconfig.txt

Performance Results and Optimizations

The output prints the total time to solution for the entire application, which takes into account the time for the different solvers and operators (for example, CG solver, fermion force, link fattening, gauge force, and so on).

The performance chart below is the speedup w.r.t 2S Intel Xeon processor E5-2697v4 based on the total run time.

Speedup w.r.t 2S Intel® Xeon® processor E5-2697v4

The optimizations as part of the QPhiX library include data layout changes to target vectorization and generation of packed aligned loads/stores, cache blocking, load balancing and improved code generation for each architecture (Intel Xeon processor, Intel Xeon Phi processor) with corresponding intrinsics, where necessary. See References and Resources section for details.

Testing Platform Configurations

The following hardware was used for the above recipe and performance testing.

Processor	Intel® Xeon® Processor E5-2697 v4	Intel® Xeon Phi™ Processor 7250F
Sockets / TDP	2S / 290W	1S / 215W
Frequency / Cores / Threads	2.3 GHz / 36 / 72	1.4 a / 68 / 272
DDR4	8x16 GB 2400 MHz	6x16 GB 2400 MHz
MCDRAM	N/A	16 GB Flat
Cluster/Snoop Mode	Home	Quadrant
Memory Mode		Flat
Turbo	OFF	OFF
BIOS	SE5C610.86B.01.01.0016.033 120161139	GVPRCRB1.86B.0010.R02.1 606082342
Operating System	Oracle Linux* 7.2 (3.10.0-229.20.1.el6.x86_64)	Oracle Linux* 7.2 (3.10.0-229.20.1.el6.x86_64)

MILC Build Configurations

The following configurations were used for the above recipe and performance testing.

MILC Version	Master version as of 28 January 2017
Intel® Compiler Version	2017.1.132
Intel® MPI Library Version	2017.0.098
MILC Makefiles Used	Makefile.qphix, Makefile_qphixlib, Makefile

References and Resources

MIMD Lattice Computation (MILC) Collaboration: http://physics.indiana.edu/~sg/milc.html
QPhiX Case Study: http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/application-case-studies/qphix-case-study/
MILC Staggered Conjugate Gradient Performance on Intel Intel® Xeon Phi™ Processor: https://anl.app.box.com/v/IXPUG2016-presentation-10

Recipe: Building and Running MILC on Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

Introduction

Code Access

Build Directions

Compile the QPhiX Library

Build qphix-codgen

Build mbench

Compile MILC Code

Run Directions

Input Files

The Lattice Sizes

Running with MPI x OpenMP*

Running the Test Cases

Performance Results and Optimizations

Testing Platform Configurations

MILC Build Configurations

References and Resources

Trending Articles

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

NCERT Solutions for Class 9th Sanskrit Chapter 2 अविवेकः परमापदां पदम्

Ifield Avenue closed following crash in Langley Green

Practice Sheet of Right form of verbs for HSC Students

S.K. Macharia Biography, Wealth, Awards, Family, Wife and Children

TASK ERROR: storage migration failed: block job (mirror) error:...

Electronic Bank Statement field Assignment (ZUONR) missing alphabets from...

गर्मी पर स्टेटस – Funny Summer Status in Hindi for Whatsapp

Forum Post: RE: TMS570LC4357: Disable error pin output for ESM group 1, 2, 3

newbie need guide - help - read flash xc2287-96F with dap miniwiggler

Karimnagar District Police Office Mobile Numbers List in Telangana State

Griffith faces three more offences

Shatta Wale – You Shock Me (Prod. by Willis Beatz)

09g927750** 6 speed transmission TCM VAG original firmware files

Parris out on $9,000 bail

More things we have to put up with: when NOT to raise hell with Disclosure

Karnataka SSLC 10th Exam Time Table 2016 (www.kseeb.kar.nic.in)

The 10 Wyoming Cities With The Largest Black Population For 2021

PSM I question: Product Backlog item considered complete

Scripting Tracker - Development Tool for SAP GUI Scripting