Download [96KB]
For single-node runs, refer to the recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ processors
Purpose
This recipe describes a step-by-step process for getting, building, and running NAMD (scalable molecular dynamics code) on the Intel® Xeon Phi™ processor and Intel® Xeon® processor family to achieve better performance.
Introduction
NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecule systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.
NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Below are the details for how to build NAMD on Intel Xeon Phi processor and Intel Xeon processor E5 family. You can learn more about NAMD at http://www.ks.uiuc.edu/Research/namd/.
Building and Running NAMD for Cluster on the Intel® Xeon® processors
E5-2697 v4 (formerly Broadwell (BDW)), Intel® Xeon Phi™ processor 7250 (formerly Knights Landing (KNL)), and Intel® Xeon® Gold 6148 processor (formerly Skylake (SKX))
Download the code
- Download the latest NAMD source code from this site: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD
- Download Open Fabric Interfaces (OFI). NAMD uses Charm++/OFI for multi-node.
- You can use the installed OFI library, which comes with the IFS package, or download and build it manually.
- To check the version of the installed OFI use the “fi_info --version” command (OFI1.4.2 was used here).
- The OFI library can be downloaded from https://github.com/ofiwg/libfabric/releases.
- Download Charm++ with OFI support:
From here: http://charmplusplus.org/download/
or
git clone: http://charm.cs.illinois.edu/gerrit/charm.git - Download the fftw3 version: http://www.fftw.org/download.html
Version 3.3.4 is used is this run.
- Download the apao and stvm workloads: http://www.ks.uiuc.edu/Research/namd/utilities/
Build the Binaries
- Set the environment for compilation:
CC=icc; CXX=icpc; F90=ifort; F77=ifort export CC CXX F90 F77 source /opt/intel/compiler/<version>/compilervars.sh intel64
- Build the OFI library (you can skip this step if you want to use the installed OFI library):
- cd <libfabric_root_path>
- ./autogen.sh
- ./configure --prefix=<libfabric_install_path> --enable-psm2
- make clean && make -j12 all && make install
- custom OFI can be used further using LD_PRELOAD or LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=<libfabric_install_path>/lib:${LD_LIBRARY_PATH}
mpiexec.hydra …
or
LD_PRELOAD=<libfabric_install_path>/lib/libfabric.so mpiexec.hydra … - Build fftw3:
- cd <fftw_root_path>
- ./configure --prefix=<fftw_install_path> --enable-single --disable-fortran CC=icc
Use –xCORE-AVX512 for SKX, -xMIC-AVX512 for KNL and –xCORE-AVX2 for BDW - make CFLAGS=“-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits” clean install
- Build multi-node version of Charm++:
- cd <charm_root_path>
- ./build charm++ ofi-linux-x86_64 icc smp --basedir <libfabric_root_path> --with-production “-O3 -ip” -DCMK_OPTIMIZE
- Build NAMD:
- Modify the arch/Linux-x86_64-icc to look like the following (select one of the FLOATOPTS options depending on the CPU type):
NAMD_ARCH = Linux-x86_64 CHARMARCH = multicore-linux64-iccstatic # For KNL FLOATOPTS = -ip -xMIC-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE # For SKX FLOATOPTS = -ip -xCORE-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE # For BDW FLOATOPTS = -ip -xCORE-AVX2 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE CXX = icpc -std=c++11 -DNAMD_KNL CXXOPTS = -static-intel -O2 $(FLOATOPTS) CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4 CXXCOLVAROPTS = -O2 -ip CC = icc COPTS = -static-intel -O2 $(FLOATOPTS)
- Compile NAMD
- ./config Linux-x86_64-icc --charm-base <charm_root_path> --charm-arch ofi-linux-x86_64-smp-icc --with-fftw3 --fftw-prefix <fftw_install_path>--without-tcl --charm-opts -verbose
- cd Linux-x86_64-icc
- make clean && gmake –j
- Modify the arch/Linux-x86_64-icc to look like the following (select one of the FLOATOPTS options depending on the CPU type):
- Build memopt NAMD binaries:
Like BDW/KNL build with extra options “–with-memopt” for config.
Other Setup
Change the next lines in the *.namd file for both the stmv and opao1 workloads:
numsteps: 1000
outputtiming: 20
outputenergies: 600
Run the Binaries
- Set the environment for launching:
- source /opt/intel/compiler/<version>/compilervars.sh intel64
- source /opt/intel/impi/<version>/intel64/bin/mpivars.sh
- specify host names to run on in “hosts” file
- export MPIEXEC=“mpiexec.hydra -hostfile ./hosts”
- export PSM2_SHAREDCONTEXTS=0 (if you use PSM2 < 10.2.85)
- Launch the task (for example with N nodes, with 1 process per node and PPN cores):
$MPPEXEC -n N -ppn 1 ./namd2 +ppn (PPN-1) <workload_path> +pemap 1-(PPN-1) +commap 0 For example for BDW (PPN=72): $MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 71 <workload_path> +pemap 1-71 +commap 0 For example for KNL (PPN=68, without hyper threads): $MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 67 <workload_path> +pemap 1-67 +commap 0 For example for KNL (with 2 hyper threads per core): $MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 134 <workload_path> +pemap 0-66+68 +commap 67
- For KNL with MCDRAM in flat mode:
$MPPEXEC -n N -ppn 1 numactl -p 1 ./namd2 +ppn (PPN-1) <workload_path> +pemap 1-(PPN-1) +commap 0
Remarks
To achieve better scale on multi-node, increase the count of the communication threads (1, 2, 4, 8, 13, 17). For example, the following is a command for N KNL nodes with 17 processes per node and 8 threads per process (7 worker threads and 1 communication thread):
$MPPEXEC -n $(($N*17)) -ppn 17 numactl -p 1 ./namd2 +ppn 7 <workload_path> +pemap 0-67,68-135:4.3 +commap 71-135:4
Basic Charm++/OFI knobs (should be added as NAMD parameters)
- +ofi_eager_maxsize: (default: 65536) Threshold between buffered and RMA paths
- +ofi_cq_entries_count: (default: 8) Maximum number of entries to read from the completion queue with each call to fi_cq_read().
- +ofi_use_inject: (default: 1) whether to use buffered send.
- +ofi_num_recvs: (default: 8) Number of pre-posted receive buffers.
- +ofi_runtime_tcp: (default: off) during the initialization phase the OFI EP names need to be exchanged among all nodes.
By default, the exchang is done with both PMI and OFI. If this flag is set then the exchange is done with PMI only.
For example:
$MPPEXEC -n 2 -ppn 1 ./namd2 +ppn 1 <workload_path> +ofi_eager_maxsize 32768 +ofi_num_recvs 16
Best performance results reported on an up to 128 Intel® Xeon Phi™ processor nodes cluster (ns/day; higher is better)
Workload/Node (2HT) | 1 | 2 | 4 | 8 | 16 |
stmv (ns/day) | 0.55 | 1.05 | 1.86 | 3.31 | 5.31 |
Workload/Node (2HT) | 8 | 16 | 32 | 64 | 128 |
stmv.28M (ns/day) | 0.152 | 0.310 | 0.596 | 1.03 | 1.91 |