Purpose
This code recipe describes how to get, build, and use the NWChem* code that includes support for the Intel® Xeon Phi™ Coprocessor with Intel® Many-Integrated Core (MIC) architecture.
Introduction
NWChem provides scalable computational chemistry tools. NWChem codes treat large scientific computational chemistry problems efficiently, and they can take advantage of parallel computing resources, from high-performance parallel supercomputers to conventional workstation clusters.
NWChem software handles
- Biomolecules, nanostructures, and solid-state
- From quantum to classical, and all combinations
- Ground and excited-states
- Gaussian basis functions or plane-waves
- Wide scalability, from one to thousands of processors
- Properties and relativistic effects
NWChem is actively developed by a consortium of developers and maintained by the Environmental Molecular Sciences Laboratory (EMSL) located at the Pacific Northwest National Laboratory (PNNL) in Washington State. The code is distributed as open-source under the terms of the Educational Community License version 2.0 (ECL 2.0).
The current version of NWChem can be downloaded from http://www.nwchem-sw.org. Current support for Intel® Xeon Phi™ coprocessors is included in the latest development version, which can be downloaded at https://svn.pnl.gov/svn/nwchem/trunk.
Code Access
NWChem code supports the Intel® Language Extensions for Offload of operations of the Intel® Xeon® processor (referred to as ‘host’ in this document) with the Intel Xeon Phi coprocessor (referred to as ‘coprocessor’ in this document) in a single node and in a cluster environment.
To get access to the code and test workloads:
- Use the following Subversion* command to download the code:
% svn checkout https://svn.pnl.gov/svn/nwchem/trunk nwchem
This will check out the development version and place it in a folder called “nwchem” on your system.
Build Directions
The build of NWChem with offload support for Intel Xeon Phi coprocessors is split into three steps.
- Configure NWChem for your system.
- Enable offload support.
- Build NWChem.
Configure
Set the following configuration options (the following are in bash syntax):
export ARMCI_NETWORK=OPENIB export ARMCI_DEFAULT_SHMMAX_UBOUND=65536 export USE_MPI=y export NWCHEM_MODULES=all\ python export USE_MPIF=y export USE_MPIF4=y export MPI_HOME=$I_MPI_HOME/intel64 export MPI_INCLUDE="$MPI_HOME"/include export MPI_LIB="$MPI_HOME"/lib export LIBMPI="-lmpi -lmpigf -lmpigi -lrt -lpthread" export MKLROOT=/msc/apps/compilers/intel/14.0/composer_xe_2013_sp1.1.106/mkl/ export SCALAPACK_LIB=" -mkl -openmp -lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64 -lpthread -lm" export SCALAPACK="$SCALAPACK_LIB" export LAPACK_LIB="-mkl -openmp -lpthread -lm" export BLAS_LIB="$LAPACK_LIB" export BLASOPT="$LAPACK_LIB" export USE_SCALAPACK=y export SCALAPACK_SIZE=8 export BLAS_SIZE=8 export LAPACK_SIZE=8 export PYTHONHOME=/usr export PYTHONVERSION=2.6 export PYTHONLIBTYPE=so export USE_PYTHON64=y export USE_CPPRESERVE=y export USE_NOFSCHECK=y
Enable Offload Support
Set the following environment variables to enable offload support:
export USE_OPENMP=1 export USE_OFFLOAD=1
Build
To build NWChem, issue the following commands:
cd $NWCHEM_TOP/src make FC=ifort CC=icc AR=xiar
This will build NWChem with support for Intel Xeon Phi coprocessors for the CCSD(T) method (as of 12 August 2014). Coprocessor support for more NWChem methods will follow in the future.
Running Workloads Using NWChem CCSD(T) Method
To run the CCSD(T) method you will need to use a proper NWChem input file that triggers this module. You can find an example input file in the Appendix of this document. Other input files that use the CCSD(T) method can be found on the NWChem website at http://www.nwchem-sw.org.
To run the code only on hosts in the traditional mode using plain Global Arrays (GA), run the following command:
$ OMP_NUM_THREADS=1 mpirun –np 768 –perhost 16 nwchem input.nw
This command will execute NWChem using a file called “input.nw” with 768 GA ranks and 16 processes per node (a total of 48 machines).
To enable OpenMP* threading on the host and use fewer total GA ranks run the following command:
$ OMP_NUM_THREADS=2 mpirun –np 384 –perhost 8 nwchem input.nw
This directs NWChem to use eight GA ranks per node and launches two threads for each process on the node. Because it uses less GA ranks, less communication takes place; thus, you should observe a speed-up compared to the plain method above.
Our next step is to enable offloading to the Intel Xeon Phi coprocessor, by executing this command:
$ NWC_RANKS_PER_DEVICE=2 OMP_NUM_THREADS=4 mpirun –np 384 –perhost 8 nwchem input.nw
The NWC_RANKS_PER_DEVICE
environment variable enables offloading, if it is set to an integer larger than 0. It also controls how many GA ranks from the host will offload to each of the compute node’s coprocessors
In the example, we assume that the node contains two coprocessors, and NWChem should allocate two GA ranks per coprocessor. Hence, 4 out 8 GA ranks assigned to a particular compute node will offload to the coprocessors. During offload, a host core is idle; thus, we double the number of OpenMP threads for the host (OMP_NUM_THREADS=4
) in order to fill the idle core with work from another GA rank.
NWChem itself automatically detects the available coprocessors in the system and properly partitions them for optimal use.
For best performance, you should also enable turbo mode on both the host system and the coprocessors, plus set the following environment variable to use large pages on the coprocessor devices:
export MIC_USE_2MB_BUFFER=16K
In all of the above cases, NWChem will produce the output files as requested in the input file.
Once NWChem prints the last lines on the console log, you will find a line that reports the total runtime consumed:
Total times cpu: wall:
The reported runtimes will show considerable speedup for the OpenMP threaded version, as well as the offload version. Of course, the exact runtimes will depend on your system configuration. Experiment with the above settings to control OpenMP and offloading in order to find the best possible values for your system.
Performance Testing1,2
The following chart shows the speedups achieved on NWChem using the configuration listed below. Your performance may be different, depending on configurations of your systems, system optimizations, and NWChem settings described above.
Testing Platform Configurations
Nodes | Intel® Xeon® processor cores | Intel® Xeon Phi™ coprocessor cores | Heterogeneous cores |
---|---|---|---|
130 | 2080 | 15600 | 17680 |
230 | 3680 | 27600 | 31280 |
360 | 5760 | 43200 | 48960 |
450 | 7200 | 54000 | 61200 |
Server Configuration:
- Atipa Visione vf442, 2-socket/16 cores, Intel® C600 IOH
- Processors: Two Intel® Xeon® processor E5-2670 @ 2.60GHz (8 cores) with Intel® Hyper-Threading Technology3
- Operating System: Scientific Linux* 6.5
- Memory: 128GB DDR3 @ 1333 MHz
- Coprocessors: 2X Intel® Xeon Phi™ Coprocessor 5110P, GDDR5 with 3.6 GT/s, Driver v3.1.2-1, FLASH image/micro OS 2.1.02.390
- Intel® Composer XE 14.0.1.106
Appendix: Example Input File
start example title example echo memory stack 4800 mb heap 200 mb global 4800 mb noverify geometry units angstrom noprint symmetry c1 C -0.7143 6.0940 -0.00 C 0.7143 6.0940 -0.00 C 0.7143 -6.0940 0.00 C -0.7143 -6.0940 0.00 C 1.4050 4.9240 -0.00 C 1.4050 -4.9240 0.00 C -1.4050 -4.9240 0.00 C -1.4050 4.9240 0.00 C 1.4027 2.4587 -0.00 C -1.4027 2.4587 0.00 C 1.4027 -2.4587 -0.00 C -1.4027 -2.4587 0.00 C 1.4032 -0.0000 -0.00 C -1.4032 0.0000 0.00 C 0.7258 1.2217 -0.00 C -0.7258 1.2217 0.00 C 0.7258 -1.2217 0.00 C -0.7258 -1.2217 0.00 C 0.7252 3.6642 -0.00 C -0.7252 3.6642 0.00 C 0.7252 -3.6642 0.00 C -0.7252 -3.6642 0.00 H -1.2428 7.0380 -0.00 H 1.2428 7.0380 -0.00 H 1.2428 -7.0380 0.00 H -1.2428 -7.0380 0.00 H 2.4878 4.9242 -0.00 H -2.4878 4.9242 0.00 H 2.4878 -4.9242 -0.00 H -2.4878 -4.9242 0.00 H 2.4862 2.4594 -0.00 H -2.4862 2.4594 0.00 H 2.4862 -2.4594 -0.00 H -2.4862 -2.4594 0.00 H 2.4866 -0.0000 -0.00 H -2.4866 0.0000 0.00 end basis spherical noprint H S 13.0100000 0.0196850 1.9620000 0.1379770 0.4446000 0.4781480 H S 0.1220000 1.0000000 H P 0.7270000 1.0000000 #BASIS SET: (9s,4p,1d) -> [3s,2p,1d] C S 6665.0000000 0.0006920 -0.0001460 1000.0000000 0.0053290 -0.0011540 228.0000000 0.0270770 -0.0057250 64.7100000 0.1017180 -0.0233120 21.0600000 0.2747400 -0.0639550 7.4950000 0.4485640 -0.1499810 2.7970000 0.2850740 -0.1272620 0.5215000 0.0152040 0.5445290 C S 0.1596000 1.0000000 C P 9.4390000 0.0381090 2.0020000 0.2094800 0.5456000 0.5085570 C P 0.1517000 1.0000000 C D 0.5500000 1.0000000 #END end scf #thresh 1.0e-10 #thresh 1.0e-4 #tol2e 1.0e-10 #tol2e 1.0e-8 #noscf singlet rhf vectors input atomic output pent_cpu_768d.movecs direct noprint "final vectors analysis" multipole end tce freeze atomic ccsd(t) thresh 1e-4 maxiter 10 io ga tilesize 24 end set tce:pstat t set tce:nts t task tce energy