NWChem* for the Intel® Xeon Phi™ Coprocessor

Purpose

This code recipe describes how to get, build, and use the NWChem* code that includes support for the Intel® Xeon Phi™ Coprocessor with Intel® Many-Integrated Core (MIC) architecture.

Introduction

NWChem provides scalable computational chemistry tools. NWChem codes treat large scientific computational chemistry problems efficiently, and they can take advantage of parallel computing resources, from high-performance parallel supercomputers to conventional workstation clusters.

NWChem software handles

Biomolecules, nanostructures, and solid-state
From quantum to classical, and all combinations
Ground and excited-states
Gaussian basis functions or plane-waves
Wide scalability, from one to thousands of processors
Properties and relativistic effects

NWChem is actively developed by a consortium of developers and maintained by the Environmental Molecular Sciences Laboratory (EMSL) located at the Pacific Northwest National Laboratory (PNNL) in Washington State. The code is distributed as open-source under the terms of the Educational Community License version 2.0 (ECL 2.0).

The current version of NWChem can be downloaded from http://www.nwchem-sw.org. Current support for Intel® Xeon Phi™ coprocessors is included in the latest development version, which can be downloaded at https://svn.pnl.gov/svn/nwchem/trunk.

Code Access

NWChem code supports the Intel® Language Extensions for Offload of operations of the Intel® Xeon® processor (referred to as ‘host’ in this document) with the Intel Xeon Phi coprocessor (referred to as ‘coprocessor’ in this document) in a single node and in a cluster environment.

To get access to the code and test workloads:

Use the following Subversion* command to download the code:

% svn checkout https://svn.pnl.gov/svn/nwchem/trunk nwchem

This will check out the development version and place it in a folder called “nwchem” on your system.

Build Directions

The build of NWChem with offload support for Intel Xeon Phi coprocessors is split into three steps.

Configure NWChem for your system.
Enable offload support.
Build NWChem.

Configure

Set the following configuration options (the following are in bash syntax):

export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX_UBOUND=65536
export USE_MPI=y
export NWCHEM_MODULES=all\ python
export USE_MPIF=y
export USE_MPIF4=y
export MPI_HOME=$I_MPI_HOME/intel64
export MPI_INCLUDE="$MPI_HOME"/include
export MPI_LIB="$MPI_HOME"/lib
export LIBMPI="-lmpi -lmpigf -lmpigi -lrt -lpthread"
export MKLROOT=/msc/apps/compilers/intel/14.0/composer_xe_2013_sp1.1.106/mkl/
export SCALAPACK_LIB=" -mkl -openmp -lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
export SCALAPACK="$SCALAPACK_LIB"
export LAPACK_LIB="-mkl -openmp  -lpthread -lm"
export BLAS_LIB="$LAPACK_LIB"
export BLASOPT="$LAPACK_LIB"
export USE_SCALAPACK=y
export SCALAPACK_SIZE=8
export BLAS_SIZE=8
export LAPACK_SIZE=8
export PYTHONHOME=/usr
export PYTHONVERSION=2.6
export PYTHONLIBTYPE=so
export USE_PYTHON64=y
export USE_CPPRESERVE=y
export USE_NOFSCHECK=y

Enable Offload Support

Set the following environment variables to enable offload support:

export USE_OPENMP=1
export USE_OFFLOAD=1

Build

To build NWChem, issue the following commands:

cd $NWCHEM_TOP/src
make FC=ifort CC=icc AR=xiar

This will build NWChem with support for Intel Xeon Phi coprocessors for the CCSD(T) method (as of 12 August 2014). Coprocessor support for more NWChem methods will follow in the future.

Running Workloads Using NWChem CCSD(T) Method

To run the CCSD(T) method you will need to use a proper NWChem input file that triggers this module. You can find an example input file in the Appendix of this document. Other input files that use the CCSD(T) method can be found on the NWChem website at http://www.nwchem-sw.org.

To run the code only on hosts in the traditional mode using plain Global Arrays (GA), run the following command:

$ OMP_NUM_THREADS=1 mpirun –np 768 –perhost 16 nwchem input.nw

This command will execute NWChem using a file called “input.nw” with 768 GA ranks and 16 processes per node (a total of 48 machines).

To enable OpenMP* threading on the host and use fewer total GA ranks run the following command:

$ OMP_NUM_THREADS=2 mpirun –np 384 –perhost 8 nwchem input.nw

This directs NWChem to use eight GA ranks per node and launches two threads for each process on the node. Because it uses less GA ranks, less communication takes place; thus, you should observe a speed-up compared to the plain method above.

Our next step is to enable offloading to the Intel Xeon Phi coprocessor, by executing this command:

$ NWC_RANKS_PER_DEVICE=2 OMP_NUM_THREADS=4 mpirun –np 384 –perhost 8 nwchem input.nw

The NWC_RANKS_PER_DEVICE environment variable enables offloading, if it is set to an integer larger than 0. It also controls how many GA ranks from the host will offload to each of the compute node’s coprocessors

In the example, we assume that the node contains two coprocessors, and NWChem should allocate two GA ranks per coprocessor. Hence, 4 out 8 GA ranks assigned to a particular compute node will offload to the coprocessors. During offload, a host core is idle; thus, we double the number of OpenMP threads for the host (OMP_NUM_THREADS=4 ) in order to fill the idle core with work from another GA rank.

NWChem itself automatically detects the available coprocessors in the system and properly partitions them for optimal use.

For best performance, you should also enable turbo mode on both the host system and the coprocessors, plus set the following environment variable to use large pages on the coprocessor devices:

export MIC_USE_2MB_BUFFER=16K

In all of the above cases, NWChem will produce the output files as requested in the input file.

Once NWChem prints the last lines on the console log, you will find a line that reports the total runtime consumed:

Total times  cpu:           wall:

The reported runtimes will show considerable speedup for the OpenMP threaded version, as well as the offload version. Of course, the exact runtimes will depend on your system configuration. Experiment with the above settings to control OpenMP and offloading in order to find the best possible values for your system.

Performance Testing^1,2

The following chart shows the speedups achieved on NWChem using the configuration listed below. Your performance may be different, depending on configurations of your systems, system optimizations, and NWChem settings described above.

Image may be NSFW.
Clik here to view.

Testing Platform Configurations

Nodes	Intel® Xeon® processor cores	Intel® Xeon Phi™ coprocessor cores	Heterogeneous cores
130	2080	15600	17680
230	3680	27600	31280
360	5760	43200	48960
450	7200	54000	61200

Server Configuration:

Atipa Visione vf442, 2-socket/16 cores, Intel® C600 IOH
Processors: Two Intel® Xeon® processor E5-2670 @ 2.60GHz (8 cores) with Intel® Hyper-Threading Technology³
Operating System: Scientific Linux* 6.5
Memory: 128GB DDR3 @ 1333 MHz
Coprocessors: 2X Intel® Xeon Phi™ Coprocessor 5110P, GDDR5 with 3.6 GT/s, Driver v3.1.2-1, FLASH image/micro OS 2.1.02.390
Intel® Composer XE 14.0.1.106

Appendix: Example Input File

start  example

title example

echo

memory stack   4800 mb heap 200 mb global 4800 mb noverify

geometry units angstrom noprint
symmetry c1
C     -0.7143     6.0940    -0.00
C      0.7143     6.0940    -0.00
C      0.7143    -6.0940     0.00
C     -0.7143    -6.0940     0.00
C      1.4050     4.9240    -0.00
C      1.4050    -4.9240     0.00
C     -1.4050    -4.9240     0.00
C     -1.4050     4.9240     0.00
C      1.4027     2.4587    -0.00
C     -1.4027     2.4587     0.00
C      1.4027    -2.4587    -0.00
C     -1.4027    -2.4587     0.00
C      1.4032    -0.0000    -0.00
C     -1.4032     0.0000     0.00
C      0.7258     1.2217    -0.00
C     -0.7258     1.2217     0.00
C      0.7258    -1.2217     0.00
C     -0.7258    -1.2217     0.00
C      0.7252     3.6642    -0.00
C     -0.7252     3.6642     0.00
C      0.7252    -3.6642     0.00
C     -0.7252    -3.6642     0.00
H     -1.2428     7.0380    -0.00
H      1.2428     7.0380    -0.00
H      1.2428    -7.0380     0.00
H     -1.2428    -7.0380     0.00
H      2.4878     4.9242    -0.00
H     -2.4878     4.9242     0.00
H      2.4878    -4.9242    -0.00
H     -2.4878    -4.9242     0.00
H      2.4862     2.4594    -0.00
H     -2.4862     2.4594     0.00
H      2.4862    -2.4594    -0.00
H     -2.4862    -2.4594     0.00
H      2.4866    -0.0000    -0.00
H     -2.4866     0.0000     0.00
end

basis spherical noprint
H    S
     13.0100000              0.0196850
      1.9620000              0.1379770
      0.4446000              0.4781480
H    S
      0.1220000              1.0000000
H    P
      0.7270000              1.0000000
#BASIS SET: (9s,4p,1d) -> [3s,2p,1d]
C    S
   6665.0000000              0.0006920             -0.0001460
   1000.0000000              0.0053290             -0.0011540
    228.0000000              0.0270770             -0.0057250
     64.7100000              0.1017180             -0.0233120
     21.0600000              0.2747400             -0.0639550
      7.4950000              0.4485640             -0.1499810
      2.7970000              0.2850740             -0.1272620
      0.5215000              0.0152040              0.5445290
C    S
      0.1596000              1.0000000
C    P
      9.4390000              0.0381090
      2.0020000              0.2094800
      0.5456000              0.5085570
C    P
      0.1517000              1.0000000
C    D
      0.5500000              1.0000000
#END
end

scf
#thresh 1.0e-10
#thresh 1.0e-4
#tol2e 1.0e-10
#tol2e 1.0e-8
#noscf
singlet
rhf
vectors input atomic output pent_cpu_768d.movecs
direct
noprint "final vectors analysis" multipole
end

tce
freeze atomic
ccsd(t)
thresh 1e-4
maxiter 10
io ga
tilesize 24
end

set tce:pstat t
set tce:nts  t

task tce energy

NWChem* for the Intel® Xeon Phi™ Coprocessor

Purpose

Introduction

NWChem software handles

Code Access

Build Directions

Configure

Enable Offload Support

Build

Running Workloads Using NWChem CCSD(T) Method

Performance Testing^1,2

Testing Platform Configurations

Server Configuration:

Appendix: Example Input File

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112

Purpose

Introduction

NWChem software handles

Code Access

Build Directions

Configure

Enable Offload Support

Build

Running Workloads Using NWChem CCSD(T) Method

Performance Testing1,2

Testing Platform Configurations

Server Configuration:

Appendix: Example Input File

Trending Articles

Performance Testing^1,2