Explicit offload for Quantum ESPRESSO

Purpose

This code recipe describes how to get, build, and use the Quantum ESPRESSO code that includes support for the Intel® Xeon Phi™ coprocessor with Intel® Many-Integrated Core (MIC) architecture. This recipe focuses on how to run this code using explicit offload.

Code Access

Quantum ESPRESSO is an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudo potentials. The Quantum ESPRESSO code is maintained by Quantum ESPRESSO Foundation and is available under the GPLv2 licensing agreement. The code supports the offload mode of operation of the Intel® Xeon® processor (referred to as ‘host’ in this document) with the Intel® Xeon Phi™ coprocessor (referred to as ‘coprocessor’ in this document) in a single node and in a cluster environment.

To get access to the code and test workloads:

Download the latest Quantum ESPRESSO version from http://www.quantum-espresso.org/download/
Clone the linear algebra package libxphi from Gibthub:
```
$ git clone https://github.com/cdahnken/libxphi.
```

Build Directions

Untar the Quantum ESPRESSO tarball
```
$ tar xzf espresso-5.1.tar.gz
```

Source the Intel® compiler and Intel® MPI Library

$ source /opt/intel/composer_xe_2013_sp1.4.211/bin/compilervars.sh intel64
$ source /opt/intel/impi/latest/bin64/mpivars.sh

Change to the espresso directory and run the configure script

$ cd espresso-5.1
$ export SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64"
$ export LAPACK_LIBS="-mkl=parallel"
$ export BLAS_LIBS="-mkl=parallel"
$ export FFT_LIBS="-mkl=parallel"
$ export MPIF90=mpiifort
$ export AR=xiar
$ ./configure --enable-openmp

Make sure make.sys (by editing make.sys) has the following configuration:

MANUAL_DFLAGS = -D__KNC_OFFLOAD
FLAGS =  -D__INTEL -D__FFTW -D__MPI -D__PARA -D__SCALAPACK  -D__OPENMP $(MANUAL_DFLAGS)
MOD_FLAG = -I<PATH_TO_PW> -I
MPIF90 = mpif90
CC = icc
F77 = ifort
BLAS_LIBS =  "-mkl=parallel"
BLAS_LIBS_SWITCH = external
LAPACK_LIBS = "-mkl=parallel"
LAPACK_LIBS_SWITCH = external
SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
FFT_LIBS =  "-mkl=parallel"

You can add “-xHost -ansi-alias” to CFLAGS as well as FFLAGS.

Build the Quantum ESPRESSO PW binary
```
$ make pw -j16
```
You should now have bin/pw.x
Change to the directory you cloned libxphi to and execute the build script. Make sure you do this in the shell you have the Intel compilers and Intel MPI library sourced.
```
$ cd libxphi
$ ./build-library.sh
```
You should now find two libraries: libxphi.so and libmkl_proxy.so

The build process is now complete.

Run Directions

A single Quantum ESPRESSO on a single node

The Quantum ESPRESSO binary compiled above initially has support for accelerated 3D FFT. Additionally, the library libxphi.so contains a number of linear algebra numerical routines invoked by Quantum ESPRESSO, particularly the numerically intensive ZGEMM BLAS3 routine for complex matrix-matrix multiplication. Instead of executing this routine via Intel® Math Kernel Library (Intel MKL), libxphi blocks the matrices and buffers them asynchronously to the card, where Intel MKL then executes the multiplication of the blocks and transfers the result back. When the Quantum ESPRESSO binary is created with the build instructions above, it will contain dynamic calls to the ZGEMM routine, which are usually satisfied by Intel MKL. To get offloaded ZGEMM in place, libxphi.so needs to be preloaded:

$ export LD_LIBRARY_PATH=$PATH_TO_LIBXPH:$LD_LIBRARY_PATH
$ LD_PRELOAD=”$PATH_TO_LIBXPHI/libxphi.so” ./pw.x <pw arguments>

The last line executes the Quantum ESPRESSO binary pw.x with offloaded ZGEMM support. To make this easier, we provide a shell script that facilitates this preloading and just takes the binary and its arguments as input, so that the execution of an offloaded run would look like this:

$ <PATH_TO_LIBPXPHI>/xphilibwrapper.sh <PATH_TO_PW>/pw.x <pw arguments>.

In this case Quantum ESPRESSO will execute a single instance with OpenMP* threads (by default as many as you have cores) and offload FFT and ZGEMM to all the cores of the Intel Xeon Phi coprocessor.

Tuning the linear algebra offloading

To tune the offloading process, we need to understand the ZGEMM routine, which executes matrix-matrix multiplication

C=αA∙B+βC

where α and β are complex numbers and C, A and B are matrices of dimension MxN, MxK, and KxN, respectively. The library libxphi.so now blocks this matrix-matrix multiplication, so that the resulting block-matrix multiplication consists of smaller blocks that are continuously streamed to the coprocessor and back. The size of those blocks can be defined by three parameters, M, N, and K, where m x n, m x k, and k x n are the dimensions of the C-, A- and B-block, respectively. By default, libxphi will block the matrices in sizes of m=n=k =1024.You can play with these values to achieve better performance, depending on your workload size. We have found that making m and n somewhat larger (m=n=2048) and playing with the size of k (between 512 and 1024) can yield very good results.

Block size can be set via the environment variables QE_MIC_BLOCKSIZE_M, QE_MIC_BLOCKSIZE_N, and QE_MIC_BLOCKSIZE_K. For example:

$ QE_MIC_BLOCKSIZE_M=2048
$ QE_MIC_BLOCKSIZE_N=2048
$ QE_MIC_BLOCKSIZE_K=512

An additional setting is required to avoid the offloading of small matrices, which might be more efficiently computed on the host instead of the coprocessor. With QE_MIC_OFFLOAD_THRESHOLD you can define the minimal number of floating point operations a matrix must have in order to get offloaded. The setting

$  export QE_MIC_OFFLOAD_THRESHOLD=20

achieves good results.

Partitioning the coprocessor

Partitioning the coprocessor leverages the advantages of multi-processing vs. multi-threading. It is somewhat similar to running Message Passing Interface (MPI) ranks on the coprocessor (a.k.a. symmetric usage model) although the MPI ranks are only on the host. Varying the number of ranks on the host can be used to partition each coprocessor into independent sets of threads. The vehicle to achieve independent thread-partitions is given by the KMP_PLACE_THREADS environment variable. In addition, using the environment variable OFFLOAD_DEVICES utilizes multiple coprocessors within the same system. Of course there is nothing wrong with using OpenMP instead of this proposed method; however, we found that portioning the coprocessor unlocks more performance–this is simply trading implicit locks at the end of parallel regions against absolutely independent executions. To ease the tuning process, a script is provided that generates the appropriate “mpirun”-command line.

$ ~/mpirun/mpirun.sh -h
-n: list of comma separated node names
-p: number of processes per socket (host)
-q: number of processes per mic (native)
-s: number of sockets per node
-d: number of devices per node
-e: number of CPU cores per socket
-t: number of CPU threads per core
-m: number of MIC cores per device
-r: number of MIC cores reserved
-u: number of MIC threads per core
-a: affinity (CPU) e.g., compact
-b: affinity (MIC) e.g., balanced
-c: schedule, e.g., dynamic
-0: executable (rank-0)
-x: executable (host)
-y: executable (mic)
-z: prefixed mic name
-i: inputfile (<)
-w: wrapper
-v: dryrun

The script “mpirun.sh” is actually inspecting the system hardware in order to provide defaults for all of the above arguments. The script then launches “mpirun.py,” which actually builds and launches the command line for “mpirun.” This initial inspection, for example, avoids using multiple host sockets in case there is only one coprocessor attached to the system (avoids performing data transfers to a “remote” coprocessor). Any default provided by the launcher script “mpirun.sh” can be overridden at the command line (while still being able to leverage all other default settings). Please note that the script also supports symmetric execution (“-y”, etc.), which is discussed here.

Here is an example of running QE with four partitions on each of the coprocessor(s):

$ ./mpirun.sh -p4
    -w <PATH_TO_LIBPXPHI>/xphilibwrapper.sh
    -x <PATH_TO_PW>/pw.x
    -i <input-file.in>

Any argument passed at the end of the command line is simply forwarded to the next underlying mechanism if not consumed by option processing. If you need to pass arguments to the executable using “<”, you can use the script’s “-i” option; otherwise, options for the executable can be simply appended to the above command line.

The number of ranks per host-socket (“-p”) is not only dividing the number of cores per host-processor but also dividing each coprocessor’s number of cores. Therefore some ratios produce some remaining unused cores. On the other hand, the coprocessor usually comes with more cores than cores in a single host socket/processor; therefore, it is likely acceptable and anyways a subject of tuning the number of partitions.

Performance

Figure 1: Performance of Quantum Espresso executing the GRIR443 benchmark on 16 Xeon E5-2697v2 and 16 Xeon Phi 7120A.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Application parameterization

-npool=2,
2 MPI ranks/socket
6 threads/MPI rank

Platform configuration

Host configuration

Intel® Xeon® processor E5-2697 v2 64GB
64GB DDR3-1600
RHEL 6.4
Intel® Turbo Boost Technology /EIST/SMT/NUMA enabled

MIC configuration

7120A, 61cores, 1.238GHz
MPSS 2.1.6720-16
ECC enabled, Turbo disabled

Software configuration

Icc 14.0.0 update 1, Intel MPI Library 14.1.1.036

Explicit offload for Quantum ESPRESSO

Purpose

Code Access

Build Directions

Run Directions

A single Quantum ESPRESSO on a single node

Tuning the linear algebra offloading

Partitioning the coprocessor

Performance

Application parameterization

Platform configuration

Host configuration

MIC configuration

Software configuration

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112