Purpose
This code recipe describes how to get, build, and use the LAMMPS* code for the Intel® Xeon Phi™ coprocessor.
Introduction
Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS*) is a classical molecular dynamics code. LAMMPS has potentials for solid-state materials (metals, semiconductors), soft matter (biomolecules, polymers), and coarse-grained or mesoscopic systems. LAMMPS can be used to model atoms, or, more generically, as a parallel particle simulator at the atomic, meso, or continuum scale.
LAMMPS runs on single processors or in parallel using message-passing techniques with a spatial-decomposition of the simulation domain. The code is designed to be easy to modify or extend with new functionality.
LAMMPS is distributed as open source code under the terms of the GNU Public License. The current version can be downloaded at http://lammps.sandia.gov/download.html. Links are also included to older F90/F77 versions. Periodic releases are also available on SourceForge*.
LAMMPS is distributed by Sandia National Laboratories, a U.S. Department of Energy laboratory. The main authors of LAMMPS are listed on the LAMMPS site along with contact info and other contributors. Find out more about LAMMPS at http://lammps.sandia.gov.
Code Support for Intel® Xeon Phi™ coprocessor
LAMMPS* with Intel Xeon Phi coprocessor support is expected to be released as an Intel-optimized package between July and September of 2014. The release will include support for potentials to allow simulation of soft matter, biomolecules, and materials. Contact your Intel representative about access prior to September, 2014.
Build Directions
Building LAMMPS for Intel Xeon Phi coprocessor is similar to a normal LAMMPS build. A makefile supporting offload and vectorization for CPU routines will be included. An example build will include the following code:
> source /opt/intel/compiler/2013_sp1.1.106/bin/iccvars.sh intel64
> source /opt/intel/impi/4.1.2.040/bin64/mpivars.sh
> cd src
> make yes-user-intel
> make intel_offload
> echo "LAMMPS executable is src/lmp_intel_offload"
Run Directions
To run LAMMPS on Intel Xeon Phi coprocessor:
- Edit your run script, like you would with other packages (OPT, GPU, USER-OMP). See the figure below.
- Run LAMMPS as you would normally. The modified code handles the offloading to the coprocessor. See the figure below.
LAMMPS will simulate the time evolution of the input system of atoms or other particles, as specified in the input script, writing data, including atom positions, thermodynamic quantities, and other statistics computations.
Expected output and performance can be checked with comparison to the log files in the examples directory provided.
Optimizations and Usage Model
A load balancer offloads part of neighbor-list and non-bond force calculations to the Intel Xeon Phi coprocessor for concurrent calculations with the CPU. This is achieved by using the offload API to run calculations well suited for many-core chips on both the CPU and the coprocessor. In this model, the same C++ routine is run twice, once with an offload flag, to support concurrent calculations.
The dynamic load balancing allows for concurrent 1) data transfer between host and coprocessor, 2) calculations of neighbor-list, non-bond, bond, and long-range terms, and 3) some MPI* communications. It continuously updates the fraction of offloaded work to minimize idle times. The Standard LAMMPS "fix" object manages concurrency and synchronization.
The Intel package adds support for single, mixed, and double-precision calculations on both CPU and coprocessor, and vectorization (AVX on CPU / 512-bit vectorization on Phi™). This can provide significant speedups for the routines on the CPU, too.
Performance Testing
The advantages using the Intel package are illustrated below with comparison to the baseline MPI/OpenMP* routines in LAMMPS and the optimized routines running on the CPU only or the CPU with offload to the coprocessor. Results are provided for the Rhodopsin* benchmark distributed with LAMMPS scaled to 256,000 atoms.
The Rhodopsin benchmark simulates the movement of a protein in the retina that plays an important role in the perception of light. The protein is simulated in a solvated lipid bilayer using the CHARMM* force field with Particle-Particle Particle-Mesh long-range electrostatics and SHAKE* constraints. The simulation is performed at a temperature of 300K and pressure of 1 atm. The results on a single node and 32 nodes of the Endeavor cluster (configuration below) are shown, demonstrating a speedup of up to 2.15X when using the Intel Xeon Phicoprocessor.
Figure Right: Rhodopsin protein benchmark with atoms in initial configuration.
Testing Platform Configurations
The following hardware was used for the above recipe and performance testing.
Endeavor Cluster Configuration:
- 2-socket/24 cores:
- Processor: Intel® Xeon® processor E5-2697 V2 @ 2.70GHz (12 cores) with Intel® Hyper-Threading Technology4
- Network: InfiniBand* Architecture Fourteen Data Rate (FDR)
- Operating System: Red Hat Enterprise Linux* 2.6.32-358.el6.x86_64.crt1 #4 SMP Fri May 17 15:33:33 MDT 2013 x86_64 x86_64 x86_64 GNU/Linux
- Memory: 64GB
- Coprocessor: 2X Intel Xeon Phi coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading Technology, Memory: 15872 MB
- Intel® Many-core Platform Software Stack Version 2.1.6720-19
- Intel® Compiler 2013 SP1.1.106 (icc version 14.0.1)
- Compile flags:
-O3 -xHost -fno-alias -fno-omit-frame-pointer -unroll-aggressive -opt-prefetch
-mP2OPT_hpo_fast_reduction=F -offload-option.mic.compiler,"-fimf-domain-exclusion=15
-mGLOB_default_function_attrs=\"gather_scatter_loop_unroll=5\"