I. Overview
This article provides a recipe for how to obtain, compile, and run an optimized version of MASNUM WAVE (0.2 degree high resolution) workload on Intel® Xeon® processors and Intel® Xeon Phi™ processors.
The source for this version of MASNUM WAVE as well as the workload can be obtained by contacting Prof. Zhenya Song at songroy@fio.org.cn.
II. Introduction
The MASNUM WAVE model is the 3rd generation surface wave model that was proposed in the early 1990s in LAGFD (Laboratory of Geophysical Fluid Dynamics) from FIO. The application is used to simulate and predict the wave process by solving the wave energy spectrum balance equation and its complicated characteristic equations in wave-number space, which is written in Fortran* and paralleled with MPI.
This version of MASNUM WAVE is optimized for the performance on both Intel Xeon processors and Intel Xeon Phi processors. Optimizations in this package include:
- Removing repeated computation
- Vectorization by loop unrolling, loop interchange, and removing data dependency
- Multi-thread support by OpenMP*
- Compiler options tuning
III. Preliminaries
- To build this package, install Intel® MPI Library 5.1 or higher and Intel® Parallel Studio XE 2016 or higher on your host system.
- Please contact Prof. Zhenya Song at songroy@fio.org.cn to get the optimized MASNUM WAVE source package and test workload.
- Set up the Intel MPI Library and Intel® Fortran Compiler environments:
> source /opt/intel/compilers_and_libraries_<version>/linux/mpi/bin64/mpivars.sh
> source /opt/intel/compilers_and_libraries_<version>/linux/bin/compilervars.sh intel64
- In order to run MASNUM WAVE on the Intel Xeon Phi processor, reboot the system with SNC-4 cluster mode and cache memory mode via BIOS settings. Please refer to Intel® Xeon Phi™ Processor - Memory Modes and Cluster Modes: Configuration and Use Cases for more details on memory configuration.
IV. Build MASNUM WAVE for the Intel Xeon processor
- Unpack the source code to any directory of
/home/<user>
> tar xvf WAVE_opt.tar.bz2
- Build the executables for the Intel Xeon processor.
> cd /home/<user>/WAVE/src/bin
> cp makfile.cpu makefile
The executables are located at the path
/home/<user>/WAVE/exp
with the namemasnum.wam.mpi.cpu
V. Build MASNUM WAVE for Intel Xeon Phi processor
- Build the executables for the Intel Xeon Phi processor.
> cd /home/<user>/WAVE/src/bin
> cp makfile.knl makefile
This will build executables for the Intel Xeon Phi processor; the executables are located at the path
/home/<user>/WAVE/exp/
, with the namemasnum.wam.mpi.knl
VI. Run MASNUM WAVE on the Intel Xeon processor and Intel Xeon Phi processor
- Run MASNUM WAVE with the test workload on the Intel Xeon processor.
> cd /home/<user>/WAVE/exp
> mpirun –n 36 -env OMP_NUM_THREADS 1 ./masnum.wam.mpi.cpu
- Run MASNUM WAVE with the test workload on the Intel Xeon Phi processor. Make sure all the binary and workload files are located on KNL.
> cd /home/<user>/WAVE/exp
> mpirun –n 34 –env OMP_NUM_THREADS 8 ./masnum.wam.mpi.knl
VIII. Performance gain
For the test workload, the following graph shows the speedup achieved from the Intel Xeon Phi processor, compared to the Intel Xeon processor. As you can see, we get:
- Up to 1.25x faster with the Intel® Xeon Phi™ processor 7210 compared to the two-socket Intel® Xeon® processor E5-2697 v4.
- Up to 1.41x faster with the Intel® Xeon Phi™ processor 7250 compared to the two-socket Intel Xeon processor E5-2697 v4.
Comments on performance improvement on Intel Xeon Phi:
- MASNUM WAVE has good parallel scalability, and benefits from more cores. But it is memory bandwidth bound on Xeon and the hyper thread has no help on performance, which means 36 cores can get the same best performance on Xeon with 1 thread or 2 threads. And the best performance on Intel® Xeon Phi™ 7250 is achieved with 34 MPI ranks and 8 threads per rank, which means it can make good use of all the logical cores (272 cores). The best multi-thread scalability with OpenMP is running 8 threads per rank from our test result. It can reduce MPI communication time by reducing MPI ranks number from 68 to 34.
- MASNUM WAVE has been well vectorized, and therefore the added register size available with AVX512 improves performance significantly.
- MASNUM WAVE also benefits from MCDRAM because of memory bandwidth bound.
Testing platform configuration:
Intel Xeon processor E5-2697 v4: Dual-socket Intel Xeon processor E5-2697 v4, 2.3 GHz, 18 cores/socket, 36 cores, 72 threads (HT and Turbo ON), DDR4 128 GB, 2400 MHz, Oracle Linux* Server release 6.7.
Intel Xeon Phi processor 7210 (64 cores): Intel Xeon Phi processor 7210, 64 core, 256 threads, 1300 MHz core freq. (HT and Turbo ON), 1600 MHz uncore freq., MCDRAM 16 GB 6.4 GT/s, BIOS 10D42, DDR4 96 GB, 2133 MHz, Red Hat 7.2, SNC-4 cluster mode, MCDRAM cache memory mode.
Intel Xeon Phi processor 7250 (68 cores): Intel Xeon Phi processor 7250, 68 core, 272 threads, 1400 MHz core freq. (HT and Turbo ON), 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS 10R00, DDR4 96 GB, 2400 MHz, Red Hat 7.2, SNC-4 cluster mode, MCDRAM cache memory mode.