Overview
This document demonstrates the best methods to obtain, build and run the WRF model on multiple nodes in symmetric mode on Intel® Xeon Phi™ Coprocessors and Intel® Xeon processors. This document also describes the WRF software configuration and affinity settings to extract the best performance from multiple node symmetric mode operation when using Intel Xeon Phi Coprocessor and an Intel Xeon processor.
Introduction
The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to serve atmospheric research and operational forecasting needs. WRF is used by academic atmospheric scientists, forecast teams at operational centers, application scientists, etc. Please see http://www.wrf-model.org/index.phpfor more details on this system. The source code and input files can be downloaded from the NCAR website. The latest version as of this writing is WRFV3.6. In this article, we use the conus2.5km benchmark.
WRF is used by many private and public organizations across the world for weather and climate prediction.
WRF has a relatively flat profile on Intel Architecture over many functions for atmospheric dynamics and physics: advection, microphysics, etc.
Technology (Hardware/Software)
System | Xeon E5-2697 v2 @ 2.7GHz |
Coprocessor | Intel Xeon Phi coprocessor 7120A @ 1.23GHz |
Intel® MPI | 4.1.1.036 |
Intel® Compiler | composer_xe_2013_sp1.1.106 |
Intel® MPSS | 6720-21 |
We used the above hardware and software configuration for all of our testing.
Note: This Index card assumes that you are running the workload on the aforementioned hardware configuration. If you are using Intel Xeon Phi coprocessor model 7110 cards, please use the following instructions on 8 nodes instead of 4. To run the workload on 4 nodes, you need Intel Xeon Phi coprocessors with 16GB memory; since the 7110 model coprocessors have 8GB memory, you will need to use more than 4 Xeon Phi coprocessor Cards.
Note: Please use netcdf-3.6.3 and pnetcdf-1.3.0 for I/O.
Multi Node Symmetric Intel Xeon + Intel Xeon Phi coprocessor (4 Nodes)
Compile WRF for the Coprocessor
- Download and un-tar the WRFV3.6 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
- Source the Intel MPI for intel64 and Intel Compiler
- source /opt/intel/impi/4.1.1.036/mic/bin/mpivars.sh
- source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
- On bash, export the path for the host netcdf and host pnetcdf. Having netcdf and pnetcdf built for Intel Xeon Phi coprocessor is a prerequisite.
- export NETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.5/netcdf/mic/
- export PNETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.5/pnetcdf/mic/
- Turn on Large file IO support
- export WRFIO_NCD_LARGE_FILE_SUPPORT=1
- Cd into the ../WRFV3/ directory and run ./configure and select the option to build with Xeon Phi (MIC architecture) (option 17). On the next prompt for nesting options, hit return for the default, which is 1.
- In the configure.wrf that is created, remove delete -DUSE_NETCDF4_FEATURES and replace –O3 with –O2
- Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
- Run ./compile wrf >& build.mic
- This will build a wrf.exe in the ../WRFV3/main folder.
- For a new, clean build run ./clean –a and repeat the process.
Compile WRF for Intel Xeon processor-based host
- Download and un-tar the WRF3.5 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
- Source the latest Intel MPI for intel64 and latest Intel Compiler (as an example below)
- source /opt/intel/impi/4.1.1.036/intel64/bin/mpivars.sh
- source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
- Export the path for the host netcdf and pnetcdf. Having netcdf and pnetcdf built for the host is a prerequisite.
- export NETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.5/netcdf/xeon/
- export PNETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.5/pnetcdf/xeon/
- Turn on Large file IO support
- export WRFIO_NCD_LARGE_FILE_SUPPORT=1
- Cd into the WRFV3 directory created in step #1 and run ./configure and select option 21: "Linux x86_64 i486 i586 i686, Xeon (SNB with AVX mods) ifort compiler with icc (dm+sm)". On the next prompt for nesting options, hit return for the default, which is 1.
- In the configure.wrf that is created, remove delete -DUSE_NETCDF4_FEATURES and replace –O3 with –O2
- Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
- Run ./compile wrf >& build.snb.avx . This will build a wrf.exe in the ../WRFV3/main folder. (Note: to speed up compiles, set the environment variable J to "-j 4" or whatever number of parallel make tasks you wish to use.)
- For a new, clean build run ./clean –a and repeat the process.
Run WRF Conus2.5km in Symmetric Mode
- Download the CONUS2.5_rundir from http://www2.mmm.ucar.edu/WG2bench/conus_2.5_v3/
- Follow the READ-ME.txt to build the wrf input files.
- The namelist.input has to be altered. The changes are as follows:
- In the &time_interval section, edit the values as below:
- restart_interval =360,
- io_form_history =2,
- io_form_restart =2,
- io_form_input =2,
- io_form_boundary =2,
- Remove "perturb_input =.true." from the &domains section and replace with "nproc_x =8,"
- Add "tile_strategy =2," under the &domains section.
- Add "use_baseparam_fr_nml =.true." under the &dynamics section.
- In the &time_interval section, edit the values as below:
- Create a new directory called CONUS2.5_rundir (../WRFV/CONUS_rundir) in the CONUS2.5_rundir, create 2 directories "mic" and "x86". Copy over contents of ../WRFV/run/ into “mic” and “x86” directories.
- Copy the Intel Xeon Phi coprocessor binary into the CONUS2.5_rundir/mic directory and copy the Intel Xeon binary into the CONUS2.5_rundir/x64 directory.
- Cd into the CONUS2.5_rundir and execute WRF as follows on 4 nodes (i.e 4 coprocessors + 4 Intel Xeon processors) in symmetric mode. To run conus2.5km, you need to have access to 4 nodes (example shown below)
Script to run on Xeon-Phi + Xeon (symmetric mode)
The nodes I am using are: node01 node02 node03 node04
When you request for nodes, make sure you have a large stack size MIC_ULIMIT_STACKSIZE=365536
source /opt/intel/impi/4.1.0.036/mic/bin/mpivars.sh source /opt/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh intel64 export I_MPI_DEVICE=rdssm export I_MPI_MIC=1 export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0 export I_MPI_PIN_MODE=pm export I_MPI_PIN_DOMAIN=auto ./run.symmetric
Below is the run.symmetric to run the code in symmetric mode:
run.symmetric script
#!/bin/sh mpiexec.hydra -host node01 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe : -host node02 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe : -host node03 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe : -host node04 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe : -host node01-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh : -host node02-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh : -host node03-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh : -host node04-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh
In ../CONUS2.5_rundir/mic, create a wrf.sh file as below.
Below is the wrf.sh that is needed for the Xeon Phi part of the runscript.
wrf.sh script
export LD_LIBRARY_PATH=/opt/intel/compiler/2013_sp1.1.106/composer_xe_2013_sp1.1.106/compiler/lib/mic:$LD_LIBRARY_PATH /path/to/CONUS2.5_rundir/mic/wrf.exe
- You will have 80 rsl.error.* and 80 rsl.out.* files in your CONUS2.5_rundir directory.
- Do a 'tail –f rsl.error.0000' and when you see 'wrf: SUCCESS COMPLETE WRF' your run is successful.
- After the run, compute the total time taken to simulate with the scripts below. The mean value (which indicates the Average Time Step (ATS)) is of interest for WRF (lower the better).
Parsing scripts
gettiming.sh – is the parsing script
grep 'Timing for main' rsl.out.0000 | sed '1d' | head -719 | awk '{print $9}' | awk -f stats.awk bash-4.1$ cat stats.awk BEGIN{ a = 0.0 ; i = 0 ; max = -999999999 ; min = 9999999999 } { i ++ a += $1 if ( $1 > max ) max = $1 if ( $1 < min ) min = $1 } END{ printf("---n%10s %8dn%10s %15fn%10s %15fn%10s %15fn%10s %15fn%10s %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }
Validation
To validate if the successful WRF run is correct or not, check the following:
- It should generate a wrf_output file.
- diffwrf your_output wrfout_reference > diffout_tag
- 'DIGITS' column should have high value (>3). If yes, the WRF run is considered valid.
Compiler Options
- -mmic : build an application that natively runs on Intel® Xeon Phi™ Coprocessor
- -openmp : enable the compiler to generate multi-threaded code based on the OpenMP* directives (same as -fopenmp)
- -O3 :enable aggressive optimizations by the compiler.
- -opt-streaming-stores always : generate streaming stores
- -fimf-precision=low : low precision for higher performance
- -fimf-domain-exclusion=15 : gives lowest precision sequences for Single precision and Double precision.
- -opt-streaming-cache-evict=0 : turn off all cache line evicts.
Conclusion
This document enables users to compile and run the WRF Conus2.5KM workload on an Intel-based cluster with Intel Xeon processor based systems and Intel Xeon Phi coprocessors and showcases the benefits of using Intel Xeon-Phi coprocessors over the use of a homogeneous Intel Xeon processor based installation in a symmetric mode run with 4 nodes.
About the Author
Indraneil Gokhale is a Software Architect in the Intel Software and Services Group (Intel SSG).