Purpose
This article provides instructions for code access, build, and run directions for the Baffin-Bay test cases for the HBM code, running on Intel® Xeon® processors and Intel® Xeon Phi™ Coprocessors.
Introduction
Danish Meteorological Institute (DMI) operates a regional 3D ocean model HBM for the North Sea - Baltic, in order to provide information about the physical state of the Danish and nearby waters in the near future.
The origin of the HBM code dates back to the BSHcmod ocean circulation model, the development of which was originally initiated in the 1990s at Bundesamt für Seeschifffahrt und Hydrographie (BSH) in Hamburg, Germany and one branch of the model code has been maintained and further developed at DMI. It has undergone extensive revision in the DMI implementation, in co-operation between DMI, BSH, FMI (Finnish Meteorological Institute) and MSI (Marine Systems Institute, Estonia).
The HBM code started from a snap-shot of the DMIcmod repository in 2009. The current code implementation is as close as one gets to a total rewrite without it being so from the original BSHcmod and an attempt has been made to merge the model code development into a common project, with the intention being to join the forces of experienced modelers from around the Baltic Sea. This model code project carries the name HBM, the HIROMB-BOOS {Baltic Operational Oceanographic System} Model.
HBM calculates the sea level, water temperature, salinity, ocean currents, ice thickness, and concentration at 15-minute intervals. Temperature, salt, and current is calculated for the entire water column—top to bottom—at approximately 50 fixed-depth levels. Sea level at selected coastal points is calculated in 10-minute time resolution. HBM runs every six hours, (four times per day). A run produces a five day forecast of the three-dimensional state of the ocean.
HBM is not an open source or a community model code. Therefore, the full HBM results are not shared in this code recipe. But, the “advection” routines (part of HBM code) will be available under BSD license later this year; results will be updated once the code is released.
Code Access
Jacob Weismann Poulsen and Per Berg from DMI are the main developers and maintainers of the HBM code. As stated before, since HBM is not an open source or a community model code, individual/third parties can have access to the code by signing an MoU.
To get access to the code and test cases, please contact Per Berg per@dmi.dk or Jacob Weismann Poulsen jwp@dmi.dk
Build Directions
The build system uses autoconf. The corresponding configure process is meant to set mandatory compiler settings and to distinguish between different build incarnations. If you do not specify any configure options, you will get a serial build.
Note that configure generates a file with the compiler flag settings called Makefile.include. If you want to change compiler flags, you can either rerun configure or adjust the generated Makefile.include file. Additionally, you can pass the new flags onto make itself, for example make FCFLAGS="-O3".
All the following build examples are written in BASH syntax. These examples generate a single binary called cmod. You can rename them as needed for a specific architecture.
Configuration Examples
--enable-contiguous
--enable-openmp
--enable-mpi
--enable-openmp --enable-mpi --enable-contiguous
--enable-openacc --enable-contiguous
--enable-openacc --enable-mpi
Build
The following examples illustrate different commands for different Intel Xeon processors. Choose the command for the appropriate processor for your build.
NOTE: In the following code examples, extra paragraph spacing shows where very long command lines terminate. Single-spaced code examples are one command line.
Decompress the source files.
tar -zxvf hbm-<version>.tar.gz
cd hbm-<version>Source the latest Intel® C and C++ Compilers and Intel® MPI Library. The following example is for the Intel® Compiler (14.0.2 20140120) and Intel® MPI Library (4.1.3.049), including which instructions to use for the type of processor and coprocessor support you need. (The parenthetical comments are not to be included with the code.)
source /opt/intel/composer_xe_2013_sp1.2.144/bin/compilervars.sh intel64
source /opt/intel/impi/4.1.3.049/intel64/bin/mpivars.sh (for Intel® Xeon® processor)
source /opt/intel/impi/4.1.3.049/mic/bin/mpivars.sh (for Intel® Xeon Phi™ Coprocessor)
To build for the Intel® Xeon® processor v2 family (formerly codenamed Ivy Bridge):
FC="mpiifort" FCFLAGS='-O3 -xCORE-AVX-I' ./configure --enable-openmp --enable-mpi --enable-contiguous && make
To build for the Intel® Xeon® processor v3 family (formerly codenamed Haswell):
FC="mpiifort" FCFLAGS='-O3 -xCORE-AVX2' ./configure --enable-openmp --enable-mpi --enable-contiguous && make
To build for the Intel Xeon Phi Coprocessor:
FC="mpiifort" FCFLAGS='-O3 -mmic -fimf-precision=low -fimf-domain-exclusion=15 -opt-streaming-stores always -opt-streaming-cache-evict=0' ./configure --enable-openmp --enable-mpi --enable-contiguous --host=x86_64-k1om-linux --build=x86_64-unknown-linux && make
Compiler Flags Used
Compiler Flag | Effect |
---|---|
| Optimize for maximum speed and enable more aggressive optimizations that may not improve performance on some programs |
| May generate Intel® Advanced Vector Extensions (Intel® AVX), including instructions in Intel® Core 2™ processors in process echnologies smaller than 32nm, Intel® SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel® processors |
| May generate Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel processors |
| Build an application that runs natively on Intel® Multi-integrated Core Architecture (Intel® MIC Architecture) |
| Equivalent to accuracy-bits = 11 (single precision); accuracy-bits = 26 (double precision) |
| Indicates the domain on which a function is evaluated |
| Enables generation of streaming stores under the assumption that the application is memory bound |
| Turns off generation of the cache line eviction instructions when streaming loads/stores are used (specific to Intel MIC Architecture) |
Run Directions
The Test Cases
The provided test cases are both the irregular test cases and cubic test cases. The irregular Baffin-Bay test cases are based on bathymetry data from etopo2 and contain two binary files nidx.bin and hz.bin.
The regular cubic test cases are completely specified by options.nml, where the three dimensions of the cube are set as follows:
&optionlist mi = 100 mj = 100 mk = 50 cube = .true.
Running with MPI
To run with more than one task, use the correct decomposition file. Assuming you want to run with ${MPITASKS}
tasks, then execute the following:
cp mpi_decompo_${MPITASKS}x1.txt mpi_decompo.txt
Generating New MPI Decompositions
To generate MPI decompositions, you need to first build the code without MPI support, i.e. without --enable-mpi
. Follow the build directions above, but build without MPI support. For instance:
./configure && make
Then add the four lines below to the options list in options.nml:
decomp_version = 1 nproci = 1 nprocj = 10 only_islices = .true.
The nprocj is the number of decompositions that will be generated; the above setting generates 10 files: mpi_decompo_Nx1.txt where N is from 1 to 10.
Then, if you run the cmod binary, you will get the mpi_decompo_Nx1.txt files:
./cmod
You only need to do this once on the host (for example Intel Xeon processor v2 family), and use the same decompositions for different architectures.
The above steps result in several decompositions; some will not be needed. We want to select the best candidate from all the cases, when there are multiple candidates for a certain number of tasks.
To select the appropriate decomposition files, execute the following command in the location that the MPI decomposition files were generated. This will create a “NEW” directory with the right decompositions files.
./select_mpi_decompo_from_pruned_gen.sh
Once the decomposition files are generated, the program will exit. You can now use the generated files together with a build that includes MPI support as explained above.
Running the Baffin-Bay Test Cases
In the following instructions, select the executing command appropriate for the processor you are using.
Move to the BaffinBay directory:
cd BaffinBay_*
Be sure to run the appropriate binary for each architecture.
Source the Intel Compilers and Intel MPI Library as appropriate for the architecture.
For Intel Xeon processor v2 family (e.g. Intel® Xeon® processor E5-2697v2), execute the following:
cp mpi_decompo_12x1.txt mpi_decompo.txt
mpiexec.hydra –n 12 –env OMP_NUM_THREADS 4 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_ivb
For Intel Xeon processor v3 family (e.g. Intel® Xeon® processor E5-2697 v3), execute the following:
cp mpi_decompo_14x1.txt mpi_decompo.txt
mpiexec.hydra –n 14 –env OMP_NUM_THREADS 4 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_hsw
For Intel® Xeon Phi™ Coprocessor 7120A, be sure to source the Intel Compilers and Intel MPI Library, then execute the following:
export I_MPI_MIC=enable
cp mpi_decompo_6x1.txt mpi_decompo.txt
mpiexec.hydra –n 6 –env OMP_NUM_THREADS 40 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_knc
For both Intel Xeon processor v2 family (e.g. Intel® Xeon® processor E5-2697 v2) and Intel Xeon Phi Coprocessor 7120A (Symmetric Mode), first source the Intel Compilers and Intel MPI Library, then execute the following:
export I_MPI_MIC=enable
Use the appropriate MPI fabric using the following:
export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u
cp mpi_decompo_18x1.txt mpi_decompo.txt
mpiexec.hydra –host `hostname` –n 12 –env OMP_NUM_THREADS 4 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_ivb : -host `hostname`-mic0 –wdir `pwd` -n 6 –env OMP_NUM_THREADS 40 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_knc
For Intel Xeon processor v3 family (e.g. Intel® Xeon® processor E5-2697 v3) and the Intel Xeon Phi Coprocessor 7120A, execute the following:
export I_MPI_MIC=enable
Use the appropriate MPI fabric using the following:
export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u
cp mpi_decompo_20x1.txt mpi_decompo.txt
mpiexec.hydra –host `hostname` –n 14 –env OMP_NUM_THREADS 4 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_hsw : -host `hostname`-mic0 –wdir `pwd` -n 6 –env OMP_NUM_THREADS 40 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_knc
Performance and Optimizations
Note that currently the best performance for the HBM simulation code is achieved in symmetric mode with the aforementioned MPI ranks and OpenMP threads. This is still an on-going project and further optimizations are works in progress, thus the run configuration is subject to change in the future.
There are many technical reports published on HBM code optimizations. An upcoming book High Performance Parallelism Pearls, dedicates a chapter to the subject, with focus on the following:
- Code-restructuring, such as data structures for better locality,
- Exploiting the available threads and SIMD lanes for better concurrency at thread and loop level, and thus using the maximum available memory bandwidth both on Xeon and Xeon Phi.
Please look at the References and Resources section for more details
About the Authors
Karthik Raman (Intel Corporation, United States)
Karthik is a Software Architect at Intel, focusing on silicon performance analysis and optimization of HPC workloads for the Intel Xeon Phi Coprocessor. He focuses on analyzing for optimal compiler code generation, vectorization, and assessing key architectural features for performance. He helps deliver transformative methods and tools to expose new opportunities and insights.
Karthik received his Masters of Science in Electrical Engineering (with specialization in VLSI and Computer Architecture) from the University of Southern California, Los Angeles.
Jacob Weismann Poulsen (Danish Meteorological Institute, Denmark)
Jacob is an HPC and scientific programming consultant for the research departments at DMI. Jacob is noted for his expertise in analyzing and optimizing applications within the meteorological field.
Jacob received his Masters of Science in Computer Science from the University of Copenhagen. Jacob has over five years of teaching experience and has been at DMI for more than 15 years.
Per Berg (Danish Meteorological Institute, Denmark)
Per applies his mathematical modeling and scientific computing education to develop modeling software for applications in seismic exploration, and, more recently, for water environments (estuaries, oceans). Working for both private companies and public institutes, Per has been involved in numerous projects that apply models to solve engineering and scientific problems.
Per received his Masters of Engineering at the Technical University of Denmark. He has over 20 years of experience in weather domain (water environment, ocean). He has worked for DMI for over 10 years.
References and Resources
[1] Per Berg and Jacob Weismann Poulsen. Implementation details for HBM. DMI Technical Report No. 12-11. Technical report, DMI, Copenhagen, 2012. http://beta.dmi.dk/fileadmin/Rapporter/TR/tr12-11.pdf
[2] Jacob Weismann Poulsen and Per Berg. More details on HBM – general modelling theory and survey of recent studies. DMI Technical Report No. 12-16. Technical report, DMI, Copenhagen, 2012. http://beta.dmi.dk/fileadmin/Rapporter/TR/tr12-16.pdf
[3] Jacob Weismann Poulsen and Per Berg. Thread scaling with hbm. DMI Technical Report No. 12-20. Technical report, DMI, Copenhagen, 2012. www.dmi.dk/fileadmin/user_upload/Rapporter/tr12-20.pdf
[4] Better Concurrency and SIMD on HBM, a chapter on High Performance Parallelism Pearls – Multicore and Many-core Programming Approaches (book to be published)