HBM for the Intel® Xeon Phi™ Coprocessor

Purpose

This article provides instructions for code access, build, and run directions for the Baffin-Bay test cases for the HBM code, running on Intel® Xeon® processors and Intel® Xeon Phi™ Coprocessors.

Introduction

Danish Meteorological Institute (DMI) operates a regional 3D ocean model HBM for the North Sea - Baltic, in order to provide information about the physical state of the Danish and nearby waters in the near future.

The origin of the HBM code dates back to the BSHcmod ocean circulation model, the development of which was originally initiated in the 1990s at Bundesamt für Seeschifffahrt und Hydrographie (BSH) in Hamburg, Germany and one branch of the model code has been maintained and further developed at DMI. It has undergone extensive revision in the DMI implementation, in co-operation between DMI, BSH, FMI (Finnish Meteorological Institute) and MSI (Marine Systems Institute, Estonia).

The HBM code started from a snap-shot of the DMIcmod repository in 2009. The current code implementation is as close as one gets to a total rewrite without it being so from the original BSHcmod and an attempt has been made to merge the model code development into a common project, with the intention being to join the forces of experienced modelers from around the Baltic Sea. This model code project carries the name HBM, the HIROMB-BOOS {Baltic Operational Oceanographic System} Model.

HBM calculates the sea level, water temperature, salinity, ocean currents, ice thickness, and concentration at 15-minute intervals. Temperature, salt, and current is calculated for the entire water column—top to bottom—at approximately 50 fixed-depth levels. Sea level at selected coastal points is calculated in 10-minute time resolution. HBM runs every six hours, (four times per day). A run produces a five day forecast of the three-dimensional state of the ocean.

HBM is not an open source or a community model code. Therefore, the full HBM results are not shared in this code recipe. But, the “advection” routines (part of HBM code) will be available under BSD license later this year; results will be updated once the code is released.

Code Access

Jacob Weismann Poulsen and Per Berg from DMI are the main developers and maintainers of the HBM code. As stated before, since HBM is not an open source or a community model code, individual/third parties can have access to the code by signing an MoU.

To get access to the code and test cases, please contact Per Berg per@dmi.dk or Jacob Weismann Poulsen jwp@dmi.dk

Build Directions

The build system uses autoconf. The corresponding configure process is meant to set mandatory compiler settings and to distinguish between different build incarnations. If you do not specify any configure options, you will get a serial build.

Note that configure generates a file with the compiler flag settings called Makefile.include. If you want to change compiler flags, you can either rerun configure or adjust the generated Makefile.include file. Additionally, you can pass the new flags onto make itself, for example make FCFLAGS="-O3".

All the following build examples are written in BASH syntax. These examples generate a single binary called cmod. You can rename them as needed for a specific architecture.

Configuration Examples

--enable-contiguous --enable-openmp --enable-mpi --enable-openmp --enable-mpi --enable-contiguous --enable-openacc --enable-contiguous --enable-openacc --enable-mpi

Build

The following examples illustrate different commands for different Intel Xeon processors. Choose the command for the appropriate processor for your build.

NOTE: In the following code examples, extra paragraph spacing shows where very long command lines terminate. Single-spaced code examples are one command line.

Decompress the source files.
tar -zxvf hbm-<version>.tar.gz cd hbm-<version>
Source the latest Intel® C and C++ Compilers and Intel® MPI Library. The following example is for the Intel® Compiler (14.0.2 20140120) and Intel® MPI Library (4.1.3.049), including which instructions to use for the type of processor and coprocessor support you need. (The parenthetical comments are not to be included with the code.)
source /opt/intel/composer_xe_2013_sp1.2.144/bin/compilervars.sh intel64
source /opt/intel/impi/4.1.3.049/intel64/bin/mpivars.sh (for Intel® Xeon® processor)
source /opt/intel/impi/4.1.3.049/mic/bin/mpivars.sh (for Intel® Xeon Phi™ Coprocessor)
To build for the Intel® Xeon® processor v2 family (formerly codenamed Ivy Bridge):
FC="mpiifort" FCFLAGS='-O3 -xCORE-AVX-I' ./configure --enable-openmp --enable-mpi --enable-contiguous && make
To build for the Intel® Xeon® processor v3 family (formerly codenamed Haswell):
FC="mpiifort" FCFLAGS='-O3 -xCORE-AVX2' ./configure --enable-openmp --enable-mpi --enable-contiguous && make
To build for the Intel Xeon Phi Coprocessor:
FC="mpiifort" FCFLAGS='-O3 -mmic -fimf-precision=low -fimf-domain-exclusion=15 -opt-streaming-stores always -opt-streaming-cache-evict=0' ./configure --enable-openmp --enable-mpi --enable-contiguous --host=x86_64-k1om-linux --build=x86_64-unknown-linux && make

Compiler Flags Used

Compiler Flag	Effect
`O3`	Optimize for maximum speed and enable more aggressive optimizations that may not improve performance on some programs
`CORE-AVX-I`	May generate Intel® Advanced Vector Extensions (Intel® AVX), including instructions in Intel® Core 2™ processors in process echnologies smaller than 32nm, Intel® SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel® processors
`CORE-AVX2`	May generate Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel processors
`mmic`	Build an application that runs natively on Intel® Multi-integrated Core Architecture (Intel® MIC Architecture)
`fimf-precision=low`	Equivalent to accuracy-bits = 11 (single precision); accuracy-bits = 26 (double precision)
`fimf-domain-exclusion=classlist[:funclist]`	Indicates the domain on which a function is evaluated
`opt-streaming-stores always`	Enables generation of streaming stores under the assumption that the application is memory bound
`opt-streaming-cache-evict=0`	Turns off generation of the cache line eviction instructions when streaming loads/stores are used (specific to Intel MIC Architecture)

Run Directions

The Test Cases

The provided test cases are both the irregular test cases and cubic test cases. The irregular Baffin-Bay test cases are based on bathymetry data from etopo2 and contain two binary files nidx.bin and hz.bin.

The regular cubic test cases are completely specified by options.nml, where the three dimensions of the cube are set as follows:

&optionlist
  mi = 100
  mj = 100
  mk = 50
  cube = .true.

Running with MPI

To run with more than one task, use the correct decomposition file. Assuming you want to run with ${MPITASKS} tasks, then execute the following:


cp mpi_decompo_${MPITASKS}x1.txt mpi_decompo.txt

Generating New MPI Decompositions

To generate MPI decompositions, you need to first build the code without MPI support, i.e. without --enable-mpi. Follow the build directions above, but build without MPI support. For instance:


./configure && make

Then add the four lines below to the options list in options.nml:


decomp_version = 1

nproci         = 1

nprocj         = 10

only_islices = .true.

The nprocj is the number of decompositions that will be generated; the above setting generates 10 files: mpi_decompo_Nx1.txt where N is from 1 to 10.

Then, if you run the cmod binary, you will get the mpi_decompo_Nx1.txt files:


./cmod

You only need to do this once on the host (for example Intel Xeon processor v2 family), and use the same decompositions for different architectures.

The above steps result in several decompositions; some will not be needed. We want to select the best candidate from all the cases, when there are multiple candidates for a certain number of tasks.

To select the appropriate decomposition files, execute the following command in the location that the MPI decomposition files were generated. This will create a “NEW” directory with the right decompositions files.


./select_mpi_decompo_from_pruned_gen.sh

Once the decomposition files are generated, the program will exit. You can now use the generated files together with a build that includes MPI support as explained above.

Running the Baffin-Bay Test Cases

In the following instructions, select the executing command appropriate for the processor you are using.

Move to the BaffinBay directory:
cd BaffinBay_*
Be sure to run the appropriate binary for each architecture.
Source the Intel Compilers and Intel MPI Library as appropriate for the architecture.
For Intel Xeon processor v2 family (e.g. Intel® Xeon® processor E5-2697v2), execute the following:
cp mpi_decompo_12x1.txt mpi_decompo.txt
mpiexec.hydra –n 12 –env OMP_NUM_THREADS 4 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_ivb
For Intel Xeon processor v3 family (e.g. Intel® Xeon® processor E5-2697 v3), execute the following:
cp mpi_decompo_14x1.txt mpi_decompo.txt
mpiexec.hydra –n 14 –env OMP_NUM_THREADS 4 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_hsw
For Intel® Xeon Phi™ Coprocessor 7120A, be sure to source the Intel Compilers and Intel MPI Library, then execute the following:
export I_MPI_MIC=enable
cp mpi_decompo_6x1.txt mpi_decompo.txt
mpiexec.hydra –n 6 –env OMP_NUM_THREADS 40 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_knc
For both Intel Xeon processor v2 family (e.g. Intel® Xeon® processor E5-2697 v2) and Intel Xeon Phi Coprocessor 7120A (Symmetric Mode), first source the Intel Compilers and Intel MPI Library, then execute the following:
export I_MPI_MIC=enable
Use the appropriate MPI fabric using the following:
export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u
cp mpi_decompo_18x1.txt mpi_decompo.txt
mpiexec.hydra –host `hostname` –n 12 –env OMP_NUM_THREADS 4 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_ivb : -host `hostname`-mic0 –wdir `pwd` -n 6 –env OMP_NUM_THREADS 40 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_knc
For Intel Xeon processor v3 family (e.g. Intel® Xeon® processor E5-2697 v3) and the Intel Xeon Phi Coprocessor 7120A, execute the following:
export I_MPI_MIC=enable
Use the appropriate MPI fabric using the following:
export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u
cp mpi_decompo_20x1.txt mpi_decompo.txt
mpiexec.hydra –host `hostname` –n 14 –env OMP_NUM_THREADS 4 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_hsw : -host `hostname`-mic0 –wdir `pwd` -n 6 –env OMP_NUM_THREADS 40 –env KMP_AFFINITY compact,verbose –env I_MPI_PIN_DOMAIN omp ./cmod_knc

Performance and Optimizations

Note that currently the best performance for the HBM simulation code is achieved in symmetric mode with the aforementioned MPI ranks and OpenMP threads. This is still an on-going project and further optimizations are works in progress, thus the run configuration is subject to change in the future.

There are many technical reports published on HBM code optimizations. An upcoming book High Performance Parallelism Pearls, dedicates a chapter to the subject, with focus on the following:

Code-restructuring, such as data structures for better locality,
Exploiting the available threads and SIMD lanes for better concurrency at thread and loop level, and thus using the maximum available memory bandwidth both on Xeon and Xeon Phi.

Please look at the References and Resources section for more details

About the Authors

Karthik Raman (Intel Corporation, United States)

Karthik is a Software Architect at Intel, focusing on silicon performance analysis and optimization of HPC workloads for the Intel Xeon Phi Coprocessor. He focuses on analyzing for optimal compiler code generation, vectorization, and assessing key architectural features for performance. He helps deliver transformative methods and tools to expose new opportunities and insights.

Karthik received his Masters of Science in Electrical Engineering (with specialization in VLSI and Computer Architecture) from the University of Southern California, Los Angeles.

Jacob Weismann Poulsen (Danish Meteorological Institute, Denmark)

Jacob is an HPC and scientific programming consultant for the research departments at DMI. Jacob is noted for his expertise in analyzing and optimizing applications within the meteorological field.

Jacob received his Masters of Science in Computer Science from the University of Copenhagen. Jacob has over five years of teaching experience and has been at DMI for more than 15 years.

Per Berg (Danish Meteorological Institute, Denmark)

Per applies his mathematical modeling and scientific computing education to develop modeling software for applications in seismic exploration, and, more recently, for water environments (estuaries, oceans). Working for both private companies and public institutes, Per has been involved in numerous projects that apply models to solve engineering and scientific problems.

Per received his Masters of Engineering at the Technical University of Denmark. He has over 20 years of experience in weather domain (water environment, ocean). He has worked for DMI for over 10 years.

References and Resources

[1] Per Berg and Jacob Weismann Poulsen. Implementation details for HBM. DMI Technical Report No. 12-11. Technical report, DMI, Copenhagen, 2012. http://beta.dmi.dk/fileadmin/Rapporter/TR/tr12-11.pdf

[2] Jacob Weismann Poulsen and Per Berg. More details on HBM – general modelling theory and survey of recent studies. DMI Technical Report No. 12-16. Technical report, DMI, Copenhagen, 2012. http://beta.dmi.dk/fileadmin/Rapporter/TR/tr12-16.pdf

[3] Jacob Weismann Poulsen and Per Berg. Thread scaling with hbm. DMI Technical Report No. 12-20. Technical report, DMI, Copenhagen, 2012. www.dmi.dk/fileadmin/user_upload/Rapporter/tr12-20.pdf

[4] Better Concurrency and SIMD on HBM, a chapter on High Performance Parallelism Pearls – Multicore and Many-core Programming Approaches (book to be published)

HBM for the Intel® Xeon Phi™ Coprocessor

Purpose

Introduction

Code Access

Build Directions

Configuration Examples

Build

Compiler Flags Used

Run Directions

The Test Cases

Running with MPI

Generating New MPI Decompositions

Running the Baffin-Bay Test Cases

In the following instructions, select the executing command appropriate for the processor you are using.

Performance and Optimizations

About the Authors

References and Resources

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112