NOAA NIM with Support for Intel® Xeon Phi™ Coprocessor

Non-hydrostatic Icosahedral Model is a weather forecasting model developed by NOAA. G6 K96 which is a smaller data-set which scales best up to 4 cluster nodes. G9 is useful for studying larger clusters. The code supports the symmetric mode of operation of the Intel® Xeon® processor (Referred to as ‘host’ in this document) with the Intel® Xeon Phi™ coprocessor (Referred to as ‘coprocessor’ in this document) in a single node and in a cluster environment.

To get access to the code:

Contact NOAA/ESRL https://earthsystemcog.org/projects/dcmip-2012/nim

Unzip the kit to directory named /nim/. The NIM benchmark contains 5 main subdirectories underneath the nim/ directory: xeonphinim_r2201/, data/, sms_r604/, F2C-ACC_V5.5/ and gptl_v5.0/. Subdirectory xeonphinim_r2201 contains source code for NIM's dynamics core, along with scripts to build and run the NIM dynamics; data/ contains input data sets used by the NIM dynamics core; sms_r604 contains the Scalable Modeling System; F2C-ACC_V5.5/ contains the F2C-ACC code translator and gptl_v5.0/ contains the General Purpose Timing Library.

Build Directions

(Disclaimer: this document describes how to build and run on a specific cluster environment referred to as Endeavor. You will need to adjust these instructions accordingly for your cluster environment.)

Get an interactive compute node on Endeavor to build for Intel® Xeon® processors and for Intel® Xeon Phi™ coprocessors
bsub -R '1*{select[kncc0] span[ptile=1]}' -q hoelleq -Is -W 700 -l
MIC_ULIMIT_STACKSIZE=365536 -l MIC_TIMINGDEVICE=tsc /bin/bash
You will need Intel® Composer XE 2013 or newer C/C++ and Fortran compiler and Intel® MPI Library 4.1.1 or newer.
1. You can obtain Intel® Composer XE, which includes the Intel® C/C++ and Fortran Compilers1, from https://registrationcenter.intel.com/regcenter/register.aspx, or register at https://software.intel.com/en-us/ to get a free 30-day evaluation copy.
Set environment variables for Intel Composer XE and Intel MPI Library.
All scripts mentioned here are provided in the NIM application kit.
Building Libraries and Tools Used by NIM (SMS, GPTL, Ruby)
NIM can run in symmetric mode. This means we need to compile both for the host and for the coprocessor.
1. SMS
  1. Copy or soft-link one of the (GNU Make-compatible) profile* files from sms_r604/etc/ into the sms_r604/ directory. Rename as "profile".
  2. Edit the profile file, supplying an appropriate value for each variable. The variables most likely to require custom definitions are: CC, CCSERIAL, FC, FCSERIAL, FFLAGS, FIXED and FREE (the last two are the flags indicating Fortran fixed or free form source for the compiler indicated by the FC and FCSERIAL variables). The variable TREETOP refers to the sms_r604/ directory during the build.
  3. In the sms_src/ directory, type "make". If it succeeds, type "make install".
  4. If "make install" succeeds, new bin/, incldue/ and lib/ directories will have been created under the root-level sms/ (or sms_phi/, see below) directory.
  SMS must be built twice to support symmetric mode on machines with the coprocessor. In this case, first build for the host by starting with an etc/profile.* file that does not contain "phi" in the name. This will install SMS in the root-level sms/ directory. Then clean by executing "make distclean" and rebuild starting with a etc/profile.* file that does contain "phi" in the name. This will install the coprocessor SMS in the root-level sms_phi/ directory.
2. GPTL
  NIM uses the GPTL timing library (jmrosinski.github.io/GPTL/) to measure wall-clock execution times for NIM components. We used v5.0 in gptl_v5.0/. The gptl_ v5.0/INSTALL file includes instructions for building the library and installing it in the root-level gptl/ directory.
  GPTL must be built twice to support symmetric mode on machines with coprocessor, such as TACC's Stampede system. In this case, first build for the host by starting with a macros.make.* file that does not contain "phi" in the name. For example, on TACC stampede if you do "cp jrmacros/macros.make.tacc ./macros.make", "make", and then "make install" it should work right out of the box with no modifications. This will install GPTL in the root-level gptl/ directory. Then "make clean" and rebuild starting with a macros.make.* file that does contain "phi" in the name. This will install the coprocessor GPTL in the root-level gptl_phi/ directory.
3. Ruby
  
  SMS source-to-source translator requires Ruby 1.9.x with YAML support. If your system does not have Ruby then you need to build from source:
  1. Obtain the LibYAML source from http://pyyaml.org/wiki/LibYAML. Version 0.1.4 is known to work. Unpack the archive, configure, make and make install.
  2. Obtain the Ruby source from http://ruby-lang.org. Version 1.9.3 is known to work. Unpack the archive, configure, make and make install. NOTE that you must use the --with-opt-dir= argument when running configure, and provide its value the path to your LibYAML installation from the previous step.
  3. Verify your installation
    $ /path/to/new/ruby -e "puts RUBY_VERSION" # should print 1.9.3 or similar
    $ /path/to/new/ruby -e "require 'yaml'" # should return silently
Building NIM
Go to dir xeonphinim_r2201/src
1. Intel® Xeon Phi™ Coprocessor build
  1. Open macros.make.endeavorxeonphi.
    Make sure SMS__RUBY and SMS are pointing to the right dir
    export SMS__RUBY=/panfs/users/Xjrosin/ruby-1.9.3-p484-install/bin/ruby
    SMS = /home/Xjrosin/sms_r604/xeonphi
  2. In cmd prompt type: ./ makenim –i
    select
    1. arch: endeavorxeonphi
    2. Set underlying hardware: cpu
    3. Set parallelism: parallel
    4. Set threading: yes
    5. Set bitwise: no
    6. Set double_precision: false
  3. 2 directories are built in dir xeonphinim_r2201
    1. run_endeavorxeonphi_cpu_parallel_ompyes_bitwiseno
    2. src_endeavorxeonphi_cpu_parallel_ompyes_bitwiseno
2. Intel® Xeon® Processor build
  1. Open macros.make.endeavor
    Make sure SMS__RUBY and SMS are pointing to the right dir
    export SMS__RUBY=/panfs/users/Xjrosin/ruby-1.9.3-p484-install/bin/ruby
    SMS = /home/Xjrosin/sms_r604/xeon
  2. In cmd prompt type: ./ makenim –i
    select
    1. arch: endeavor
    2. Set underlying hardware: cpu
    3. Set parallelism: parallel
    4. Set threading: yes
    5. Set bitwise: no
    6. Set double_precision: false
  3. 2 directories are built in dir xeonphinim_r2201
    1. run_endeavor_cpu_parallel_ompyes_bitwiseno

Running NIM for G6 data-set

cd run_endeavor_cpu_parallel_ompyes_bitwiseno

Create NIMnamelist for different configurations

General Settings

MaxQueueTime = '30' ! Run time for the complete job (HH:MM:SS)
DataDir = "/home/ajha1/panfs_users_ajha1/workloads/NIM/nim2/nimdata/G6"
Point to your G5/G6/G9 data location

G6 specific
&CNTLnamelist

glvl	= 6 ! Grid level
gtype	= 2 ! Grid type: Standard recursive (0), Modified recursive (2), Modified great circle (3)
SubdivNum	= 2 2 2 2 2 2 2 2 2 2 2 2 ! subdivision number for each recursive refinement
nz	= 96 ! Number of vertical levels
ArchvTimeUnit	= 'ts' ! ts:timestep; hr:hour dy:day
itsbeg	= 1
RestartBegin	= 0 ! Begin restart if .ne.0
ForecastLength	= 100 ! Total number of timesteps (100/day),(2400/hr),(2400*60/min), (9600/ts)
ArchvIntvl	= 100 ! Archive interval (in ArchvTimeUnit) to do output (10-day), (240-hr), (240*60-min), (960-ts)
minmaxPrintInterval	= 100 ! Interval to print out MAXs and MINs
physics	= 'none' ! GRIMS or GFS or none for no physics
GravityWaveDrag	= .true. ! True means calculate gravity wave drag
yyyymmddhhmm	= "200707170000" ! Date of the model run
pertlim	= 0. ! Perturbation bound for initial temperature (1.e-7 is good for 32-bit roundoff)
curve	= 3 ! 0: ij order, 1: Hilbert curve order (only for all-bisection refinement), 2:ij block order, 3: Square Layout
NumCacheBlocksPerPE	= 1 ! Number of cache blocks per processor. Only applies to ij block order
tiles_per_thread	= 1 ! multiplies OMP_NUM_THREADS to give num_tiles for GRIMS
dyn_phy_barrier	= .false. ! artificial barrier before and after physics for timing
read_amtx	= .false. ! false means NIM computes amtx* arrays (MUCH faster than reading from a file!)
writeoutput	= .false.

ComputeTasks = 1 ! Compute tasks for NIM (set to 1 for serial)
Number of Cores available to MPI
&TASKnamelist
cpu_cores_per_node = 48
max_compute_tasks_per_node = 1
omp_threads_per_compute_task = 48
num_write_tasks = 0
max_write_tasks_per_node = 1
root_own_node = .false.
icosio_debugmsg_on = .false.
max_compute_tasks_per_mic = 0
omp_threads_per_mic_mpi_task = 240

2-Node Host
1. ComputeTasks = 2 ! Compute tasks for NIM (set to 1 for serial)
2. Number of Cores available to MPI
  &TASKnamelist
  cpu_cores_per_node = 48
  max_compute_tasks_per_node = 1
  omp_threads_per_compute_task = 48
  num_write_tasks = 0
  max_write_tasks_per_node = 1
  root_own_node = .false.
  icosio_debugmsg_on = .false.
  max_compute_tasks_per_mic = 0
  omp_threads_per_mic_mpi_task = 240
4-Node Host
1. ComputeTasks = 4 ! Compute tasks for NIM (set to 1 for serial)
2. Number of Cores available to MPI
  &TASKnamelist
  cpu_cores_per_node = 48
  max_compute_tasks_per_node = 1
  omp_threads_per_compute_task = 48
  num_write_tasks = 0
  max_write_tasks_per_node = 1
  root_own_node = .false.
  icosio_debugmsg_on = .false.
  max_compute_tasks_per_mic = 0
  omp_threads_per_mic_mpi_task = 240
1-Node Coprocessor
1. ComputeTasks = 1 ! Compute tasks for NIM (set to 1 for serial)
2. Number of Cores available to MPI
  &TASKnamelist
  cpu_cores_per_node = 240
  max_compute_tasks_per_node = 1
  omp_threads_per_compute_task = 240
  num_write_tasks = 0
  max_write_tasks_per_node = 1
  root_own_node = .false.
  icosio_debugmsg_on = .false.
  max_compute_tasks_per_mic = 0
  omp_threads_per_mic_mpi_task = 0
2-Node Coprocessor
1. ComputeTasks = 2 ! Compute tasks for NIM (set to 1 for serial)
2. Number of Cores available to MPI
  &TASKnamelist
  cpu_cores_per_node = 240
  max_compute_tasks_per_node = 1
  omp_threads_per_compute_task = 240
  num_write_tasks = 0
  max_write_tasks_per_node = 1
  root_own_node = .false.
  icosio_debugmsg_on = .false.
  max_compute_tasks_per_mic = 0
  omp_threads_per_mic_mpi_task = 0
4-Node Coprocessor
1. ComputeTasks = 4 ! Compute tasks for NIM (set to 1 for serial)
2. Number of Cores available to MPI
  &TASKnamelist
  cpu_cores_per_node = 240
  max_compute_tasks_per_node = 1
  omp_threads_per_compute_task = 240
  num_write_tasks = 0
  max_write_tasks_per_node = 1
  root_own_node = .false.
  icosio_debugmsg_on = .false.
  max_compute_tasks_per_mic = 0
  omp_threads_per_mic_mpi_task = 0
1-Node SYMMETRIC
1. ComputeTasks = 2 ! Compute tasks for NIM (set to 1 for serial)
2. Number of Cores available to MPI
  &TASKnamelist cpu_cores_per_node = 48
  max_compute_tasks_per_node = 2
  omp_threads_per_compute_task = 48
  num_write_tasks = 0
  max_write_tasks_per_node = 1
  root_own_node = .false.
  icosio_debugmsg_on = .false.
  max_compute_tasks_per_mic = 1
  omp_threads_per_mic_mpi_task = 240
2-Node SYMMETRIC
1. ComputeTasks = 4 ! Compute tasks for NIM (set to 1 for serial)
2. Number of Cores available to MPI
  &TASKnamelist
  cpu_cores_per_node = 48
  max_compute_tasks_per_node = 2
  omp_threads_per_compute_task = 48
  num_write_tasks = 0
  max_write_tasks_per_node = 1
  root_own_node = .false.
  icosio_debugmsg_on = .false.
  max_compute_tasks_per_mic = 1
  omp_threads_per_mic_mpi_task = 240
4-Node SYMMETRIC
1. ComputeTasks = 8 ! Compute tasks for NIM (set to 1 for serial)
2. Number of Cores available to MPI
  &TASKnamelist
  cpu_cores_per_node = 48
  max_compute_tasks_per_node = 2
  omp_threads_per_compute_task = 48
  num_write_tasks = 0
  max_write_tasks_per_node = 1
  root_own_node = .false.
  icosio_debugmsg_on = .false.
  max_compute_tasks_per_mic = 1
  omp_threads_per_mic_mpi_task = 240

Running
1. Copy NIMnamelist for a given Cluster configuration i.e. 1/2/4 Nodes to NIMnamelist
2. Run on Host
  1. Execute ./endeavorsubnim.host
3. Run on Coprocessor
  1. Execute ./endeavorsubnim.mic
4. Run SYMMETRIC
  1. Execute ./endeavorsubnim.symm
Results

The execution of above script will create directory like
G6_K96_NONE_P1_44452
Cd to above directory, there will be run files like
1. Stdout //std output file – check for any errors
2. nodeconfig.ivb.txt – MPI node config file
3. NIMnamelist – run config file generated above
4. taskinfo.yaml – MPI task info
5. timing.0 - timing for MPI 0
6. timing.summary – timing summary for overall run
  1. grab the wallmax for the MainLoop – this is your app runtime

name   ncalls nranks mean_time   std_dev   wallmax (rank  thread)   wallmin (rank  thread)

Total 1  1   102.552   0.000  102.552 (     0     0)       102.552 (      0    0)

MainLoop   1    1     68.203     0.000    68.203 (    0   0)   68.203 (    0    0)

NOAA NIM with Support for Intel® Xeon Phi™ Coprocessor

Build Directions

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List