Getting Started with the MPI Performance Snapshot

The MPI Performance Snapshot (MPS) is a scalable lightweight performance tool for MPI applications. It collects a variety of MPI application statistics (such as communication, activity, and load balance) and presents it in an easy-to-read format. The tool is not available separately but is provided as part of the Intel® Trace Analyzer and Collector installation. This article will serve as a quick getting started guide.

The MPI Performance Snapshot is trying to solve the following problems as it relates to analysis of MPI application when scaling out to thousands of ranks:

Cluster Sizes continue to grow so applications are getting more and more scalable
Large amounts of data are collected when doing profiling at larger scale - that can easily become unmanageable
It's hard to identify which are the key metrics to track when you gather so much data

MPS combines lightweight statistics from the Intel® MPI Library with OS and hardware-level counters to provide you with high-level categorization of your application: MPI vs. OpenMP load imbalance info, memory usage, and a break-down of MPI vs. computation vs. serial time.

Prerequisites:

Intel® Compilers version 15.0.1 or higher: this provides accurate OpenMP runtime data
Intel® MPI Library version 5.0.3 or higher: this provides accurate MPI runtime data
Intel® Trace Analyzer and Collector version 9.0.3 or higher: the MPI Performance Snapshot is included in the package

Optional software:

Performance Application Programming Interface (PAPI) library version 5.3.0 or higher: provides OS and hardware counters

Once you have all tools installed, make sure your environment is setup properly (assuming all Intel tools are available under /opt/intel):

# Environment setup for the Intel Compiler
$ source /opt/intel/composer_xe_2015/bin/compilervars.sh intel64
# Environment setup for the Intel MPI Library (xyz is the latest build number)
$ source /opt/intel/impi/5.1.0.xyz/bin64/mpivars.sh
# Environment setup for MPS (xyz is the latest build number)
$ source /opt/intel/itac/9.1.0.xyz/bin/mpsvars.sh

For this guide, we'll be using the Poisson application that's shipped with the Intel Trace Analyzer and Collector. Copy the contents of the <install_dir>/examples/poisson directory into your working directory and edit the input file:

$ cp -R /opt/intel/itac/9.1.0.006/examples/poisson/* $HOME
$ cd $HOME/poisson
$ cat inp
3200
2 16

Now build the application and run on your machine:

$ make
$ mpirun -mps -n 32 ./poisson

As you can see, the only additional option here is the -mps flag. This will load the necessary utilities to for MPS to track application metrics as the code is running. Once complete, 2 extra files will be created:

stats.txt contains native MPI statistics provided by the Intel MPI Library
app_stat.txt contains expanded statistics provided by MPS

While the raw data is human-readable, MPS does some extra post-processing of the data, including analysis of all recorded metrics. Your final step is to run both of these files through the tool as arguments:

$ mps ./stats.txt ./app_stat.txt

Let's take a look at the output to dig deeper into the data:

| Summary information
|--------------------------------------------------------------------
  Application : ./poisson
  Number of ranks: 32
  Used statistics: app_stat.txt, stats.txt
|
  WallClock time : 6.37 sec
| Total application lifetime. The time is elapsed time for the slowest process.
| This metric is the sum of the MPI Time and the Computation time below.
|
     MPI Time: 2.36 sec 37.56%
|    Time spent inside the MPI library. High values are usually bad.
|    This value is HIGH. The application is Communication-bound.
|    This might be caused by:
|    - High wait times inside the library - see the MPI Imbalance metric below.
|    - Active communications - see the diagrams 'MPI Time per Rank' (key '-m'
|      or '-m -D' for per MPI-function details) & 'Collective Operations Time
|      per Rank' (key '-t' or '-t -D' for per MPI-function details).
|    - Unoptimized settings of the MPI library. You can tune Intel(R) MPI
|      Library for your application and cluster configuration using the mpitune
|      utility available as part of the library package.
|         MPI Imbalance: 2.34 sec 37.24%
|         Mean unproductive wait time per-process spent in the MPI library calls
|         when a process is waiting for data. This time is part of the MPI time
|         above. High values are usually bad.
|         This value is HIGH. The application workload is NOT well balanced
|         between MPI ranks.
|         For more details about the MPI communication scheme use Intel(R) Trace
|         Analyzer and Collector available as part of Intel(R) Parallel Studio
|         XE Cluster Edition.
...

MPS clearly shows what percentage of your wall clock time is spent in MPI and tells you whether that's good or bad. A quick visual comparison shows that time spent in MPI is almost the same as the calculated MPI imbalance value. That means the application does little productive work and should be optimized.

The next step will be to look at the most time-consuming functions in the application by using the -f flag for mps:

$ mps -f ./stats.txt ./app_stat.txt
| Function summary for all ranks
|----------------------------------------------------------------------
|   Function    Time(sec)    Time(%)  Volume(MB)  Volume(%)  Calls
|----------------------------------------------------------------------
    SendRecv      50.37       59.96     39.06       99.96     9200
    Allreduce     24.20       28.81      0.01        0.03     1600
    Init           8.83       10.51      0.00        0.00       32
    Bcast          0.55        0.66      0.00        0.00       32
    Gather         0.05        0.06      0.00        0.01       32
    Finalize       0.00        0.00      0.00        0.00       32
|======================================================================
| TOTAL           84.01        100.00    39.08      100.00    10928

Based on this data, we should focus our optimization efforts on the SendRecv and Allreduce routines as potential hotspots.

A PDF version of this quick-start guide can be downloaded here: Tutorial: Analyzing MPI Applications with the MPI Performance Snapshot.

We also encourage you to check out the full MPI Performance Snapshot User's Guide - the latest version will always be posted on the Intel Trace Analyzer and Collector documentation page.

As always, we encourage your feedback. If you've used the tool and have run into problems, submit an issue at Intel® Premier Customer Support or let us know your general impressions over at the Intel® Clusters and HPC Technology forums.

Getting Started with the MPI Performance Snapshot

Trending Articles

SANIDAPA LIVE IN HALDADUWANA 2005-06-26

Hizia picha za utupu za meneja wa benki imekaaje?

Black Angus Grilled Artichokes

A Bottle of Dew Class 6 Worksheet English Poorvi Chapter 1

[ROM][UNOFFICIAL][x1s][SM-G980F/DS][10] Resurrection Remix v8.6.6 for Samsung...

Our most epic blog yet, 4 stunning, gorgeous Curvy Kate Star In A Bra...

LC4245W - TOSHIBA LCD TV - POWER SUPPLY SCHEMATIC [Circuit Diagram]

JAVARIS FOSTER Arrested by Miami-Dade County Corrections on Feb 01, 2017

UPDATE: Police charge three men after Chelmsford drugs raid

Afzal Hai Kul Jahan Se Gharana Hussain Ka

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Teen Shot In Miami Drive-By Dies From Injuries

Giorgio Moroder - Music From Battlestar Galactica and Other Original...

'Exceptionally dangerous' rapist Bradley Trengove from Camborne...

Chaoro Lyrics Translation | Mary Kom - Priyanka Chopra

Creating Database from Backup of a Terminated DB System

Tinny — Dzormo (Prod by Hammer)

The 10 Tennessee Cities With The Largest Black Population For 2021

Grimsby school staff resign in sex photo shame

Banks reluctant to lend on 400 Manx homes built in 1970s