The MPI Performance Snapshot (MPS) is a scalable lightweight performance tool for MPI applications. It collects a variety of MPI application statistics (such as communication, activity, and load balance) and presents it in an easy-to-read format. The tool is not available separately but is provided as part of the Intel® Trace Analyzer and Collector installation. This article will serve as a quick getting started guide.
The MPI Performance Snapshot is trying to solve the following problems as it relates to analysis of MPI application when scaling out to thousands of ranks:
- Cluster Sizes continue to grow so applications are getting more and more scalable
- Large amounts of data are collected when doing profiling at larger scale - that can easily become unmanageable
- It's hard to identify which are the key metrics to track when you gather so much data
MPS combines lightweight statistics from the Intel® MPI Library with OS and hardware-level counters to provide you with high-level categorization of your application: MPI vs. OpenMP load imbalance info, memory usage, and a break-down of MPI vs. computation vs. serial time.
Prerequisites:
- Intel® Compilers version 15.0.1 or higher: this provides accurate OpenMP runtime data
- Intel® MPI Library version 5.0.3 or higher: this provides accurate MPI runtime data
- Intel® Trace Analyzer and Collector version 9.0.3 or higher: the MPI Performance Snapshot is included in the package
Optional software:
- Performance Application Programming Interface (PAPI) library version 5.3.0 or higher: provides OS and hardware counters
Once you have all tools installed, make sure your environment is setup properly (assuming all Intel tools are available under /opt/intel):
# Environment setup for the Intel Compiler $ source /opt/intel/composer_xe_2015/bin/compilervars.sh intel64 # Environment setup for the Intel MPI Library (xyz is the latest build number) $ source /opt/intel/impi/5.1.0.xyz/bin64/mpivars.sh # Environment setup for MPS (xyz is the latest build number) $ source /opt/intel/itac/9.1.0.xyz/bin/mpsvars.sh
For this guide, we'll be using the Poisson application that's shipped with the Intel Trace Analyzer and Collector. Copy the contents of the <install_dir>/examples/poisson directory into your working directory and edit the input file:
$ cp -R /opt/intel/itac/9.1.0.006/examples/poisson/* $HOME $ cd $HOME/poisson $ cat inp 3200 2 16
Now build the application and run on your machine:
$ make $ mpirun -mps -n 32 ./poisson
As you can see, the only additional option here is the -mps flag. This will load the necessary utilities to for MPS to track application metrics as the code is running. Once complete, 2 extra files will be created:
- stats.txt contains native MPI statistics provided by the Intel MPI Library
- app_stat.txt contains expanded statistics provided by MPS
While the raw data is human-readable, MPS does some extra post-processing of the data, including analysis of all recorded metrics. Your final step is to run both of these files through the tool as arguments:
$ mps ./stats.txt ./app_stat.txt
Let's take a look at the output to dig deeper into the data:
| Summary information |-------------------------------------------------------------------- Application : ./poisson Number of ranks: 32 Used statistics: app_stat.txt, stats.txt | WallClock time : 6.37 sec | Total application lifetime. The time is elapsed time for the slowest process. | This metric is the sum of the MPI Time and the Computation time below. | MPI Time: 2.36 sec 37.56% | Time spent inside the MPI library. High values are usually bad. | This value is HIGH. The application is Communication-bound. | This might be caused by: | - High wait times inside the library - see the MPI Imbalance metric below. | - Active communications - see the diagrams 'MPI Time per Rank' (key '-m' | or '-m -D' for per MPI-function details) & 'Collective Operations Time | per Rank' (key '-t' or '-t -D' for per MPI-function details). | - Unoptimized settings of the MPI library. You can tune Intel(R) MPI | Library for your application and cluster configuration using the mpitune | utility available as part of the library package. | MPI Imbalance: 2.34 sec 37.24% | Mean unproductive wait time per-process spent in the MPI library calls | when a process is waiting for data. This time is part of the MPI time | above. High values are usually bad. | This value is HIGH. The application workload is NOT well balanced | between MPI ranks. | For more details about the MPI communication scheme use Intel(R) Trace | Analyzer and Collector available as part of Intel(R) Parallel Studio | XE Cluster Edition. ...
MPS clearly shows what percentage of your wall clock time is spent in MPI and tells you whether that's good or bad. A quick visual comparison shows that time spent in MPI is almost the same as the calculated MPI imbalance value. That means the application does little productive work and should be optimized.
The next step will be to look at the most time-consuming functions in the application by using the -f flag for mps:
$ mps -f ./stats.txt ./app_stat.txt | Function summary for all ranks |---------------------------------------------------------------------- | Function Time(sec) Time(%) Volume(MB) Volume(%) Calls |---------------------------------------------------------------------- SendRecv 50.37 59.96 39.06 99.96 9200 Allreduce 24.20 28.81 0.01 0.03 1600 Init 8.83 10.51 0.00 0.00 32 Bcast 0.55 0.66 0.00 0.00 32 Gather 0.05 0.06 0.00 0.01 32 Finalize 0.00 0.00 0.00 0.00 32 |====================================================================== | TOTAL 84.01 100.00 39.08 100.00 10928
Based on this data, we should focus our optimization efforts on the SendRecv and Allreduce routines as potential hotspots.
A PDF version of this quick-start guide can be downloaded here: Tutorial: Analyzing MPI Applications with the MPI Performance Snapshot.
We also encourage you to check out the full MPI Performance Snapshot User's Guide - the latest version will always be posted on the Intel Trace Analyzer and Collector documentation page.
As always, we encourage your feedback. If you've used the tool and have run into problems, submit an issue at Intel® Premier Customer Support or let us know your general impressions over at the Intel® Clusters and HPC Technology forums.