Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Using Intel® VTune™ Amplifier on Cray* XC systems

$
0
0

Introduction

The goal of this article is to provide detailed description of the process of VTune Amplifier installation and using it for applications performance analysis, which is a little bit specific to Cray’s programming environment (PE).  We will be referencing to the CLE 6.0 – Cray installation and configuration model for software on Cray XC systems [1] The installation part of the article is targeting  site administrators and system supporters responsible for the Cray XC programming environment, while data collection and analysis part is applicable for Cray XC system users.

 

Installation 

The Cray CLE 6.0 provides a set of different compilers, performance analysis tools and run-time libraries including Intel Compiler and Intel MPI library. However, VTune Amplifier is not a part of it, and it require additional efforts for installing in the programming environment.

According to the Cray CLE 6.0 documentation [2], installation of additional software into a PE image root is performed on the system's System Management Workstation SMW. The PE image root is then pushed to the boot node so that it can be mounted by a group of Data Virtualization Service (DVS) servers and then mounted to the system's login and compute nodes. 

Cray positions advantages of PE image root model as the installation is designed to be system and hardware agnostic, so the same PE image root can also be used for other systems, such as eLogin systems or another Cray XC. A feature of Image Management and Provisioning System (IMPS) images is that they are easily "cloned" leveraging the use of rpm and zypper. This ability allows the site to test new PE releases, and also makes reverting back to previous PE releases easier. However, VTune in its part of the sampling driver installation is not system agnostic and requires thorough following of supported Linux kernel used for data collection. This will be shown later in the example.

Installing VTune Amplifier is performed on the SMW by using chroot to access the PE root image. You need to copy the installation package of VTune to PE image root, execute VTune installation procedure and create a VTune modulefile.  

The Craypkg-gen tool is used to generate a modulefile so that third party programming software like VTune can be used in a similar manner as the components of the Cray Programming Environment. But before that you need to define USER_INSTALL_DIR environment variable, which is for VTune would be /opt/intel.

The Craypkg-gen ‘-m’ option will create the modulefile:

$ craypkg-gen –m $USER_INSTALL_DIR/vtune_amplifier_xe_2017.0.2.478468

The ‘-m’ option also creates a set_default script that will make the associated modulefile the default version that is used by the module command. For this example, the following set_default script was created:

$USER_INSTALL/admin-pe/set_default_craypkg/set_default_vtune_amplifier_xe_2017.0.2.478468

Executing the generated set_default script will result in a “module load vtune” loading the vtune_amplifier_xe/2017.0.2.478468 modulefile.

 

Example of installing VTune Amplifier 2017

With having CLE 6.0 the Programming Environment software installed on to a PE image root, download the Intel VTune Amplifier 2017 package, and copy it to the PE image root:

smw # export  PECOMPUTE=/var/opt/cray/imps/image_roots/<pe_compute_cle_6.0_imagename>smw # cp vtune_amplifier_xe_2017_update1.tar.gz $PECOMPUTE/var/tmp

Note, it could be not a standalone VTune installation package, but the whole Intel parallel Studio XE package - parallel_studio_xe_2016_update1.tgz. In this case the installation would be different only in a sense of a selecting a VTune component.

If not using a FlexLm license server, which require a certain configuration, copy a registered license file to PE image for interactive installation:

smw # cp l_vtune_amplifier_xe_2017_p.lic $PECOMPUTE/var/tmp

Or copy the license file to the default Intel licenses directory:

smw # cp l_vtune_amplifier_xe_2017_p.lic $PECOMPUTE/opt/intel/licenses

Perform a chroot to PE image:

smw # chroot $PECOMPUTE

Untar the VTune Amplifier package:

smw # cd /var/tmpsmw # tar xzvf vtune_amplifier_xe_2017_update1.tar.gz

By default, the VTune installer is interactive and requires the administrator to respond to prompts. You might want to consult with the Intel® VTune™ Amplifier XE Installation Guide before proceeding.

smw # cd vtune_amplifier_xe_2017_update1/smw # ./install.sh 

Follow the command prompts to install the product.

If you need a non-interactive VTune installation, refer to the Automated Installation of Intel® VTune™ Amplifier XE help article.

 

Once the installer flow reached the sep driver installation, you can either postpone that step or provide a path to the Linux kernel source directory that runs on the Cray compute nodes.

Note: the Cray SWM 8.0 is based on SLES 12 system, which might not be the same as on the compute nodes. In this case you need to provide a path to the target OS kernel headers when requested by the VTune installer.

In case of postponed driver installation, go through the following steps (assuming that the compute node Linux kernel sources are unpacked to the usr/src/target_linux).

Use the GCC environment for building:

smw #  module swap PrgEnv-cray PrgEnv-gnu

Set environment variable CC, so that 'cc' is used as the compiler:

smw #  env CC=cc

Build the drivers (two kernel drivers will be built):

smw # cd vtune_amplifier_xe_2017/sepdk/srcsmw # ./build-driver –ni --kernel-src-dir=$PECOMPUTE/usr/src/target_linux

Install the drivers with permit to the user group (by default, the driver access group name is ‘vtune’ and the driver permissions - 660):

smw # ./insmod-sep3 -r -g <group>

By default, the driver will be installed in the current /sepdk/src directory. If you need to change it, use the --install-dir option with the insmod-sep3 script.

Refer to the <vtune-install-dir>/sepdk/src/README.txt document for more details on building the driver.

Create the VTune modulefile following the steps:

smw # module load craypkg-gensmw # craypkg-gen -m $PECOMPUTE/opt/intel/vtune_amplifier_xe_2017.0.2.478468smw # /opt/intel/vtune_amplifier_xe_2017.0.2.478468/amplxe-vars.sh

The above procedure will create the module file $PECOMPUTE/modulefiles/vtune_amplifier_xe /2017.0.2.478468

You might want to edit the newly created modulefile specifying path variables.

 

Collecting profile data with VTune Amplifier

In order to collect profiling data for further analysis you need to run VTune collector along with your application on a system. There are several ways how to launch an application for analysis and in general, they are described in the VTune Amplifier Help pages.

Cray systems have specifics of running applications by submitting batch jobs, so has VTune. Generally it is recommend using VTune command line tool, "amplxe-cl", to collect profiling data on compute nodes via batch jobs, and then using VTune GUI, “amplxe-gui”, to display results on a login node of the system. 

However, job scheduler utilities accepted as a part of task submitting procedure, as well as Compilers and MPI libraries used for creating parallel applications, may vary depending on specific requirements. This creates additional complexities for performance data collection using VTune or any other performance profiling tool. Below, we will give some common recipes on how to run performance data collection with two most frequently used job schedulers.

 

Slurm* workload manager and srun command

Here is an example of a job script for analysis of a pure MPI application:

#!/bin/bash -l
#SBATCH --partition debug
#SBATCH --vtune
#SBATCH --time 01:00:00
#SBATCH --nodes 2
#SBATCH --job-name myjob

module unload darshan
module load vtune
srun -n 64 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

This script will run the advanced-hotspots analysis over a.out program running on two nodes with 96 tasks in total. Other VTune options mean the following:

-collect advanced-hotspots type of analysis used by VTune collector (this is hardware events based collector as well as general-exploration and memory-access)

--trace-mpi allows the collectors to trace MPI code, and determine the MPI rank IDs if the code is linked to a non-Intel MPI library. When using Intel MPI library this options should be omitted.

-r my_res_dir name for results directory which will be created in a current directory

It is highly recommended to create result directory on the fast Lustre file system. VTune needs frequently purging trace data from memory to disk, so it’s not recommended putting results in to global file system as it’s projected to compute nodes via Cray DVS layer and might be may not fully supporting mmap functionality required by VTune collector.

In the script you need to unload the darshan module on the system before profiling your code, as VTune collector might interfere with the I/O characterization tool. Although, there might be no darshan tool installed in your system at all.

The --vtune flag is needed for dynamic insmod’ing driver for hardware events collection during the job.

Note the length of you job. Even if the '-t' is set to 1 hour, it doesn’t mean that VTune will be collecting data for the whole period if application run time. By default, the size of results directory is limited and when trace file reaches this limit, VTune will stop the collection while the application continues to run. The simple implication is that performance data will be collected over some part for application starting from its beginning. To overcome this limitation consider either increasing result directory size limit or decreasing sampling frequency. 

 

If you application uses a hybrid parallelization approach with combination of MPI and OpenMP, your job script for VTune analysis might look like the following:

#!/bin/bash -l
#SBATCH --partition debug
#SBATCH --vtune
#SBATCH --time 01:00:00
#SBATCH --nodes 2
#SBATCH --job-name myjob

module unload darshan
module load vtune
export OMP_NUM_THREADS=32
srun -n 2 –c 32 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

As you can see, tasks and threads assignment syntax remains the same for srun, and as with pure MPI application, you specify the amplxe-cl as a task to execute which will take care of distribution of the a.out tasks between compute nodes. In this case VTune creates only two per-node results directories, named my_res_dir.<nodename>. The per-OpenMP thread results will be aggregated in each per-node resulting trace file.

One of the downsides of using such approach is that VTune will analyze each task and it will create results against each MPI rank in the job. It’s not a problem when a job is distributed among a few ranks, but in case of hundreds or thousands tasks you might end up with enormous performance data size and infinite time to complete analysis finalization. In this case you might want to collect profile against a single or a subset of MPI ranks, leveraging the multiple program configuration from srun. This approach is described in the article [3]. 

For the aforementioned example you need to create a separate configuration file that will define which MPI ranks will be analyzed.

$ cat run_config.conf
0-1022 ./a.out
1023 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

And in the job script the srun line will look like the following:

srun –n 32 –c 32 --multi-prog ./srun_config.conf

 

Application Level Placement Scheduler* (ALPS) and aprun command

With ALPS running VTune by the the aprun command is similar to the Slurm/srun experience. Just make sure you are using the --trace-mpi option to make sure VTune is keeping one collector instance on each node with multiple MPI ranks.

For a pure MPI application your job script for VTune analysis might look like the following [4]:

#!/bin/bash
#PBS -l mppwidth=32
#PBS -l walltime=00:10:00
#PBS -N myjob
#PBS -q debug

cd $PBS_O_WORKDIR

aprun -n 32 –N 16 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

where:

-n– number of processes

-N– number of processes per node

In case of a hybrid parallelization approach with combination of MPI and OpenMP:

#!/bin/bash
#PBS -l mppwidth=32
#PBS -l walltime=00:10:00
#PBS -N myjob
#PBS -q debug

cd $PBS_O_WORKDIR
setenv OMP_NUM_THREADS 8
aprun -n 32 –N 2 –d 8 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

where:

-d– depth or number of CPUs assigned per a process

If you’d like to analyze just on node, you need to modify the script for multiple executables:

#!/bin/bash
#PBS -l mppwidth=32
#PBS -l walltime=00:10:00
#PBS -N myjob
#PBS -q debug

cd $PBS_O_WORKDIR

aprun -n 16 ./a.out : -n 16 –N 16 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

 

Known limitations on VTune Amplifier collection in Cray XC systems

1. By default Cray compiler produces static binaries. General recommendation is to use dynamic linking for profiling under VTune Amplifier where possible to avoid a set of limitations that the tools has on profiling static binaries. If dynamic linking cannot be applied the following should be taken into account on VTune Amplifier limitations:

a) PIN-based analysis types don’t work with static binaries out of the box reporting the following message:

Error: Binary file of the analysis target does not contain symbols required for profiling. See the 'Analyzing Statically Linked Binaries' help topic for more details.

This impacts hotspots, concurrency and locks and waits collection and also memory access collection with memory object instrumentation. See https://software.intel.com/en-us/node/609433 how to work around the issue.

 

b) PMU-based analysis crashes on static binaries with OpenMP RTL from 2017 Gold and earlier Intel compiler versions.

To workaround the issue use a wrapper script with the following variables to be unset:

unset INTEL_LIBITTNOTIFY64
unset INTEL_ITTNOTIFY_GROUPS

The issue was fixed in Intel OpenMP RTL that is a part of Intel Compiler 2017 Update 1 and later.

 

c) Collection information based on User API will not be available including user pauses, resumes, frames, tasks defined by a user in their source code, OpenMP instrumentation based statistics such as Serial time vs Parallel time, imbalance on barriers etc, rank number capturing to enrich process names with MPI rank numbers.

 

2. In the case if VTune result directory is placed on a file system projected by Cray DVS VTune emits an error that the result cannot be finalized.

To workaround the issue - place a VTune result directory on a file system w/o Cray DVS projection (scratch etc) using '-r' VTune command line option.

 

3. It is required to add PMI_NO_FORK=1 to the application environment to make MPI profiling working and avoid MPI application hang under profiling.

 

Analyzing data with VTune Amplifier

VTune Amplifier provides a powerful and visual tools for multi-process, multithreading and single-threading performance analysis. In most cases it’s better using VTune GUI for opening collected results, while command-line tools have very similar results reporting functionality. For doing that you can enter a login node, load tune module and launch VTune GUI:

$ module load vtune
$ amplxe-gui

In the GUI you need to open an .amplxe project file in an appropriate results directory created during data collection. VTune Amplifier GUI is exposing a lot of graphic controls and objects, so performance wise it better for remote users to run an in-place X-server and open a client X-Window terminal using VNC* or NX* [5] software.

 

*Other names and brands may be claimed as the property of others.

 

References

[1] http://docs.cray.com/PDF/XC_Series_Software_Installation_and_Configuration_Guide_CLE60UP02_S-2559.pdf

[2] https://cug.org/proceedings/cug2016_proceedings/includes/files/pap127.pdf

[3] Running Intel® Parallel Studio XE Analysis Tools on Clusters with Slurm* / srun

[4] http://docs.cray.com/cgi-bin/craydoc.cgi?mode=Show;q=;f=man/alpsm/10/cat1/aprun.1.html

[5] https://en.wikipedia.org/wiki/NX_technology


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>