Measuring Load Imbalance using the Intel® Vtune™ Amplifier XE

OpenMP on the Intel® Xeon Phi™ coprocessor performs as well as on Intel® Xeon processors. However, the slower clock on the Intel Xeon Phi coprocessor and the sheer number of threads accentuates OpenMP overhead. In most cases, the problem is either load imbalance or a significant amount of serial execution and is rarely the overhead itself.

Let’s take a look at the following Intel Vtune screenshot.

As apparent from the screenshot, the libiomp5.so shared object that implements the OpenMP runtime takes 24.7% of the CPU time. It comes as no surprise that developers may blame OpenMP for the suboptimal performance. To understand what is really going on, it helps to understand how most OpenMP programs work.

OpenMP is a fork-join parallel model meaning that an OpenMP program starts with a single master thread executing serial code. When a parallel region is encountered, the master thread forks into multiple threads. The threads then execute the parallel region. At the end of the parallel region, the threads join at a barrier, and then the master thread continues executing serial code.

It is possible to write an OpenMP program more like an MPI program, where the master thread immediately forks to a parallel region and constructs such as barrier and single are used for synchronization; but it is far more common for an OpenMP program to consist of a sequence of parallel regions interspersed with serial code. In such a program, CPU time will show up in the OpenMP runtime in two cases:

1. When the master thread is executing a serial region, the slave threads will be spinning in the OpenMP runtime waiting for the next parallel region. We call this serial time.

2. When a thread finishes a parallel region, it will spin in the barrier waiting for the other threads to finish. We call this load imbalance.

Until the release of Intel Vtune Amplifier XE 2013 Update 5, it was difficult to distinguish these two cases, and there was no way to correlate time spent in the OpenMP runtime with the source code of the program. But now, Intel Vtune Amplifier XE 2013 Update 5 together with Intel® Composer XE 2013 Update 2 provide developers with the tools needed to understand where time is wasted in OpenMP serial time and where it is wasted due to a load imbalance. The OpenMP runtime library in Intel Composer XE 2013 Update 2 contains frame markers that can be used by Intel Vtune to break down the CPU time in OpenMP into constituent parallel regions and serial regions. The screenshot below gives an example:

Intel Vtune has had the concept of frames for some time. Originally, frames were intended to correspond to video frames, and could be distinguished by frame domain which is essentially a name for the frame. In the above screenshot, the frame domain is the OpenMP region name as reported by the instrumented libiomp5.so contained in Intel Composer XE Update 2. The name consists of the name of the function containing the parallel region together with the beginning and ending line number of the region. There is also an entry for the serial part of the code which is shown to be "Outside any frame”. In addition to the domain name, Intel Vtune also provides the number of times the region was executed, and the total wall clock time spent in the region.

The frames can be expanded to see the functions called in each frame or in the serial part, thus giving a function profile by parallel region. Intel Vtune also provides the time for each thread in the parallel region, which can be helpful to determine which threads, if any, are starved for work.

This frame information can be collected on the host as well as on the Intel Xeon Phi coprocessor. On the host, this information can be collected by simply setting the environment variable KMP_FORKJOIN_FRAMES=1, before running the application. However, on the Intel Xeon Phi coprocessor, the collection requires more work because of the way in which Intel Vtune collects data on the coprocessor. To collect the frame information, Intel Vtune sets environment variables which also need to be propagated to the coprocessor. These variables are designed to work seamlessly with OpenCL programs on the coprocessor, but require some extra effort for OpenMP programs.

The following environment variables are used internally by Intel Vtune to collect the frame information in OpenCL applications running on the Intel Xeon Phi coprocessor:

__OCL_MIC_INTEL_LIBITTNOTIFY64 – Dynamic library used to collect frames information on the card.

__OCL_MIC_INTEL_ITTNOTIFY_CONFIG – Configuration parameters for the dynamic library

__OCL_MIC_USERAPICOLLECTOR_LOG_DIR – Log directory on the card

Collecting frame information in native applications running on the Intel Xeon Phi coprocessor

To collect frame information for native applications, the developer needs to create a wrapper script that propagates the values for the above environment variables to the coprocessor. The script needs to strip the __OCL_MIC prefix and set the values of the unprefixed variables on the coprocessor. This is typically done by using ssh. Assuming that the executable on the coprocessor is /tmp/a.out, and the scipt is called run.sh, the scriptshould contain:

ssh mic0 “KMP_FORKJOIN_FRAMES=1 
INTEL_LIBITTNOTIFY64=$__OCL_MIC_INTEL_LIBITTNOTIFY64 
INTEL_ITTNOTIFY_CONFIG=$__OCL_MIC_INTEL_ITTNOTIFY_CONFIG 
USERAPICOLLECTOR_LOG_DIR=$__OCL_MIC_USERAPICOLLECTOR_LOG_DIR 
/tmp/a.out”

To start the collection, execute run.sh by running the Intel Vtune command line, amplxe-cl, on the host:

amplxe-cl –collect knc-lightweight-hotspots sh run.sh

Collecting frame information in Offload applications running on the Intel Xeon Phi coprocessor

The method for offload collection is almost identical to that of the native collection with an exception that ssh isn’t used. The following example assumes that you have the MIC_ENV_PREFIX environment variable set to MIC.

Similar to the native example, to start the collection, execute run.sh by running the Intel Vtune command line, amplxe-cl,on the host:

amplxe-cl –collect knc-lightweight-hotspots sh run.sh

Where run.sh contains

export MIC_INTEL_LIBITTNOTIFY64=$__OCL_MIC_INTEL_LIBITTNOTIFY64
export MIC_INTEL_ITTNOTIFY_CONFIG=$__OCL_MIC_INTEL_ITTNOTIFY_CONFIG
export MIC_USERAPICOLLECTOR_LOG_DIR=$__OCL_MIC_USERAPICOLLECTOR_LOG_DIR
./a.out

Viewing the frame data collected

Once you have collected frame data using the Intel Vtune command line you can view the data using the Intel Vtune GUI. By default, the results are stored under the host current directory in a folder named r00**lh. You can view this information by opening the created project in Intel Vtune: File>Open>Project>r00**lh. The collected frame data can be viewed in the bottom-up tab of the results, by selecting a grouping based on Frame Domain.

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

Optimization Notice

http://software.intel.com/en-us/articles/optimization-notice/

Intel® VTune™ Amplifier XE