High Bandwidth Memory (HBM): how will it benefit your application?

Purpose

The first step towards the usability of MCDRAM or High Bandwidth Memory (HBM) is assessing the memory bandwidth utilization for your application.

This article provides basic instructions on how to profile and evaluate memory bandwidth utilization for your application using Intel® Vtune™ Amplifier on Intel® Xeon® processors (IvyBridge/Haswell) and Intel® Xeon Phi™ Coprocessors (Knights Corner).

Instructions

Collecting Bandwidth Profile on Intel® Xeon® processors and/or Intel® Xeon Phi™ Coprocessors using Intel® Vtune™ Amplifier

Viewing the Bandwidth Profile on Intel® Xeon® processors and/or Intel Xeon Phi™ Coprocessors using Intel® Vtune™ Amplifier

source the latest version of the Intel® Vtune™ Amplifier.
- Make sure to use Intel® Vtune™ Amplifier 2015 Update 1 or later.
  
  Example:
  source /opt/intel/vtune_amplifier_xe_2015./amplxe-vars.sh
Create an appropriate run script to run your application on Xeon and/or Xeon Phi
- Make sure to compile your application with “-g” to provide debug information
Make sure to set the path to resolve any compiler and MPI dependencies for your application running on Xeon or Xeon Phi

Example:
source /opt/intel/composer_xe_2015.1.133/bin/compilervars.sh intel64
source /opt/intel/impi/5.0.2.044//bin/mpivars.sh
Collecting Bandwidth:

Intel® Xeon® processors (IvyBridge/Haswell)

amplxe-cl -collect bandwidth -r -- ./ (or run script)

Intel® Xeon® Phi™ coprocessors (Knights Corner)

Syntax while running the application on Xeon Phi coprocessor natively
(e.g. ssh mic0 “cd /tmp ; ./a.out)

Run the amplxe-cl command from the host only
amplxe-cl -target-system=mic-native:`hostname`-mic<N> -collect bandwidth –r <your-result-dir> -- <full-path-to-app-to-launch-on-TARGET_CARD>

Syntax while running the application on Xeon Phi Coprocessor from host (e.g. mpirun from host, offload, OpenCL etc.):
Run the amplxe-cl command from the host only
amplxe-cl -target-system=mic-host-launch:`hostname`-mic<N> -collect bandwidth –r <your-result-dir> -- <full-path-to-app-to-launch-on-host>
Some additional handy Vtune commands:
- The default Vtune limit for the result data for any profile collection is 500 MB. You can add the following knob to your “amplxe-cl” command to increase the size or even make it unlimited (by specifying 0)
  
  -data-limit= (default is 500)
  
  Limit the amount of raw data to be collected. For unlimited data size, specify 0.
- If you believe that your application is really huge and long running, you can reduce the size of the data collected by adding the knob below to your “amplxe-cl” command
  
  -target-duration-type=veryshort | short | medium | long (default is ‘short’)
  
  This value affects the size of collected data. For long running targets, sampling interval is increased to reduce the result size. For hardware event-based analysis types, the duration estimate affects a multiplier applied to the configured Sample after value.
After collecting the Bandwidth Profile for your application , the next step is to view and analyze the results

Viewing the Bandwidth Profile on Intel® Xeon® processors and/or Intel Xeon Phi™ Coprocessors using Intel® Vtune™ Amplifier

You need to open a “VNC” session to open the results in Vtune GUI
source the latest version of the Intel® Vtune™ Amplifier

Example:
source /opt/intel/vtune_amplifier_xe_2015./amplxe-vars.sh>
Open the result using Vtune GUI
amplxe-cl <your-result-dir>
In the “Summary Tab” you can see the “Average Bandwidth” reported for your application

Example:
The results in the below snapshots are from one of the Sandia’s Mantevo mini-apps on Intel Xeon® Processors (2 Socket Haswell) and Intel® Xeon Phi™ coprocessor (Knights Corner)
Note for Intel® Xeon®: Package_0 is Socket 0; Package_1 is Socket 1. The bandwidth is reported for each socket of the N-Socket Xeon® processor you run on.

Intel® Xeon® (Haswell)
Intel® Xeon Phi™ Coprocessor (Knights Corner)
In the “Bottom-up” tab you can observe the “Peak Bandwidth” utilized and also a time-line view of your application’s bandwidth utilization.
- You can also see the Read Bandwidth and Write Bandwidth utilization separately in this view.
  
  Example:
  The results in the below snapshots are from one of the Sandia’s Mantevo mini-apps on Intel® Xeon® Processors (Haswell) and Intel® Xeon Phi™ coprocessor (Knights Corner).
  
  Intel® Xeon® (Haswell)
  The total peak bandwidth reported for this Dual Socket Haswell run is (52.599*2 = 105.198 GB/s)
  Intel® Xeon Phi™ coprocessor (Knights Corner) The total peak bandwidth reported for this Xeon Phi Coprocessor run is (158.580 GB/s)
In addition, one can also select only a portion of the region in the time-line view and then zoom-in and filter on that region (by clicking and dragging on the timeline, as shown in the below snapshot). The GUI will show new bandwidth utilization numbers to reflect the value for the new zoomed-in region.
- This is especially important for applications which have a long initialization. Using this zoom-in feature, we can focus only on the required part (or phase) of the application.
  Snapshot: Users can zoom in on a particular region of the time line by clicking and dragging on the time line, and then selecting “Zoom in and Filter by Selection” menu option.

Analyzing the Bandwidth Profile on Intel® Xeon® processors and/or Intel® Xeon Phi™ Coprocessors using Intel® Vtune™ Amplifier

Understanding memory bandwidth profile and limitations are important for your application mainly because:
- Bandwidth bottlenecks increase the latency at which cache misses are serviced.
This is more important for Intel® Xeon Phi™ Coprocessor (Knights Landing), since the on-package high bandwidth memory (MCDRAM: up-to 16GB) will have ~3to4x more memory bandwidth of DDR4.
- Hence it is important to know which data structures/hot arrays one would need to allocate to MCDRAM as opposed to DDR4.
- But that’s the next step, before that one has to understand if their application has a high memory foot print (> MCDRAM size) and is indeed memory bandwidth limited or not.
The theoretical memory bandwidth peaks for the Intel® Xeon® processors (IvyBridge/Haswell) and Intel® Xeon Phi™ coprocessor (Knights Corner) can be computed as follows:

Intel® Xeon® (IvyBrdige/Haswell):

Theoretical Peak (GB/s) [Per Socket] = (MT/s) * 8 bytes/Clock * <num channels> / 1000

Example:

For Dual Socket Haswell (2133 MT/s; 4 Channels per Socket) Theoretical Peak (GB/s) [Per Socket] = (2133 * 8 * 4) / 1000 = 68.256 GB/s Thus, Theoretical peak for Dual Socket = 68.256 * 2 = 136.512 GB/s

Intel® Xeon Phi™ Coprocessor (Knights Corner):

Theoretical Peak (GB/s) = (MT/s) * 4 bytes/Clock * <num channels> / 1000

Example:

For Intel® Xeon Phi™ Coprocessor (Knight Corner) (5500 MT/s; 16 Channels per Socket)

Theoretical Peak (GB/s) [Per Socket] = (5500 *4 * 4) / 1000 = 352 GB/s
But due to certain memory limitations and bottlenecks it is not always possible to achieve the theoretical memory bandwidth limits.
- Hence, it is also necessary to compare the bandwidth rate for the profiled code with the by-design bandwidth limited benchmark (like those from STREAM benchmark).
The peak STREAM Triad performance for the specified Intel® Xeon® Processors (IvyBridge/Haswell) and Intel® Xeon Phi™ coprocessor (Knights Corner) are as shown below:
IvyBridge: 2.70 GHz Dual Socket, 12 cores/socket, EIST/Turbo on, SMT on, 64 GB RAM DDR3 1600 8*8GB Haswell: 2.60 GHZ Dual Socket, 14 cores/socket, EIST/Turbo on, SMT on, 64 GB RAM DDR4 2133 8*8GB Knights Corner: 1.23 GHz, 61 Cores/node, 5.5 GT/s, ECC on, Turbo off
IvyBridge Haswell Knights Corner
STREAM Triad (GB/s) 87 GB/s 110 GB/s 177 GB/s
Analyzing the obtained bandwidth vs. peaks from one of the Sandia’s Mantevo mini-apps shown above:

Intel® Xeon® Processor (Haswell):

Profiled Bandwidth from Vtune (Dual Socket): 105.198 GB/s

Theoretical Peak (Dual Socket): 136.512 GB/s

STREAM Triad: 110 GB/s

The profiled Bandwidth for the application ~77% of the theoretical peak and ~95% of the practical peak (STREAM Triad). Thus this application is indeed memory bandwidth limited on Haswell (>75% theoretical and/or practical peaks).
Intel® Xeon® Phi™ Co-processor (Knights Corner):

Profiled Bandwidth from Vtune (Dual Socket): 158.580 GB/s

Theoretical Peak: 352 GB/s

STREAM Triad: 177 GB/s

The profiled Bandwidth for the application ~45% of the theoretical peak and ~90% of the practical peak (STREAM Triad).Thus this application is indeed memory bandwidth limited even on Knights Corner (>75% theoretical and/or practical peaks).
The next step in this exercise would be to try to understand the application’s data-structures, find the arrays which are bandwidth hungry, find their memory profile and allocate the arrays to the on-package High Bandwidth Memory (HBM) accordingly.
- This is a topic which will be described on another white paper in the near future.

High Bandwidth Memory (HBM): how will it benefit your application?

Purpose

Instructions

Collecting Bandwidth Profile on Intel® Xeon® processors and/or Intel® Xeon Phi™ Coprocessors using Intel® Vtune™ Amplifier

Viewing the Bandwidth Profile on Intel® Xeon® processors and/or Intel Xeon Phi™ Coprocessors using Intel® Vtune™ Amplifier

Analyzing the Bandwidth Profile on Intel® Xeon® processors and/or Intel® Xeon Phi™ Coprocessors using Intel® Vtune™ Amplifier

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112

	IvyBridge	Haswell	Knights Corner
STREAM Triad (GB/s)	87 GB/s	110 GB/s	177 GB/s