Getting Ready for Intel® Xeon Phi™ x200 Product Family

This article demonstrates some of the techniques application developers can use to best prepare their applications for the upcoming Intel® Xeon Phi™ x200 product family – codename Knights Landing (KNL).

Quick links

1. Introduction

The Intel® Xeon Phi™ x100 family of coprocessors was the first generation of the Intel® Xeon Phi™ product family. It offered energy efficient scaling, enhanced vectorization capabilities and exploited local memory bandwidth. Some of its important features include more than 60 cores (240+ threads), up to 16 GB GDDR5 memory with 352 GB/s bandwidth, and the ability to run Linux* with standard tools and languages. Some applications used these many-core processors by offloading compute intensive workload while others simultaneously used both the Intel® Xeon® host system and Intel® Xeon Phi™ coprocessors each crunching its own portion of the workload.

There are applications which perform well under this paradigm while there are others where the benefit of accelerated computing is not enough to make up for the cost of moving data between the host and the coprocessor over PCIe. From the application developer’s perspective, this can be a serious problem.

The Intel® Xeon Phi™ x200 product family – codename Knights Landing (KNL) – is offered both as a processor and as a coprocessor. As a processor, KNL has no need for a host to support it. It can boot the full operating system by itself. For the applications which were limited by the overhead of data transfer on KNC, all the data processing can be completed on the KNL node itself, either in high bandwidth near memory or slower DDR4, without worrying about moving the data back and forth across a PCIe bus between host and accelerator. The coprocessor version of KNL offers an offload paradigm similar to KNC, the first generation of Intel® Xeon Phi™ coprocessors, but now with an added advantage of improved parallelism and greater single thread performance. However for both the processor and the coprocessor versions of the Intel® Xeon Phi™ x200 product family, it is important that the applications use as many cores as possible in parallel effectively and also explore and utilize the enhanced vectorization capabilities to achieve significant performance gains. Cluster applications must also support fabric scaling. Moreover, it is highly likely that applications optimized for Knights Corner will also perform well for the next generation of Intel® Xeon Phi™ product family.

2. About This Document

The first part of this document lists important features of the Intel® Xeon Phi™ x200 product family. Secondly, it demonstrates how currently available tools like the Intel® Software Development Emulator and Intel® VTune Amplifier can be used to prepare for upcoming KNL processors and coprocessors. It also enlists programming/optimization techniques which are already known from the Intel® Xeon Phi™ x100 products and new techniques suitable for the Intel® Xeon Phi™ x200 processors and coprocessors. Wherever possible, pointers are given to already published best known methods which can assist application developers to apply these optimization techniques. This document will not explain any architecture or instruction level details. It is also not intended for system administrators who wants to setup or manage their Knights Landing systems. Most of the discussions in this document will hover around the KNL processor as it will be the first release of Intel® Xeon Phi™ x200 product family.

3. Intel® Xeon Phi™ x200 Product Overview

Figure 1. KNL package[1] overview

3.1 On Package Micro-architecture Information

Some of the architectural highlights for Intel® Xeon Phi™ x200 product family – codename Knights Landing - are as follows:

Up to 72 cores (in 36 Tiles) connected in a 2D Mesh architecture with improved on-package latency
6 Channel of DDR4 supporting up to 384GB with a sustained bandwidth of more than 80 GB/s
Up to 16GB of high performance on-package memory (MCDRAM) with a sustained bandwidth of ~500 GB/s, supporting flexible memory modes including cache and flat
Each tile can be drawn as follows:

Figure 2. KNL Tile

Here each core is based on Intel® Atom™ core with many HPC enhancements such as:
- 4 Threads/core
- Deep Out-of-Order buffers
- Gather/scatter in hardware
- Advanced branch prediction
- High cache bandwidth
2x 512b Vector Processing Units per core with support for Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
3x Single-thread performance compared to Knights Corner
Binary compatible with Intel® Xeon® processors
Cache-coherent
Support for Intel® Omni Scale™ fabric integration

3.2 Performance

3+ Teraflops of double-precision peak theoretical performance per single KNL node
Power efficiency (over 25% better than discrete coprocessor)[2] – over 10 GF/W
Standalone bootable processor with ability to run Linux and Windows OS
Platform memory capacity comparable to Intel® Xeon® processors

3.3 Programming Standards Supported

OpenMP
Message Passing Interface (MPI)
Fortran
Intel® Threading Building Blocks and Intel® Cilk™ Plus
C/C++

4. Application Readiness for Knights Landing

Similar to the first generation of Intel® Xeon Phi™ coprocessors, scaling and vectorization are two fundamental considerations to achieve high performance on Knights Landing. Moreover, the Intel® Xeon Phi™ x200 processors have the ability to use high bandwidth memory (MCDRAM) as a separate addressable memory. For certain memory bound applications, modifying allocations of some data structures to utilize this high bandwidth memory can also boost the performance further.

4.1 Scaling

In order to obtain performance benefits with Intel® Xeon Phi™ product families, it is very important for the application to scale with respect to the increasing number of cores. To check scaling, you can create a graph of performance as you run your application with various numbers of threads either on Intel® Xeon® processors or Intel® Xeon Phi™ x100 coprocessors. Depending on your programming environment, you can either change an appropriate environment variable (for example, OMP_NUM_THREADS for OpenMP) or configuration parameters to vary the number of threads. In some cases, as you increase the number of cores, you may also want to increase the size of the dataset to ensure there is enough work for all the threads and the benefits of parallel performance are not subsumed by overhead in thread creation and maintenance.

4.2 Vectorization

The Intel® Xeon Phi™ x200 product family supports Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions in addition to Intel® SSE, AVX, AVX2 instructions sets. This enables processing of twice the number of data elements as AVX/AVX2 with a single instruction and four times that of SSE. These instructions also represent a significant leap in 512-bit SIMD support which was also available with the first generation Intel® Xeon Phi™ coprocessors.

With AVX-512, the Intel® Xeon Phi™ x200 product family offers higher performance for the most demanding computational tasks. It features the AVX-512 foundation instructions to support 32 vector registers each 512 bits wide, eight dedicated mask registers, 512-bit operations on packed floating point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, new operations, additional gather/scatter support, high speed math instructions, and compact representation of large displacement value. In addition to foundation instructions, Knights Landing will also support three additional capabilities: Intel® AVX-512 Conflict Detection Instructions (CDI), Intel® AVX-512 Exponential and Reciprocal Instructions (ERI) and Intel® AVX-512 Prefetch Instructions (PFI). These capabilities provide efficient conflict detection to allow more loops to be vectorized, exponential and reciprocal operations, and new prefetch capabilities, respectively.

As part of the application readiness efforts for the Intel® Xeon Phi™ x200 product family, support for AVX-512 can be currently evaluated using the Intel® Software Development Emulator (Intel® SDE) on an Intel® Xeon® processor. It has been extended for AVX-512 and is available at https://software.intel.com/en-us/articles/intel-software-development-emulator. Intel® SDE is a software emulator and it is mainly used to emulate future instructions. It is not cycle accurate and can be very slow (up to 100x). However with the instruction mix report it can give useful information like a dynamic instruction execution count and function based instruction count breakdown for evaluating compiler code generation.

The Compiler Switch to enable AVX512 for KNL is -xMIC_AVX512 (14.0 and later Intel® Compilers)

4.2.1 SDE Example

Sample code, as shown in Appendix B can be used to demonstrate how Intel® SDE can help evaluate differences in compiler code generation for AVX, AVX2 and AVX-512.

Download the latest version of Intel® SDE from https://software.intel.com/en-us/articles/intel-software-development-emulator. The version used in the following example is 7.15.

Extract the SDE and set the environment variable to use sde/sde64

$ tar -xjvf sde-external-7.15.0-2015-01-11-lin.tar.bz2
$ cd sde-external-7.15.0-2015-01-11-lin
$ export PATH=`pwd`:$PATH

Use the latest Intel® Compilers (14.0+) and compiling with the “-xMIC-AVX512” knob to generate Knights Landing (KNL) binary

//Compiling for KNL
$ icc -openmp -g -O3 -xMIC-AVX512 -o simpleDAXPY_knl simpleDAXPY.c
//Compiling for Haswell
$ icc -openmp  -g -O3 -xCORE-AVX2 -o simpleDAXPY_hsw simpleDAXPY.c
//Compiling for Ivy Bridge
$ icc -openmp -g -O3 -xCORE-AVX-I -o simpleDAXPY_ivb simpleDAXPY.c

In order to simplify the analysis, set the number of threads to 1
```
$ export OMP_NUM_THREADS=1
```

Generate instruction mix reports for AVX, AVX2[3] and AVX-512 to compare performance metrics

//Generating report for KNL
$ sde -knl -mix -top_blocks 100 -iform 1 -omix sde-mix-knl.txt -- ./simpleDAXPY_knl 64 40

// Generating report for Ivy Bridge
$ sde -ivb -mix -top_blocks 100 -iform 1 -omix sde-mix-ivb.txt -- ./simpleDAXPY_ivb 64 40

// Generating report for Haswell
$ sde -hsw -mix -top_blocks 100 -iform 1 -omix sde-mix-hsw.txt -- ./simpleDAXPY_hsw 64 40

We compare dynamic count of total instructions executed to get a rough estimate of overall improvement in application performance by running AVX512 against AVX and AVX2 instructions sets. This can be quickly done by parsing the generated instruction mix reports as follows:

//Getting instruction count with AVX-512
$ grep total sde-mix-knl.txt | head -n 1
*total                                                   5493008680

//Getting instruction count with AVX2
$ grep total sde-mix-hsw.txt | head -n 1
*total                                                   6488866275

//Getting instruction count with AVX
$ grep total sde-mix-ivb.txt | head -n 1
*total                                                   7850210690

Reduction in total dynamic instruction execution count

Change of Instruction set	Reduction in dynamic instruction count
AVX -> AVX-512	30.03%
AVX2 -> AVX-512	15.34%

Thus it can be observed that current and future generations of Intel hardware strongly rely on SIMD[4] performance. In order to write efficient and unconstrained parallel programs, it is important that the application developers fully exploit vectorization capabilities of hardware and understand benefits of using explicit vector programming. This can be achieved by either restructuring vector loops, using explicit SIMD directives (#pragma simd) or using compiler intrinsics. Compiler auto-vectorization may also help achieve the goal in most of the cases.

4.3 High Bandwidth Memory and Supported Memory Modes

4.3.1 Introduction to MCDRAM

The next generation of Intel® Xeon Phi™ product family can include up to 16GB of On-Package High Bandwidth Memory – Multi Channel DRAM (MCDRAM). It can provide up to 5x the bandwidth as compared to DDR and 5x the power efficiency compared to GDDR5. MCDRAM supports NUMA [5] and can be configured in Cache, Flat and Hybrid modes. The modes must be selected and configured at boot time.

4.3.2 Cache Mode

In cache mode, all of MCDRAM behaves as a memory-side direct mapped cache in front of DDR4. As a result, there is only a single visible pool of memory and you see MCDRAM as high bandwidth/high capacity L3 cache. Advantage of using MCDRAM as cache is that your legacy application would not require any modifications. So if your application cares about data locality, is not memory bound (i.e., DDR bandwidth bound), and the majority of the critical data structures fit in MCDRAM then this mode will work great for you.

4.3.3 Flat Mode

In flat mode, MCDRAM is used as a SW visible and OS managed addressable memory (as a separate NUMA node), so that memory can be selectively allocated to your advantage on either DDR4 or MCDRAM. With slight modifications to your software to enable use of both types of memory at the same time, the flat model can deliver uncompromising performance. If your application is DDR bandwidth limited, you can certainly boost your application performance by investigating bandwidth critical hotspots and selectively allocating critical data structures to high bandwidth memory.

4.3.4 Hybrid Mode

The hybrid model offers a bit of both worlds – some MCDRAM is configured as addressable memory and some is configured as cache. In order to enable this mode at boot, MCDRAM is configured in flat mode and portion (25% or 50%) of MCDRAM is configured as cache.

4.3.5 DDR Bandwidth Analysis

One of the important steps to decide which memory mode will work best for you, is to analyze memory behavior of your application. This can be done by asking yourself if your application is DDR bandwidth bound. If yes, is it possible to find hotspots where peak bandwidth is often attained for some of the data structures involved i.e. can you identify which data structures are BW critical? Is it possible to fit those bandwidth critical data structures in MCDRAM?

In order to help us answer the above questions, we will use the sample source as shown in Appendix B to demonstrate how Intel® VTune Amplifier can be used to analyze peak DDR bandwidth and also identify bandwidth critical data structures.

4.3.5.1 Sample Kernels

DAXPY – We use a simplified daxpy routine where a vector is multiplied by a constant and added to another vector. It has been modified to use OpenMP parallel with explicit vectorization using simd clause

//A simple DAXPY kernel
void run_daxpy(double A[], double PI, double B[], unsigned long vectorSize){
       unsigned long i = 0;
#pragma omp parallel for simd
        for(i=0; i<vectorSize; i++){
              B[i] = PI*A[i] + B[i];
        }
       return;
}

swap_low_and_high()– A dummy subroutine to do a special swap as given below:

Input Array

Output Array

A similar kind of rearrangement is commonly seen after the filtering step of Discrete Wavelet Transform to separate low and high frequency elements.

//Rearranging Odd and Even Position elements into Low and High Vectors
void swap_low_and_high(unsigned long vectorSize, double C[]){
       unsigned long i = 0, j=0;
       unsigned long half = vectorSize/2;
       double temp = 0.0;

#pragma omp parallel for private(temp)
       for(i=0, j=half; i<half; i+=2, j+=2){
              temp = C[i+1];
              C[i+1] = C[half];
              C[half] = temp;
       }
       return;
}

4.3.5.2 Analysis Using Sample Source

Set up the environment for Compiler and Intel® VTune Amplifier

$ source /opt/intel/composerxe/bin/compilervars.sh intel64
$ source /opt/intel/vtune_amplifier_xe/amplxe-vars.sh

Compile and profile bandwidth for simpleDAXPY application

$ icc -g -O3 -o simpleDAXPY_ddr simpleDAXPY.c -openmp –lpthread
$ amplxe-cl --collect bandwidth -r daxpy_swap_BW -- numactl --membind=0 --cpunodebind=0 ./simpleDAXPY_ddr_debug 512 5

Note
- In order to simplify the analysis we bind the application to run only on one socket.
- Number of array elements selected here is 512M(512 x 1024 x 1024) and both DAXPY and SWAP_LOW_HIGH is repeated 5 times to generate enough samples for analysis

Analyze the bandwidth profile using Intel® VTune Amplifier

$ amplxe-gui daxpy_swap_BW

Figure 3. Bandwidth Profile

From figure 3, it can be seen that the simpleDAXPY application attains single socket bandwidth of ~57 GB/s, which is comparable to practical peak bandwidth for a Haswell[6] system with 4 DDR channels per socket. Hence it can be inferred that the application is DDR memory bandwidth bound.

Note
- Peak bandwidth here is referenced as per the STREAM benchmark Triad results in GB/s
- Single socket peak theoretical bandwidth for experimental setup can be given as
2133 (MT/s) * 8 (Bytes/Clock) * 4 (Channels/socket)/1000 = ~68 GB/s

Identify bandwidth critical data structures

To identify which data structures should be allocated to high bandwidth memory, it is important to look at some of the core counters which correlate to DDR bandwidth. Two such counters we are interested in are MEM_LOAD_UOPS_RETIRED.L3_MISS and MEM_LOAD_RETIRED.L2_MISS_PS. These core hardware event counters can be collected by profiling application with Intel® VTune Amplifier as follows:

$ amplxe-cl -collect-with runsa -data-limit=0 -r daxpy_swap_core_counters -knob event-config=UNC_M_CAS_COUNT.RD,UNC_M_CAS_COUNT.WR,CPU_CLK_UNHALTED.THREAD,CPU_CLK_UNHALTED.REF_TSC,MEM_LOAD_UOPS_RETIRED.L3_MISS,MEM_LOAD_UOPS_RETIRED.L2_MISS_PS numactl --membind=0 --cpunodebind=0 ./simpleDAXPY_ddr_debug 512 5

Analyze the core hardware event counters using Intel® VTune Amplifier GUI

$ amplxe-gui daxpy_swap_core_counters

With Hardware Event Counts viewpoint selected, we look at PMU Events graph.

In order to see the filtered graph as shown in figure 4, Thread box is unchecked and Hardware Event Count drop-down box is selected to be MEM_LOAD_UOPS_RETIRED.L3_MISS

Now looking at the sources of maximum LLC[7] misses, the data structures that contribute to peak bandwidth can be identified. As seen in Figure 5, region of peak LLC misses can be zoomed in and filtered by selection to get further information about contributors of peak LLC misses.

Figure 4. MEM_LOAD_UOPS_RETIRED.L3_MISS Profile

Figure 5. Zoom In and Filter the region of peak L3 cache misses

As shown in Figure 6, for the top contributor right click and view the source to get the exact line in source code where the LLC misses are at peak.

Figure 6. Top Contributor of L3 Misses

Figure 7. Suggestions for BW Critical data structures

Changing allocations to high bandwidth memory

HBWMALLOC is a new memory allocation library which abstracts NUMA programming details and helps application developers use high bandwidth memory on Intel® Xeon and Intel® Xeon Phi™ x200 product families in both FORTRAN and C. Use of this interface is as simple as replacing malloc calls with hbw_malloc in C, or placing explicit declarations in Fortran as shown below:

C - Application program interface

//Allocate 1000 floats from DDR
float   *fv;
fv = (float *)malloc(sizeof(float) * 1000);


//Allocate 1000 floats from MCDRAM
float   *fv;
fv = (float *)hbw_malloc(sizeof(float) * 1000);

FORTRAN – Application program interface

//Allocate arrays from MCDRAM & DDR
c     Declare arrays to be dynamic
      REAL, ALLOCATABLE:: A(:), B(:), C(:)
!DEC$ ATTRIBUTES, FASTMEMORY :: A
      NSIZE=1024
c
c     allocate array ‘A’ from MCDRAM
c
      ALLOCATE (A(1:NSIZE))
c
c     Allocate arrays that will come from DDR
c
      ALLOCATE  (B(NSIZE), C(NSIZE))

Please refer to the Appendix A for further details about hbwmalloc - high bandwidth memory allocation library.

The above profiling results suggests that array A and array B are the data structures which should be preferentially allocated in High bandwidth memory.

In order to allocate array A and array B in high bandwidth memory, the following modifications are done:

.
.
.
#ifndef USE_HBM      //If not using high bandwidth memory
       double *A = (double *)_mm_malloc(limit * sizeof(double),ALIGNSIZE);
       double *B = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
       double *C = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
#else
       //Allocating A and B in High Bandwidth Memory
       double *A, *B, *C;
       hbw_posix_memalign((void**)(&(A)), ALIGNSIZE, limit*sizeof(double));
       if(A == NULL){
              printf("Unable to allocate on HBM: A");
       }
       printf("Allocating array A in High Bandwidth Memory\n");
       hbw_posix_memalign((void**)(&(B)), ALIGNSIZE, limit*sizeof(double));
       if(B == NULL){
              printf("Unable to allocate on HBM: B");
       }
       printf("Allocating array B in High Bandwidth Memory\n");
       C = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
#endif

.
.
.

#ifndef USE_HBW
       _mm_free(A);
       _mm_free(B);
       _mm_free(C);
#else
       hbw_free(A);
       hbw_free(B);
       _mm_free(C);

#endif

.
.
.

Note
- In order to compile your application with high bandwidth memory allocations LD_LIBRARY_PATH environment variable must be updated to include paths to memkind and jemalloc libraries
- For example: $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/build/memkind/lib:$HOME/build/jemalloc/lib

Simulation of low bandwidth and high bandwidth performance gap

Figure 8. Simulating KNL NUMA behavior using 2 socket Intel® Xeon®

At this time, since we do not have access to actual KNL hardware, we will study the behavior of high bandwidth memory using the concept of Non Uniform Memory Access (NUMA). We simulate the scenario of low bandwidth and high bandwidth regions by allocating and accessing arrays from two separate NUMA nodes (i.e. near memory and far memory).

Example:

Lets first compile and execute simpleDAXPY without use of high bandwidth memory allocations. Here we bind the application to socket 1 and bind all the memory allocations to Socket 2.

$ icc -g -O3 -o simpleDAXPY_ddr simpleDAXPY.c -openmp –lpthread
$ numactl --membind=1 --cpunodebind=0 ./simpleDAXPY_ddr 512 5

//Output
Running with selected parameters:
No. of Vector Elements : 512M
Repetitions = 5
Threads = 16
Time - DAXPY (ms): 4074
Time – SWAP_LOW_HIGH (ms): 2051

Set the NUMA node 0 (Socket 0) as High Bandwidth Memory Node as follows:

$  export MEMKIND_HBW_NODES=0

Note
- Explicit configuration of HBW node is only required for simulation. In presence of actual high bandwidth memory (MCDRAM) MEMKIND library will be automatically able to identify high bandwidth memory nodes

Now we will compile and execute simpleDAXPY with high bandwidth allocations using the memkind library as follows. Since we bind memory allocations to node 1 (i.e. socket 1) and bind application to node 0 (i.e. socket 0), by default all the allocations are done in far memory and the memory on NUMA node 0 is selected for allocations only if high bandwidth allocations are explicitly done using hbw_malloc* calls. This simulates the KNL behavior which we might observe when MCDRAM is configured in flat or hybrid mode.

$ icc -O3 -DUSE_HBM -I$HOME/build/memkind/include -L$HOME/build/memkind/lib/ -L$HOME/build/jemalloc/lib/ -o simpleDAXPY_hbm simpleDAXPY.c -openmp -lmemkind -ljemalloc -lnuma -lpthread
$ numactl --membind=1 --cpunodebind=0 ./simpleDAXPY_hbm 512 5

//Output
Running with selected parameters:
No. of Vector Elements : 512M
Repetitions = 5
Allocating array A in High Bandwidth Memory
Allocating array B in High Bandwidth Memory
Threads = 16
Time - DAXPY (ms): 2355
Time – SWAP_LOW_HIGH (ms): 2068

Note
- The above performance improvement reported here is due to both reduced latency and improved bandwidth
- At the time this white paper was written, difference in execution times with and without hbw_malloc could only be observed on systems with RHEL* 7.0 and later. This could be because of some software bug in handling membind for operating systems before RHEL* 7.0

4.4 Other Optimization Techniques

In addition to better scaling, increased vectorization intensity and exploring high bandwidth memory, there are a number of possible user-level optimizations which can be applied to improve application performance. These advanced techniques proved successful for the first generation of Intel® Xeon Phi™ coprocessors and should be helpful for application development on the Intel® Xeon Phi™ x200 product family as well. Some of these optimizations aid compilers while others involve restructuring code to extract additional performance for your application. In order to achieve peak performance, the following optimizations should be kept in mind:

Optimization Technique	Related Reading
Cache blocking	Cache-blocking-techniques
Loop unrolling	Optimization and Performance tuning for Intel® Xeon Phi™ Coprocessors
Prefetching
Tiling
Unit-stride memory access
Array of Structures (AoS) to Structure of Arrays (SoA) transformation	Case study comparing AoS and SoA data layouts for compute intensive loop

5. References

Intel® Xeon Phi™ Coprocessor code named “Knights Landing” - Application Readiness (https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-code-named-knights-landing-application-readiness)

What disclosures has Intel made about Knights Landing (https://software.intel.com/en-us/articles/what-disclosures-has-intel-made-about-knights-landing)

An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors (https://software.intel.com/sites/default/files/article/330164/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors_1.pdf)

Knights Corner: Your Path to Knights Landing (https://software.intel.com/en-us/videos/knights-corner-your-path-to-knights-landing)

Intel® Software Development Emulator (https://software.intel.com/en-us/articles/intel-software-development-emulator)

Intel® Architecture Instruction Set Extensions Programming Reference - Intel® AVX-512 is detailed in Chapters 2-7 (https://software.intel.com/en-us/intel-architecture-instruction-set-extensions-programming-reference)

AVX-512 Instructions (https://software.intel.com/en-us/blogs/2013/avx-512-instructions)

High Bandwidth Memory (HBM): how will it benefit your application? (https://software.intel.com/en-us/articles/high-bandwidth-memory-hbm-how-will-it-benefit-your-application)

GitHub - memkind and jemalloc (https://github.com/memkind)

Appendix A – HBWMALLOC

NAME

       hbwmalloc - The high bandwidth memory interface

SYNOPSIS

       #include <hbwmalloc.h>

       Link with -ljemalloc -lnuma -lmemkind -lpthread

       int hbw_check_available(void);
       void* hbw_malloc(size_t size);
       void* hbw_calloc(size_t nmemb, size_t size);
       void* hbw_realloc (void *ptr, size_t size);
       void hbw_free(void *ptr);
       int hbw_posix_memalign(void **memptr, size_t alignment, size_t size);
       int hbw_posix_memalign_psize(void **memptr, size_t alignment, size_t size, int pagesize);
       int hbw_get_policy(void);
       void hbw_set_policy(int mode);

Installing jemalloc

//jemalloc and memkind can be downloaded from https://github.com/memkind
$ unzip jemalloc-memkind.zip
$ cd jemalloc-memkind
$ autoconf
$ mkdir obj
$ cd obj/
$ ../configure --enable-autogen --with-jemalloc-prefix=je_ --enable-memkind --enable-safe --enable-cc-silence --prefix=$HOME/build/jemalloc
$ make
$ make build_doc
$ make install

Installing memkind

$ unzip memkind-master.zip
$ cd memkind-master
$ ./autogen.sh
$ ./configure --prefix=$HOME/build/memkind --with-jemalloc=$HOME/build/jemalloc
$ make && make install

Update LD_LIBRARY_PATH to include locations of memkind and jemalloc

Appendix B – simpleDAXPY.c

/*
 *  Copyright (c) 2015 Intel Corporation.
 *  Intel Corporation All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */

#include<stdio.h>
#include<stdlib.h>
#include<sys/time.h>
#include<omp.h>

#define ALIGNSIZE 64

//A simple DAXPY kernel
void run_daxpy(double A[], double PI, double B[], unsigned long vectorSize){
       unsigned long i = 0;
#pragma omp parallel for simd
        for(i=0; i<vectorSize; i++){
              B[i] = PI*A[i] + B[i];
        }
       return;
}

//Rearranging Odd and Even Position into Low and High Vectors
void swap_low_and_high(unsigned long vectorSize, double C[]){
       unsigned long i = 0, j=0;
       unsigned long half = vectorSize/2;
       double temp = 0.0;
#pragma omp parallel for private(temp)
       for(i=0, j=half; i<half; i+=2, j+=2){
              temp = C[i+1];
              C[i+1] = C[half];
              C[half] = temp;
       }
       return;
}


int main  (int argc, char * argv[]){

       struct timeval tBefore, tAfter;
       unsigned long timeDAXPY = 0,timeAverage=0;
       unsigned long i = 0;
       unsigned int j = 0;
       unsigned long limit = 0;
       unsigned int repetitions = 0;

       if (argc < 3){
              printf("Enter Number of Elements in Millions and number of repetitions\nEg: ./simpleDAXPY 64 5\n");
              printf("Running with default settings:\n");
              printf("No. of Vector Elements : 64M\nRepetitions = 1\n");

              limit = 64 * 1024 * 1024;
              repetitions = 1;
       }
       else
       {
              limit = atoi(argv[1]) * 1024 * 1024;
              repetitions = atoi(argv[2]);
              printf("Running with selected parameters:\n");
              printf("No. of Vector Elements : %dM\nRepetitions = %d\n", atoi(argv[1]), atoi(argv[2]));
        }

#ifndef USE_HBW
       double *A = (double *)_mm_malloc(limit * sizeof(double),ALIGNSIZE);
       double *B = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
       double *C = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
#else
       double *A, *B, *C;
      //Allocating A and B in High Bandwidth Memory
       hbw_posix_memalign((void**)(&(A)), ALIGNSIZE, limit*sizeof(double));
        if(A == NULL){
              printf("Unable to allocate on HBM: A");
       }
       printf("Allocating array A in High Bandwidth Memory\n");
       hbw_posix_memalign((void**)(&(B)), ALIGNSIZE, limit*sizeof(double));
        if(B == NULL){
              printf("Unable to allocate on HBM: B");
       }
       printf("Allocating array B in High Bandwidth Memory\n");
       C = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
#endif


#pragma omp parallel for simd
        for(i=0; i<limit; i++){
                A[i] = (double)1.0*i;
                B[i] = (double)2.0*i;
                C[i] = (double)4.0*i;
        }

       double PI = (double)22/7;
       printf("Threads = %d\n", omp_get_max_threads());
       for(j = 0; j<repetitions; j++){

              gettimeofday(&tBefore, NULL);
              run_daxpy(A, PI*(j+1), B, limit);
              gettimeofday(&tAfter, NULL);
              timeDAXPY += ((tAfter.tv_sec - tBefore.tv_sec)*1000L +(tAfter.tv_usec - tBefore.tv_usec)/1000);


              gettimeofday(&tBefore, NULL);
              swap_low_and_high(limit, C);
              gettimeofday(&tAfter, NULL);
              timeAverage += ((tAfter.tv_sec - tBefore.tv_sec)*1000L +(tAfter.tv_usec - tBefore.tv_usec)/1000);

       }

       printf("Time - DAXPY (ms): %ld\n", timeDAXPY);
       printf("Time – SWAP_LOW_HIGH (ms): %ld\n", timeAverage);

#ifndef USE_HBW
       _mm_free(A);
       _mm_free(B);
       _mm_free(C);
#else
       hbw_free(A);
       hbw_free(B);
       _mm_free(C);
#endif

       return 1;
}

[1] This diagram is for conceptual purposes only and only illustrates a processor and memory – it is not to scale and does not include all functional areas of the processor, nor does it represent actual component layout. All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

[2] Projected result based on internal Intel analysis using estimated performance and power consumption of a rack sized deployment of Intel® Xeon® processors and Knights Landing coprocessors as compared to a rack with Knights Landing processors only

[3]Instruction mix report for AVX2 was generated using non OpenMP version of the code. At this time, the version of SDE available externally fails to generate instruction mix report when used with "-hsw" flag and OpenMP. This issue is fixed and will be released in next version of SDE.

[4] SIMD – Single Instruction Multiple Data

[5] NUMA – Non Uniform Memory Access

[6]Experiment Setup

2 Socket 14 core Intel® Xeon® CPU E5-2697 v3 @ 2.60GHz
4 DDR channels per socket
Red Hat* Enterprise Linux Server release 7.0

[7]LLC – Last Level Cache