Tutorial on Intel® Xeon Phi™ Processor Optimization

1. Introduction

In this tutorial, we demonstrate some possible ways to optimize an application to run on the Intel® Xeon Phi™ processor. The optimization process in this tutorial is divided into three parts:

The first part describes the general optimization techniques that are used to vectorize (data parallelism) the code.
The second part describes how thread-level parallelism is added to utilize all the available cores in the processor.
The third part optimizes the code by enabling memory optimization on the Intel Xeon Phi processor.

We conclude the tutorial by showing the graph that shows the performance improvement at each optimization step.

This work is organized as follows: a serial, suboptimal sample code is used as the base. Then we apply some optimization techniques to that code to obtain the vectorized version of the code, and we add threaded parallelism to the vectorized code to have the parallel version of the code. Finally, we use Intel® VTune™ Amplifier to analyze memory bandwidth of the parallel code to improve the performance further using the high bandwidth memory. All three versions of the code (mySerialApp.c, myVectorizedApp.c, and myParallelApp.c) are included as an attachment with this tutorial.

The sample code is a streaming application with two large sets of buffer containing the inputs and outputs. The first large set of input data contains the coefficients for quadratic equations. The second large set of output data is used to hold the roots of each quadratic equation. For simplicity, the coefficients are chosen so that we always have two real roots of the quadratic equation.

Consider the quadratic equation:

The two roots are the solutions to the known formula:

The conditions for having two real distinct roots are: and

2. Hardware and Software

The program runs on a preproduction Intel® Xeon Phi™ processor, model 7250, with 68 cores clocked at 1.4 GHz, 96 GB of DDR4 RAM, and 16 GB Multi Channel Dynamic Random Access Memory (MCDRAM). With 4 hardware threads per core, this system can run with a total of 272 hardware threads. We install Red Hat Enterprise Linux* 7.2, the Intel® Xeon Phi™ processor software version 1.3.1, and Intel® Parallel Studio XE 2016 update 3 on this system.

To check the type of processor and the number of processors in your system, you can display the output using /proc/cpuinfo. For example:

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 87
model name      : Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
stepping        : 1
microcode       : 0xffff0180
cpu MHz         : 1515.992
cache size      : 1024 KB
physical id     : 0
siblings        : 272
core id         : 0
cpu cores       : 68
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt
scp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc ap
erfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16
 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f1
6c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi
flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdsee
d adx avx512pf avx512er avx512cd xsaveopt
bogomips        : 2793.59
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
……………………………………

The complete output from the test system shows 272 CPUs, or hardware threads. Note that in the flags field, it shows instruction extensions avx512f, avx512pf, avx512er, avx512cd; those are the instruction extensions that the Intel Xeon Phi processor supports.

You can also display information about the CPU by running lscpu:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                272
On-line CPU(s) list:   0-271
Thread(s) per core:    4
Core(s) per socket:    68
Socket(s):             1
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 87
Model name:            Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
Stepping:              1
CPU MHz:               1365.109
BogoMIPS:              2793.59
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0-271
NUMA node1 CPU(s):

The above command shows that the system has 1 socket, 68 cores, and 272 CPUs. It also indicates that this system has 2 NUMA nodes, all 272 CPUs belong to the NUMA node 0. For more information on NUMA, please refer to An Intro to MCDRAM (High Bandwidth Memory) on Knights Landing.

Before analyzing and optimizing the sample program, compile the program and run the binary to get the baseline performance.

3. Benchmarking the Baseline Code

A naïve implementation of the solution is shown in the attached program mySerialApp.c. The coefficients a, b, and c are grouped in the structure Coefficients; roots x₁ and x₂ are grouped in the structure Roots. Coefficient and root are single-precision floating numbers. Every coefficients tuple corresponds to a roots tuple. The program allocates N coefficient tuples and N root tuples. N is a large number (N = 512M elements, or to be exact, 512*1024*1024 = 536,870,912 elements). The coefficients structure and the roots structure are shown below:

struct Coefficients {
        float a;
        float b;
        float c;
    } coefficients;

struct Roots {
        float x1;
        float x2;
    } roots;

This simple program computes the real roots x₁ and x₂ according to the above formula. Also, we use the standard system timer to measure the computation time. The buffer allocation time and initialization time are not measured. The simple program repeats the calculation process 10 times.

To start, benchmark the application by compiling the baseline code using the Intel® C++ Compiler:

$ icc mySerialApp.c

By default, the compiler compiles with the switch -O2, which is optimized for maximum speed. We then run the application:

$ ./a.out No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... SERIAL Elapsed time in msec: 461,222 (after 10 iterations)

The output indicates that for that large number of entries (N = 512M elements), the system takes 461,222 msec to complete 10 iterations to stream data, to compute the roots, and to store the results. For each coefficients tuple, this program calculates the roots tuple. Note that this baseline code does not take advantage of the high number of available cores in the system or SIMD instructions since it runs in the serial and scalar mode (only one thread that processes one tuple element at a time). Therefore, only one hardware thread (CPU) is running, all the rest of the CPUs are idle. You can verify this by generating a vectorization report (*.optrpt) using the compiler option -qopt-report=5 -qopt-report-phase:vec.

$ icc mySerialApp.c -qopt-report=5 -qopt-report-phase:vec

After measuring the baseline code performance, we can start vectorizing the code.

4. Vectorizing Code

4.1. Change array of structure to structure of arrays. Do not use multiple layers in buffer allocation.

The first way to improve the code performance is to convert the array of structure (AoS) to structure of arrays (SoA). SoA can increase the amount of data accessed with unit strides. Instead of defining a large number of coefficients tuples (a, b, c) and roots tuples (x₁,x₂), we can rearrange the data structure so that we allocate them in 5 large arrays called a, b, c, x₁ and x₂ (refer to the program myVectorizedApp.c). In addition, instead of using malloc to allocate memory, we use _mm_malloc to align data with 64-byte boundary (see next section).

float *coef_a  = (float *)_mm_malloc(N * sizeof(float), 64);
float *coef_b  = (float *)_mm_malloc(N * sizeof(float), 64);
float *coef_c  = (float *)_mm_malloc(N * sizeof(float), 64);
float *root_x1 = (float *)_mm_malloc(N * sizeof(float), 64);
float *root_x2 = (float *)_mm_malloc(N * sizeof(float), 64);

4.2. Further improvement: removing type conversion, data alignment

The next step is to remove unnecessary type conversion. For example, the function sqrt() takes a double precision as input. But since we pass a single precision as input in this program, the compiler needs to convert single precision to double precision. To remove unnecessary data type conversion, use sqrtf() instead of sqrt(). Similarly, use single precision instead of integer, and so on. For example, instead of using 4, we use 4.0f. Note that 4.0 (without the f suffix) is a double-precision floating number while 4.0f is a single-precision floating number.

Data alignment helps data move efficiently when data have to be moved from and to memory. For the Intel Xeon Phi processor, data movement is optimal when the data starting address lies on the 64-byte boundary, just like for the Intel® Xeon Phi™ coprocessor. To help the compiler vectorize, you need to allocate memory with 64-byte alignment, and use pragma/directives where data is used to tell the compiler that memory access is aligned. Vectorization performs best with properly aligned data. In this document, vectorization refers to the ability to process multiple data in a single instruction (SIMD).

In the above example, to align heap-allocated data, we use _mm_malloc() and _mm_free() to allocate arrays. Note that _mm_malloc() behaves as malloc() but takes an alignment parameter (in bytes) as a second argument, which is 64 for the Intel Xeon Phi processor. We need to inform the compiler by inserting a clause before the data used assume_aligned(a, 64) to indicate the array a is aligned. To inform the compiler that all arrays accessed in a given loop are aligned, add the clause #pragma vector aligned before the loop.

4.3. Use auto-vectorization, run a compiler report, and disable vectorization via a compiler switch

Vectorization refers to programming techniques that use the Vector Processing Units (VPU) available to perform an operation on multiple values simultaneously. Auto-vectorization is the capability of a compiler to identify opportunities in the loops and perform vectorization accordingly. You can take advantage of the auto-vectorization feature in the Intel compiler since auto-vectorization is enabled for the default optimization level –O2 or higher.

For example, when the mySerialApp.c sample code is compiled using the Intel compiler icc, by default, the compiler looks for vectorization opportunities at the loops. However, the compiler needs to follow certain rules (loop trip must be known, single entry and single exit, straight-line code, the innermost loop of a nest, and so on) in order to vectorize these loops. You can help the compiler to vectorize these loops by providing additional information.

To determine whether or not your code is vectorized, you can generate a vectorization report by specifying the option -qopt-report=5 -qopt-report-phase:vec. A vectorization report (*.optrpt) is then generated by the compiler. The report tells you whether or not vectorization is done at each loop and a brief explanation about loop vectorization. Note that the vectorization report option is –qopt-report=<n>, where n is used to specify the level of details.

4.4. Compile with optimization level `–O3`

Now compile the application with optimization level –O3. This optimization level is for maximum speed and is a more aggressive optimization than the default optimization level –O2.

With auto-vectorization, at each iteration loop, instead of processing one element at a time, the compiler packs 16 single-precision floating numbers in a vector register and performs the operation on that vector.

$ icc myVectorizedApp.c –O3 -qopt-report -qopt-report-phase:vec -o myVectorizedApp

The compiler generates the following output files: a binary, myVectorizedApp, and a vectorization report, myVectorizedApp.optrpt. To run the binary:

$ ./myVectorizedApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... Elapsed time in msec: 30496 (after 10 iterations)

The binary runs with only one thread but with vectorization. The myVectorizedApp.optrpt report should confirm that all the inner loops are vectorized.

To compare, also compile the program with the -no-vec option:

$ icc myVectorizedApp.c –O3 -qopt-report -qopt-report-phase:vec -o myVectorizedApp-noVEC -no-vec icc: remark #10397: optimization reports are generated in *.optrpt files in the output location

Now run the myVectorizedApp–noVEC binary:

$ ./myVectorizedApp-noVEC No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... Elapsed time in msec: 180375 (after 10 iterations)

This time the myVectorizedApp.optrpt report shows that loops are not vectorized because auto-vectorization is disabled as expected.

We now can observe two improvements. The improvement from the original version (461,222 msec) to the no-vec version (180,375 msec) is basically due to general optimization techniques. The improvement from the version without vectorization (180,375 msec) to the one with vectorization (30,496 msec) is due to auto-vectorization.

Even with this improvement in performance, it is still only one thread performing the calculation. The code can be further enhanced by enabling multiple threads running in parallel to take advantage of the multi-core architecture.

5. Enabling Multi-Threading

5.1. Thread-level parallelism: OpenMP*

To take advantage of the high number of cores on the Intel Xeon Phi processor (68 cores in this system), you can scale the application by running OpenMP threads in parallel. OpenMP is the standard API and programing model for shared memory.

To use OpenMP threads, you need to include the header file "omp.h", and to link the code with the flag –qopenmp. In the myParallelApp.c program, the below directive is added before the for-loop:

#pragma omp parallel for simd

This pragma directive added before the for-loop tells the compiler to generate a team of threads and to break the work in the for-loop into many chunks. Each thread executes a number of work chunks according to the OpenMP runtime schedule. The SIMD construct simply indicates that multiple iterations of the loop can be executed concurrently using SIMD instructions. It informs the compiler to ignore assumed vector dependencies found in the loop, so use it carefully.

In this program, thread parallelism and vectorization occurs at the same loop. Each thread starts with its own lower-bound for the loop. To ensure that OpenMP (static scheduling) has good alignment results, we may restrict the number of parallel loops and the remaining loops are processed serially.

#pragma omp parallel
#pragma omp master
    {
        int tid = omp_get_thread_num();
 numthreads = omp_get_num_threads();

        printf("thread num=%d\n", tid);
        printf("Initializing\r\n");

// Assuming omp static scheduling, carefully limit the loop-size to N1 instead of N
        N1 = ((N / numthreads)/16) * numthreads * 16;
        printf("numthreads = %d, N = %d, N1 = %d, num-iters in remainder serial loop = %d, parallel-pct = %f\n", numthreads, N, N1, N-N1, (float)N1*100.0/N);
    }

And the function that computes the roots becomes

for (j=0; j<ITERATIONS; j++)
    {
#pragma omp parallel for simd
#pragma vector aligned
        for (i=0; i<serial; i++)   // Perform in parallel fashion
        {
            x1[i] = (- b[i] + sqrtf((b[i]*b[i] - 4.0f*a[i]*c[i])) ) / (2.0f*a[i]);
            x2[i] = (- b[i] - sqrtf((b[i]*b[i] - 4.0f*a[i]*c[i])) ) / (2.0f*a[i]);
        }

#pragma vector aligned
        for( i=serial; i<vectorSize; i++)
        {
            x1[i] = (- b[i] + sqrtf((b[i]*b[i] - 4.0f *a[i]*c[i])) ) / (2.0f*a[i]);
            x2[i] = (- b[i] - sqrtf((b[i]*b[i] - 4.0f *a[i]*c[i])) ) / (2.0f*a[i]);
        }
    }

Now you can compile the program and link it with –qopenmp:

$ icc myParallelApp.c –O3 -qopt-report=5 -qopt-report-phase:vec,openmp -o myParallelAppl -qopenmp

Check the myParallelApp.optrpt report that confirms all loops are vectorized and parallelized with OpenMP.

5.2. Use environment variables to set the number of threads and to set affinity

The OpenMP implementation can start a number of threads in parallel. By default, the number of threads is set to the maximum hardware threads in the system. In this case, 272 OpenMP threads will be running by default. However, we can use the OMP_NUM_THREADS environment variable to set the number of OpenMP threads. For example, the command below starts 68 OpenMP threads:

$ export OMP_NUM_THREADS=68

Thread affinity (the ability to bind OpenMP thread to a CPU) can be set by using the KMP_AFFINITY environment variable. To distribute the threads evenly across the system, set the variable to scatter:

$ export KMP_AFFINITY=scatter

Now run the program using all the cores in the system and vary the number of threads running per core. Here is the output from our test runs comparing the performance when running 1, 2, 3, and 4 threads per core.

Running 1 thread per core in the test system:

$ export KMP_AFFINITY=scatter $ export OMP_NUM_THREADS=68 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 1722 (after 10 iterations)

Running 2 threads per core:

$ export OMP_NUM_THREADS=136 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 136, N = 536870912, N1 = 536869248, num-iters in remainder serial loop = 1664, parallel-pct = 99.999690 Starting Compute on 136 threads Elapsed time in msec: 1781 (after 10 iterations)

Run 3 threads per core:

$ export OMP_NUM_THREADS=204 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 204, N = 536870912, N1 = 536869248, num-iters in remainder serial loop = 1664, parallel-pct = 99.999690 Starting Compute on 204 threads Elapsed time in msec: 1878 (after 10 iterations)

Running 4 threads per core:

$ export OMP_NUM_THREADS=272 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 272, N = 536870912, N1 = 536867072, num-iters in remainder serial loop = 3840, parallel-pct = 99.999285 Starting Compute on 272 threads Elapsed time in msec: 1940 (after 10 iterations)

From the above results, the best performance is obtained when one thread runs on one core, and uses all 68 cores.

6. Optimizing the code for the Intel Xeon Phi processor

6.1. Memory bandwidth optimization

There are two different types of memory on the system: the on-package memory 16 GB of MCDRAM and 96 GB of traditional on-platform 6 channels of DDR4 RAM (with option to extend to a maximum of 384 GB). MCDRAM bandwidth is about 500 GB/s while DDR4 peak performance bandwidth is about 90 GB/s.

There are three possible configuration modes for the MCDRAM: flat mode, cache mode, or hybrid mode. If MCRDAM is configured as addressable memory (flat mode), the user can explicitly allocate memory in MCDRAM. If MCDRAM is configured as cache, the entire MCDRAM is used as last-level cache, between L2 cache and DDR4 memory. If MCDRAM is configured as hybrid, some portion of MCDRAM is used as cache and the rest is used as addressable memory. The pros and cons of these configurations are shown in the below table:

Memory Mode	Pros	Cons
Flat	User can control MCDRAM to take advantage of high bandwidth memory	User needs to use `numactl` or modify code
Cache	Transparent to user Extend cache level	Potentially increases latency to load/store memory in DDR4
Hybrid	Application can take advantage of both flat and cache modes	Cons associated with both flat and cache modes

With respect to Non Uniform Memory Access (NUMA) architecture, the Intel Xeon Phi processor can appear as one or two nodes depending on how the MCDRAM is configured. If MCDRAM is configured as cache, the Intel Xeon Phi processor appears as 1 NUMA node. If MCDRAM is configured as flat or hybrid, the Intel Xeon Phi processor appears as 2 NUMA nodes. Note that Clustering Mode can further partition the Intel Xeon Phi processor up to 8 NUMA nodes; however, in this tutorial, Clustering Mode is not covered.

The numactl utility can be used to display the NUMA nodes in the system. For example, executing “numactl –H” in this system, where MCDRAM is configured in flat mode, will show two NUMA nodes. Node 0 consists of 272 CPUs and the DDR4 memory with 96 GB, while node 1 consists of the MCDRAM memory with 16 GB.

$ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 98200 MB node 0 free: 92888 MB node 1 cpus: node 1 size: 16384 MB node 1 free: 15926 MB node distances: node 0 1 0: 10 31 1: 31 10

The "numactl" tool can be used to allocate memory in certain NUMA modes. In this example, the node 0 contains all CPUs and the on-platform memory DDR4, while node 1 has the on-packet memory MCDRAM. The switch –m , or –-membind, can be used to force the program to allocate memory to a NUMA node number.

To force the application to allocate DDR memory (node 0), run the command:
$ numactl -m 0 ./myParallelApp

This is equivalent to:
$ ./myParallelApp

Now run the application with 68 OpenMP threads:

$ export KMP_AFFINITY=scatter $ export OMP_NUM_THREADS=68

$ numactl -m 0 ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 1730 (after 10 iterations)

To display another view of the NUMA nodes, run the command “lstopo”. This command displays not only the NUMA nodes, but also the L1 and L2 cache associated with these nodes.

6.2. Analyze memory usage

Is this application bandwidth bound? Use the Intel VTune Amplifier to analyze memory accesses. DDR4 DRAM peak performance bandwidth is about 90 GB/s (gigabytes per second), while MCDRAM memory peak performance is around 500 GB/s.

Install Intel VTune Amplifier on your system, and then run the following Intel VTune Amplifier command to collect memory access information while the application allocates DDR memory:

$ export KMP_AFFINITY=scatter; export OMP_NUM_THREADS=68; amplxe-cl -collect memory-access -- numactl -m 0 ./myParallelApp

You can view bandwidth utilization of your application by looking at the “Bandwidth Utilization Histogram” field. The histogram shows that DDR bandwidth utilization is high.

Bandwidth Utilization Histogram

Looking at the memory access profile, we observe that the peak DDR4 bandwidth reaches 96 GB/s, which is around the peak performance bandwidth of DDR4. The result suggests that this application is memory bandwidth bound.

Memory access profile peak DDR4 bandwidth reaches 96 GB/s

Looking at the memory allocation in the application, we see allocating of 5 large arrays of 512 M elements (to be precise, 512 * 1024 * 1024 elements). Each element is a single-precision float (4 bytes); therefore the total size of each array is about 4*512 M or 2 GB. The total memory allocation is 2 GB * 5 = 10 GB. This memory size fits well in the MCDRAM (16 GB capacity), so it is likely that allocating memory in MCDRAM (flat mode) will benefit the application.

To allocate the memory in the MCDRAM (node 1), pass the argument –m 1 to the command numactl as below:

$ numactl -m 1 ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 498 (after 10 iterations)

We clearly see a significant performance improvement when the application allocates memory in MCDRAM.

For comparison purpose, run the following Intel VTune Amplifier command collecting memory access information while the application allocates MCDRAM memory:

$ export KMP_AFFINITY=scatter; export OMP_NUM_THREADS=68; amplxe-cl -collect memory-access -- numactl -m 1 ./myParallelApp

The histogram shows that DDR bandwidth utilization is low and MCDRAM use is high:

Histogram shows DDR bandwidth utilization is low and MCDRAM use is high

Bandwidth Utilization Histogram

Looking at the memory access profile, we observe that DDR4 peak bandwidth reaches 2.3 GB/s, while MCDRAM peak bandwidth reaches 437 GB/s.

Memory access profile DDR4 peak bandwidth 2.3 GB/s, MCDRAM peak bandwidth 437 GB/s

6.3. Compiling using the compiler knob `–xMIC-AVX512`

The Intel Xeon Phi processor supports x87, Intel® Streaming SIMD Extensions (Intel® SSE), Intel® SSE2, Intel® SSE3, Supplemental Streaming SIMD Extensions 3, Intel® SSE4.1, Intel® SSE4.2, Intel® Advanced Vector Extensions (Intel® AVX), Intel Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions 512 (Intel AVX-512) Instruction Set Architecture (ISA). It does not support Intel® Transactional Synchronization Extensions.

Intel AVX-512 is implemented in the Intel Xeon Phi processor. The Intel Xeon Phi processor supports the following groups: Intel AVX-512F, Intel AVX-512CD, Intel AVX-512ER, and Intel AVX-FP. Intel AVX-512F (Intel AVX-512 Foundation Instructions) includes extensions of the Intel AVX and Intel AVX2 of SIMD instructions for 512-bit vector registers; Intel AVX-512CD, (Intel AVX-512 Conflict Detection) enables efficient conflict detection to allow more loops vectorized; Intel AVX-512ER (Intel AVX-512 Exponential and Reciprocal instructions) provides instructions for base 2 exponential functions, reciprocals, and inverse square root. Intel AVX-512PF (Intel AVX-512 Prefetch instructions) are useful for reducing memory operation latency.

To take advantage of Intel AVX-512, compile the program with the compiler knob –xMIC-AVX512
$ icc myParallelApp.c -o myParallelApp-AVX512 -qopenmp -O3 -xMIC-AVX512

$ export KMP_AFFINITY=scatter $ export OMP_NUM_THREADS=68

$ numactl -m 1 ./myParallelApp-AVX512 No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 316 (after 10 iterations)

Note that you can run the following command, which generates an assembly file named myParallelApp.s:

$ icc -O3 myParallelApp.c -qopenmp -xMIC-AVX512 -S -fsource-asm

By examining the assembly file, you can confirm that Intel AVX512 ISA is generated

6.4. Using `–no-prec-div -fp-model fast=2` optimization flags.

If high precision is not required, we can relax the floating-point model by compiling with -fp-model fast=2, which provides more optimization for the floating number (but is unsafe). The compiler uses a faster and less-precise implementation of square root, division. For example:

$ icc myParallelApp.c -o myParallelApp-AVX512-FAST -qopenmp -O3 -xMIC-AVX512 -no-prec-div -no-prec-sqrt -fp-model fast=2 $ export OMP_NUM_THREADS=68

$ numactl -m 1 ./myParallelApp-AVX512-FAST No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 310 (after 10 iterations)

6.5. Configuring MCDRAM as cache

In the BIOS setting, configure MCDRAM as cache and reboot the system. The numactl utility should confirm that there is only one NUMA node since the MCDRAM is configured as cache, thus transparent to this utility:

$ numactl -H available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 98200 MB node 0 free: 94409 MB node distances: node 0 0: 10

Recompile the program:

$ icc myParalledApp.c -o myParalledApp -qopenmp -O3 -xMIC-AVX512 -no-prec-div -no-prec-sqrt -fp-model fast=2

And run it:

$ export OMP_NUM_THREADS=68 $ ./myParalledApp-AVX512-FAST No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 325 (after 10 iterations)

Observe that there is no additional benefit of using MCDRAM as cache in this application.

7. Summary and Conclusion

In this tutorial, the following topics were discussed:

Memory alignment
Vectorizing
Generating the compiler report to assist code analysis
Using command line utilities cpuinfo, lscpu, numactl, lstopo
Using OpenMP to add thread-level parallelism
Setting environment variables
Using Intel VTune Amplifier to profile bandwidth utilization
Allocating MCDRAM memory using numactl
Compiling with the Intel AVX512 flag to get better performance

The graph below shows the performance improvement for each step from the baseline code: general optimization with data alignment, vectorizing, adding thread-level parallelism, allocating MCDRAM memory in flat mode, compiling with Intel AVX512, compiling with no-precision flag, and using MCDRAM as cache.

Performance Improvement (in second)

We were able to reduce the execution time significantly making use of all the available cores, Intel AVX-512 vectorization, and MCDRAM bandwidth.

References:

About the Author

Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel Sample Source Code License Agreement