Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Intel® Xeon Phi™ Coprocessor

$
0
0

Optimizing Execution Performance for an Isotropic Double Precision 3DE Finite Difference Stencil Algorithm on the Intel® Xeon Phi™ Coprocessor

Abstract

This paper discusses applying programming restructuring transformations to a baseline 3D stencil application so that the stencil executable can exploit the architectural features of the Intel® Xeon Phi™ coprocessor for double precision computation on the Linux* operating system. Application restructuring transformations in the form of loop-tiling (loop-blocking), OpenMP* pragmas, SIMD vectorization pragmas, and loop interchange are applied to the baseline 3D stencil algorithm. Intel® VTune™ Amplifier XE instrumentation is used to provide guidance on what types of application restructuring should be applied to the baseline 3D stencil algorithm.

  1. Isotropic Double Precision 3D Finite Difference Stencil Algorithm for the Intel® Xeon Phi™ Coprocessor
    1. Introduction
    2. Intel® Xeon Phi™ Coprocessor Architecture
    3. Levels of Parallelism for Intel® Xeon Phi™ Coprocessor
  2. Tiling Transformation in the X, Y, and Z Dimensions as applied to the
    Original 3D Stencil Data Structure
    1. OpenMP Threading Parallelization
    2. SIMD Vectorization
  3. Experimental Results for the 3D Stencil on the Intel® Xeon Phi™ Coprocessor
    1. Performance Results from the Initial Tiling Algorithm
    2. Performance Improvements from Tile Size Adjustments
    3. Performance Improvements from Loop Interchange and Tile Size Adjustments
  4. User Download
    1. Working with the Original 3D Finite Difference Stencil Source
    2. Working with the Loop Interchanged 3D Finite Difference Stencil Source
  5. References

1. Isotropic Double Precision 3D Finite Difference Stencil Algorithm
for the Intel® Xeon Phi™ Coprocessor

1.1. Introduction

This paper will focus on using loop-tiling4 (loop-blocking), OpenMP* threading, vectorization, loop interchange, and Intel® VTune™ Amplifier XE analysis to performance-tune a 3D stencil scientific application for the Intel® Xeon Phi™ coprocessor. The floating-point precision is that of double.

Partial differential equation (PDE) solvers comprise a significant fraction of scientific applications in such diverse areas as heat diffusion, electromagnetics, and fluid dynamics3. These applications are often implemented using iterative finite-difference techniques, which sweep over a spatial grid, performing nearest neighbor computations called stencils. Stencil codes such as the Jacobi and Gauss-Seidel kernels are used in many scientific and engineering applications2. An example of a 7-point Jacobi stencil2 is as follows:

for t=1..nt
// nt time steps
    for i3=1..n3-1
// Third dimension in space
        for i2=1..n2-1
// Second dimension in space
            for i1=1..n1-1
// First dimension in space
                b[i1,i2,i3] += a[i1−1,i2,i3] + a[i1+1,i2,i3]
// Space first dimension
                                    +a[i1,i2−1,i3] + a[i1,i2+1,i3]
// Space second dimension
                                    +a[i1,i2,i3-1] + a[i1,i2,i3+1]
// Space third dimension
            done
        done
    done
    swap b ↔ a
done

The above matrix assignment is in contrast to the Gauss-Seidel technique where there are data dependencies between right-hand-side operands a[i1−1,i2,i3], a[i1,i2−1,i3], and a[i1,i2,i3-1]) and the left-hand-side assignment to a[i1,i2,i3]2:

for t=1..nt
// nt time steps
    for i3=1..n3-1
// Third dimension in space
        for i2=1..n2-1
// Second dimension in space
            for i1=1..n1-1
// First dimension in space
                a[i1,i2,i3] += a[i1−1,i2,i3] + a[i1+1,i2,i3]
// Space first dimension
                                    +a[i1,i2−1,i3] + a[i1,i2+1,i3]
// Space second dimension
                                    +a[i1,i2,i3-1] + a[i1,i2,i3+1]
// Space third dimension
            done
        done
    done
done

 

1.2. Intel® Xeon Phi™ Coprocessor Architecture

There are 61 cores on an Intel® Xeon Phi coprocessor and each core has 4 hardware threads. Each hardware thread can execute 512 bit Single Instruction Multiple Data (SIMD) vector instructions. A hardware thread has 32 vector registers where each register is 512 bits wide (Figure 1), and can hold 16 single precision operands or 8 double precision operands. There is 64K of L1 cache which is divided into 32K of L1 instruction cache (I-cache), and 32K of L1 data cache (D-cache). There is 512K of L2 cache. The L2 cache is also partitioned evenly between the L2 instruction cache, and the L2 data cache.

vector register configuration
Figure 1– Intel® Xeon Phi™ coprocessor Vector register configuration for a single hardware thread

1.3. Levels of Parallelism for Intel® Xeon Phi™ Coprocessor

An Intel® Xeon Phi™ coprocessor can support three levels of parallelism:

  1. As mentioned previously, there are 61 cores and therefore thread parallelism can be expressed across the 61 cores.
  2. Within each core there are four hardware threads available.
  3. Single instruction multiple data (SIMD) vector instructions per hardware thread.

This means that with 61 cores, and 4 hardware threads per core, 244 threads can be launched for an application. Each thread can further exploit SIMD vector instructions.

2. Tiling Transformation in the X, Y, and Z Dimensions as applied to the Original 3D Stencil Data Structure

At a high level and in regards to the preceding sections, the hierarchical cache architecture and hierarchal levels of parallelism have been discussed for the Intel® Xeon Phi™ coprocessor. The discussion will now turn to looking at ways to target a programming application so that it can exploit the architectural features of the Intel® Xeon Phi™ coprocessor for double precision computation. Tiling (blocking) is a transformation which combines strip-mining with loop permutation to form small tiles of loop iterations which are executed together to exploit data locality1.

  1. It increases the temporal and spatial locality in the data cache if the data are reusable in different passes of an algorithm.
  2. It reduces the number of iterations of the loop by a factor of the length of each "vector", or number of operations being performed per SIMD operation. In the case of double precision vector operations on the Intel® Xeon Phi™ coprocessor, this vector or strip-length is reduced by 8 times: eight floating-point data items are processed per single vector double-precision floating-point SIMD instruction.

The original Jacobi 3D stencil algorithm from Borges looks something like the following4:

for t=1..nt
// nt time steps
    for i3=4..n3-4
// Third dimension in space
        for i2=4..n2-4
// Second dimension in space
            for i1=4..n1-4
// First dimension in space
                div = coeff[0]*prev[i1,i2,i3]
                    for r=1..4
// 8th order stencil
                        div += coeff[r]*(prev[i1+r,i2,i3] + prev[i1−r,i2,i3]
// Space first dimension
                                    +prev[i1,i2+r,i3] + prev[i1,i2−r,i3]
// Space second dimension
                                    +prev[i1,i2,i3+r] + prev[i1,i2,i3-r])
// Space third dimension
                    done
                next[i1,i2,i3] = 2*prev[i1,i2,i3] – next[i1,i2,i3] + div*vel[i1,i2,i3]
// time update
            done
        done
    done
    swap prev ↔ next
done

Note that this is a 25-point 3 dimensional Jacobi stencil algorithm4.

Recall that each Intel® Xeon Phi™ coprocessor core has 512KB of L2 cache and 64KB of L1 cache, where approximately half of the L2 and L1 caches are used for instructions, and half of each cache is used for data. The matrix/lattice order for a stencil algorithm as shown above may not necessarily run efficiently in its present form. This is because stencil reuse of data elements along the third dimension may not be able to fit in the cache for large problem sizes.

Tiling is a program transformation that compilers can apply to capture this reuse requirement of 3 dimensional stencils. To understand tiling let us look a simple lattice data structure. For a lattice “a” with dimensions a[1][4][2], the subscripting reference patterns would be:

subscript
Figure 2– Subscript indices for lattice “a” with dimensions a[1][4][2]

A pictorial lattice representation for lattice “a”, with dimensions a[1][4][2], might look something like the illustration shown in Figure 3.

lattice dimensions
Figure 3 - Lattice “a” with dimensions a[1][4][2]

With respect to Figure 3, the linearized storage mapping in computer memory for a row-major order data layout would look something like what is shown in Figure 4:

linearized storage
Figure 4 - Linearized Storage representation for lattice “a” with dimensions a[1][4][2]

In this simple example for Figure 4, the overall storage could be broken down into tiles of say size 2. Each tile of size 2 could be processed by an individual core on the Intel® Xeon Phi™ coprocessor. A tile that has been assigned to a core has two elements and each element of the tile could be acted upon by a hardware thread within a core.

2.1. OpenMP Threading Parallelization

The parallelization approach by Borges4 is driven by a tiling methodology: threaded tile-blocks of size n1_Tblock x n2_Tblock x n3_Tblock, partition the data across the cores and the hardware threads available within each core of the Intel® Xeon Phi™ coprocessor. The original n1 × n2 × n3 domain is decomposed into a list of indices describing n1_Tblock × n2_Tblock × n3_Tblock blocks. This list has length of num_blocks = num_n1_blocks × num_n2_blocks × num_n3_blocks = ceiling((n1-2×R)/n1_Tblock) × ceiling((n2-2×R)/n2_Tblock) × ceiling((n3-2×R)/n3_Tblock) and is padded so that each block description is aligned on cache line boundaries. The threading is represented with OpenMP semantics4:

int index=0;
for (int i3b=4; i3b<n3-4; i3b+=n3_Tblock)
  for(int i2b=4; i2b<n2-4; i2b+=n2_Tblock)
    for(int i1b=4; i1b<n1-4; i1b+=n1_Tblock) {
      blocking[index].i1_idx = i1b;
      blocking[index].i2_idx = i2b;
      blocking[index].i3_idx = i3b;
      index++;
    }

#pragma omp parallel for num_threads(num_threads)
schedule(dynamic)       \
   firstprivate (n1, n2, n3, num_blocks, n1_Tblock, n2_Tblock, n3_Tblock) \
   shared( coeff, prev, next, vel, blocking)
   for (int i=0; i<num_blocks; i++) {
     int i1b = blocking[i].i1_idx;
     int i2b = blocking[i].i2_idx;
     int i3b = blocking[i].i3_idx;
     apply_stencil(next, prev, vel, coeff, i1b, i2b, i3b, n1, n2, n3, n1_Tblock, n2_Tblock, n3_Tblock);
   }

2.2. SIMD Vectorization

Vectorization4 is applied to the inner most loop of the stencil algorithm as follows:

// apply 8th order ISO stencil on block [i1b..i1b+n1_Tb]X[i2b..i2b+n2_Tb]X[i3b..i3b+n3_Tb]
void apply_stencil(float *ptr_next, float *ptr_prev, float *ptr_vel, float *coeff, // arrays
                   const int i1b,   const int i2b,   const int i3b,                             // block indices
                   const int n1,    const int n2,    const int n3,                             // full domain
                   const int n1_Tb, const int n2_Tb, const int n3_Tb) {                 // block sizes

  const int n1_end = n1-4, n2_end = n2-4, n3_end = n3-4;
  const float c0 = coeff[0], c1=coeff[1], c2=coeff[2], c3=coeff[3], c4=coeff[4];
  const int n1n2 = n1*n2;
  const int n1_2 = 2*n1, n1n2_2 = 2*n1n2;
  const int n1_3 = 3*n1, n1n2_3 = 3*n1n2;
  const int n1_4 = 4*n1, n1n2_4 = 4*n1n2;

  const int n3b_end = MIN(i3b+n3_Tb, n3_end);
  const int n2b_end = MIN(i2b+n2_Tb, n2_end);
  const int n1b_end = MIN(i1b+n1_Tb, n1_end);

  for (int i3=i3b; i3< n3b_end; i3++) {
    double *prev  = &ptr_prev[i3*n1n2+i2b*n1];
    double *next  = &ptr_next[i3*n1n2+i2b*n1];
    double *vel    = &ptr_vel [i3*n1n2+i2b*n1];
    for(int i2=i2b; i2< n2b_end; i2++, prev+=n1, next+=n1, vel+=n1) {

#pragma vector always
#pragma simd
      for(int i1=i1b; i1<n1b_end; i1++) {
        double tmp  = c0*  prev[i1]
                       + c1*( prev[i1+1] + prev[i1-1] +
                              prev[i1+n1] + prev[i1-n1] +
                              prev[i1+n1n2] + prev[i1-n1n2])
                       + c2*( prev[i1+2] + prev[i1-2] +
                              prev[i1+n1_2] + prev[i1-n1_2] +
                              prev[i1+n1n2_2] + prev[i1-n1n2_2])
                       + c3*( prev[i1+3] + prev[i1-3] +
                              prev[i1+n1_3] + prev[i1-n1_3] +
                              prev[i1+n1n2_3] + prev[i1-n1n2_3])
                       + c4*( prev[i1+4] + prev[i1-4] +
                              prev[i1+n1_4] + prev[i1-n1_4] +
                              prev[i1+n1n2_4] + prev[i1-n1n2_4]);
        next[i1] = 2.0f*prev[i1] -next[i1] +tmp*vel[i1];
      }
    }
  }
}

Note that loop unrolling was attempted using the unroll pragma, but it did not provide any performance improvement benefits.

3. Experimental Results for the 3D Stencil on the Intel® Xeon Phi™ Coprocessor

With the stencil example from Borges4, the goal of tiling is to do a domain decomposition of the stencil computation with a 3 dimensional tile. The question regarding the tiles is: What should be the values of the 3 dimension components si, sj, and sk for the tile subscript references2 i, j, and k (e.g. Figure 4)? Leopold2 suggests that rectangular tiles of shape (N-2) × sj× (sk× L/2) for a lattice with order N×N×N be used, where L is the cache line size, and sj and sk are the blocking factors for the j and k dimensions of the three dimensional tile.

Using the formula of Borges4 for estimating the tile dimensions for 512KB of L2 cache on the Intel® Xeon Phi™ Coprocessor, but also accounting for double precision (DP) floating point objects for the 3D stencil computation, we have:

si× s × s × ((8 DP objects per cache line) / 2) × (8 DP bytes per object) ≤ 512 KB (L2 cache size)

where si is the unit stride in the i dimension and has a tiling dimension of about 1KB. For the inequality above sj and sk have been replaced by variable s:

s2≤ 512 KB / ((8 × 8 / 2) × 1 KB)

s2≤ 16

s ≤ 4

Computing the minimal dimension size2 for tile dimension sk as a function of sj = s ≤ 4 the following equation is used:

sk = (L/2) × sj

where L is the cache line size in terms of double precision word objects:

sk = (8/2) × 4

sk = 16

3.1. Performance Results from the Initial Tiling Algorithm

As mentioned earlier, Borges4 has developed a parameterized stencil application http://software.intel.com/sites/default/files/blog/331716/iso-3dfd-v1.tar.gz which measures floating-point-operations-per-second performance. The experiments which follow were run in native mode on an Intel® Xeon Phi™ coprocessor. The results that you achieve will probably vary from the experimental output reported here. This is because of varying system configurations for the Intel® Xeon Phi™ coprocessor and the version of the Intel® C/C++ compiler that is being used.

We will start out the Intel® Xeon Phi™ coprocessor native experiments by setting the tile parameters heuristically to:

n1_Tblock=928 n2_Tblock=5 n3_Tblock=124

Notice that the value of n3_Tblock which is equivalent to the tile size value sk is initially set to 124.The tiling parameters above when running on 61 cores and using 3 hardware threads per core give results which look something like:

time:  4.93 sec

throughput:  1365.05 MPoints/s

flops:  45.05 GFlops

Can we improve the performance? To do this we will rely on Intel® VTune™ Amplifier XE.

Algorithmic tuning as described by Cepeda5 for Intel® Xeon Phi™ coprocessor involves code-restructuring transformations such as adding parallelism, tuning I/O, or choosing more efficient data structures or library routines. Algorithmic optimization generally relies on knowledge about the application’s hotspots, and familiarity with the source code, and it aims to improve performance for the application in general.

The general method to follow for performance analysis with Intel® VTune™ Amplifier XE is5:

  1. Select a hotspot (a function with a large percentage of the application’s total CPU cycles).
  2. Evaluate the efficiency of that hotspot using the metrics in Section 4 of https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding.
  3. If inefficient, check each applicable metric in Section 5 of https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding. If a metric’s value is below the suggested threshold, or unacceptable by other standards, use the additional information in the guide https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding to find and fix the problem.
  4. Repeat until all significant hotspots have been evaluated.

As an example, Intel® VTune™ Amplifier XE is capable of collecting hardware events such as VPU_ELEMENTS_ACTIVE and DATA_READ_OR_WRITE, and this allows one to compute a performance metric such as L1 Compute to Data Access Ratio, which is a function of VPU_ELEMENTS_ACTIVE / DATA_READ_OR_WRITE.

The application needs to be recompiled with the “–g” option if the 3D stencil source has not been compiled with this option already. Compiling the application with “-g” will allow VTune™ to map instrumentation results back to the source. Note that compiling an application with the “-g” compiler option along with optimization does not typically affect execution performance. For this VTune™ experiment, we will collect event data on 17 hardware counters (Figure 5), and the following VTune™ collector options will be used on Linux*:

-collect-with runsa-knc -allow-multiple-runs -knob event-config=${vtevents}

where the Bourne Shell variable “vtevents” is assigned the list of hardware events:

vtevents=CPU_CLK_UNHALTED,INSTRUCTIONS_EXECUTED,VPU_ELEMENTS_ACTIVE,DATA_READ_MISS_OR_WRITE_MISS,L2_DATA_READ_MISS_MEM_FILL,L2_DATA_WRITE_MISS_MEM_FILL,L1_DATA_HIT_INFLIGHT_PF1,DATA_READ_OR_WRITE,EXEC_STAGE_CYCLES,L2_DATA_READ_MISS_CACHE_FILL,L2_DATA_WRITE_MISS_CACHE_FILL,DATA_PAGE_WALK,LONG_DATA_PAGE_WALK,VPU_INSTRUCTIONS_EXECUTED,L2_VICTIM_REQ_WITH_DATA,HWP_L2MISS,SNP_HITM_L2

Note that one can configure the VTune™ Collector GUI on Linux* OS, and export the command-line information referenced above.

Amplifier
Figure 5 – Intel® VTune™ Amplifier XE Summary Panel showing 17 Hardware Events based on the Tiling Parameters n1_Tblock=928 n2_Tblock=5 n3_Tblock=124

For this VTune™ experiment, one can reach the VTune™ source panel by clicking on the Top-Down Tree panel so as to correlate the hardware events with the source listing (Figure 6).

source panel with mouse
Figure 6 – Source Panel with the Mouse Arrow Showing the “Go to Biggest Function Hotspot” Button about to be Pushed (Black arrow with red circle surrounding the arrow)

In essence with the “Go to Biggest Function Hotspot” button, we want to find the corresponding source code fragment that is the main bottleneck in the application and possibly use the tiling dimension parameters to improve execution performance of the 3D stencil algorithm.

Using VTune™ for this referenced 3D stencil application, we will look at the following hardware events from Figure 5:

EventMeaning
DATA_READ_OR_WRITEThe number of loads and stores seen by a
thread’s L1 data cache.
DATA_READ_MISS_OR_WRITE_MISSThe number of demand loads or stores that miss
a thread’s L1 cache.
VPU_INSTRUCTIONS_EXECUTEDThe number of VPU instructions executed by the thread.
VPU_ELEMENTS_ACTIVEThe number of vector elements active for a VPU instruction, or the number of vector operations (since each instruction performs multiple vector operations).

From these VTune™ event counters, we can assemble metrics from Cepeda5, which may provide guidance in how to improve application vectorization performance based on tiling:

MetricFormulaInvestigate if:Double Precision Metrics for tiling values n1_Tblock=928 n2_Tblock=5 n3_Tblock=124
L1 Compute to Data Access RatioVPU_ELEMENTS_ACTIVE / DATA_READ_OR_WRITE< Vectorization Intensity6
L2 Compute to Data Access RatioVPU_ELEMENTS_ACTIVE / DATA_READ_MISS_OR_WRITE_MISS< 100 × L1 Compute to Data Access Ratio27
Vectorization IntensityVPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED< 8 (DP), < 16 (SP)8

In the table above, “SP” is an acronym for single precision, and “DP” is an acronym for double precision.

The L2 Compute to Data Access Ratio is the result of dividing VPU_ELEMENTS_ACTIVE by DATA_READ_MISS_OR_WRITE_MISS5. Similarly, Vectorization Intensity is defined as the quotient of VPU_ELEMENTS_ACTIVE divided by VPU_INSTRUCTIONS_EXECUTED. The table has an “Investigate if” column which describes whether or not a metric is at a particular threshold value that indicates good execution performance. If the threshold is not met, then it may be worthwhile to investigate the cause of these metric bottlenecks.

The L2 Compute to Data Access Ratio shows the average number of vectorized operations that occur for each L2 access5. Applications that are able to block data for the L1 cache, or reduce data access in general, will have higher numbers for this ratio. As a baseline, the threshold of 100 times the L1 ratio has been used, meaning there should be roughly 1 L2 data access for every 100 L1 data accesses. Like the L1 metric, it includes all vectorized operations (including data movement) in the numerator.

3.2. Performance Improvements from Tile Size Adjustments

For native mode on the Intel® Xeon Phi™ coprocessor, we can adjust the tiling parameters to say a set of values:

n1_Tblock=928 n2_Tblock=5 n3_Tblock=16

The tiling dimension value of n3_Tblock=16 matches the theoretical value that was computed at the beginning of Section 4 above for double precision 3D stencil computation.

If we rerun the VTune™ experiments, the metrics change as follows:

MetricFormulaInvestigate if:Double Precision Metrics for tiling values n1_Tblock=928 n2_Tblock=5 n3_Tblock=16
L1 Compute to Data Access RatioVPU_ELEMENTS_ACTIVE / DATA_READ_OR_WRITE< Vectorization Intensity7
L2 Compute to Data Access RatioVPU_ELEMENTS_ACTIVE / DATA_READ_MISS_OR_WRITE_MISS< 100 × L1 Compute to Data Access Ratio27
Vectorization IntensityVPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED< 8 (DP), < 16 (SP)8

We obtain stencil computation results that look something like the following with 61 core threads and 3 hardware threads per core:

time:  4.49 sec

throughput:  1501.28 MPoints/s

flops:  49.54 GFlops

The L1 Compute to Data Access Ratio changed from 6 to 7 which is an improvement over the original vectorization intensity. The Giga-Flops per second value also improved for the revised tiling parameters. However, the L2 Compute to Data Access Ratio did not increase for the new tile settings:

n1_Tblock=928 n2_Tblock=5 n3_Tblock=16

We can further proceed to adjust the tiling parameters to say a new set of values:

n1_Tblock=928 n2_Tblock=5 n3_Tblock=5

If we rerun the VTune™ experiments, the following metrics are generated, and note that the L2 Compute to Data Access Ratio has increased which is what we want:

MetricFormulaInvestigate if:Double Precision Metrics for tiling values n1_Tblock=928 n2_Tblock=5 n3_Tblock=5
L1 Compute to Data Access RatioVPU_ELEMENTS_ACTIVE / DATA_READ_OR_WRITE< Vectorization Intensity7
L2 Compute to Data Access RatioVPU_ELEMENTS_ACTIVE / DATA_READ_MISS_OR_WRITE_MISS< 100 × L1 Compute to Data Access Ratio28
Vectorization IntensityVPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED< 8 (DP), < 16 (SP)8

Using 61 core threads and 3 hardware threads per core, we also get stencil computation results that look something like:

time:  3.83 sec

throughput:  1758.12 MPoints/s

flops:  58.02 GFlops

Note in the execution output above, that we have improved the Giga-Flops per Second value for this third run of the 3D stencil algorithm.

3.3. Performance Improvements from Loop Interchange and Tile Size Adjustments

If we revisit the OpenMP loop organization from the original Jacobi 3D stencil algorithm from Borges4:

for(int it=1; it<nreps; it++) { // Start of timing loop

                    …

#pragma omp parallel for num_threads(num_threads) schedule(dynamic)       \
   firstprivate (n1, n2, n3, num_blocks, n1_Tblock, n2_Tblock, n3_Tblock) \
   shared( coeff, prev, next, vel, blocking)
   for (int i=0; i<num_blocks; i++) {
     int i1b = blocking[i].i1_idx;
     int i2b = blocking[i].i2_idx;
     int i3b = blocking[i].i3_idx;
     apply_stencil(next, prev, vel, coeff, i1b, i2b, i3b, n1, n2, n3, n1_Tblock, n2_Tblock, n3_Tblock);
   }

                    …

} // End of timing loop

notice that the timing loop encapsulates the OpenMP parallel region. In most parallel region cases6, the best parallelization performance occurs when the number of forks and joins is minimized, which occurs when the OpenMP parallel region is at the outermost nest level. For the above code fragment, suppose that loop interchanging6 were to be part of the loop optimization process where the “it” loop is included as part of each fork operation. Thus, the timing loop would be placed within the “apply_stencil” function above. From a theoretical perspective, restructuring the loop nesting so that the timing loop is part of the computation may improve data reuse in the time dimension7. In doing such a restructuring transformation on the original 3 dimensional stencil algorithm, and using the following tiling values for the 3D stencil executable:

n1_Tblock=928 n2_Tblock=7 n3_Tblock=1

yields performance values that look something like:

time:  3.60 sec

throughput:  1871.31 MPoints/s

flops:  61.75 GFlops

using 61 core threads and 3 hardware threads per core. Note that lattice data structure for the stencil is array expanded by 1 dimension to account for the time component8. The dimension extent is 2.

Feel free to gather VTune™ events and formulate the appropriate metrics as described in sections 3.1 and 3.2 for this loop-interchange version of the stencil.

4. User Download

The application source for the 3D finite difference stencil algorithm for single and double precision computation can be downloaded from the URL:

http://software.intel.com/sites/default/files/blog/331716/iso-3dfd-v1.tar.gz

The package consists of a directory called “iso-3dfd_V2” that contains the source for the original implementation of the 3D stencil algorithm and associated source for the loop-interchanged version. An Intel® C++ Compiler is required that supports the –mmic command-line option.

4.1. Working with the Original 3D Finite Difference Stencil Source

The makefile that is used for building the original 3D stencil is called “Makefile_orig”. Table 1 provides information for building the 3D stencils in either single or double precision. The scripts ./float_orig_mic_native.sh and ./double_orig_mic_native.sh exemplify how to run the sample code in either single precision or double precision mode respectively, with KMP_AFFINITY="granularity=thread,balanced" and 4 threads per core using 61 cores natively on the Intel® Xeon Phi™ coprocessor. The Shell scripts ./float_orig_offload.sh and ./double_orig_offload.sh run the sample code in respectively single precision and double precision offload mode on the Intel® Xeon Phi™ coprocessor:

ModeClean DirectoryMake ExecutableShell Script Harness Command for Executable
Single Precision Nativemake –f Makefile_orig clean arch=mic precision=floatmake –f Makefile_orig arch=mic precision=floatfloat_orig_mic_native.sh
Single Precision Offloadmake –f Makefile_orig clean arch=offload precision=floatmake –f Makefile_orig arch=offload precision=floatfloat_orig_offload.sh
Double Precision Nativemake –f Makefile_orig clean arch=mic precision=doublemake –f Makefile_orig arch=mic precision=doubledouble_orig_mic_native.sh
Double Precision Offloadmake –f Makefile_orig clean arch=offload precision=doublemake –f Makefile_orig arch=offload precision=doubledouble_orig_offload.sh

Table 1– Makefile commands and Shell scripts for the original 3D stencil algorithm

4.2. Working with the Loop Interchanged 3D Finite Difference Stencil Source

The makefile that is used for building the executables for the loop-interchanged version of the 3D stencil is called “Makefile_loop_interchange”. Table 2 provides information for building the 3D stencils in either single or double precision. The scripts ./float_loop_interchange_mic_native.sh and ./double_loop_interchange_mic_native.sh exemplify how to respectively run the sample code in either single precision or double precision mode on the  Intel® Xeon Phi™ coprocessor. The Shell scripts ./float_loop_interchange_offload.sh and ./double_loop_interchange_offload.sh respectively run the sample code in single precision and double precision offload mode on the Intel® Xeon Phi™ coprocessor:

ModeClean DirectoryMake ExecutableShell Script Harness Command for Executable
Single Precision Nativemake –f Makefile_loop_interchange clean arch=mic precision=floatmake –f Makefile_loop_interchange arch=mic precision=floatfloat_loop_interchange
_mic_native.sh
Single Precision Offloadmake –f Makefile_loop_interchange clean arch=offload precision=floatmake –f Makefile_loop_interchange arch=offload precision=floatfloat_loop_interchange
_offload.sh
Double Precision Nativemake –f Makefile_loop_interchange clean arch=mic precision=doublemake –f Makefile_loop_interchange arch=mic precision=doubledouble_loop_interchange
_mic_native.sh
Double Precision Offloadmake –f Makefile_loop_interchange clean arch=offload precision=doublemake –f Makefile_loop_interchange arch=offload precision=doubledouble_loop_interchange
_offload.sh

Table 2– Makefile commands and Shell scripts for the loop interchanged 3D stencil algorithm

5. References

  1. G. Rivera, C. W. Tseng, “Tiling Optimizations for 3D Scientific Computations”, Proceedings of 2000 IEEE/ACM Conference on Supercomputing, November 2000, pp. 1-23.
  2. C. Leopold, “Tight Bounds on Capacity Misses for 3D Stencil Codes”, Proceedings of the International Conference on Computational Science, Amsterdam, April 2002, pp. 843-852.
  3. K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, K. Yelick, “Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors”, SIAM Review, Vol. 51, No. 1, 2009, pp. 129-159.
  4. http://software.intel.com/en-us/blogs/2012/10/26/experiences-in-developing-seismic-imaging-code-for-intel-xeon-phi-coprocessor
  5. https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding
  6. M. Wolfe, “More Iteration Space Tiling”, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, November 1989, pp. 655-664.
  7. S. M. Faisur Rhaman, Q. Yi, A. Qasem, “Understanding Stencil Code Performance on Multicore Architectures”, Proceedings of the 8th ACM International Conference on Computing Frontiers, May 2011, p. 30.
  8. S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, K. Yelick, “The Potential of the Cell Process for Scientific Computing”, Proceedings of the 3rd ACM International Conference on Computing Frontiers, May 2006, pp. 9-20.

*Other brands and names are the property of their respective owners


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>