Eight Optimizations for 3-Dimensional Finite Difference (3DFD) Code with an Isotropic (ISO)

Optimizing and Running ISO 3DFD with Support for Intel® Xeon Phi™ Coprocessor

Introduction

Finite difference is a simple and efficient mathematical tool that helps solve differential equations. In seismic applications, wave propagation is ruled by this kind of equation.

Propagating a seismic wave is a CPU-intensive task. In this article, we will explain how to implement and optimize a three-dimension isotropic kernel with finite differences to run on the Intel® Xeon® Processor v2 family and Intel® Xeon Phi™ Coprocessor.

This approach has the following advantages:

Implementation is straightforward (no transformation to frequency domain).
Precision can be easily adapted by changing the size of the stencil.

Wave Propagation Overview

The isotropic acoustic wave equation is:

Where ²p is the Laplace operator.

Developing this equation gives:

Then using finite difference gives:

Where :

Here c₀ to c_n are the coefficients of the finite differences.

Then, the pressure at the next time step can be expressed from the pressure at the current and previous time steps:

Implementation and Optimizations

Now let’s look at implementations and optimizations for the Intel Xeon Phi Coprocessor.

Dev00 – implementation

Now that we’ve introduced the formulation with the finite difference scheme, implementation is straight forward. We will implement the propagation with a 16th order in space and 2nd order in time.

Finite differences codes are organized around a stencil. The stencil represents all the neighbor cells that are needed for refreshing a single cell. An 8th order stencil is shown in Figure 1.

Figure 1. 8th Order Stencil

The finite differences are computed in the three dimensions as shown in Figure 2:

Figure 2.3-dimensional Stencil Computation

Before going into the code, some optimizations can already be realized:

XXXXXXXXXXXXXXX

Because t²is always multiplied to c² and here c[x,y,z] is constant over the time steps, we can directly pre-compute vel[x,y,z] = c²[x , y, z]t²
In the finite difference in space computation, the sampling is the same in x, y, and z. As x² = y² = z² we can simply divide the finite difference coefficients by x² once for all.
The equation indicates that we need to use three arrays to store the pressure at times t-1, t, t+1. Actually, only two arrays are really needed. Figure 3 illustrates the three-array process, and it shows that pt+1 requires only one cell in pt-1; those two arrays can be merged.

Figure 3.Three-array Process

Let’s start by implementing the kernel that solves the equation:

void iso_3dfd_it(float *ptr_next,  float *ptr_prev,  float *ptr_vel,   float *coeff,
  const int n1, const int n2, const int n3, const int num_threads,
  const int n1_Tblock, const int n2_Tblock, const int n3_Tblock){
  int dimn1n2 = n1*n2;//This value will be used later
  #pragma omp parallel for OMP_SCHEDULE OMP_N_THREADS collapse(2) default(shared)
  for(int iz=0; iz<n3; iz++) {
  for(int iy=0; iy<n2; iy++) {
  for(int ix=0; ix<n1; ix++) {
    if( ix>=HALF_LENGTH && ix<(n1-HALF_LENGTH) &&
        iy>=HALF_LENGTH && iy<(n2-HALF_LENGTH) &&
        iz>=HALF_LENGTH && iz<(n3-HALF_LENGTH) ) {
      int offset = iz*dimn1n2 + iy*n1 + ix;
      float value = 0.0;
      value += ptr_prev[offset]*coeff[0];
      for(int ir=1; ir<=HALF_LENGTH; ir++) {
        value += coeff[ir] *
          (ptr_prev[offset + ir] + ptr_prev[offset - ir]);
        value += coeff[ir] *
          (ptr_prev[offset + ir*n1] + ptr_prev[offset - ir*n1]);
        value += coeff[ir] *
          (ptr_prev[offset + ir*dimn1n2] + ptr_prev[offset - ir*dimn1n2]);
      }
      ptr_next[offset] = 2.0f* ptr_prev[offset] - ptr_next[offset] + value*ptr_vel[offset];
    }//end if
  }}}//3 for loops
}//end function

Here, HALF_LENGTH represents half the size of our stencil (8). We need to check that we are not computing the finite differences in the border of the array to avoid segmentation faults. Note that the function prototype contains parameters that are not used yet. Those parameters will be used later for some specific optimizations.

As we are working with two arrays, we need to switch those two arrays after one iteration. This is done here:

void iso_3dfd(float *ptr_next,  float *ptr_prev,  float *ptr_vel,   float *coeff,
  const int n1, const int n2, const int n3, const int num_threads, const int nreps,
  const int n1_Tblock, const int n2_Tblock, const int n3_Tblock){

  for(int it=0; it<nreps; it+=2){

  iso_3dfd_it(ptr_next, ptr_prev, ptr_vel, coeff,
    n1, n2, n3, num_threads, n1_Tblock, n2_Tblock, n3_Tblock);

  // here's where boundary conditions+halo exchanges happen

  // Swap previous & next between iterations
  iso_3dfd_it(ptr_prev, ptr_next, ptr_vel, coeff,
    n1, n2, n3, num_threads, n1_Tblock, n2_Tblock, n3_Tblock);

  } // time loop
}

We won’t describe in detail the main file, because there is nothing critically important there except the array allocation.

In order to benefit from aligned loads on Intel® Xeon Phi™ Coprocessor, data should be aligned on the first cells that are updated. Because of the border checking condition, we actually start updating the array with a padding of HALF_LENGTH. A good practice here is to work with aligned data directly from this padding.

In order to guaranty that each start of a new row is also aligned starting from the HALF_LENGTH padding, we impose that the x dimension is a multiple of 16 (number of floats in an Intel Xeon Phi Coprocessor SIMD register). The allocation of data is shown here:

size_t nsize = p.n1*p.n2*p.n3;
float *prev_base = (float*)_mm_malloc( (nsize+16+MASK_ALLOC_OFFSET(0 ))*sizeof(float), 64);
float *next_base = (float*)_mm_malloc( (nsize+16+MASK_ALLOC_OFFSET(16))*sizeof(float), 64);
float *vel_base  = (float*)_mm_malloc( (nsize+16+MASK_ALLOC_OFFSET(32))*sizeof(float), 64);

if( prev_base==NULL || next_base==NULL || vel_base==NULL ){
  printf("couldn't allocate CPU memory prev_base=%p next=_base%p vel_base=%p\n",prev_base, next_base, vel_base);
  printf("  TEST FAILED!\n"); fflush(NULL);
  exit(-1);
}

// Align working vectors offsets
p.prev = &prev_base[16 – HALF_LENGTH + MASK_ALLOC_OFFSET(0 )];
p.next = &next_base[16 – HALF_LENGTH + MASK_ALLOC_OFFSET(16)];
p.vel  = &vel_base [16 – HALF_LENGTH + MASK_ALLOC_OFFSET(32)];

This part of the code likely needs some comments. Because we want the data to align on the HALF_LENGTH padding, we shift the pointer so that ptr[HALF_LENGTH] is 64-byte aligned (64 bytes = 16 floats). To do this, we add the alignment factor (16) and we subtract HALF_LENGTH.

The macro MASK_ALLOC_OFFSET is used to add another padding multiple of our alignment factor (16 floats or 64 bytes). This technique is used to reduce cache rewriting due to a similar allocation that would result in the same cache address translations for the three arrays.

If we run this kernel with standard sizes, the performance results are as shown in Table 1.

Table 1.Dev00 Results

	Intel Xeon Phi Coprocessor	Two-socket Intel® Xeon® Processor v2 System
Flops by point	78	78
MPoints/s	41.12	369.51
GFlop/s	3.2	28.8

Dev01 – Removing the condition

The first thing we can do here is to remove the if statement that checks whether or not we are processing data in the arrays boundaries. This can easily be done by modifying the three for loop ranges.

void iso_3dfd_it(float *ptr_next,  float *ptr_prev,  float *ptr_vel,   float *coeff,
    const int n1, const int n2, const int n3, const int num_threads,
    int n1_Tblock, const int n2_Tblock, const int n3_Tblock){
  int dimn1n2 = n1*n2;
  /*We store the initial addresses of the pointers to access directly*/
  float *ptr_init_next = ptr_next;
  float *ptr_init_prev = ptr_prev;
  float *ptr_init_vel = ptr_vel;
  //To avoid the condition in the inner loop, we start at HALF_LENGTH and stop at dim-HALF_LENGTH
  #pragma omp parallel for OMP_SCHEDULE OMP_N_THREADS collapse(2) default(shared)
  for(int iz=HALF_LENGTH; iz<n3-HALF_LENGTH; iz++) {
  for(int iy=HALF_LENGTH; iy<n2-HALF_LENGTH; iy++) {
  for(int ix=HALF_LENGTH; ix<n1-HALF_LENGTH; ix++) {
    int offset = iz*dimn1n2 + iy*n1 + ix;
    float value = 0.0;
    value += ptr_prev[offset]*coeff[0];
    for(int ir=1; ir<=HALF_LENGTH; ir++) {
      value += coeff[ir] * (ptr_prev[offset + ir] + ptr_prev[offset - ir]);
      value += coeff[ir] * (ptr_prev[offset + ir*n1] + ptr_prev[offset - ir*n1]);
      value += coeff[ir] * (ptr_prev[offset + ir*dimn1n2] + ptr_prev[offset - ir*dimn1n2]);
    }
    ptr_next[offset] = 2.0f* ptr_prev[offset] - ptr_next[offset] + value*ptr_vel[offset];
  }}}
}

Results for this version is as shown in Table 2.

Table 2.Dev01 Results

	Intel Xeon Phi Coprocessor	Two-socket Intel® Xeon® Processor v2 System
Flops by point	78	78
MPoints/s	58.69	378.54
GFlop/s	4.6	29.5

These numbers were obtained with the following parameters:

NX = 256
NY = 256
NZ = 256
We process 100 iterations
We run on 24 threads on the host processor and 244 threads on the Intel® Xeon Phi™ Coprocessor

Dev02 – Cache blocking

Instead of processing the array cell by cell from the beginning to the end of the allocated area, we create blocks to improve the data locality. Implementing cache blocking can be done by adding three for loops over the three existing for loops. The drawback is that we need to use three new parameters for specifying the size of the blocks in each of the X, Y, and Z dimensions. Setting those new parameters can be difficult. Usually, what works well is to block over the Y and Z dimensions, but not on the X dimension.

void iso_3dfd_it(float *ptr_next_base,  float *ptr_prev_base,  float *ptr_vel_base, float *coeff,
  const int n1, const int n2, const int n3, const int num_threads,
  const int n1_Tblock, const int n2_Tblock, const int n3_Tblock){
  int dimn1n2 = n1*n2;
  int n3End = n3 - HALF_LENGTH;
  int n2End = n2 - HALF_LENGTH;
  int n1End = n1 - HALF_LENGTH;

  #pragma omp parallel for OMP_SCHEDULE OMP_N_THREADS collapse(3) default(shared)
  for(int bz=HALF_LENGTH; bz<n3End; bz+=n3_Tblock){
  for(int by=HALF_LENGTH; by<n2End; by+=n2_Tblock){
  for(int bx=HALF_LENGTH; bx<n1End; bx+=n1_Tblock){
    int izEnd = MIN(bz+n3_Tblock, n3End);
    int iyEnd = MIN(by+n2_Tblock, n2End);
    int ixEnd = MIN(n1_Tblock, n1End-bx);
    int ix;
    for(int iz=bz; iz<izEnd; iz++) {
    for(int iy=by; iy<iyEnd; iy++) {
      float* ptr_next = ptr_next_base + iz*dimn1n2 + iy*n1 + bx;
      float* ptr_prev = ptr_prev_base + iz*dimn1n2 + iy*n1 + bx;
      float* ptr_vel = ptr_vel_base + iz*dimn1n2 + iy*n1 + bx;
      for(int ix=0; ix<ixEnd; ix++) {
        float value = 0.0;
        value += ptr_prev[ix]*coeff[0];
        for(int ir=1; ir<=HALF_LENGTH; ir++) {
          value += coeff[ir] * (ptr_prev[ix + ir] + ptr_prev[ix – ir]);
          value += coeff[ir] * (ptr_prev[ix + ir*n1] + ptr_prev[ix - ir*n1]);
          value += coeff[ir] * (ptr_prev[ix + ir*dimn1n2] + ptr_prev[ix - ir*dimn1n2]);
        }
        ptr_next[ix] = 2.0f* ptr_prev[ix] - ptr_next[ix] + value*ptr_vel[ix];
      }
    }}//end of inner iterations
  }}}//end of cache blocking
}

The performance for this version is as shown in Table 3.

Table 3.Dev02 Results

	Intel Xeon Phi Coprocessor	Two-socket Intel® Xeon® Processor v2 System
Flops by point	78	78
MPoints/s	508.8	418.2
GFlop/s	39.7	32.4

Here the improvement of performance on the Intel Xeon Phi Coprocessor is biased by the fact that the compiler was also able to vectorize the code. Also, note that in this implementation, we chose to remap the x dimension so that the ix index starts at 0 and ends at ixEnd-1.

Dev03 – Separation of the omp clauses

Instead of using the clause

#pragma omp parallel for OMP_SCHEDULE OMP_N_THREADS collapse(3) default(shared)

we can use two directives—one for the parallel section and the other for the for loops. Doing this allows us to declare variables private to the OpenMP threads. The same variables can be re-used from one iteration to the other.

Here we won’t see a big performance difference between the configurations, because, right now, our iterations handle few variables. But if you have many more variables, the benefit can be interesting.

void iso_3dfd_it(float *ptr_next_base,  float *ptr_prev_base,  float *ptr_vel_base, float *coeff,
  const int n1, const int n2, const int n3, const int num_threads,
  const int n1_Tblock, const int n2_Tblock, const int n3_Tblock){
  int dimn1n2 = n1*n2;
  int n3End = n3 - HALF_LENGTH;
  int n2End = n2 - HALF_LENGTH;
  int n1End = n1 - HALF_LENGTH;

  #pragma omp parallel OMP_N_THREADS default(shared)
  {
    float* ptr_next;
    float* ptr_prev;
    float* ptr_vel;
    float value;
    int izEnd;
    int iyEnd;
    int ixEnd;

    #pragma omp for OMP_SCHEDULE collapse(3)
    for(int bz=HALF_LENGTH; bz<n3End; bz+=n3_Tblock){
    for(int by=HALF_LENGTH; by<n2End; by+=n2_Tblock){
    for(int bx=HALF_LENGTH; bx<n1End; bx+=n1_Tblock){
      izEnd = MIN(bz+n3_Tblock, n3End);
      iyEnd = MIN(by+n2_Tblock, n2End);
      ixEnd = MIN(n1_Tblock, n1End-bx);

      for(int iz=bz; iz<izEnd; iz++) {
      for(int iy=by; iy<iyEnd; iy++) {
        ptr_next = &ptr_next_base[iz*dimn1n2 + iy*n1 + bx];
        ptr_prev = &ptr_prev_base[iz*dimn1n2 + iy*n1 + bx];
        ptr_vel = &ptr_vel_base[iz*dimn1n2 + iy*n1 + bx];

        for(int ix=0; ix<ixEnd; ix++) {
          value = 0.0;
          value += ptr_prev[ix]*coeff[0];
          for(int ir=1; ir<=HALF_LENGTH; ir++) {
            value += coeff[ir] * (ptr_prev[ix + ir] + ptr_prev[ix - ir]);
            value += coeff[ir] * (ptr_prev[ix + ir*n1] + ptr_prev[ix - ir*n1]);
            value += coeff[ir] * (ptr_prev[ix + ir*dimn1n2] + ptr_prev[ix - ir*dimn1n2]);
          }
          ptr_next[ix] = 2.0f* ptr_prev[ix] - ptr_next[ix] + value*ptr_vel[ix];
        }//end x
      }}//end y and z
    }}}//end cache blocking
  }//end parallel
}

The performance for this version is as shown in Table 4.

Table 4. Dev03 Results

	Intel Xeon Phi Coprocessor	Two-socket Intel® Xeon® Processor v2 System
Flops by point	78	78
MPoints/s	520.4	418.8
GFlop/s	40.6	32.7

Dev04 – Adding pragmas

Here, our kernel is compiled with the option –fno-alias, which indicates that pointers are not supposed to be aliased. If you don’t want to compile with this option, you can locally specify that in a specific for loop; the pointers will not be aliased.

Indicating that pointers are not aliased allows the compiler to vectorize the code. Then the compiler uses its own heuristics to decide whether or not vectorization can bring more performance.

To help the compiler vectorize the code, you can add different pragmas:

#pragma ivdep—indicates that your pointers are not aliased, but the compiler will decide whether or not it’s interesting to vectorize.
#pragma simd—forces the vectorization. In some conditions it can lead to segmentation faults.

Other pragmas can help the vectorization. For example, with #pragma unroll, you increment your for loop index by 2, 3 or more, instead of by 1, and you hard code 2, 3 or more iterations at each step.

void iso_3dfd_it(float *ptr_next_base,  float *ptr_prev_base,  float *ptr_vel_base, float *coeff,
  const int n1, const int n2, const int n3, const int num_threads,
  const int n1_Tblock, const int n2_Tblock, const int n3_Tblock){
  int dimn1n2 = n1*n2;
  int n3End = n3 - HALF_LENGTH;
  int n2End = n2 - HALF_LENGTH;
  int n1End = n1 - HALF_LENGTH;

  #pragma omp parallel OMP_N_THREADS default(shared)
  {
    float* ptr_next;
    float* ptr_prev;
    float* ptr_vel;
    float value;
    int izEnd;
    int iyEnd;
    int ixEnd;

    #pragma omp for OMP_SCHEDULE collapse(3)
    for(int bz=HALF_LENGTH; bz<n3End; bz+=n3_Tblock){
    for(int by=HALF_LENGTH; by<n2End; by+=n2_Tblock){
    for(int bx=HALF_LENGTH; bx<n1End; bx+=n1_Tblock){
      izEnd = MIN(bz+n3_Tblock, n3End);
      iyEnd = MIN(by+n2_Tblock, n2End);
      ixEnd = MIN(n1_Tblock, n1End-bx);

      for(int iz=bz; iz<izEnd; iz++) {
      for(int iy=by; iy<iyEnd; iy++) {
        ptr_next = &ptr_next_base[iz*dimn1n2 + iy*n1 + bx];
        ptr_prev = &ptr_prev_base[iz*dimn1n2 + iy*n1 + bx];
        ptr_vel = &ptr_vel_base[iz*dimn1n2 + iy*n1 + bx];
        #pragma ivdep
        for(int ix=0; ix<ixEnd; ix++) {
          value = 0.0;
          value += ptr_prev[ix]*coeff[0];

          #if defined(MODEL_CPU)
            #pragma unroll(16)
          #endif
          #pragma ivdep
          for(int ir=1; ir<=HALF_LENGTH; ir++) {
            value += coeff[ir] * (ptr_prev[ix + ir] + ptr_prev[ix - ir]);
            value += coeff[ir] * (ptr_prev[ix + ir*n1] + ptr_prev[ix - ir*n1]);
            value += coeff[ir] * (ptr_prev[ix + ir*dimn1n2] + ptr_prev[ix - ir*dimn1n2]);
          }
          ptr_next[ix] = 2.0f* ptr_prev[ix] - ptr_next[ix] + value*ptr_vel[ix];
        }//end x
      }}//end y and z
    }}}//end cache locking
  }//end parallel
}//end function

Performance results for this version are shown in Table 5.

Table 5.Dev04 Results

	Intel Xeon Phi Coprocessor	Two-socket Intel® Xeon® Processor v2 System
Flops by point	78	78
MPoints/s	515.6	439.8
GFlop/s	40.2	34.3

Dev05 - __assume_aligned and manual unrolling

We’ve seen in the version dev00 that the data are aligned with a padding of HALF_LENGTH. This information can be added later in the code so that the compiler can proceed to more optimizations.

On Intel Xeon Phi Coprocessor, unaligned loads produce a very important performance drawback. As the files are compiled separately, the compiler has no way to know that the kernel uses aligned data. To help the compiler, developers can add the _assume_aligned directive. This directive takes 2 parameters:

A pointer
The alignment factor

Because dimension(x)%16 = 0 and array[0][0][HALF_LENGTH] is 64 bytes aligned, we can guarantee that whatever Z and Y are, array[Z][Y][HALF_LENGTH] will always be aligned.

Also, the innermost loop, the one that computes the finite differences for each coefficient, can manually be unrolled. Completely unrolling this loop can also help vectorization. Manual unrolling is easy and interesting here, because there are few iterations (eight in our case). In order to make the code more readable, we chose to use a macro for adding two symmetric points of the stencil:

#define FINITE_ADD(ix, off) ((ptr_prev[ix + off] + ptr_prev[ix - off]))

We also pre-compute all the padding required for accessing data in the Y and Z dimensions. Those computations are done on each thread between the #pragma omp parallel and #pragma omp for clauses.

void iso_3dfd_it(float *ptr_next_base,  float *ptr_prev_base,  float *ptr_vel_base, float *coeff,
  const int n1, const int n2, const int n3, const int num_threads,
  const int n1_Tblock, const int n2_Tblock, const int n3_Tblock){
  int dimn1n2 = n1*n2;
  int n3End = n3 - HALF_LENGTH;
  int n2End = n2 - HALF_LENGTH;
  int n1End = n1 - HALF_LENGTH;

  #pragma omp parallel OMP_N_THREADS default(shared)
  {
    float* ptr_next;
    float* ptr_prev;
    float* ptr_vel;
    float value;
    int izEnd;
    int iyEnd;
    int ixEnd;

    const int vertical_1 = n1, vertical_2 = n1*2, vertical_3 = n1*3, vertical_4 = n1*4;
    const int front_1 = dimn1n2, front_2 = dimn1n2*2, front_3 = dimn1n2*3, front_4 = dimn1n2*4;
    const float c0=coeff[0], c1=coeff[1], c2=coeff[2], c3=coeff[3], c4=coeff[4];
    //At this point, we must handle the stencil possible sizes.
#if ( HALF_LENGTH == 8 )
    const int vertical_5 = n1*5, vertical_6 = n1*6, vertical_7 = n1*7, vertical_8 = n1*8;
    const int front_5 = dimn1n2*5, front_6 = dimn1n2*6, front_7 = dimn1n2*7, front_8 = dimn1n2*8;
    const float c5=coeff[5], c6=coeff[6], c7=coeff[7], c8=coeff[8];
#endif
    __assume_aligned((void*)vertical_1, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_2, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_3, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_4, CACHELINE_BYTES);
    __assume_aligned((void*)front_1, CACHELINE_BYTES);
    __assume_aligned((void*)front_2, CACHELINE_BYTES);
    __assume_aligned((void*)front_3, CACHELINE_BYTES);
    __assume_aligned((void*)front_4, CACHELINE_BYTES);
    //Handle all size of stencil
#if( HALF_LENGTH == 8 )
    __assume_aligned((void*)vertical_5, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_6, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_7, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_8, CACHELINE_BYTES);
    __assume_aligned((void*)front_5, CACHELINE_BYTES);
    __assume_aligned((void*)front_6, CACHELINE_BYTES);
    __assume_aligned((void*)front_7, CACHELINE_BYTES);
    __assume_aligned((void*)front_8, CACHELINE_BYTES);
#endif

    __declspec(align(CACHELINE_BYTES)) float div[n1_Tblock];

    #pragma omp for OMP_SCHEDULE collapse(3)
    for(int bz=HALF_LENGTH; bz<n3End; bz+=n3_Tblock){
    for(int by=HALF_LENGTH; by<n2End; by+=n2_Tblock){
    for(int bx=HALF_LENGTH; bx<n1End; bx+=n1_Tblock){
      izEnd = MIN(bz+n3_Tblock, n3End);
      iyEnd = MIN(by+n2_Tblock, n2End);
      ixEnd = MIN(n1_Tblock, n1End-bx);

      for(int iz=bz; iz<izEnd; iz++) {
      for(int iy=by; iy<iyEnd; iy++) {
        ptr_next = &ptr_next_base[iz*dimn1n2 + iy*n1 + bx];
        ptr_prev = &ptr_prev_base[iz*dimn1n2 + iy*n1 + bx];
        ptr_vel = &ptr_vel_base[iz*dimn1n2 + iy*n1 + bx];

        __assume_aligned(ptr_next, CACHELINE_BYTES);
        __assume_aligned(ptr_prev, CACHELINE_BYTES);
        __assume_aligned(ptr_vel, CACHELINE_BYTES);
        #pragma ivdep
        for(int ix=0; ix<ixEnd; ix++) {
          value = ptr_prev[ix]*c0
            + c1 * FINITE_ADD(ix, 1)
            + c1 * FINITE_ADD(ix, vertical_1)
            + c1 * FINITE_ADD(ix, front_1)
            + c2 * FINITE_ADD(ix, 2)
            + c2 * FINITE_ADD(ix, vertical_2)
            + c2 * FINITE_ADD(ix, front_2)
            + c3 * FINITE_ADD(ix, 3)
            + c3 * FINITE_ADD(ix, vertical_3)
            + c3 * FINITE_ADD(ix, front_3)
            + c4 * FINITE_ADD(ix, 4)
            + c4 * FINITE_ADD(ix, vertical_4)
            + c4 * FINITE_ADD(ix, front_4)
#if( HALF_LENGTH == 8)
            + c5 * FINITE_ADD(ix, 5)
            + c5 * FINITE_ADD(ix, vertical_5)
            + c5 * FINITE_ADD(ix, front_5)
            + c6 * FINITE_ADD(ix, 6)
            + c6 * FINITE_ADD(ix, vertical_6)
            + c6 * FINITE_ADD(ix, front_6)
            + c7 * FINITE_ADD(ix, 7)
            + c7 * FINITE_ADD(ix, vertical_7)
            + c7 * FINITE_ADD(ix, front_7)
            + c8 * FINITE_ADD(ix, 8)
            + c8 * FINITE_ADD(ix, vertical_8)
            + c8 * FINITE_ADD(ix, front_8)
#endif
            ;
          ptr_next[ix] = 2.0f* ptr_prev[ix] - ptr_next[ix] + value*ptr_vel[ix];
        }// end x
      }}//end Y and Z
    }}}//end of cache blocking
  }//end of parallel
}//end of function

Note that the number of additions changes here. The results are shown in Table 6.

Table 6.Dev05 Results

	Intel Xeon Phi Coprocessor	Two-socket Intel® Xeon® Processor v2 System
Flops by point	77	77
MPoints/s	512.3	1147.7
GFlop/s	39.4	88.4

Dev06 – Changing the kernel factorization

The previous version makes explicit a way to produce a different factorization. Indeed, in the dev05 version, each coefficient is used for three multiplications. Another solution would be to add the FINITE_ADD results in the three dimensions, and then multiply this result by the coefficient. This will reduce the number of multiplications, with no impact on the number of additions.

A remark can be made here. In order to get as close as possible to the processor’s PEAK performance, a code must have a perfect balance between additions and multiplications. This is because the CPU contains two separate pipelines: one for additions and the other for multiplications. In the dev05 version, we had 50 additions for 27 multiplications; in the dev06 version, we reduce even more the number of multiplications (50 additions and 11 multiplications). On the Intel Xeon Processor v2 family, because of the out-of-order execution, we don’t expect to get better performance. But, on Intel Xeon Phi Coprocessor, things are different. Changing the number of additions or multiplications can modify the performance due to the in-order architecture.

void iso_3dfd_it(float *ptr_next_base,  float *ptr_prev_base,  float *ptr_vel_base, float *coeff,
  const int n1, const int n2, const int n3, const int num_threads,
  const int n1_Tblock, const int n2_Tblock, const int n3_Tblock){
  int dimn1n2 = n1*n2;
  int n3End = n3 - HALF_LENGTH;
  int n2End = n2 - HALF_LENGTH;
  int n1End = n1 - HALF_LENGTH;

  #pragma omp parallel OMP_N_THREADS default(shared)
  {
    float* ptr_next;
    float* ptr_prev;
    float* ptr_vel;
    float value;
    int izEnd;
    int iyEnd;
    int ixEnd;

    const int vertical_1 = n1, vertical_2 = n1*2, vertical_3 = n1*3, vertical_4 = n1*4;
    const int front_1 = dimn1n2, front_2 = dimn1n2*2, front_3 = dimn1n2*3, front_4 = dimn1n2*4;
    const float c0=coeff[0], c1=coeff[1], c2=coeff[2], c3=coeff[3], c4=coeff[4];
    //At this point, we must handle the stencil possible sizes.
#if ( HALF_LENGTH == 8 )
    const int vertical_5 = n1*5, vertical_6 = n1*6, vertical_7 = n1*7, vertical_8 = n1*8;
    const int front_5 = dimn1n2*5, front_6 = dimn1n2*6, front_7 = dimn1n2*7, front_8 = dimn1n2*8;
    const float c5=coeff[5], c6=coeff[6], c7=coeff[7], c8=coeff[8];
#endif
    __assume_aligned((void*)vertical_1, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_2, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_3, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_4, CACHELINE_BYTES);
    __assume_aligned((void*)front_1, CACHELINE_BYTES);
    __assume_aligned((void*)front_2, CACHELINE_BYTES);
    __assume_aligned((void*)front_3, CACHELINE_BYTES);
    __assume_aligned((void*)front_4, CACHELINE_BYTES);
    //Handle all size of stencil
#if( HALF_LENGTH == 8 )
    __assume_aligned((void*)vertical_5, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_6, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_7, CACHELINE_BYTES);
    __assume_aligned((void*)vertical_8, CACHELINE_BYTES);
    __assume_aligned((void*)front_5, CACHELINE_BYTES);
    __assume_aligned((void*)front_6, CACHELINE_BYTES);
    __assume_aligned((void*)front_7, CACHELINE_BYTES);
    __assume_aligned((void*)front_8, CACHELINE_BYTES);
#endif

    __declspec(align(CACHELINE_BYTES)) float div[n1_Tblock];

    #pragma omp for OMP_SCHEDULE collapse(3)
    for(int bz=HALF_LENGTH; bz<n3End; bz+=n3_Tblock){
    for(int by=HALF_LENGTH; by<n2End; by+=n2_Tblock){
    for(int bx=HALF_LENGTH; bx<n1End; bx+=n1_Tblock){
      izEnd = MIN(bz+n3_Tblock, n3End);
      iyEnd = MIN(by+n2_Tblock, n2End);
      ixEnd = MIN(n1_Tblock, n1End-bx);

      for(int iz=bz; iz<izEnd; iz++) {
      for(int iy=by; iy<iyEnd; iy++) {
        ptr_next = &ptr_next_base[iz*dimn1n2 + iy*n1 + bx];
        ptr_prev = &ptr_prev_base[iz*dimn1n2 + iy*n1 + bx];
        ptr_vel = &ptr_vel_base[iz*dimn1n2 + iy*n1 + bx];

        __assume_aligned(ptr_next, CACHELINE_BYTES);
        __assume_aligned(ptr_prev, CACHELINE_BYTES);
        __assume_aligned(ptr_vel, CACHELINE_BYTES);
        #pragma ivdep
        for(int ix=0; ix<ixEnd; ix++) {
          value = ptr_prev[ix]*c0
            + c1 * (FINITE_ADD(ix, 1)
            	+ FINITE_ADD(ix, vertical_1)
            	+ FINITE_ADD(ix, front_1))
            + c2 * (FINITE_ADD(ix, 2)
            	+ FINITE_ADD(ix, vertical_2)
            	+ FINITE_ADD(ix, front_2))
            + c3 * (FINITE_ADD(ix, 3)
            	+ FINITE_ADD(ix, vertical_3)
            	+ FINITE_ADD(ix, front_3))
            + c4 * (FINITE_ADD(ix, 4)
            	+ FINITE_ADD(ix, vertical_4)
            	+ FINITE_ADD(ix, front_4))
#if( HALF_LENGTH == 8)
            + c5 * (FINITE_ADD(ix, 5)
            	+ FINITE_ADD(ix, vertical_5)
            	+ FINITE_ADD(ix, front_5))
            + c6 * (FINITE_ADD(ix, 6)
            	+ FINITE_ADD(ix, vertical_6)
            	+ FINITE_ADD(ix, front_6))
            + c7 * (FINITE_ADD(ix, 7)
            	+ FINITE_ADD(ix, vertical_7)
            	+ FINITE_ADD(ix, front_7))
            + c8 * (FINITE_ADD(ix, 8)
            	+ FINITE_ADD(ix, vertical_8)
            	+ FINITE_ADD(ix, front_8))
#endif
            ;
          ptr_next[ix] = 2.0f* ptr_prev[ix] - ptr_next[ix] + value*ptr_vel[ix];
        }// end x
      }}//end Y and Z
    }}}//end of cache blocking
  }//end of parallel
}//end of function

Here, the number of flops by point is reduced compared to the previous versions. The results are shown in Table 7.

Table 7. Dev06 Results

	Intel Xeon Phi Coprocessor	Two-socket Intel® Xeon® Processor v2 System
Flops by point	61	61
MPoints/s	543.5	1145.7
GFlop/s	33.1	69.9

Dev07 – First touch

This optimization only works on NUMA systems (which is not the case of Intel Xeon Phi Coprocessor). On a multi-socket system, a CPU can more quickly access DIMM memory local to its socket. Figure 4 illustrates this principle with a two-socket device, where each socket has four DIMMs.

Figure 4. Two-socket System Memory Access

In order to limit the slower accesses, data must be mapped correctly in memory. This can be achieved with the first-touch technique. When data are allocated (with malloc, _mm_malloc, etc.), just the space is reserved by the operating system, but data are not physically mapped in memory. The mapping happens when the data are first written. The thread that “touches” the data places it close to the CPU running the thread.

Therefore, if the data are placed in the same way that they are accessed later, we can benefit from fast accesses all along the program’s execution.

Implementing the first touch technique is quite easy in this code. We just need to initialize the data with the same OpenMP splitting that will be used later for processing and computing the wave propagation.

The first touch’s implementation is shown in the following code.

void first_touch(float* tab_base, long n1, long n2, long n3,
  long n1_Tblock, long n2_Tblock, long n3_Tblock, long num_threads){
  long n3End = n3 - HALF_LENGTH;
  long n2End = n2 - HALF_LENGTH;
  long n1End = n1 - HALF_LENGTH;
  long dimn1n2 = n1*n2;
  #pragma omp parallel OMP_N_THREADS
  {
    float* tab;
    #pragma omp for collapse(3)
    for(long bz=HALF_LENGTH; bz<n3End; bz+=n3_Tblock){
    for(long by=HALF_LENGTH; by<n2End; by+=n2_Tblock){
    for(long bx=HALF_LENGTH; bx<n1End; bx+=n1_Tblock){
//---Cache Blocking Loop implementation End--------------
      long izEnd = MIN(bz+n3_Tblock, n3End);
      long iyEnd = MIN(by+n2_Tblock, n2End);
      long ixEnd = MIN(n1_Tblock, n1End-bx);

      if(bz==HALF_LENGTH){
        for(long i=0;i<bz; i++){
        for(long iy=by; iy<iyEnd; iy++) {
          //Compute the addresses for next x-loops
          tab = &tab_base[i*dimn1n2 + iy*n1 + bx];

          for(long ix=0;ix<ixEnd; ix++){
            tab[ix] = 0.f;
          }
        }}
      }
      if(izEnd>=n3End){
        for(long i=n3End;i<n3; i++){
        for(long iy=by; iy<iyEnd; iy++) {
          //Compute the addresses for next x-loops
          tab = &tab_base[i*dimn1n2 + iy*n1 + bx];

          for(long ix=0;ix<ixEnd; ix++){
            tab[ix] = 0.f;
          }
        }}
      }
      if(by==HALF_LENGTH){
        for(long iz=bz; iz<izEnd; iz++) {
        for(long i=0;i<by; i++){
          //Compute the addresses for next x-loops
          tab = &tab_base[iz*dimn1n2 + i*n1 + bx];

          for(long ix=0;ix<ixEnd; ix++){
            tab[ix] = 0.f;
          }
        }}
      }
      if(iyEnd>=n2End){
        for(long iz=bz; iz<izEnd; iz++) {
        for(long i=n2End;i<n2; i++){
          //Compute the addresses for next x-loops
          tab = &tab_base[iz*dimn1n2 + i*n1 + bx];

          for(long ix=0;ix<ixEnd; ix++){
            tab[ix] = 0.f;
          }
        }}
      }
      if(bx==HALF_LENGTH){
        for(long iz=bz; iz<izEnd; iz++) {
        for(long iy=by; iy<iyEnd; iy++) {
          //Compute the addresses for next x-loops
          tab = &tab_base[iz*dimn1n2 + iy*n1 ];

          for(long ix=0;ix<HALF_LENGTH; ix++){
            tab[ix] = 0.f;
          }
        }}
      }
      if(ixEnd>=n1End){
        for(long iz=bz; iz<izEnd; iz++) {
        for(long iy=by; iy<iyEnd; iy++) {
          //Compute the addresses for next x-loops
          tab = &tab_base[iz*dimn1n2 + iy*n1 + ixEnd];

          for(long ix=0;ix<HALF_LENGTH; ix++){
            tab[ix] = 0.f;
          }
        }}
      }

      for(long iz=bz; iz<izEnd; iz++) {
      for(long iy=by; iy<iyEnd; iy++) {
        //Compute the addresses for next x-loops
        tab = &tab_base[iz*dimn1n2 + iy*n1 + bx];

        for(long ix=0;ix<ixEnd; ix++){
          tab[ix] = 0.f;
        }
      }}
    }}}
  }
}

This function is called for each of the arrays just after allocation. It’s important to have the same OpenMP decomposition later in the code, as follows:

#pragma omp parallel OMP_N_THREADS
  {
    float* tab;
    #pragma omp for collapse(3)
    for(long bz=HALF_LENGTH; bz<n3End; bz+=n3_Tblock){
    for(long by=HALF_LENGTH; by<n2End; by+=n2_Tblock){
    for(long bx=HALF_LENGTH; bx<n1End; bx+=n1_Tblock){

Performance results for this version are shown in Table 8

Table 8. Dev07 Results

	Intel Xeon Phi Coprocessor	Two-socket Intel® Xeon® Processor v2 System
Flops by point	61	61
MPoints/s	540.4	1666.2
GFlop/s	32.9	101.6

Dev08 – Intrinsics

Using intrinsics can further improve the performance on Intel Xeon Phi Coprocessors. Ideally, we would utilize the SIMD architecture through explicit vector programming available in Intel compilers. However, at the time of this work, the Intel compilers did not fully support the pattern needed in this application. As a work around, we use intrinsics to demonstrate the performance that is possible. We expect the Intel compiler to continue to improve in upcoming releases. The manual vectorization is straight forward along the Y and Z axes, because of the unit-strides accesses (Figure 5).

Figure 5.Vectorization

For each coefficient of the finite difference, we can directly load four SIMD registers. Then, we sum the four registers and multiply the result by the associated coefficient.

On the X axis, things are more complicated because we need to sum values from the same register as shown in Figure 6.

Figure 6.X-axis Vectorization

On Intel Xeon Phi Coprocessor, we can actually code this in a very efficient way. The use of the instruction _mm512_alignr_epi32 really helps here. This instruction allows us to shift float values between SIMD registers. For the 16th-order finite difference in space, updating one vector requires us to load only three vectors. Figure 7 shows how to proceed for coefficients c0 and c1. The shift is done with the instruction _mm512_alignr_epi32.

Figure 7. C0, C1 Shifting

Here we need to maintain two different versions, because the sets of intrinsics for Intel Xeon Phi Coprocessor and Intel Xeon Processor v2 family are not the same. The code can be found in the package.

The results for this optimization are shown in Table 9.

Table 9.Dev08 Results

	Intel Xeon Phi Coprocessor	Two-socket Intel® Xeon® Processor v2 System
Flops by point	69	69
MPoints/s	603.1	1338.2
GFlop/s	41.6	92.3

We can see that the compiler produced a better code on dev07 for Intel Xeon Processor v2 family.

Dev09 – Using FMA

In this version, we are still using intrinsics, but we generalize the use of FMA. This also means that the Intel Xeon Processor v2 family version stays the same as the dev08 version (no FMA on Intel Xeon Processor v2 family).

Using only FMA offers an advantage here. The finite difference coefficient can stay in the same register for a longer time, which is what happened in the dev08 version. Six vectors were loaded, and many temporary results were generated and combined (Figure 8).

Figure 8.Dev08 Process

In the dev09 version, things look simpler (Figure 9).

Figure 9. Dev09 Process

The temporary result always stays in registers and is used from one FMA to the other. Also the finite difference coefficient stays in registers. The load repartition is also better.

The results for this optimization are shown in Table 10.

Table 10.Dev09 Results

	Intel Xeon Phi Coprocessor	Two-socket Intel® Xeon® Processor v2 System
Flops by point	98	69
MPoints/s	632.4	1353.4
GFlop/s	61.9	93.4

Getting better results

In HPC, some analysis models, such as the Roofline Model, can help the developer predict what will be the upper-bound performance for a given algorithm.

If we apply this tool on the dev00 version for both Intel Xeon Phi Coprocessor and Intel Xeon Processor v2 family, we get these two charts (Figure 10):

Figure 10. Roofline Model Results for Dev00

To be accurate, these charts should be drawn after each optimization, because the number of additions and multiplications can change from one version to another. But, using the roofline model gives a hint about the maximum expected performance.

Here, the roofline models indicate that for Intel Xeon Processor v2 family, we should be able to reach more than 337.5 GFlop/s. This value is based on the number of additions/multiplications/loads/stores in a single iteration of the kernel. Because the dev00 version processes 78 points at each iteration, the maximum throughput of our kernel should be 337.5/78 * 1000= 4327 MPoint/s.

On Intel Xeon Phi Coprocessor, it would give 658.12/78 * 1000 = 8437 MPoint/s.

We are pretty far from those expectations so far.

The roofline model makes the assumption that the code is highly parallel, perfectly vectorized, and that hardware benefits from a perfect cache system. Through these ten different versions presented above, we tried to improve the parallelism, the vectorization, and the data locality, but it seems that is not enough.

One of the problems appears at the compilation and execution times. Compilers have so many options to influence the performances that it is very difficult to find the best settings. Also, our application takes many parameters, such as the array sizes, the cache blocking sizes, the number of threads and iterations, and more. Those parameters can also modify the performance.

In order to optimize those parameters, we applied an auto-tuning tool, which searched the parameter space to find an improved set of parameters to use.
With this tool, we achieved the following results on Intel Xeon Phi Coprocessor and Two-socket Intel Xeon Processor v2 system (Table 11).

Table 11. Further Optimization with Parameter Settings

	Intel Xeon Phi Coprocessor	Two-socket Intel® Xeon® Processor v2 System
Flops by point	98	69
MPoints/s	6100	3500
GFlop/s	597	241

Parameter adjustments indeed improved performance closer to the theoretical maximums presented by the Roofline Model. As we conduct more auto-tuning studies, opportunities may emerge to fine tune compiler heuristics, as well as more optimal application specific data layout settings.

Appendix

Configuration

Performance testing for the results provided in the tables in this paper were achieved from the following test system. For more information go to http://www.intel.com/performance.

Component	Specification
System	Two-socket server
Host Processor	Intel® Xeon® Processor E5-2697 V2; 2.70 GHz
GFlop/sHost Cores/Threads	12/24
Host Memory	65865 MB
Coprocessor	Intel® Xeon Phi™ Coprocessor 7120P, 61 cores, 1.24 GHz
Compiler	Intel® C++ Compiler Version 1.2.144, Driver 3.3-1, MPSS 3.3
Host OS	Linux; Version 2.6.32-431.3.1.el6.x86

Notes

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.