Recipe: Building and Running YASK (Yet Another Stencil Kernel) on Intel® Processors

Overview

YASK, Yet Another Stencil Kernel, is a framework to facilitate design exploration and tuning of HPC kernels including vector folding, cache blocking, memory layout, loop construction, temporal wave-front blocking, and others. YASK contains a specialized source-to-source translator to convert scalar C++ stencil code to SIMD-optimized code. Proper tuning of a stencil kernel can show a performance boost on the Intel® Xeon Phi™ processor of up to 2.8 times the performance of the same program on an Intel® Xeon® processor. The performance advantage of the Xeon Phi processor can be attributed to the high bandwidth memory and 512 bit SIMD instructions.

Introduction

A very important subset of HPC computing is the use of stencil computations to update temporal and spatial values of data. Conceptually, the kernel of a typical 3D iterative Jacobian stencil computation can be shown by the following pseudo-code that iterates over the points in a 3D grid:

for t = 1 to T do
  for i = 1 to Dx do
    for j = 1 to Dy do
       for k = 1 to Dz do
          u(t + 1, i, j, k) ← S(t, i, j, k)
      end for
    end for
  end for
end for

where T is the number of time-steps; Dx, Dy, and Dz are the problem-size dimensions; and S(t, i, j, k) is the stencil function. For very simple 1D and 2D stencils, modern compilers can often recognize the data access patterns and optimize code generation to take advantage of vector registers and cache lines, but for more complicated stencils, combined with modern multi-core processors with shared caches and memories, the task of producing optimal code is beyond the scope of most compilers.

YASK is a tool which allows a user to experiment with different types of data distribution, including vector folding and loop structures which may yield better performing code than straight compiler optimizations. YASK is currently focused on single node OpenMP optimizations.

The following graphic shows the typical YASK usage model:

High-level components

Introductory Tutorial

A complete introductory tutorial can be found in the documentation section of the Yask website. This tutorial will walk a user through the necessary steps to build and execute YASK jobs.

Vector Folding Customization

Vector folding, otherwise known as multi-dimensional vectorization is the process of packing vector registers with blocks of data which are not necessarily contiguous in order to optimize data and cache reuse. For a complete discussion of vector folding, please refer to the document titled: “Vector Folding: improving stencil performance via multi-dimensional SIMD-vector representation.” Vector folding by hand is complicated and error prone, so YASK presents a software tool for translating standard sequential code into new code which can then be compiled to produce faster, more efficient code.

Download detailed Vector Folding paper [PDF 330 KB]

Loop Structure Customization

In combination with vector folding, the execution of loops across multiple threads gains additional performance. By allowing a user to experiment with loop structure via OpenMP constructs, YASK offers yet another avenue for code optimization. There are three main loop control customizations: ‘Rank’ loops break the problem in OpenMP regions, ‘Region’ loops break each OpenMP region into cache blocks, and ‘Block’ loops iterate over each vector cluster in a cache block.

Performance

AWP-ODC: One of the stencils included in YASK is awp-odc, a staggered-grid finite difference scheme used to approximate the 3D velocity-stress elastodynamic equations: http://hpgeoc.sdsc.edu/AWPODC. Applications using this stencil simulate the effect of earthquakes to help evaluate designs for buildings and other at-risk structures. Using a problem size of 1024*384*768 grid points, the Intel® Xeon Phi™ processor 7250 improved performance by up to 2.8x compared to the Intel® Xeon® processor E5-2697 v4.

AWP-ODC

Configuration details: YASK HPC Stencils, AWP-ODC kernel

Intel® Xeon® processor E5-2697 v4: Dual Socket Intel® Xeon® processor E5-2697 v4 2.3 GHz (Turbo ON) , 18 Cores/Socket, 36 Cores, 72 Threads (HT on), DDR4 128GB, 2400 MHz, Red Hat Enterprise Linux Server release 7.2

Recipe:

Download code from https://github.com/01org/yask and install per included directions
make stencil=awp arch=hsw cluster=x=2,y=2,z=2 fold=y=8 omp_schedule=guided mpi=1
./stencil-run.sh -arch hsw -ranks 2 -bx 74 -by 192 -bz 20 -pz 2 -dx 512 -dy 384 -dz 768

Intel® Xeon Phi™ processor 7250 (68 cores): Intel® Xeon Phi™ processor 7250, 68 core, 272 threads, 1400 MHz core freq. (Turbo ON), 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS 86B.0010.R00, DDR4 96GB 2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux Server release 6.7

Recipe:

Download code from https://github.com/01org/yask and install per included directions
make stencil=awp arch=knl INNER_BLOCK_LOOP_OPTS='prefetch(L1,L2)'
./stencil-run.sh -arch knl -bx 128 -by 32 -bz 32 -dx 1024 -dy 384 -dz 768

ISO3DFD: Another of the stencils included in YASK is iso3dfd, a 16^th-order in space, 2^nd-order in time, finite-difference code found in seismic imaging software used by energy-exploration companies to predict the location of oil and gas deposits. Using a problem size of 1536*1024*768 grid points, the Intel® Xeon Phi™ processor 7250 improved performance by up to 2.6x compared to the Intel® Xeon® processor E5-2697 v4.

ISO3DFD

Configuration details: YASK HPC Stencils, iso3dfd kernel

Recipe:

Download code from https://github.com/01org/yask and install per included directions
make stencil=iso3dfd arch=hsw mpi=1
./stencil-run.sh -arch hsw -ranks 2 -bx 256 -by 64 -bz 64 -dx 768 -dy 1024 -dz 768

Recipe:

Download code from https://github.com/01org/yask and install per included directions
make stencil=iso3dfd arch=knl
./stencil-run.sh -arch knl -bx 192 -by 96 -bz 96 -dx 1536 -dy 1024 -dz 768

Recipe: Building and Running YASK (Yet Another Stencil Kernel) on Intel® Processors

Overview

Introduction

Introductory Tutorial

Vector Folding Customization

Loop Structure Customization

Performance

Trending Articles

Dive into ELF files using readelf command

Giorgio Moroder - Music From Battlestar Galactica and Other Original...

Re: 929-Fatal MCA error. HA error detected CPU 0 DIMM Slot 5 or 6

[ROM][UNOFFICIAL][x1s][SM-G980F/DS][10] Resurrection Remix v8.6.6 for Samsung...

Our most epic blog yet, 4 stunning, gorgeous Curvy Kate Star In A Bra...

LC4245W - TOSHIBA LCD TV - POWER SUPPLY SCHEMATIC [Circuit Diagram]

JAVARIS FOSTER Arrested by Miami-Dade County Corrections on Feb 01, 2017

Bishop Freddie Marshall Has Lost Everything! Evicted From His Church, His...

UPDATE: Police charge three men after Chelmsford drugs raid

Sabrina Carpenter – Short n’ Sweet [iTunes Plus M4A]

Teen Shot In Miami Drive-By Dies From Injuries

A Bottle of Dew Class 6 Worksheet English Poorvi Chapter 1

'Exceptionally dangerous' rapist Bradley Trengove from Camborne...

Chaoro Lyrics Translation | Mary Kom - Priyanka Chopra

Creating Database from Backup of a Terminated DB System

Tinny — Dzormo (Prod by Hammer)

The 10 Tennessee Cities With The Largest Black Population For 2021

Banks reluctant to lend on 400 Manx homes built in 1970s

Afzal Hai Kul Jahan Se Gharana Hussain Ka

Grimsby school staff resign in sex photo shame