By utilizing the strengths of the Intel® Xeon Phi™ coprocessor, the chapter 3 High Performance Parallelism Pearls authors were able to improve and modernize their code and “achieve great scaling, vectorization, bandwidth utilization and performance/watt”. The authors (Jacob Weismann Poulsen, Karthik Raman and Per Berg) note, “The thinking process and techniques used in this chapter have wide applicability: focus on data locality and then apply threading and vectorization techniques.”. In particular, they write about the advection routine from the HIROMB‐BOOS‐Model (HBM) which was initially underperforming on the Intel Xeon Phi coprocessor. However, they were able to achieve a 3x performance improvement after re-structuring the code which involved changing data structures for better data locality, exploiting the available threads and SIMD lanes for better concurrency at thread and loop level to utilize the maximum available memory bandwidth. To avoid data licensing issues the example code provided in High Performance Parallelism Pearls utilizes the Baffin Bay setup generated from the freely available ETOPO2 data set.
Click to view entire article.