The Intel Advisor will soon offer a great step forward in memory performance optimization with a new vivid Advisor “Roofline” bounds and bottlenecks analysis
This new feature provides insights beyond vectorization, such as memory usage and the quality of algorithm implementation.
If you want to try it out or influence the development of this new Roofline feature, sign up for the early access program by sending a request to vector_advisor@intel.com.
Accelerate your application: Tuning existing vectorization and adding new vectorization is easy with the visually intuitive Vectorization Advisor tool in the Intel® Advisor. Try out new vectorization capabilities available in the Intel® Parallel Studio Beta Update, such as expanded memory access patterns analysis, Flops information, and special features for the second generation of the Intel® Xeon Phi™ processor (code name Knights Landing) that uses the AVX-512 instruction set. Register for Intel® Parellel Studio XE 2017 beta program https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2017-beta#howto , or download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/.
Roofline Modeling
Roofline modeling was first proposed by Berkley researchers Samuel Williams, Andrew Waterman, and David Patterson in paper "Roofline: An Insightful Visual Performance Model for Multicore Architectures" in 2009.
A Roofline model provides insight into how your application works by helping you answer the following questions:
- Does my application work optimally on the current hardware?
- What limits performance? Is my application workload memory or compute bound?
- What is the right strategy to improve application performance?
The model plots data to help you visualize application compute- and memory-bandwidth ceilings by measuring two parameters:
- Operational intensity – the number of floating-point operations per byte transferred from memory
- Floating-point performance – in Gflops per second
The proximity of the data points to the model lines (rooflines) shows the degree of optimization.
Consider the roofline plot in Fig. 1.
The kernel's on the right hand side are more compute bound and as you move up the Y-axis they become get close to the FP peak. The performance of these kernels are bounded by the compute capabilities of the platform. To improve performance of kernel 3 consider migrating this kernel to a highly parallel platform, such as the Intel Xeon Phi processor, where the compute ceiling and memory throughput is higher. For the kernel 2 vectorization can be considered as a performance improvement strategy as it is far away from the ceiling.
Towards the left-hand side of the plot, the kernel's here are memory bound and you go up the Y-axis they are more bound to the DRAM and cache peak bandwidth of the platform. To increase the performance of these kernels (shifting the plot position to the right with a higher performance ceiling), consider improving the algorithm or its implementation to perform more computations per data item. These kernels may also run faster on an Intel Xeon Phi processor because of greater memory bandwidth availability.
Intel Advisor Roofline Feature
The Intel Advisor implemented "Cache-aware roofline" model described in "Cache-aware Roofline model: Upgrading the loft" paper authored by Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. It provides additional insight by addressing all levels of memory / cache hierarchy:
- Slope rooflines illustrate performance levels, if all the data fits into respective cache.
- Horizontal lines show the peak achievable performance levels if vectorization and other CPU resources are used effectively.
Intel Advisor places a dot for every loop in the Roofline plot. Consider the Intel Advisor roofline plot in the Fig.2. Most of loops require extra cache use optimizations. Loops to the right of the plotted blue data point fall below the scalar execution roofline and therefore require vectorization.
You can examine you application performance opportunities by applying our experimental Roofline Vector Advisor and browsing through high loaded loops. The circles sizes denotes execution time of loops.