by Jason Fletcher
Introduction
The Itanium® processor is an industrial-strength execution engine, but faulty code can bring any machine to a crawl. In many cases, simple and seemingly innocent segments of code run significantly slower than they could. Thus, good developers must have techniques to identify and correct slow-running portions of code.
The VTune™ Performance Analyzer is an important tool in the discovery of what causes code to perform sub-optimally. This article walks through the steps needed to identify and correct some performance problems on the Itanium processor. It uses the VTune Performance Analyzer and sample code to illustrate the following concepts:
- Using performance counters to organize data and optimize 64-bit applications
- Avoiding performance issues due to mixing real and integer structures
- Minimizing the impact of BACK_END_BUBBLE
- Organizing data to positively impact performance counters
Identify a Processor Back-End Bubble
A 'bubble' is defined as any delay in the processor. The 'back end' is the place where instructions are retired when they are complete. There are five main causes of bubbles in the Itanium processor:
- Pipeline flush (BE_FLUSH_BUBBLE)
- Stalls in the L1 data-cache or floating-point processing-unit pipelines (BE_L1D_FPU_BUBBLE)
- Stalls in the execution stage of the pipeline due to data not being available (BE_EXE_BUBBLE)
- A need for the Register Stack Engine to free registers for the current stack (BE_RSE_BUBBLE)
- A lack of instructions coming from the Front End (BACK_END_BUBBLE.FE)
The following matrix-multiplication code (MatrixMultiply.c) provides an example of code that runs sub-optimally (the printf statement is present solely to make sure the compiler does not optimize the code into nothingness):
#include "stdio.h" #include "stdlib.h" int main () { int i, j, k; int a[512][512], b[512][512], c[512][512]; for (i = 0; i < 512; i++) { for (j = 0; j < 512; j++) { a[i][j] = rand(); b[i][j] = rand(); c[i][j] = rand(); } } for (i = 0; i < 512; i++) { for (j = 0; j < 512; j++) { for (k = 0; k < 512; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } printf("Multiply done, %d", c[i][j]); }
Open a 64-bit command window and compile the code:
ecl MatrixMultiply.c /o matrix.exe
Copy the code to a 64-bit machine if it is not there already. Open the VTune analyzer and start a new project (the example is called 'Matrix' for convenience). Start a new Sampling Wizard and select Win32/Win64/Linux profiling. Enter the path to your executable in the “Application to Launch” field and uncheck “Run Activity when done with wizard.” Click “Finish.”
Next, select “Configure” on the menu bar and “Modify ... <sampling> collector.” Click on “Events” and add the BACK_END_BUBBLE-ALL counter. Click on the green arrow in the VTune analyzer window and results similar to the following should appear:
Notice that the execution of the triple-nested loop took more than five billion clockticks and that the event BACK_END_BUBBLE-ALL consumed 4.5 billion clockticks. These figures mean that, on a 1 GHz Itanium processor, the matrix multiply took 5.3 seconds and spent 4.5 seconds (85% of the execution time) waiting on the back end.
If BACK_END_BUBBLE had not been the culprit, other sampling runs using a counter with -ALL at the end of the name could have been used, or even run simultaneously with the BACK_END_BUBBLE run. For the sake of simplicity, this example takes the shortcut of using only one counter.
Create a VTune Analyzer Activity to Identify the Root Cause
Now that the cause has been determined to be bubbles in the back end of the processor, more prying must be done to determine the root cause of the trouble. Go back to “Modify <…><sampling> collector” and a new window should pop up:
Choose the first option, "Make a copy of the Activity and modify the copy" (this option will be used for the remainder of this article). When the sampling box appears, add the counters listed below in the right side of the window. (There should be a 'Sample After' column in the right window; it was moved out of the picture in order to show the full counter name.)
After applying the changes and exiting the window, rename the new project to “BACK_END root cause.” Next, run the new activity by pressing the green arrow on the VTune analyzer toolbar.
At the end of the run, notice that BE_EXE_BUBBLE-ALL has three billion events associated with it and that every other counter added is an order of magnitude smaller:
Choose to modify the collector and make a copy of the activity again. Remove all the counters and keep CPU_CYCLES or IA64_INST_RETIRED-THIS if desired. Add all counters with BE_EXE as a prefix. Leave the collector, rename the project to BE_EXE, and then run it. The results look somewhat like this:
The problem areas are GRALL and FRALL, which are the general and floating registers. The VTune Performance Analyzer online help states that GRALL and FRALL events occur when the registers are dependent on each other or a load is waiting for the results in another register. After examining the core of the matrix code, it is evident that each iteration is independent of others, and the problem must be due to waiting for loads to complete. The primary cause of load stalls is that the data required is not located in cache. This implies inefficiency in memory access.
Improve the Efficiency of the Code Based on the Root Cause
Studying the core loops shows that in every iteration, a new element of matrix a and b are accessed:
Take a moment for a quick review of how matrixes are stored in memory; the scheme can be visualized as a straight line. If given a segment of memory that can hold 1024 elements, memory position 0000 will hold element [0][0], 0001 will hold [0][1], 0002 will hold [0][2], and so on until 0511 holds [0][511]. The new row of data starts at the next available memory address.
This means that memory position 0512 holds element [1][0], 0513 holds [1][1], and so on until 1023 holds [1][511]. This can also affect cache, because when data is accessed from one segment of memory, data in nearby locations is also loaded, since memory accesses are often close to one another.
Matrix c is static for 512 iterations of k. Matrix a accesses successive elements of memory, but matrix b is accessed in a sequential but not successive manner. This means that matrix a takes advantage of cache, and matrix b does not.
Suppose that the first-level cache can hold 64 elements of a matrix, and the second-level cache can hold 256 elements. As the first element for matrix a is loaded into cache, the next 64 values are loaded into the quickest cache. Those 64 elements plus additional elements are loaded into second-level cache, for a total of 256 values ready for quick access.
Matrix a will not have to wait for a load from main memory for at least another 255 loop iterations. Matrix b, however, has to wait for memory with each iteration, because it increments each row per iteration, instead of per column like matrix a. All the data that gets loaded into cache with the first memory access is then wasted, because none of the other values will get used soon.
In an effort to improve the way matrix b access memory, let us swap the positions of loops j and k, like so:
Matrix c now gets accessed sequentially, but it changes every iteration. Matrix a still gets accessed sequentially and becomes static for 512 iterations, as matrix c did earlier. Matrix b is accessed sequentially, because j increments every iteration instead of k. Therefore, the program should take advantage of spatial locality (memory elements being next to each other) and run faster. Make the change in code, recompile, and replace the old executable.
Right click the “BACK_END root cause” activity and select “Copy Object.” Right-click on the parent folder and paste. Rename the new “BACK_END root cause” to “BE root cause 02.” Run the sampling activity again:
Notice that the total number of CPU_CYCLES has been reduced by half! BE_EXE_BUBBLE-ALL is still a large consumer of clockticks, though, and there is more that can be done to improve performance.
Further Refine Code Efficiency
The next step of tuning this code is to take advantage of the Itanium processor’s ability to execute six instructions at one time. Since we have established that the loops are independent of each other, it is possible to unroll the loops, executing multiple computations per iteration.
The way to unroll loops is to write code similar to this:
This way, four computations are done per iteration. After recompiling and replacing the executable, the VTune analyzer reports the following:
Intel® Software Partner Program provides software vendors with Intel's latest technologies, helping member companies to improve product lines and grow market share.
Conclusion
Like any hardware platform, the Itanium processor is dependent for performance upon the quality of the code it executes. Seemingly innocuous code can produce performance issues that are difficult to identify without systematic analysis using proven techniques. Subjecting application code to such analysis prior to its release is a vital step in development that can prevent performance deficits and the associated threats to your application's competitive advantage.
Developers should identify bubbles, or processor delays, in 64-bit applications and subject them to root-cause analysis using the VTune Performance Analyzer. The VTune environment provides a means of sampling performance data for iterative analysis that gradually zeroes in on suboptimal code and then determines the root cause of the associated performance issues so that they can be corrected.
The following steps represent best practices for using the VTune analyzer to discover the causes of poor performance on the Itanium processor:
- Run the high-level -ALL counters, primarily BACK_END_BUBBLE-ALL.
- If necessary, run the mid level -ALL counters that are underneath the primary high-level culprit.
- Run the low-level counters underneath the counter with the highest number of events.
- Read the VTune analyzer online-help entry related to the specific low-level counter to find out what causes the event to happen.
- Analyze the code, keeping in mind the cause of the event.
- Change the code.
- Recompile.
This sequence of steps should be repeated iteratively as long as it continues to generate significant performance increases.
Additional Resources
Intel, the world's largest chipmaker, also provides an array of value-added products and information to software developers:
- Intel® Software Partners provides software vendors with Intel's latest technologies, helping member companies to improve product lines and grow market share.
- Intel® Developer Zone offers free articles and training to help software developers maximize code performance and minimize time and effort.
- Intel Software Development Products include Compilers, Performance Analyzers, Performance Libraries and Threading Tools.
- IT@Intel, through a series of white papers, case studies, and other materials, describes the lessons it has learned in identifying, evaluating, and deploying new technologies.
About the Author
Jason Fletcher works as a software engineer for the Software and Solutions Group at Intel, enabling software for the Itanium® and Intel® Xeon® processor platforms.