Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Simple optimization methodology with Intel System Studio ( VTune, C++ Compiler, Cilk Plus )

$
0
0

Introduction:

 In this article, we introduce an easy optimization methodology that includes Intel® Cilk™ Plus and Intel® C++ Compiler based on the performance analysis using Intel® VTune amplifier. Intel® System Studio 2015 that containes the mentioned components was used for this article.

  • Intel® VTune amplifier, is an integrated performance analyzer that helps developers anayzes complex code and indentify bottlenecks quickly.  
  • Intel® C++ Compiler generates optimized code that runs on IA-32 and Intel 64 architectures. It also provides numbers of features to help developers easily improve performance.
  • Intel® Cilk Plus, a C/C++ language extension, included in the Intel® C++ Compiler, allows you to improve performance by adding parallelism to new or existing C or C++ programs. 

Strategy:

 We will use one of the code examples that used for tutorials of VTune,  tachyon_amp_xe, as our target code for performance optimization. This example draws a picture of a complicate objects.

 

                                                                                                               ↓

 

 

The performance optimization methodology that possibly applicable for the sample is described below.

  1. Running Basic Hotspots Analysis or General Exploration Analysis on the example project on the integrated IDE, for instance Visual Studio* 2013.
  2. Identifying hotspots and other potentials of optimization.
  3. Applying code modification on the detected hotspot.
  4. Examining optimization options of the compiler.
  5. Applying parallelism on parallelization candidates.

Optimization :

< Test Environment >

 OS : Windows 8.1

 Tool Suite : Intel® System Studio for Windows Update 3 

 IDE : Microsoft Visual Studio 2013

 

< Step 1 : Interpret & Analyze the result data >

  • Running General Exploration Analysis ( if not possible, go with Basic Hotspots Analysis ) and find hotspots. Since this example code is made for practice finding hotspots and improving the performance. It is helpful to follow and refer the tachyon_amp_ex example page for this particular hotspot finding. After running the example with VTune, we can see the result as the following
  • We can observe the elapsed time this application took was 44.834s andthis can be the performance baseline we will concentrate to reduce.
  • Also, for this sample application, the 'initialize_2D_buffer'function, which took 18.945s to execute, shows up at the top of the list as the hottest function. We will try to optimize this most time-consuming function.

 

  • The CPU Usage Histogram above shows this sample does not make use of parallelism. Therefore, there are possibilities that we may use of multi threads to handle heavy tasks more quickly.

 

< Step 2 : Algorithmic approach for 'initialize_2D_buffer' >

 

 

 

  • As we saw earlier, 'initialize_2D_buffer' function took the longest time to execute and the largest amount of instructions have been retired by the function, which means If we can optimize something and get performance improvement, this function is where we can get the largest benefit out possibly.  

  • By double-clicking the function name, VTune Amplifier opens the source file positioning at the most time-consuming code line of this function. For the 'initialize_2D_buffer' function, this is the line used to initialize a memory array using non-sequential memory locations. This sample code already has its alternative faster 'for loop'.

  • The code listed below is actual code of the function 'initialize_2D_buffer'. The first for loop is not consecutively filling in the target array, and the second for loop in designed to do the same task consecutively. By using the second for loop, we can get performance benefit.

  • After replacing the for loop with the second one, we can observe some performance improvement. Let's look at the new VTune profiling results.

  • Compare to the previous results, the total elapsed time has been reduced from 44.834s to 35.742s, which is about x1.25 faster than before,  and for only the target function, it is from 18.945s to 11.318s, which is about x1.67 faster.

 

< Step 3: Compiler Optimization Options >

  • We often overlook automatic optimization ability that compilers have. In this case, we simply enable Intel C++ Compiler's optimization option which is triggered by adding '/O3' while compiling. Also we can use GUI to enable this. First, seting Intel C++ Compiler as the project's compiler is required to use '/O3'.

 

  •  
  • Just changing the above option sometimes brings great performance benefits. For the detailed explanation of Optimization option '/O[n]', please click here . The new results below show 24.979s to finish the task. It was 35.742s, which gives us the result as x1.43 boosting.
  •  

< Step 4 : Adding Parallelism by Cilk Plus >

  • Parallel programing is a very broad area by itself and there are many ways to achieve and implement parallelism in your system trying to manipulate multi-core platforms. This time, we are introducing Intel Cilk Plus, which is a language expansion that is fairly easy to implement  and works smart.

  • By investigating and analyzing the code with VTune's results, we can find the point where it calls the heaviest routine repeatedly and it can be a successful parallelization candidate. Useually, it can be done by looking at the caller/callee tree and following back from the root hotspot until you find a parallelizable spot to test.

  • For this time, it was 'draw_trace' function in find_hotspots.cpp. Adding simple 'cilk_for' to parallelize the target task here can work dynamically address lines to draw to multi-threads instead of a single thread. Therefore, you can visually observe 4 threads ( tested machine is dual-core with Hyper Threading technology ) drawing different lines simultaneously.

  • If you see the time it took to finish the painting job, it is 11.656s which is a big improvement than how long it took at the beginning. Let's take a look at VTune results.

  • We can see 13.117s as the total elapsed time which is x1.9 faster than the previous result. Also we can observe that multi-core is being efficiently utilized.

Summary:

  • Total elapsed time has been decreased from 44.834s to 13.117s -> x3.41 boosting up.
  • This optimization case has been achieved by simple VTune analysis and adding an Intel C++ Compiler option and a Cilk Plus feature.
  • Intel System Studio's components are designed as a solution to help developers to easily make improvements for their products.

Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>