Introduction
This tutorial presents a step-by-step guide to performance analysis, bottleneck identification, and rendering optimization of an OpenGL ES* 3.0 application on Android*. The sample application, entitled “City Racer,” simulates a road race through a stylized urban setting. Performance analysis of the application is done using the Intel® INDE Graphics Performance Analyzers (Intel® INDE GPA) tool suite.
The combined city and vehicle geometry consists of approximately 230K polygons (690K vertices) with diffuse mapped materials lit by a single non-shadow casting directional light. The provided source material includes the code, project files, and art assets required to build the application, including the source code optimizations identified throughout this tutorial.
Acknowledgements
This tutorial is an Android and OpenGL ES 3.0 version of the Intel Graphics Performance Workshop for 3rd Generation Intel® Core™ Processor (Ivy Bridge) (PDF) created by David Houlton. It ships with Intel INDE GPA.
Tutorial Organization
This tutorial guides you through four successive optimization steps. At each step the application is analyzed with Intel INDE GPA to identify specific performance bottlenecks. An appropriate optimization is then toggled within the application to overcome the bottleneck and it is analyzed again to measure the performance gained. The optimizations applied are generally in line with the guidelines provided in the Developer’s Guide for Intel® Processor Graphics (PDF).
Over the course of the tutorial, the applied optimizations improve the rendering performance of City Racer by 83%.
Prerequisites
- The City Racer sample is built using the Android API version 20 and the Android NDK version 10.
- Performance analysis is done using the Intel INDE GPA tool suite.
- Intel INDE GPA is compatible with most Android devices; however, an Android device powered by x86 architecture will provide the most detailed profiling metrics.
City Racer Sample Application
City Racer is logically divided into race simulation and rendering subcomponents. Race simulation includes modeling vehicle acceleration, braking, turning parameters, and AI for track following and collision avoidance. The race simulation code is in the track.cpp and vehicle.cpp files and is not affected by any of the optimizations applied over the course of this tutorial.
The rendering component consists of drawing the vehicles and scene geometry using the OpenGL ES 3.0 and our internally developed CPUT framework. The initial version of the rendering code represents a first-pass effort, containing several performance-limiting design choices.
Mesh and texture assets are loaded from the Media/defaultScene.scene file. Individual meshes are tagged as either pre-placed scenery items, instanced scenery with per-instance transformation data, or vehicles for which the simulation provides transformation data. There are several cameras in the scene: one follows each car and an additional camera allows the user to freely explore the scene. All performance analysis and code optimizations are targeted at the vehicle-follow camera mode.
For the purpose of this tutorial, City Racer is designed to start in a paused state, which allows you to walk through each profiling step with identical data sets. City Racer can be unpaused by unchecking the Pause check box in the City Racer HUD or by setting g_Paused = false at the top of CityRacer.cpp.
Optimization Potential
Consider the City Racer application as a functional but non-optimized prototype. In its initial state it provides the visual result desired, but not the rendering performance. It has a number of techniques and design choices in place that are representative of those you’d find in a typical game-in-development that limits the rendering performance. The goal of the optimization phase of development is to identify the performance bottlenecks one by one, make code changes to overcome them, and measure the improvements achieved.
Note that this tutorial addresses only a small subset of all possible optimizations that could be applied to City Racer. In particular, it only considers optimizations that can be applied completely in source code, without any changes to the model or texture assets. Other asset-changing optimizations are excluded here simply because they become somewhat cumbersome to implement in tutorial format, but they can be identified using Intel INDE GPA tools and should be considered in a real-world game optimization.
Performance numbers shown in this document were captured on an Intel® Atom™ processor-based system (codenamed Bay Trail) running Android. The numbers may differ on your system, but relative performance relationships should be similar and logically lead to the same performance optimizations.
The optimizations to be applied over the course of the tutorial are found in CityRacer.cpp. They can be toggled through City Racer’s HUD or through direct modification in CityRacer.cpp.
CityRacer.cpp
bool g_Paused = true; bool g_EnableFrustumCulling = false; bool g_EnableBarrierInstancing = false; bool g_EnableFastClear = false; bool g_DisableColorBufferClear = false; bool g_EnableSorting = false;
They are enabled one by one as you progress through the optimization steps. Each variable controls the substitution of one or more code segments to achieve the optimization for that step of the tutorial.
Optimization Tutorial
The first step is to build and deploy City Racer on an Android device. If your Android environment is set up correctly, the buildandroid.bat file located in CityRacer/Game/Code/Android will perform these steps for you.
Next, launch Intel INDE GPA Monitor, right click the system tray icon, and select System Analyzer.
System Analyzer will show a list of possible platforms to connect to. Choose your Android x86 device and press “Connect.”
When System Analyzer connects to your Android device, it will display a list of applications available for profiling. Choose City Racer and wait for it to launch.
While City Racer is running, press the frame capture button to capture a snapshot of a GPU frame to use for analysis.
Examine the Frame
Open Frame Analyzer for OpenGL* and choose the City Racer frame you just captured, which will allow you to examine GPU performance in detail.
The timeline at the top is laid out in equally spaced ‘ergs’ of work, each of which usually corresponds to an OpenGL draw call. For a more traditional timeline display, select GPU Duration on the X and Y axis. This will quickly show us which ergs are consuming the most GPU time and where we should initially focus our efforts. If no ergs are selected, then the panel on the right shows our GPU time for the entire frame, which is 55ms.
Optimization 1 – Frustum Culling
When viewing all of the draws, we can see that there are many items drawn that are not visually apparent on the screen. By changing the Y-axis to Post-Clip Primitives the gaps in this view serve to point out which draws are wasted because the geometry is entirely clipped.
The buildings in City Racer are combined into groups according to spatial locations. We can cull out the groups not visible and thus eliminate the GPU work associated with them. By toggling the Frustum Culling check box, each draw will be run through a view-frustum culling routine on the CPU before being submitted to the GPU.
Turn on the Frustum Culling check box and use System Analyzer to capture another frame. Once the frame is captured, open it again in Frame Analyzer.
By viewing this frame we can see the number of draws is reduced by 22% from 740 to 576 and our overall GPU time is reduced by 18%.
Optimization 2 – Instancing
While frustum culling reduced the overall amount of ergs, there are still a great number of small ergs (highlighted in yellow) which, when taken cumulatively, add up to a non-trivial amount of GPU time.
By examining the geometry for these ergs we can see the majority of them are the concrete barriers which line the sides of the track.
We can eliminate much of the overhead involved in these draws by combining them into a single instanced draw. By toggling the Barrier Instancing check box the barriers will be combined into a single instanced draw thus removing the need for the CPU to submit each one of them via a draw to the GPU.
Turn on the Barrier Instancing check box and use System Analyzer to capture another frame. Once the frame is captured, open it with Frame Analyzer.
By viewing this frame we can see the number of draws is reduced by 90% from 576 to 60.
Draw calls before concrete barrier instancing (top) and after instancing (bottom)
Additionally, the GPU duration is reduced by 71% to 13ms.
Optimization 3 – Front to Back Sorting
The term “overdraw” refers to writing to each pixel multiple times; this can impact pixel fill rate and increase frame rendering time. Examining the Samples Written metric shows us that each pixel is being written to approximately 1.8 times per frame (Resolution / Samples Written).
Sorting the draws from front to back before rendering is a relatively straightforward way to reduce overdraw because the GPU pipeline will reject any pixels occluded by previous draws.
Turn on the Sort Front to Back check box and use System Analyzer to capture another frame. Once the frame is captured, open it with Frame Analyzer.
By viewing this frame we can see the Samples Written metric decreased by 6% and our overall GPU time is reduced by 8%.
Optimization 4 – Fast Clear
A final look at our draw times shows the first erg is taking the longest individual GPU time. Selecting this erg reveals that it’s not a draw but a glClear call.
Intel’s GPU hardware has an optimization path that performs a ‘fast clear’ in a fraction of the time it takes a traditional clear. A fast clear can be performed by setting the glClearColor to all black or all white (0, 0, 0, 0 or 1, 1, 1, 1).
Turn on the Fast Clear check box and use System Analyzer to capture another frame. Once the frame is captured, open it with Frame Analyzer.
By viewing this frame we can see the GPU duration for the clear has decreased by 87% over the regular clear, from 1.2ms to 0.2ms.
As a result, the overall frame duration of the GPU is decreased by 24% to 9.2ms.
Conclusion
This tutorial has taken a representative early-stage game application and used the Intel INDE GPA to analyze application behavior and make targeted changes to improve performance. The changes made and improvements realized were:
Optimization | Before | After | % Improvement |
Frustum Culling | 55.2ms | 45.0ms | 82% |
Instancing | 45.0ms | 13.2ms | 71% |
Sorting | 13.2ms | 12.1ms | 8% |
Fast Clear | 12.1ms | 9.2ms | 24% |
Overall GPU Optimizations | 55.2ms | 9.2ms | 83% |
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.
Overall, from the initial implementation of City Racer to the best optimized version, we demonstrate rendering performance improvement of 300%, from 11 fps to 44 fps. Since this implementation starts out significantly sub-optimal, a developer applying these techniques will probably not see the same absolute performance gain on a real-world game.
Nevertheless, the primary goal of this tutorial is not the optimization of this specific sample application, but the potential performance gains you can find by following the recommendations in Developer’s Guide for Intel Processor Graphics and the usefulness of Intel INDE GPA in finding and measuring those improvements.