Abstract
Games for smartphones and tablets are the most popular category on app stores. In the early days, mobile devices had significant CPU and GPU constraints that affected performance. So most games had to be simple. Now that CPU and GPU performance has increased, more high-end games are being produced. Nevertheless, a mobile processor still has less performance than a PC processor.
With the growth in the mobile market, many PC game developers are now making games for the mobile platform. However, traditional game design decisions and the graphic resources of a PC game are not a good fit for mobile processors and may not perform well. This article shows how to analyze and improve the performance of a mobile game and how to optimize graphic resources for a mobile platform, using mTricks Looting Crown as an example. The looting crown IA version is now released with the following link.
https://play.google.com/store/apps/details?id=com.barunsonena.looting
Figure 1. mTricks Looting Crown
1. Introduction
mTricks has significant experience in PC game development using a variety of commercial game engines. While planning its next project, mTricks forecasted that the mobile market was ready for a complex MMORPG, given the performance growth of mobile CPUs and GPUs. So it changed the game target platform for its new project from the PC to mobile.
mTricks first ported the PC codebase to Android*. However, the performance was less than expected on the target mobile platforms, including an Intel® Atom™ processor-based platform (code named Bay Trail).
mTricks was encountering two problems that often face PC developers who transition to mobile:
- The low processing power of the mobile processor means that traditional PC graphic resources and designs are unsuitable.
- Due to capability and performance variations among mobile CPUs and GPUs, game display and performance vary on different target platforms.
2. Executive summary
Looting Crown is SNRPG (Social Network + RPG) style game, supporting full 3D graphics and various multi-play modes (PvP, PvE and Clan vs Clan). mTricks developed and optimized on a Bay Trail reference design, and the specification is listed in Table 1.
Table 1. Bay Trail reference design specification and 3DMark score
Bay Trail reference design 10” | |
---|---|
CPU | Intel® Atom™ processor Quad Core 1.46 Ghz |
RAM | 2GB |
Resolution | 2560 x 1440 |
3DMark ICE Storm Unlimited Score | 15,094 |
Graphics score | 13,928 |
Physics score | 21,348 |
mTricks used Intel® Graphics Performance Analyzers (Intel® GPA) to find CPU and GPU bottlenecks during development and used the analysis to solve issues of graphic resources and performance.
The baseline performance was 23 fps, and Figure 2 shows GPU Busy and Target App CPU Load statistics during a 2 minute run. The average of GPU Busy is about 91%, and the Target App CPU Load is about 27%.
Figure 2. Comparing CPU and GPU load of the baseline version with Intel® GPA System Analyzer
3. Where is the bottleneck between CPU and GPU?
There are two ways to know where the bottleneck is between CPU and GPU. One is to use an override mode, and the other is to change CPU frequency.
Intel GPA System Analyzer provides the “Disable Draw Calls” override mode to help developers find where the bottleneck is between CPU and GPU. After running this override mode, compare each result with/without the override mode and check the following guidelines:
Table 2. How to analyze games with Disable Draw Calls override mode
Performance change for “Disable Draw Calls” override mode | Bottleneck |
---|---|
If FPS doesn’t change much | The game is CPU bound; use the Intel® GPA Platform Analyzer or Intel® VTune™ Amplifier to determine which functions are taking the most time |
If FPS improves | The game is GPU bound; use the Intel GPA Frame Analyzer to determine which draw calls are taking the most time |
Intel GPA System Analyzer can simulate the application performance with various CPU settings, which is useful for bottleneck analysis. To determine whether your application performance is CPU bound, do the following:
- Verify that your application is not Vertical Sync (Vsync) bound.
Check the Vsync status. Vsync is enabled if you see the gray highlight in the Intel GPA System Analyzer Notification pane.- If Vsync is disabled, proceed to step 2.
- If Vsync is enabled, review the frame rate in the top-right corner of the Intel GPA System Analyzer window. If the frame rate is around 60 FPS, your application is Vsync bound, and there is no opportunity to increase FPS. Otherwise, proceed to step 2.
- Force a different CPU frequency using the sliders in the Platform Settings pane (Figure 3) of the Intel GPA System Analyzer window. If the FPS value changes when you modify the CPU frequency, the application is likely to be CPU bound.
Figure 3. Modify the CPU frequency in the Platform Settings pane
Table 3 shows the simulation results for Looting Crown. With “Disable Draw Calls” override on, the FPS remained unchanged. This would normally indicate the game was CPU bound. However, the “Highest CPU freq” override also didn’t change FPS, implying that Looting Crown was GPU bound. To resolve this, we returned to the data in Figure 2, which showed that the GPU load was about 91% and CPU load was about 27% on the Bay Trail device. The CPU could not be utilized well due to the GPU bottleneck. We proceeded with the plan to optimize the GPU usage first and then retest.
Table 3. The FPS result of the baseline version with Disable Draw Calls and Highest CPU Frequency.
Bay Trail device | FPS |
---|---|
Original | 23 |
Disable Draw Calls | 23 |
Highest CPU freq. | 23 |
4. Identifying GPU bottlenecks
We found that the performance bottleneck was in the GPU. As a next step, we analyzed the cause of the GPU bottleneck with Intel GPA Frame analyzer. Figure 4 shows the captured frame information of the baseline version.
Figure 4. Intel® GPA Frame Analyzer view of the baseline version
4.1 Decrease the number of draw calls by merging hundreds static mesh into one static mesh and using bigger texture.
4 and 5 show the information captured by Intel GPA Frame analyzer.
Table 4. The captured frame information of the baseline version
Total Ergs | 1,726 |
Total Primitive Count | 122,204 |
GPU Duration, ms | 23 ms |
Time to show frame, ms | 48 ms |
Table 5. Draw call cost of the baseline version
Type | Erg | Time(ms) | % |
---|---|---|---|
Clear | 0 | 0.2 ms | 0.5 % |
Ocean | 1 | 6 ms | 13.7 % |
Terrain | 2~977 | 20 ms | 41.9 % |
Grass | 19~977 | 18 ms | 39.0 % |
Character, building and effect | 978~1676 | 19 ms | 40.6 % |
UI | 1677~1725 | 1 ms | 3.4 % |
Total time of “Terrain” is 20 ms while the time of “Grass” in the “Terrain” is 18 ms. It’s about 90% of “Terrain” processing time. So we analyzed further to see why it takes a lot of time for “Grass” processing.
Figures 5 and 6 show the output of the ergs for “Terrain” and “Grass”.
Figure 5. The terrain
Figure 6. Texture of “Grass”
Looting Crown drew the terrain by drawing a small grass quad repeatedly. So the number of draw calls in “Terrain” was 960. The drawing time of one small grass is very small; however, the draw call itself has overhead, which makes it an expensive operation. So we recommended to decrease the number of draw calls by merging hundreds of static mesh into one static mesh and using bigger texture. Table 6 shows the changed result.
Table 6. Comparison of draw cost between small and big texture
Small texture, ms | 18 ms |
Number of ergs | 960 |
Big texture, ms | 6 ms |
Number of ergs | 1 |
Figure 7. The changed terrain
Though we simplified, the tile-based terrain required a lot of draw calls, so we decreased the number of draw calls and saved 12 ms on drawing the “Grass”.
4.2 Optimizing graphics resources
Tables 7 and 8 show the new information captured by Intel GPA Frame analyzer after applying the big texture for grass.
Table 7. The captured frame information of the 1st optimization version
Total Ergs | 179 |
Total Primitive Count | 27,537 |
GPU Duration, ms | 24 ms |
Time to show frame, ms | 27 ms |
Table 8. Draw call cost of the 1st optimization version
Type | Erg | Time(ms) | % |
---|---|---|---|
Clear | 0 | 2 ms | 10.4 % |
Ocean | 18 | 6 ms | 23.6 % |
Terrain | 1~17, 19, 23~96 | 14 ms | 54.3 % |
Grass | 19 | 6 ms | 23.2 % |
Character, building and effect | 20~22, 97~131 | 1 ms | 5.9 % |
UI | 132~178 | 1 ms | 5.7 % |
We checked if the game is still GPU bound. We did the same measurement with “Disable Draw Calls” and “Highest CPU Frequency” simulation.
Table 9. The FPS result of 1st optimization version with “Disable Draw Calls” and “Highest CPU Frequency”
Bay Trail device | FPS |
---|---|
Original | 40 |
Disable Draw Calls | 60 |
Highest CPU freq. | 40 |
In Table 9, “Disable Draw Calls” simulation increased the FPS number while “Highest CPU Frequency” simulation didn’t change the FPS number. So, we knew Looting Crown was still GPU bound. And we also checked CPU load and GPU Busy again.
Figure 8. CPU and GPU load of the 1st optimization version with Intel® GPA System Analyzer
Figure 8 shows GPU load is about 99% and CPU load is about 13% on Bay Trail. CPU still could not be a source of speedup due to GPU bottleneck on Bay Trail.
Looting Crown was originally developed for PCs, so the existing graphic resources were not suitable for mobile devices, which have lower GPU and CPU processing power. We did several optimizations to the graphic resources as follows.
- Minimizing Draw Calls
- Reduced the number of materials: The number of object materials was reduced from 10 to 2.
- Reduced the number of particle layers.
- Minimizing the number of polygons
- Applied LOD (level of detail) for characters using the “Simplygon” tool.
Figure 9. A character with progressively reduced LOD
- Minimized number of polygons used for terrain: First, we minimized the number of polygons for faraway mountains that did not require much detail. Second, we minimized the number of polygons for flat terrain that could be represented by two triangles.
- Applied LOD (level of detail) for characters using the “Simplygon” tool.
- Using optimized light maps
- Removed the dynamic lights for “Time of Day”.
- Minimized the light map size of each mesh: Reduced the number of light maps used for the background.
- Minimizing the changes of render states
- Reduced the number of materials, which also reduced render state changes and texture changes.
- Decoupling the animation part in static mesh
- Havok engine didn’t support a partial update of an animated part of an object. An object with only a small moving mesh was being updated even for the static mesh part of the object. So, we separated the animated part (smoke, red circle on Figure 10) from the rest of the object, dividing it into two separate object models.
Figure 10. Decoupled animation of the smoke from the static mesh
4.3 Apply Z-culling efficiently
When an object is rendered by the 3D graphics card, the three-dimensional data is changed into two-dimensional data (x-y), and the Z-buffer or depth buffer is used to store the depth information (z coordinate) of each screen pixel. If two objects of the scene must be rendered in the same pixel, the GPU compares the two depths. The GPU overrides the current pixel if the new object is closer to the observer. So Z-buffer will reproduce the usual depth perception correctly. The process of Z-culling is drawing the closest objects first so that a closer object hides a farther one. Z-culling provides performance improvement on rendering of hidden surfaces.
In Looting Crown, there were two kinds of terrain drawing: Ocean drawing and Grass drawing. Because large portions of ocean were behind grass, lots of ocean areas were hidden. However, the ocean was rendered earlier than grass, which prevented efficient Z-culling. Figures 11 and 12 show the GPU duration time of drawing ocean and grass, respectively; erg 18 is for ocean and erg 19 is for grass. If grass is rendered before ocean, then the depth test would indicate that the ocean pixels would not need to be drawn. It would result in decreased GPU duration of drawing ocean. Figure 13 shows the ocean drawing cost on the second optimization. The GPU duration decreased from 6 ms to 0.3 ms.
Figure 11. Ocean drawing cost of 1st optimization
Figure 12. Grass drawing cost of 1st optimization
Figure 13. Ocean draw cost of 2nd optimization
Results
By taking these steps, mTricks changed all graphics resources to be optimized for mobile device without compromising graphics quality. Erg numbers were decreased from 1,726 to 124; Primitive count was decreased from 122,204 to 9,525.
Figure 14. The change of graphics resource
Figure 15 and Table 10 show the outcome of all these optimizations. After optimizations, FPS changed from 23 FPS to 60 FPS on the Bay Trail device.
Figure 15. FPS Increase
Table 10. Changed FPS, GPU Busy, and App CPU Load
Baseline | 1st Optimization | 2nd Optimization | |
---|---|---|---|
FPS | 23 FPS | 45 FPS | 60 FPS |
GPU Busy(%) | 91% | 99% | 71% |
App CPU Load(%) | 27% | 13% | 22% |
After the first optimization, Bay Trail still was GPU bound. We did the second optimization to reduce the GPU workload by optimizing the graphic resources and z-buffer usage. Finally the Bay Trail device hit the maximum (60) FPS. Because Android uses Vsync, 60 FPS is the maximum performance on the Android platform.
Conclusion
When you start to optimize a game, first determine where the application bottleneck is. Intel GPA can help you do this with some powerful analytic tools.If your game is CPU bound, then Intel VTune Amplifier is a helpful tool. If your game is GPU bound, then you can find more detail using Intel GPA.To fix GPU bottlenecks, you can try to find an efficient way of reducing draw calls, polygon count, and render state changes. You can also check the right size of terrain texture, animation objects, light maps, and the right order of z-buffer culling.
About the Authors
Tai Ha is an application engineer focusing on enabling online games in APAC region. He has been working for Intel since 2005 covering Intel® Architecture optimization on Healthcare, Server, Client, and Mobile platforms. Before joining Intel, Tai worked for biometric companies based in Santa Clara, USA as a security middleware architect since 1999. He received his BS in Computer Science from Hanyang University, Korea.
Jackie Lee is an Applications Engineer with Intel's Software Solutions Group, focused on performance tuning of applications on Intel® Atom™ platforms. Prior to Intel, Jackie Lee worked at LG in the electronics CTO department. He received his MS and BS in Computer Science and Engineering from The ChungAng University.
References
The looting crown IA version is now released on Google Play:
https://play.google.com/store/apps/details?id=com.barunsonena.looting
Intel® Graphics Performance Analyzers
https://software.intel.com/en-us/vcsource/tools/intel-gpa
Havok
http://www.havok.com
mTricks
https://www.facebook.com/mtricksgame
Intel, the Intel logo, and Atom are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2014 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.