By Steve Hughes
Introduction
Most people, including gamers, pigeonhole computing devices as either desktop or mobile and expect high-end effects on their desktop apps and lower level streamlined effects on their mobile devices. They usually accept the gulf between the devices and don’t complain. However, when I started looking at the 4th generation Intel® Atom™ processor (codenamed Bay Trail) late last year, I realized that it is no lazy piece of HW. In fact, I saw the potential to add some significant desktop-style effects to the right game and with a bit of work produce a real showpiece app to demonstrate its capabilities. After a quick look around, I decided to work with Gameloft on its racing title GT Racing 2 (GTR2). I already knew the team at Gameloft, and they’ve always been eager to go the extra mile to optimize performance and make their games stand out.
In this article I will describe the effects implemented by Gameloft in GTR2 and focus on how we managed to fit those effects into the 30 frames per second (FPS) budget we had set ourselves. We were also limited by time since we wanted to show off what we’d achieved at GDC in SF 2014, and the end of 2013 was already fast approaching.
The effects
The exact effects used in the game were chosen by Gameloft as they knew which effects they most wanted to include. This is only fair, they know their engine and we needed to get the effects in quickly so we could spend time on optimization. Figures 1 and 2 show the before / after images, clearly showing what we managed to do to enhance the image with the extra CPU and GPU time we had on the x86 device.
Figure 1.The normal appearance of GTR2 on existing ARM* devices. Great models but a bit of a letdown with normal lighting.
Figure 2.GTR2 on Intel® Atom™ processor-based tablets showing enhanced visuals from the bloom and light shaft effects.
Light Shafts
To achieve the light shafts effect the sun is rendered to a second render target and a radial blur pass in the opposite direction to the sun’s position. This is carried out on a low resolution render target in several passes, and the results look like this:
Figure 3.Initial render of the sun at a low resolution render target. The sun here is occluded by opaque scene objects to get the shape seen here.
Figure 4.Secondly a set of radial blur passes are applied to get the image in the right. From the original partially obscured sun we now get the glow loosely representing airborne particles colliding with direct sunlight.
Figure 5.The blurred image is then added back in to the original frame and the final effect can be seen here. This effect is applied real-time during the race.
Bloom
Bloom is a fairly stock effect but easy to get wrong. The object of bloom is to simulate the effect of a sudden bright light in a scene saturating the image and leaking out into the scene around it.
Bloom is completed in three stages:
1: The original scene is filtered to remove any dark pixels leaving only the bright pixels in the scene. This image is written to another render target (Figure 6).
Figure 6.Light pixels are extracted from the original image.
2: The new render target containing the light pixels is then blurred. This it to simulate the bright pixels leaching into the surrounding dark pixels (Figure 7).
Figure 7.Light pixel image is then blurred.
3: Finally the blurred light pixel render target is added to the original scene to produce the bloom effect (Figure 8).
Figure 8.Blurred light pixel images are added to original scene with some scaling to produce the final bloom image.
Depth of field
To achieve depth of field, we start with the game scene and apply a horizontal and vertical Gaussian blur pass.
Figure 9.The original game scene
After the two stages, we can see that the whole image is now blurred (Figure 10). We now have a blurred image and the original sharp image, along with a depth buffer from the original render pass. The next step is to select a depth value which will be our focal point - such as the center of the car. For each pixel on the screen, we blend the blurred image and the sharp image based on the difference between the depth of the current pixel and the focal point depth value. Pixels father away in depth from the focal point will have greater contribution from the blurred image, while pixels with a depth value close to the focal point will have a greater contribution from the sharp image.
Figure 10.Blurred out of focus copy of game scene
Figure 11.Depth of field in action on Bay Trail. We left this effect to the menu and other non-game screens, because accurate distance vision is important in racing.
The net result (Figure 11) is a fairly good approximation to depth of field images such as you would get from a camera.
Heat Haze
Figure 12.Heat haze was reserved for the start grid, where it gave a realistic heat feel to the cars before the race start.
Heat haze effects try to simulate the air shimmer you see rising from heated objects in sunlight (Figure 12). The effect is created by applying an animating distortion effect to the original color buffer. To confine the effect to the region around the car the effect is masked by an alpha channel image (Figure 13).
Figure 13.Heat haze mask generated from the camera viewpoint.
The effect was confined to the starting grid because accurate distance vision is essential to successful racing.
Getting started on Optimization
Developers often view game optimization as a path of diminishing returns. By that I mean a lot of work generally goes in to optimizing a game to an average frame time of 33ms for 30FPS, but generally there is no point optimizing a game past 30FPS because that is the rate at which it will be expected to run. However, on mobile devices this is not true. In all we had about 12ms worth of effects to add that would have increased the frame time to 45ms (nearer 22FPS). This meant we had to remove 12ms from an already optimized game to achieve our final target frame rate with all the effects turned on.
The place to start in any optimization process is to look at whether the game is GPU or CPU bound. That is, determine if GPU or CPU code needs optimizing to improve frame time. Using Graphics Performance Analyzers (GPA), the System Analyzer, we captured data for the following graph:
Figure 14.GPU Busy hovers around 90-100% for most of the race, while the CPU averages around 25%. It’s fairly clear that the app is GPU bound, which is reasonable for a racing game.
It’s pretty easy to make this graph. Simply add the metrics you want to System Analyzer, then hit the “CSV” button to dump out the metrics you want to a csv file. You can then load them in to Excel* or other graphing software.
A lot of developers don’t know that GPA works great on x86 mobile devices. It’s a great set of tools and well worth looking at.
Drilling down on a frame
We captured a number of frames from the game before the effects were added using System Analyzer then opened them with Frame Analyzer to see what low hanging fruit we could find. Figure 15 shows a frame of the game before the effects were added that I used a lot in the early stages:
Figure 15.Frame is split in to two halves. Some big GPU events occur in the last half.
First and most obvious are the two calls to glClear() in the second half of the frame highlighted in purple in the frame graph. This is an issue I often find in engines - render targets tend to be cleared first even though they are going to be fully written to later. Removing these was an easy fix that gave us about 5ms, getting us well on the way.
The big blue bar in between the two glClear() calls is an interesting event. We had been experimenting with the screen size of earlier development kits that Gameloft received, and with very large screens (2560x1900) it was more efficient for them to render to a lower res back buffer then upsample to the full size screen. The event in question is the upsample from the back buffer to the screen. This is a huge event and needed some scrutiny. What I found here was that most of the time the EUs on the GPU were stalled waiting for the texture sampler on this erg. This made sense actually because the fragment shader was very simple and the size of the texture being copied was huge (>8Mb), so naturally the shader would spend a lot of time waiting for the data it needed in order to complete. This led me to think that we could probably render to a full-size render target and get rid of the upsample. The net result was not a performance improvement because what time we gained was used up by rendering to the larger target. What we did gain was a fair bit of visual improvement.
The last thing of note in this frame was the 4 big ergs labelled A, B, C, and D. You may have noticed that my approach here is to look at all the big ergs and see what can be done to remove or reduce them. That’s the best way to get started with Graphics Performance Analyzers. In cases like these 4 ergs, we could do very little: these are the 4 cars in view in the frame. This is a racing game so it is only right to devote a fair amount of rendering time to the cars.
Platform Analyzer Investigation
One place we looked for performance was Platform Analyzer, which is a relatively new tool in GPA. With Platform Analyzer you can look at the CPU / GPU holistically and see how the queues are managed on the GPU (Figure 16). I startled to see that we had a problem with the driver that was hurting us:
Figure 16.From Platform Analyzer. Horizontal scale is time; the stacked chart at the top shows queue depth on the GPU.
At first it looks like the GPU queues are always full and everything is fine. However, looking closely at the marked points we noticed that about every 10 frames an event occurred that stalled the whole process and drained out the queue, grinding almost to a halt before starting up again. We spent some time looking for a periodical draw call that had some kind of dependency, but it was hard to know what to look for.
This one turned out to an Intel graphics driver issue. As often happens in prerelease HW, the drivers were still being worked on. This turned out to be a stall, which had been fixed a few weeks before but we hadn’t updated the driver because we were otherwise happy with the driver we had. We’re not sure of the actual improvement we got from a frames-per-second perspective, but we did get a much smoother frame rate as a result of the driver fix.
Drilling down into the effects
At this point we had gained about 5ms frame time and improved the visual quality. We still needed to find about another 7ms so we decided to look at the effects themselves. We weren’t going to skimp on visual quality, but since they were all new we thought there might be some performance gains to find.
Figure 17.Frame Analyzer capture with bloom and light-shaft effects added. Note that glClears() are gone, so predictably there is a lot more time spent in the second half of the frame, where all the post processing for the effects takes place.
Looking at this frame (Figure 17) we were drawn to the ergs labelled B and C, which turned out to be blur and bright passes for the bloom effect. These were consuming 3-4ms each which we figured looked a little high. After investigation Gameloft made some significant changes here to the effect which resulted in a significant performance increase.
Firstly, they found that the blur stages were being executed on a full screen texture. This was reduced to a quarter screensize and the result was that the blur almost dropped from the Frame Analyzer display all together.
Secondly, the bright pass render target was in full HD. Gameloft found that this could be safely reduced insize to about half screen without visual changes and gaining another significant increase in performance.
After the bloom render targets had been optimized and we had gained some performance, we started to look more closely at the bloom itself. The general consensus was that the bloom looked a little washed out (Figure 18), so after verifying that the blur and bright pass textures looked ok, we took a look at the shaders.
The math in the bloom shader looked a little complex, as compared to a typical bloom shider like this fragment:
lowp vec4 bloom = texture2D(blur, vCoord0) * 1.5 - threshold; gl_FragColor = bloom * bloomFactor;
As an experiment I used a little known feature of GPA Frame Analyzer where you can modify shaders in a captured frame and recompile them to see the difference in appearance, performance, etc. It didn’t take long to invent a shader that did a simple bloom within the confines of the frame (you can change the source, but you couldn’t touch input or constants in GPA at the time).
The shader ran a tiny bit faster than the original shader, but the significant contribution from the shader changes was the visual quality. As a result, a new shader was created for the bloom pass which made the effect significantly better. Compare figure 18 with figure 19 to see the difference we saw.
Figure 18.Bloom effect showing the “washed out feel” of the shadows and the rocks on the left.
Figure 19.New bloom, which looks almost HDR compared to the old one.
Conclusions
The aim of this project was to take a game already optimized to 30FPS and optimize it further to gain enough ms per frame to allow room for about 12ms of effects to be added. We managed to pull about 7ms from the game itself and save another 5ms from the effects themselves and as a result of driver fixes. We managed to prove that modern mobile devices like Bay Trail are capable of executing effects that previously were preserved for consoles and desktop GPUs. None of what we did would have been possible without GPA, and without a great working relationship with Gameloft.
About Gameloft
A leading global publisher of digital and social games, Gameloft® has established itself as one of the top innovators in its field since 2000. Gameloft creates games for all digital platforms, including feature phones, smartphones, tablets, set-top boxes and connected TVs. Gameloft operates its own established franchises such as Asphalt®, Real Football®, Modern Combat and Order & Chaos®, and also partners with major rights holders including Marvel®, Hasbro®, FOX®, Mattel® and Disney®. Gameloft is present on all continents, distributes its games in over 120 countries and employs over 5,200 developers.
For more information, consult http://www.gameloft.com.
About the Author
Steve is a senior application engineer at Intel, providing technical support to game developers in the areas of 3D graphics enabling and multi-threading solutions on PC and mobile devices. Steve has 14 years of experience as a programmer in the gaming industry where he worked on 11 titles, went through 2 bankruptcies, and generally had a good time. Steve is a keen gamer, writes and plays music, and isn’t a writer!