The release of Epic’s Unreal* Engine 4.19 marks a new chapter in optimizing for Intel technology, particularly in the case of optimizing for multicore CPUs. In the past, game engines traditionally followed console design points, in terms of graphics features and performance. In general, most games weren’t optimized for the CPU, which can leave a lot of PC performance sitting idle. Intel’s work with Unreal Engine 4 seeks to unlock the potential of games as soon as developers work in the engine, to fully take advantage of all the extra CPU computing power that a PC platform provides.
Intel's enabling work for Unreal Engine version 4.19 delivered the following:
- Increased the number of worker threads to match a user’s CPU
- Increased the throughput of the cloth physics system
- Integrated support for Intel® VTune™ Amplifier
Each of these advances enable Unreal Engine users to take full advantage of Intel® Architecture and harness the power of multicore systems. Systems such as cloth physics, dynamic fracturing, CPU particles, and enhanced operability with Intel tools such as Intel VTune Amplifier and the C++ compiler will all benefit. This white paper will discuss in detail the key improvements and provide developers with more reasons to consider the Unreal Engine for their next PC title.
Unreal* Engine History
Back in 1991, Tim Sweeney founded Epic MegaGames (later dropping the “Mega”) while still a student at the University of Maryland. His first release was ZZT*, a shareware puzzle game. He wrote the game in Turbo Pascal using an object-oriented model, and one of the happy results was that users could actually modify the game’s code. Level editors were already common, but this was a big advance.
In the years that followed, Epic released popular games such as Epic Pinball*, Jill of the Jungle*, and Jazz Jackrabbit*. In 1995, Sweeney began work on a first-person shooter to capitalize on the success of games such as DOOM*, Wolfenstein*, Quake*, and Duke Nukem*. In 1998, Epic released Unreal*, probably the best-looking shooter of its time, offering more detailed graphics and capturing industry attention. Soon, other developers were calling and asking about licensing the Unreal Engine (UE) for their own games.
In an article for IGN in 2010, Sweeney recalled that the team was thrilled by the inquiries, and said their early collaboration with those partners defined the style of their engine business from day one. They continue to use, he explained, “a community-driven approach, and open and direct communication between licensees and our engine team.” By focusing on creating cohesive tools and smoothing out technical hurdles, their goal was always to unleash the creativity of the gaming community. They also provided extensive documentation and support, something early engines often lacked.
Today, the UE powers most of the top revenue-producing titles in the games industry. In an interview with VentureBeat in March 2017, Sweeney said developers have made more than USD 10 billion to date with Unreal games. “We think that Unreal Engine’s market share is double the nearest competitor in revenues,” Sweeney said. “This is despite the fact that Unity* has more users. This is by virtue of the fact that Unreal is focused on the high end. More games in the top 100 on Steam* in revenue are Unreal, more than any other licensable engine competitor combined.”
Intel Collaboration Makes Unreal Engine Better
Game developers who currently license the UE can easily take advantage of the optimizations described here. The work will help them grow market share for their games by broadening the range of available platforms, from laptops and tablets with integrated graphics to high-end desktops with discrete graphics cards. The optimizations will benefit end users on most PC-based systems by ensuring that platforms can deliver high-end effects such as dynamic cloth and interactive physics. In addition, optimized Intel tools will continue to make Intel Architecture a preferred platform of choice.
According to Jeff Rous, Intel Developer Relations Engineer, the teams at Intel and Epic Games have collaborated since the late 1990s. Rous has personally worked on UE optimization for about six years, involving extensive collaboration and vibrant communication with Epic engineers over email and conference calls, as well as visits to Epic headquarters in North Carolina two or three times a year for week-long deep dives. He has worked on specific titles, such as Epic’s own Fortnite* Battle Royale, as well as UE code optimization.
Prior to the current effort, Intel worked closely with Unreal on previous UE4 releases. There is a series of optimization tutorials at the Intel® Developer Zone, starting with the Unreal* Engine 4 Optimization Tutorial, Part 1. The tutorials cover the tools developers can use inside and outside of the engine, as well as some best practices for the editor, and scripting to help increase the frame rate and stability of a project.
Intel® C++ Compiler Enhancements
For UE 4.12, Intel added support for the Intel C++ Compiler into the public engine release. Intel C++ Compilers are standards-based C and C++ tools that speed application performance. They offer seamless compatibility with other popular compilers, dev environments, and operating systems, and boost application performance through superior optimizations and single instruction multiple data (SIMD) vectorization, integration with Intel® Performance Libraries, and by leveraging the latest OpenMP* 5.0 parallel programming models.
Figure 1: Scalar and vectorized loop versions with Intel® Streaming SIMD Extensions, Intel® Advanced Vector Extensions, and Intel® Advanced Vector Extensions 512.
Since UE 4.12, Intel has continued to keep the code base up to date, and tests on the infiltrator workload show significant improvements for frame rates.
Texture compression improvement
UE4 also launched with support for Intel’s fast texture compressor. ISPC stands for Intel® SPMD (single program, multiple data) program compiler, and allows developers to easily target multicore and new and future instruction sets through the use of a code library. Previous to integrating the ISPC texture compression library, ASTC (Adaptive Scalable Texture Compression), the newest and most advanced texture compression format, would often take minutes to compress per texture. On the Sun Temple* demo (part of the UE4 sample scenes pack), the time it took to compress all textures went from 68 minutes to 35 seconds, with better quality over the reference encoder that was used previously. This allows content developers to build their projects faster, saving hours per week of a typical developer’s time.
Optimizations for UE 4.19
Intel’s work specifically with UE 4.19 offers multiple benefits for developers. At the engine level, optimizations improve scaling mechanisms and tasking. Other work at the engine level ensures that the rendering process isn’t a bottleneck due to CPU utilization.
In addition, the many middleware systems employed by game developers will also benefit from optimizations. Physics, artificial intelligence, lighting, occlusion culling, virtual reality (VR) algorithms, vegetation, audio, and asynchronous computing all stand to benefit.
To help understand the benefits of the changes to the tasking system in 4.19, an overview of the UE threading model is useful.
UE4 threading model
Figure 2 represents time, going from left to right. The game thread runs ahead of everything else, while the render thread is one frame behind the game thread. Whatever is displayed thus runs two frames behind.
Figure 2: Understanding the threading model of Unreal Engine 4.
Physics work is generated on the game thread and executed in parallel. Animation is also evaluated in parallel. Evaluating the animation in parallel was used to good effect in the recent VR title, Robo Recall*.
The game thread, shown in Figure 3, handles updates for gameplay, animation, physics, networking, and most importantly, actor ticking.
Developers can control the order in which objects tick, by using Tick Groups. Tick Groups don’t provide parallelism, but they do allow developers to control dependent behavior to better schedule parallel work. This is vital to ensure that any parallel work does not cause a game thread bottleneck later.
Figure 3: Game thread and related jobs.
As shown below in Figure 4, the render thread handles generating render commands to send to the GPU. Basically, the scene is traversed, and then command buffers are generated to send to the GPU. The command buffer generation can be done in parallel to decrease the time it takes to generate commands for the whole scene and kick off work sooner to the GPU.
Figure 4: The render thread model relies on breaking draw calls into chunks.
Each frame is broken down into phases that are done one after another. Within each phase, the render thread can go wide to generate the command lists for that phase:
- Depth prepass
- Base pass
- Translucency
- Velocity
Breaking the frame into chunks enables farming them into worker tasks with a parallel command list that can be filled up with the results of those tasks. Those get serialized back and used to generate draw calls. The engine doesn’t join worker threads at the call site, but instead joins at sync points (end of phases), or at the point where they are used if fast enough.
Audio thread
The main audio thread is analogous to the render thread, and acts as the interface for the lower-level mixing functions by performing the following tasks:
- Evaluating sound queue graphs
- Building wave instances
- Handling attenuation, and so on
The audio thread is the thread that all user-exposed APIs (such as Blueprints and Gameplay) interact with. The decoding and source-worker tasks decode the audio information, and also perform processing such as spatialization and head-related transfer function (HRTF) unpacking. (HRTF is vital for players in VR, as the algorithms allow users to detect differences in sound location and distance.)
The audio hardware thread is a single platform-dependent thread (for example, XAudio2* on Microsoft Windows*), which renders directly to the output hardware and consumes the mix. This isn’t created or managed by UE, but the optimization work will still impact thread usage.
There are two types of tasks—decoding and source worker.
- Decoding: decodes a block of compressed source files. Uses double buffering to decode compressed audio as it's being played back.
- Source Worker: performs the actual source processing for sources, including sample rate conversion, spatialization (HRTF), and effects. The Source Worker is a configurable number in an INI file.
- If you have four workers and 32 sources, each will mix eight sources.
- The Source Worker is highly parallelizable, so you can increase the number if you have more CPU power.
Robo Recall was also the first title to ship with the new audio mixing and threading system in the Unreal Engine. In Robo Recall, for example, the head-related transfer function took up nearly half of the audio time.
CPU worker thread scaling
Prior to UE 4.19, the number of available worker threads on the task graph was limited and did not take Intel® Hyper-Threading Technology into account. This caused a situation on systems with more than six cores where entire cores would sit idle. Correctly creating the right number of worker threads available on the task graph (UE’s internal work scheduler) allows for content creators to scale visual-enhancing systems such as animation, cloth, destruction, and particles beyond what was possible before.
In UE 4.19, the number of worker threads on the task graph is calculated based on the user’s CPU, up to a current max of 22 per priority level:
if (NumberOfCoresIncludingHyperthreads > NumberOfCores) { NumberOfThreads = NumberOfCoresIncludingHyperthreads - 2; } else { NumberOfThreads = NumberOfCores - 1; }
The first step in parallel work is to open the door to the possibility that a game can use all of the available cores. This is a fundamental issue to make scaling successful. With the changes in 4.19, content can now do so and take full advantage of enthusiast CPUs through systems such as cloth physics, environment destruction, CPU-based particles, and advanced 3D audio.
Figure 5: Unreal Engine 4.19 now has the opportunity to utilize all available hardware threads.
In the benchmarking example above, the system is at full utilization on an Intel® Core™ i7-6950X processor at 3.00 GHz system, tested using a synthetic workload.
Destruction benefits
One benefit from better thread utilization in multicore systems is in destruction. Destruction systems use the task graph to simulate dynamic fracturing of meshes into smaller pieces. A typical destruction workload consists of a few seconds of extensive simulation, followed by a return to the baseline. Better CPUs with more cores can keep the pieces around longer, with more fracturing, which greatly enhances realism.
Rous believes there is more that developers can do with destruction and calls it a good target for improved realism with the proper content. “It’s also easy to scale-up destruction, by fracturing meshes more and removing fractured chunks after a longer length of time on a more powerful CPU,” he said. “Since destruction is done through the physics engine on worker threads, the CPU won’t become the rendering bottleneck until quite a few systems are going at once.”
Figure 6: Destruction systems simulate dynamic fracturing of meshes into small pieces.
Cloth System Optimization
Cloth systems are used to add realism to characters and game environments via a dynamic 3D mesh simulation system that responds to the player, wind, or other environmental factors. Typical cloth applications within a game include player capes or flags.
The more realistic the cloth system, the more immersive the gaming experience. Generally speaking, the more cloth systems enabled, the more realistic the scene.
Developers have long struggled with the problem of making cloth systems appear realistic. Otherwise, characters are restricted to tight clothing, and any effects of wind blowing through clothing is lost. Modeling a cloth system has been a challenge, however.
Early attempts at cloth systems
According to Donald House at Texas A&M University, the first important computer graphics model for cloth simulation was presented by Jerry Weil in 1986. House and others presented an entire course on “Cloth and Clothing in Computer Graphics,” and described Weil’s work in detail. Weil developed “a purely geometric method for mimicking the drape of fabric suspended at constraint points,” House wrote. There were two phases in Weil’s simulation process. First, geometrically approximate the cloth surface with catenary curves, producing triangles of constraint points. Then, by applying an iterative relaxation process, the surface is smoothed by interpolating the original catenary intersection points. This static draping model could also represent dynamic behavior by applying the full approximation and relaxation process once, and then successively moving the constraint points slightly and reapplying the relaxation phase.
Around the same time, continuum models emerged that used physically based approaches to cloth behavior modeling. These early models employed continuum representations, modeling cloth as an elastic sheet. The first work in this area is a 1987 master’s thesis by Carl R. Feynman, who superimposed a continuum elastic model on a grid representation. Due to issues with simulation mesh sizes, cloth modeling using continuum techniques has difficulty capturing the complex folding and buckling behavior of real cloth.
Particle models gain traction
Particle models gained relevance in 1992, when David Breen and Donald House developed a non-continuum interacting particle model for cloth drape, which “explicitly represents the micro-mechanical structure of cloth via an interacting particle system,” as House described it. He explained that their model is based on the observation that cloth is “best described as a mechanism of interacting mechanical parts rather than a substance, and derives its macro-scale dynamic properties from the micro-mechanical interaction between threads.” In 1994 it was shown how this model could be used to accurately reproduce the drape of specific materials, and the Breen/House model has been expanded from there. One of the most successful of these models was by Eberhard, Weber, and Strasser in 1996. They used a Lagrangian mechanics reformulation of the basic energy equations suggested in the Breen/House model, resulting in a system of ordinary differential equations from which dynamics could be calculated.
The dynamic mesh simulation system is the current popular model. It responds to the player, wind, or other environmental factors, and results in more realistic features such as player capes or flags.
The UE has undergone multiple upgrades to enhance cloth systems; for example, in version 4.16, APEX Cloth* was replaced with NVIDIA’s NvCloth* solver. This low-level clothing solver is responsible for the particle simulation that runs clothing and allows integrations to be lightweight and very extensible, because developers now have direct access to the data.
More triangles, better realism
In UE 4.19, Intel engineers worked with the UE team to optimize the cloth system further to improve throughput. Cloth simulations are treated like other physics objects and run on the task graph’s worker threads. This allows developers to scale content on multicore CPUs and avoid bottlenecks. With the changes, the amount of cloth simulations usable in a scene has increased by approximately 30 percent.
Cloth is simulated in every frame, even if the player is not looking at that particular point; simulation results will determine if the cloth system shows up in a player’s view. Cloth simulation uses the CPU about the same amount from frame to frame, assuming more systems aren’t added. It’s easily predictable and developers can tune the amount they’re using to fit the available headroom.
Figure 7: Examples of cloth systems in the Content Examples project.
For the purposes of the graphs in this document, the cloth actors used have 8,192 simulated triangles per mesh, and were all within the viewport when the data was captured. All data was captured on an Intel® Core™ i7-7820HK processor.
Figure 8: Different CPU usages between versions of Unreal Engine 4, based on number of cloth systems in the scene.
Figure 9: Difference in frames per second between versions of Unreal Engine 4 based on number of cloth systems in the scene.
Enhanced CPU Particles
Particle systems have been used in computer graphics and video games since the very early days. They’re useful because motion is a central facet of real life, so modeling particles to create explosions, fireballs, cloud systems, and other events is crucial to develop full immersion.
High-quality features available to CPU particles include the following:
- Light emission
- Material parameter control
- Attractor modules
Particles on multicore systems can be enhanced by using CPU systems in tandem with GPU ones. Such a system easily scales—developers can keep adding to the CPU workload until they run out of headroom. Engineers have found that pairing CPU particles with GPU particles can improve realism by adding light casting, allowing light to bounce off objects they run into. Each system has inherent limitations, so pairing them results in a system greater than the sum of their parts.
Figure 10: CPU Particles can easily scale based on available headroom.
Intel® VTune™ Amplifier Support
The Intel VTune Amplifier is an industry-standard tool to determine thread bottlenecks, sync points, and CPU hotspots. In UE 4.19, support for Intel VTune Amplifier ITT markers was added to the engine. This allows users to generate annotated CPU traces that give deep insight into what the engine is doing at all times.
ITT APIs have the following features:
- Control application performance overhead based on the amount of traces that you collect.
- Enable trace collection without recompiling your application.
- Support applications in C/C++ and Fortran environments.
- Support instrumentation for tracing application code.
Users can take advantage of this new functionality by launching Intel VTune Amplifier and running a UE workload through the UI with the -VTune switch. Once inside the workload, simply type Stat Namedevents on the console to begin outputting the ITT markers to the trace.
Figure 11:Example of annotated Intel VTune Amplifier trace in Unreal Engine 4.19.
Conclusion
Improvements involved solving technical challenges at every layer—the engine, middleware, the game editor, and in the game itself. Rather than working on a title by title basis, engine improvements benefit the whole Unreal developer ecosystem. The advances in 4.19 improve CPU workload challenges throughout the ecosystem in the following areas:
- More realistic destruction, thanks to more breakpoints per object.
- More particles, leading to better animated objects such as vegetation, cloth, and dust particles.
- More realistic background characters.
- More cloth systems.
- Improved particles (for example, physically interacting with character, NPCs, and environment).
As more end users migrate to powerful multicore systems, Intel plans to pursue a roadmap that will continue to take advantage of higher core counts. Any thread-bound systems or bottlenecked operations are squarely in the team’s crosshairs. Developers should be sure to download the latest version of the UE, engage at the Intel Developer Zone, and see for themselves.
Further Resources
Unreal* Engine 4 Optimization Guide