John Stone
Integrated Computer Solutions, Inc.
Introduction
We will examine rendering parallel topics using Direct3D* 12. We will use the results from the paper, A Comparison of the Intel® Core™ i5 Processor and Intel® Core™ i7 Processor with Visualizations in OpenGL* and Oculus* VR, and extend the code there to contain a Direct3D 12 renderer, after which we are going to re-implement its particle system as a Direct3D 12 compute shader. You can find the sample code described in this article at the download link above.
CPU Compute
Our first task is to add a Direct3D 12 renderer to the particle system in the Intel Core i5 Processor-versus-Intel Core i7 Processor article mentioned above. The software design there makes this easy to do since it nicely encapsulates the concept of rendering.
Renderer
Interface
We create a new file named RendererD3D12.cpp and add the corresponding class to the Renderers.h file.
#pragma once #include "Particles.h" namespace Particle { namespace Renderers { // An D3D12 renderer drawing into a window struct D3D12 : Renderer { D3D12(); virtual ~D3D12(); void* allocate(size_t) override; void* operator()(time, iterator, iterator, bool&) override; struct Data; Data& _; }; }}
Our job is to fill in the implementations for the constructor, destructor, allocate(), and operator() methods. By examining the code in RendererOpenGL.cpp we can see that the constructor should create all of the GPU resources, allocate() should create a pipelined persistently mapped vertex upload buffer, and operator() should render a frame. We can see that the OpenGL implementation gets the job done in just 279 lines of code, but we will find out that Direct3D 12 takes significantly more work to do the same job.
Implementation
Event Loop
Let us start by deciding that we are going to do raw Direct3D 12; that is, we are going to program directly to the Microsoft* published API without relying on helper libraries. The first thing we need to do is create a window and implement a minimum event loop.
void D3D12::Data::MakeWindow() { // Initialize the window class. WNDCLASSEX k = {}; k.cbSize = sizeof(WNDCLASSEX); k.style = CS_HREDRAW | CS_VREDRAW; k.lpfnWndProc = DefWindowProc; k.hInstance = GetModuleHandle(NULL); k.hCursor = LoadCursor(NULL, IDC_ARROW); k.lpszClassName = L"RendererD3D12"; RegisterClassEx(&k); // Create the window and store a handle to it. win.hwnd = CreateWindow(k.lpszClassName, k.lpszClassName, WS_OVERLAPPEDWINDOW, 5, 34, win.Width, win.Height, NULL, NULL, GetModuleHandle(NULL), NULL); } void D3D12::Data::ShowWindow() { ::ShowWindow(win.hwnd, SW_SHOWDEFAULT); } void D3D12::Data::PollEvents() { for (MSG msg = {}; PeekMessage(&msg, NULL, 0, 0, PM_REMOVE); ) { TranslateMessage(&msg); DispatchMessage(&msg); } }
Initialization
Not too bad, just 25 lines of code. Let us talk about what we need to do to turn that window into a rendering surface for Direct3D 12. Here’s a brief list:
// initialize D3D12 _.MakeWindow(); _.CreateViewport(); _.CreateScissorRect(); _.CreateFactory(); _.CreateDevice(); _.CreateCommandQueue(); _.CreateSwapChain(); _.CreateRenderTargets(); _.CreateDepthBuffer(); _.CreateConstantBuffer(); _.CreateFence(); _.ShowWindow(); _.CreateRootSignature(); _.CreatePSO(); _.CreateCommandList(); _.Finish();
Wow, that is 16 steps to initialize Direct3D 12! Each step is going to expand to a function containing between 6 and 85 lines of code for a total of nearly 500 lines of C++ code. In examining the OpenGL renderer we see that it gets the same job done in about 100 lines of C++ code, which gives us a 5x expansion number when working in Direct3D 12. Now remember, this is pretty much as basic a renderer as you can get. It just issues a single draw-call per frame and renders its pixels to a single window. You will find that things get even more complicated in Direct3D 12 as you try to do more. Putting all these facts together suggests that Direct3D 12 programming is not for the faint of heart, to which this author wholeheartedly agrees.
There are many tutorials on this initialization process on the web, so I am not going to be covering that here. You can examine the source code for all the gory details. I will mention that unlike OpenGL in Direct3D 12, you, the programmer, are responsible for creating and managing your own backbuffer/present pipeline. You also need to pipeline updates to all changing data including constant (uniform in OpenGL) data as well as the persistently mapped vertex upload buffer. We put this all together in the Direct3D 12 code encapsulated in the frames variable in the d3d12 structure.
Custom Upload Heap
One thing I did notice while doing this was that using the default D3D12_HEAP_TYPE_UPLOAD
heap type for the vertex upload buffer led to extremely slow frame times. This puzzled me for quite a while until I ran across some information indicating a read penalty that occurs when accessing write-combined memory. Some further digging into the Direct3D 12 documentation shows this:
We did not have this problem in the OpenGL code since it mapped its vertex upload memory like this:
GLbitfield flags = GL_MAP_READ_BIT | GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;
To get the same effect in Direct3D 12 you need to use a custom heap type (see D3D12::Data::CreateGeometryBuffer() method):
// Describe the heap properties D3D12_HEAP_PROPERTIES hp = {}; hp.Type = D3D12_HEAP_TYPE_CUSTOM; hp.CPUPageProperty = D3D12_CPU_PAGE_PROPERTY_WRITE_BACK; hp.MemoryPoolPreference = D3D12_MEMORY_POOL_L0;
where the key piece of the puzzle is the use of D3D12_CPU_PAGE_PROPERTY_WRITE_BACK
CPUPageProperty.
Shaders
One nice thing I found when working with Direct3D 12 is that it allows you to create one piece of code containing both the vertex and pixel shaders unlike OpenGL, which requires separate programs. Direct3D 12 can do that because it allows you to specify the main function when compiling the shaders:
// Create the pipeline state, which includes compiling and loading shaders. #if defined(_DEBUG) UINT f = D3DCOMPILE_DEBUG | D3DCOMPILE_SKIP_OPTIMIZATION; #else UINT f = 0; #endif f |= D3DCOMPILE_WARNINGS_ARE_ERRORS; ComPtr<ID3DBlob> errs; ComPtr<ID3DBlob> vShader; hr = D3DCompile(shader.c_str(), shader.size(), 0, 0, 0, "vsMain", "vs_5_0", f, 0, &vShader, &errs); if (FAILED(hr)) { fputs((const char*)errs->GetBufferPointer(), stderr); std::exit(1); } ComPtr<ID3DBlob> pShader; hr = D3DCompile(shader.c_str(), shader.size(), 0, 0, 0, "psMain", "ps_5_0", f, 0, &pShader, &errs); if (FAILED(hr)) { fputs((const char*)errs->GetBufferPointer(), stderr); std::exit(1); }
Notice the strings "vsMain
" and "psMain
" in the above code. These are the main functions for the vertex shader and pixel shader, respectively.
Rendering
Now that the GPU resources are allocated and initialized (including the vertex upload buffer) we can finally turn our attention to actually rendering the particles, and again we see the chubbiness of Direct3D 12. OpenGL gets the job done in just 19 lines of code while Direct3D 12 bloats that into 105 lines. You can examine this for yourself by checking out the D3D12::operator()
method.
Vertex Buffer View
There is one thing that was a little surprising to me. When developing the software using the Direct3D 12 Debug Layers and the Warp software rasterizer (see the next section on GPU compute for details) I was getting an error message complaining that I was overrunning the buffer bounds. This puzzled me for a while until I realized that when mapping the attributes in the vertex buffer, I needed to account for their offsets in the structure when determining their buffer length. You can see this in the following code fragment:
// Configure the vertex array pointers auto start = _.d3d12.vertexBuffer->GetGPUVirtualAddress(); UINT size = UINT(_.memory.stride); UINT stride = UINT(sizeof(*begin)); D3D12_VERTEX_BUFFER_VIEW vertexBufferViews[] = { { start + ((char*)&begin->pos - _.memory.ptr), size - offsetof(item,pos), stride }, { start + ((char*)&begin->clr - _.memory.ptr), size - offsetof(item,clr), stride }, { start + ((char*)&begin->vel - _.memory.ptr), size - offsetof(item,vel), stride }, { start + ((char*)&begin->age - _.memory.ptr), size - offsetof(item,age), stride }, };
Notice how we subtract each attribute’s starting offset from the buffer size via offsetof() macros to avoid this buffer overrun warning.
Results
Other than these few gotchas, things went pretty much as described in the Direct3D 12 tutorials.
GPU Compute
Introduction
Now that we have finished rendering the particle system calculated on the CPU, let us consider what it would take to move this to the GPU for the ultimate in parallel programming.
The first thing I considered, after struggling with getting every single field in every single structure used in the Direct3D 12 CPU portion to be correct and consistent, was facing the prospect of two to three times more work for the GPU compute problem. I quickly decided the right thing to do was to look for some kind of helper framework.
MiniEngine*
I thought that there would be a multitude to choose from on the Internet, but my google searches were only turning up entire game engines (Unity*, Unreal*, and Oryol*), which really are too abstract for what I was doing. I wanted something that was Direct3D 12 specific, and I was beginning to think that I would actually have to write the thousands of lines of code before finally discovering a small team inside Microsoft that seems to be solely involved in DirectX* 12 and graphics education. I found their YouTube* channel, and from there I found their git GitHub* repository.
They have a set of standard how-to samples for Direct3D 12 like most other tutorial sites, but they also have the MiniEngine: A DirectX 12 Engine Starter Kit. Looking through this it seemed to be exactly was I was looking for; a framework taking the drudgery out of using the Direct3D 12 API, but also small and simple enough to keep things straightforward, to understand and use.
The code that accompanies this article does not include the MiniEngine*. Instead, it is a project created by the MiniEngine’s createNewSolution.bat file.
Installation
In order to use the accompanying code you first need to download MiniEngine:
Install MiniEngine into the gpu/MiniEngine folder like so:
Direct3D 12 Debug Layers
Install the Direct3D 12 Debug Layers development tool (which is not included in the SDK installed by Microsoft Visual Studio*). Instead it’s an optional Windows* feature. You install it by going to your Windows® 10 Settings application and choosing Apps.
Click on Manage optional features and choose Add a feature.
Choose Graphics Tools, click Install, and then wait for the install to complete:
Go ahead and open the code sample by double-clicking on the gpu\MiniEngine\app\app_VS15.sln file. Once Visual Studio 2017 is finished launching, we need to make two tiny tweaks to the MiniEngine code to align things better to what we want to do.
Customize MiniEngine
The first thing is to convince MiniEngine to use the software Warp rasterizer in debug mode, since that driver contains a lot of logging code to keep us informed if we are not doing things quite right. To do this, open the Core/Source Files/Graphics/GraphicsCore.cpp file and navigate to line 322, and change it from this:
static const bool bUseWarpDriver = false;
to this:
#ifdef _DEBUG static const bool bUseWarpDriver = true; #else static const bool bUseWarpDriver = false; #endif
Secondly, the MiniEngine is hard-coded to a few fixed resolutions. We want to run with our window at 800 x 600 to match the size of the window created in the CPU section. To do this we need to navigate to line 137 and change the code there from this:
switch (eResolution((int)TargetResolution)) { default: case k720p: NativeWidth = 1280; NativeHeight = 720; break; case k900p: NativeWidth = 1600; NativeHeight = 900; break; case k1080p: NativeWidth = 1920; NativeHeight = 1080; break; case k1440p: NativeWidth = 2560; NativeHeight = 1440; break; case k1800p: NativeWidth = 3200; NativeHeight = 1800; break; case k2160p: NativeWidth = 3840; NativeHeight = 2160; break; }
to this:
#if 0 switch (eResolution((int)TargetResolution)) { default: case k720p: NativeWidth = 1280; NativeHeight = 720; break; case k900p: NativeWidth = 1600; NativeHeight = 900; break; case k1080p: NativeWidth = 1920; NativeHeight = 1080; break; case k1440p: NativeWidth = 2560; NativeHeight = 1440; break; case k1800p: NativeWidth = 3200; NativeHeight = 1800; break; case k2160p: NativeWidth = 3840; NativeHeight = 2160; break; } #else NativeWidth = g_DisplayWidth; NativeHeight = g_DisplayHeight; #endif
After this change the size of the rendering surface will track the size of the window, and we can control the size of the window with the g_DisplayWidth and g_DisplayHeight global variables.
Sample Code
Now that we have the MiniEngine configured the way we like it, let us turn our attention to what we need to do to actually render something using it. Notice that it encapsulates all of the verbose Direct3D 12 config structures into nice C++ classes, and implements a malloc-type heap for Direct3D 12 Resources. This, combined with its automatic pipelining of resources, makes it very easy to use. The 500+ line rendering code in the CPU section is reduced to about 38 lines of setup code and 31 lines of rendering code (69 lines total). This is a huge improvement!
Setup Code
// configure root signature graphic.rootSig.Reset(1, 0); graphic.rootSig[0].InitAsConstantBuffer(0, D3D12_SHADER_VISIBILITY_ALL); graphic.rootSig.Finalize(L"Graphic", D3D12_ROOT_SIGNATURE_FLAG_ALLOW_INPUT_ASSEMBLER_INPUT_LAYOUT); // configure the vertex inputs D3D12_INPUT_ELEMENT_DESC vertElem[] = { { "POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, offsetof(particles::item, pos), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 }, { "COLOR", 0, DXGI_FORMAT_R8G8B8A8_UNORM, 0, offsetof(particles::item, clr), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 }, { "VELOCITY", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, offsetof(particles::item, vel), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 }, { "AGE", 0, DXGI_FORMAT_R32_FLOAT, 0, offsetof(particles::item, age), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 }, }; // query the MiniEngine formats DXGI_FORMAT ColorFormat = g_SceneColorBuffer.GetFormat(); DXGI_FORMAT DepthFormat = g_SceneDepthBuffer.GetFormat(); // configure the PSO graphic.pso.SetRootSignature(graphic.rootSig); graphic.pso.SetRasterizerState(RasterizerDefault); graphic.pso.SetBlendState(BlendDisable); graphic.pso.SetDepthStencilState(DepthStateReadWrite); graphic.pso.SetInputLayout(_countof(vertElem), vertElem); graphic.pso.SetPrimitiveTopologyType(D3D12_PRIMITIVE_TOPOLOGY_TYPE_POINT); graphic.pso.SetRenderTargetFormats(1, &ColorFormat, DepthFormat); graphic.pso.SetVertexShader(g_pGraphicVS, sizeof(g_pGraphicVS)); graphic.pso.SetPixelShader(g_pGraphicPS, sizeof(g_pGraphicPS)); graphic.pso.Finalize(); // set view and projection matrices DirectX::XMStoreFloat4x4( &graphic.view, DirectX::XMMatrixLookAtLH({ 0.f,-4.5f,2.f }, { 0.f,0.f,-0.3f }, { 0.f,0.f,1.f })); DirectX::XMStoreFloat4x4(&graphic.proj, DirectX::XMMatrixPerspectiveFovLH(DirectX::XMConvertToRadians(45.f), float(g_DisplayWidth)/g_DisplayHeight, 0.01f, 1000.f)); }
Rendering Code
// render graphics GraphicsContext& context = GraphicsContext::Begin(L"Scene Render"); // transition context.TransitionResource(readBuffer, D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER); context.TransitionResource(g_SceneColorBuffer, D3D12_RESOURCE_STATE_RENDER_TARGET); context.TransitionResource(g_SceneDepthBuffer, D3D12_RESOURCE_STATE_DEPTH_WRITE, true); // configure context.SetRootSignature(graphic.rootSig); context.SetViewportAndScissor(0, 0, g_SceneColorBuffer.GetWidth(), g_SceneColorBuffer.GetHeight()); context.SetRenderTarget(g_SceneColorBuffer.GetRTV(), g_SceneDepthBuffer.GetDSV()); // clear context.ClearColor(g_SceneColorBuffer); context.ClearDepth(g_SceneDepthBuffer); // update struct { DirectX::XMFLOAT4X4 MVP; } vsConstants; DirectX::XMMATRIX view = DirectX::XMLoadFloat4x4(&graphic.view); DirectX::XMMATRIX proj = DirectX::XMLoadFloat4x4(&graphic.proj); DirectX::XMMATRIX mvp = view * proj; DirectX::XMStoreFloat4x4(&vsConstants.MVP, DirectX::XMMatrixTranspose(mvp)); context.SetDynamicConstantBufferView(0, sizeof(vsConstants), &vsConstants); // draw context.SetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_POINTLIST); context.SetVertexBuffer(0, readBuffer.VertexBufferView()); context.SetPipelineState(graphic.pso); context.Draw(particles::nParticles, 0); // finish context.Finish();
CPU
Pipelining and Rendering
Now that we have the 69 lines of rendering code, let us think about what we need to do for GPU compute. My thinking is that we should have two buffers of particle data and render one while the other is being updated for the next frame by GPU compute. You can see this in the code:
static constexpr int Frames = 2; StructuredBuffer vertexBuffer[Frames]; int current = 0; and in the rendering code like this: // advance pipeline auto& readBuffer = vertexBuffer[current]; current = (current + 1) % Frames; auto& writeBuffer = vertexBuffer[current];
GPU
Introduction
Now we have to think about how to implement a particle rendering system on the GPU. An important part of the CPU algorithm is a sorting/partitioning step after the update() to collect all dead particles together at the end of the pool, to make it a fast operation to emit new ones. At first you might think we need to replicate that step on the GPU, which is technically possible via a Bitonic* sort algorithm (MiniEngine actually contains an implementation of this algorithm), but after further thinking you may realize that this sort is only required if you want fast looping over the particle pool when emitting. On the GPU this loop is not required, and is replaced by a GPU thread being launched to process each particle in the pool in parallel (remember the title of this article is Parallel Processing with DirectX 3D* 12). Knowing this, you may realize that all that is actually needed is for each thread to have access to a global count of particles to emit for each frame. Each thread will then examine its data to see if its particle is available for emitting, and if it is it will atomically get-and-decrement the global counter. If it gets a value that is positive, the thread actually goes ahead and emits the particle for further processing; otherwise the thread does nothing.
Atomic Counter
If only Direct3D 12 had an atomic counter available and easily accessed by the compute shader …. Hmmm.
The compute shader RWStructuredBuffer data type has an optional hidden counter variable. Examining the MiniEngine source code reveals that it implements this optional feature and wraps it in a convenient member function:
void CommandContext::WriteBuffer( GpuResource& Dest, size_t DestOffset, const void* BufferData, size_t NumBytes )
void CommandContext::WriteBuffer( GpuResource& Dest, size_t DestOffset, const void* BufferData, size_t NumBytes )
This makes the compute C++ rendering code straightforward, as shown:
// render compute ComputeContext& context = ComputeContext::Begin(L"Scene Compute"); // update counter context.WriteBuffer(writeBuffer.GetCounterBuffer(), 0, &flow.num2Create, 4); // compute context.SetRootSignature(compute.rootSig); context.SetPipelineState(compute.pso); context.TransitionResource(readBuffer, D3D12_RESOURCE_STATE_NON_PIXEL_SHADER_RESOURCE); context.TransitionResource(writeBuffer, D3D12_RESOURCE_STATE_UNORDERED_ACCESS); context.SetConstant(0, particles::nParticles, 0); context.SetConstant(0, flow.dt, 1); context.SetDynamicDescriptor(1, 0, readBuffer.GetSRV()); context.SetDynamicDescriptor(2, 0, writeBuffer.GetUAV()); context.Dispatch((particles::nParticles+255)/256, 1, 1); context.InsertUAVBarrier(writeBuffer); // finish context.Finish();
The corresponding compute shader code:
[numthreads(256,1,1)] void main(uint3 DTid : SV_DispatchThreadID) { // initialize random number generator rand_seed(DTid.x); // get the particle Particle p = I[DTid.x]; // compute the particle if it's alive if (p.age >= 0) compute(p); // otherwise initialize the particle if we got one from the pool else if (!(O.DecrementCounter()>>31)) initialize(p); // write the particle O[DTid.x] = p; }
Results
Just like that, we have a compute version of our particle system in only an additional 20 lines of C++ code.