Parallel Processing with DirectX 3D* 12

John Stone

Integrated Computer Solutions, Inc.

Introduction

We will examine rendering parallel topics using Direct3D* 12. We will use the results from the paper, A Comparison of the Intel® Core™ i5 Processor and Intel® Core™ i7 Processor with Visualizations in OpenGL* and Oculus* VR, and extend the code there to contain a Direct3D 12 renderer, after which we are going to re-implement its particle system as a Direct3D 12 compute shader. You can find the sample code described in this article at the download link above.

CPU Compute

Our first task is to add a Direct3D 12 renderer to the particle system in the Intel Core i5 Processor-versus-Intel Core i7 Processor article mentioned above. The software design there makes this easy to do since it nicely encapsulates the concept of rendering.

Renderer

Interface

We create a new file named RendererD3D12.cpp and add the corresponding class to the Renderers.h file.

#pragma once

#include "Particles.h"

namespace Particle { namespace Renderers {

// An D3D12 renderer drawing into a window
struct D3D12 : Renderer
{
    D3D12();
    virtual ~D3D12();

    void* allocate(size_t) override;
    void* operator()(time, iterator, iterator, bool&) override;

    struct Data; Data& _;
};

}}

Our job is to fill in the implementations for the constructor, destructor, allocate(), and operator() methods. By examining the code in RendererOpenGL.cpp we can see that the constructor should create all of the GPU resources, allocate() should create a pipelined persistently mapped vertex upload buffer, and operator() should render a frame. We can see that the OpenGL implementation gets the job done in just 279 lines of code, but we will find out that Direct3D 12 takes significantly more work to do the same job.

Implementation

Event Loop

Let us start by deciding that we are going to do raw Direct3D 12; that is, we are going to program directly to the Microsoft* published API without relying on helper libraries. The first thing we need to do is create a window and implement a minimum event loop.

void D3D12::Data::MakeWindow()
{
    // Initialize the window class.
    WNDCLASSEX k = {};
    k.cbSize = sizeof(WNDCLASSEX);
    k.style = CS_HREDRAW | CS_VREDRAW;
    k.lpfnWndProc = DefWindowProc;
    k.hInstance = GetModuleHandle(NULL);
    k.hCursor = LoadCursor(NULL, IDC_ARROW);
    k.lpszClassName = L"RendererD3D12";
    RegisterClassEx(&k);

    // Create the window and store a handle to it.
    win.hwnd = CreateWindow(k.lpszClassName, k.lpszClassName, WS_OVERLAPPEDWINDOW, 5, 34, win.Width, win.Height, NULL, NULL, GetModuleHandle(NULL), NULL);
}
void D3D12::Data::ShowWindow()
{
    ::ShowWindow(win.hwnd, SW_SHOWDEFAULT);
}
void D3D12::Data::PollEvents()
{
    for (MSG msg = {}; PeekMessage(&msg, NULL, 0, 0, PM_REMOVE); ) {
        TranslateMessage(&msg);
        DispatchMessage(&msg);
    }
}

Initialization

Not too bad, just 25 lines of code. Let us talk about what we need to do to turn that window into a rendering surface for Direct3D 12. Here’s a brief list:

    // initialize D3D12
    _.MakeWindow();
    _.CreateViewport();
    _.CreateScissorRect();
    _.CreateFactory();
    _.CreateDevice();
    _.CreateCommandQueue();
    _.CreateSwapChain();
    _.CreateRenderTargets();
    _.CreateDepthBuffer();
    _.CreateConstantBuffer();
    _.CreateFence();
    _.ShowWindow();
    _.CreateRootSignature();
    _.CreatePSO();
    _.CreateCommandList();
    _.Finish();

Wow, that is 16 steps to initialize Direct3D 12! Each step is going to expand to a function containing between 6 and 85 lines of code for a total of nearly 500 lines of C++ code. In examining the OpenGL renderer we see that it gets the same job done in about 100 lines of C++ code, which gives us a 5x expansion number when working in Direct3D 12. Now remember, this is pretty much as basic a renderer as you can get. It just issues a single draw-call per frame and renders its pixels to a single window. You will find that things get even more complicated in Direct3D 12 as you try to do more. Putting all these facts together suggests that Direct3D 12 programming is not for the faint of heart, to which this author wholeheartedly agrees.

There are many tutorials on this initialization process on the web, so I am not going to be covering that here. You can examine the source code for all the gory details. I will mention that unlike OpenGL in Direct3D 12, you, the programmer, are responsible for creating and managing your own backbuffer/present pipeline. You also need to pipeline updates to all changing data including constant (uniform in OpenGL) data as well as the persistently mapped vertex upload buffer. We put this all together in the Direct3D 12 code encapsulated in the frames variable in the d3d12 structure.

Custom Upload Heap

One thing I did notice while doing this was that using the default D3D12_HEAP_TYPE_UPLOAD heap type for the vertex upload buffer led to extremely slow frame times. This puzzled me for quite a while until I ran across some information indicating a read penalty that occurs when accessing write-combined memory. Some further digging into the Direct3D 12 documentation shows this:

Applications should avoid CPU reads from pointers to resources on UPLOAD heaps, even accidentally. CPU reads will work, but are prohibitively slow on many common GPU architectures, so consider the following …

We did not have this problem in the OpenGL code since it mapped its vertex upload memory like this:

    GLbitfield flags = GL_MAP_READ_BIT | GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;

To get the same effect in Direct3D 12 you need to use a custom heap type (see D3D12::Data::CreateGeometryBuffer() method):

// Describe the heap properties
    D3D12_HEAP_PROPERTIES hp = {};
    hp.Type = D3D12_HEAP_TYPE_CUSTOM;
    hp.CPUPageProperty = D3D12_CPU_PAGE_PROPERTY_WRITE_BACK;
    hp.MemoryPoolPreference = D3D12_MEMORY_POOL_L0;

where the key piece of the puzzle is the use of D3D12_CPU_PAGE_PROPERTY_WRITE_BACK CPUPageProperty.

Shaders

One nice thing I found when working with Direct3D 12 is that it allows you to create one piece of code containing both the vertex and pixel shaders unlike OpenGL, which requires separate programs. Direct3D 12 can do that because it allows you to specify the main function when compiling the shaders:

// Create the pipeline state, which includes compiling and loading shaders.
#if defined(_DEBUG)
UINT f = D3DCOMPILE_DEBUG | D3DCOMPILE_SKIP_OPTIMIZATION;
#else
UINT f = 0;
#endif
f |= D3DCOMPILE_WARNINGS_ARE_ERRORS;
ComPtr<ID3DBlob> errs;
ComPtr<ID3DBlob> vShader;
hr = D3DCompile(shader.c_str(), shader.size(), 0, 0, 0, "vsMain", "vs_5_0", f, 0, &vShader, &errs);
if (FAILED(hr)) {
    fputs((const char*)errs->GetBufferPointer(), stderr);
    std::exit(1);
}
ComPtr<ID3DBlob> pShader;
hr = D3DCompile(shader.c_str(), shader.size(), 0, 0, 0, "psMain", "ps_5_0", f, 0, &pShader, &errs);
if (FAILED(hr)) {
    fputs((const char*)errs->GetBufferPointer(), stderr);
    std::exit(1);
}

Notice the strings "vsMain" and "psMain" in the above code. These are the main functions for the vertex shader and pixel shader, respectively.

Rendering

Now that the GPU resources are allocated and initialized (including the vertex upload buffer) we can finally turn our attention to actually rendering the particles, and again we see the chubbiness of Direct3D 12. OpenGL gets the job done in just 19 lines of code while Direct3D 12 bloats that into 105 lines. You can examine this for yourself by checking out the D3D12::operator() method.

Vertex Buffer View

There is one thing that was a little surprising to me. When developing the software using the Direct3D 12 Debug Layers and the Warp software rasterizer (see the next section on GPU compute for details) I was getting an error message complaining that I was overrunning the buffer bounds. This puzzled me for a while until I realized that when mapping the attributes in the vertex buffer, I needed to account for their offsets in the structure when determining their buffer length. You can see this in the following code fragment:

// Configure the vertex array pointers
auto start  = _.d3d12.vertexBuffer->GetGPUVirtualAddress();
UINT size   = UINT(_.memory.stride);
UINT stride = UINT(sizeof(*begin));
D3D12_VERTEX_BUFFER_VIEW vertexBufferViews[] = {
    { start + ((char*)&begin->pos - _.memory.ptr), size - offsetof(item,pos), stride },
    { start + ((char*)&begin->clr - _.memory.ptr), size - offsetof(item,clr), stride },
    { start + ((char*)&begin->vel - _.memory.ptr), size - offsetof(item,vel), stride },
    { start + ((char*)&begin->age - _.memory.ptr), size - offsetof(item,age), stride },
};

Notice how we subtract each attribute’s starting offset from the buffer size via offsetof() macros to avoid this buffer overrun warning.

Results

Other than these few gotchas, things went pretty much as described in the Direct3D 12 tutorials.

Render Direct3D 12

GPU Compute

Introduction

Now that we have finished rendering the particle system calculated on the CPU, let us consider what it would take to move this to the GPU for the ultimate in parallel programming.

The first thing I considered, after struggling with getting every single field in every single structure used in the Direct3D 12 CPU portion to be correct and consistent, was facing the prospect of two to three times more work for the GPU compute problem. I quickly decided the right thing to do was to look for some kind of helper framework.

MiniEngine*

I thought that there would be a multitude to choose from on the Internet, but my google searches were only turning up entire game engines (Unity*, Unreal*, and Oryol*), which really are too abstract for what I was doing. I wanted something that was Direct3D 12 specific, and I was beginning to think that I would actually have to write the thousands of lines of code before finally discovering a small team inside Microsoft that seems to be solely involved in DirectX* 12 and graphics education. I found their YouTube* channel, and from there I found their git GitHub* repository.

They have a set of standard how-to samples for Direct3D 12 like most other tutorial sites, but they also have the MiniEngine: A DirectX 12 Engine Starter Kit. Looking through this it seemed to be exactly was I was looking for; a framework taking the drudgery out of using the Direct3D 12 API, but also small and simple enough to keep things straightforward, to understand and use.

The code that accompanies this article does not include the MiniEngine*. Instead, it is a project created by the MiniEngine’s createNewSolution.bat file.

Installation

In order to use the accompanying code you first need to download MiniEngine:

download MiniEngine

Install MiniEngine into the gpu/MiniEngine folder like so:

MiniEngine folder

Direct3D 12 Debug Layers

Install the Direct3D 12 Debug Layers development tool (which is not included in the SDK installed by Microsoft Visual Studio*). Instead it’s an optional Windows* feature. You install it by going to your Windows® 10 Settings application and choosing Apps.

Direct3D 12 Debug Layers

Click on Manage optional features and choose Add a feature.

Manage optional features

Choose Graphics Tools, click Install, and then wait for the install to complete:

Add a feature

Manage optional features

Go ahead and open the code sample by double-clicking on the gpu\MiniEngine\app\app_VS15.sln file. Once Visual Studio 2017 is finished launching, we need to make two tiny tweaks to the MiniEngine code to align things better to what we want to do.

Customize MiniEngine

The first thing is to convince MiniEngine to use the software Warp rasterizer in debug mode, since that driver contains a lot of logging code to keep us informed if we are not doing things quite right. To do this, open the Core/Source Files/Graphics/GraphicsCore.cpp file and navigate to line 322, and change it from this:

static const bool bUseWarpDriver = false;

to this:

#ifdef _DEBUG
    static const bool bUseWarpDriver = true;
#else
    static const bool bUseWarpDriver = false;
#endif

Secondly, the MiniEngine is hard-coded to a few fixed resolutions. We want to run with our window at 800 x 600 to match the size of the window created in the CPU section. To do this we need to navigate to line 137 and change the code there from this:

switch (eResolution((int)TargetResolution))
        {
        default:
        case k720p:
            NativeWidth = 1280;
            NativeHeight = 720;
            break;
        case k900p:
            NativeWidth = 1600;
            NativeHeight = 900;
            break;
        case k1080p:
            NativeWidth = 1920;
            NativeHeight = 1080;
            break;
        case k1440p:
            NativeWidth = 2560;
            NativeHeight = 1440;
            break;
        case k1800p:
            NativeWidth = 3200;
            NativeHeight = 1800;
            break;
        case k2160p:
            NativeWidth = 3840;
            NativeHeight = 2160;
            break;
        }

to this:

#if 0
        switch (eResolution((int)TargetResolution))
        {
        default:
        case k720p:
            NativeWidth = 1280;
            NativeHeight = 720;
            break;
        case k900p:
            NativeWidth = 1600;
            NativeHeight = 900;
            break;
        case k1080p:
            NativeWidth = 1920;
            NativeHeight = 1080;
            break;
        case k1440p:
            NativeWidth = 2560;
            NativeHeight = 1440;
            break;
        case k1800p:
            NativeWidth = 3200;
            NativeHeight = 1800;
            break;
        case k2160p:
            NativeWidth = 3840;
            NativeHeight = 2160;
            break;
        }
#else
        NativeWidth = g_DisplayWidth;
        NativeHeight = g_DisplayHeight;
#endif

After this change the size of the rendering surface will track the size of the window, and we can control the size of the window with the g_DisplayWidth and g_DisplayHeight global variables.

Sample Code

Now that we have the MiniEngine configured the way we like it, let us turn our attention to what we need to do to actually render something using it. Notice that it encapsulates all of the verbose Direct3D 12 config structures into nice C++ classes, and implements a malloc-type heap for Direct3D 12 Resources. This, combined with its automatic pipelining of resources, makes it very easy to use. The 500+ line rendering code in the CPU section is reduced to about 38 lines of setup code and 31 lines of rendering code (69 lines total). This is a huge improvement!

Setup Code

// configure root signature
graphic.rootSig.Reset(1, 0);
graphic.rootSig[0].InitAsConstantBuffer(0, D3D12_SHADER_VISIBILITY_ALL);
graphic.rootSig.Finalize(L"Graphic", D3D12_ROOT_SIGNATURE_FLAG_ALLOW_INPUT_ASSEMBLER_INPUT_LAYOUT);

// configure the vertex inputs
D3D12_INPUT_ELEMENT_DESC vertElem[] =
{
    { "POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, offsetof(particles::item, pos), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 },
    { "COLOR",    0, DXGI_FORMAT_R8G8B8A8_UNORM,  0, offsetof(particles::item, clr), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 },
    { "VELOCITY", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, offsetof(particles::item, vel), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 },
    { "AGE",      0, DXGI_FORMAT_R32_FLOAT,       0, offsetof(particles::item, age), D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, 0 },
};

// query the MiniEngine formats
DXGI_FORMAT ColorFormat = g_SceneColorBuffer.GetFormat();
DXGI_FORMAT DepthFormat = g_SceneDepthBuffer.GetFormat();

// configure the PSO
graphic.pso.SetRootSignature(graphic.rootSig);
graphic.pso.SetRasterizerState(RasterizerDefault);
graphic.pso.SetBlendState(BlendDisable);
graphic.pso.SetDepthStencilState(DepthStateReadWrite);
graphic.pso.SetInputLayout(_countof(vertElem), vertElem);
graphic.pso.SetPrimitiveTopologyType(D3D12_PRIMITIVE_TOPOLOGY_TYPE_POINT);
graphic.pso.SetRenderTargetFormats(1, &ColorFormat, DepthFormat);
graphic.pso.SetVertexShader(g_pGraphicVS, sizeof(g_pGraphicVS));
graphic.pso.SetPixelShader(g_pGraphicPS, sizeof(g_pGraphicPS));
graphic.pso.Finalize();

// set view and projection matrices
DirectX::XMStoreFloat4x4(
    &graphic.view, DirectX::XMMatrixLookAtLH({ 0.f,-4.5f,2.f }, { 0.f,0.f,-0.3f }, { 0.f,0.f,1.f }));
DirectX::XMStoreFloat4x4(&graphic.proj, DirectX::XMMatrixPerspectiveFovLH(DirectX::XMConvertToRadians(45.f), float(g_DisplayWidth)/g_DisplayHeight, 0.01f, 1000.f));
}

Rendering Code

// render graphics
GraphicsContext& context = GraphicsContext::Begin(L"Scene Render");

// transition
context.TransitionResource(readBuffer, D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER);
context.TransitionResource(g_SceneColorBuffer, D3D12_RESOURCE_STATE_RENDER_TARGET);
context.TransitionResource(g_SceneDepthBuffer, D3D12_RESOURCE_STATE_DEPTH_WRITE, true);

// configure
context.SetRootSignature(graphic.rootSig);
context.SetViewportAndScissor(0, 0, g_SceneColorBuffer.GetWidth(), g_SceneColorBuffer.GetHeight());
context.SetRenderTarget(g_SceneColorBuffer.GetRTV(), g_SceneDepthBuffer.GetDSV());

// clear
context.ClearColor(g_SceneColorBuffer);
context.ClearDepth(g_SceneDepthBuffer);

// update
struct { DirectX::XMFLOAT4X4 MVP; } vsConstants;
DirectX::XMMATRIX view = DirectX::XMLoadFloat4x4(&graphic.view);
DirectX::XMMATRIX proj = DirectX::XMLoadFloat4x4(&graphic.proj);
DirectX::XMMATRIX mvp = view * proj;
DirectX::XMStoreFloat4x4(&vsConstants.MVP, DirectX::XMMatrixTranspose(mvp));
context.SetDynamicConstantBufferView(0, sizeof(vsConstants), &vsConstants);

// draw
context.SetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_POINTLIST);
context.SetVertexBuffer(0, readBuffer.VertexBufferView());
context.SetPipelineState(graphic.pso);
context.Draw(particles::nParticles, 0);

// finish
context.Finish();

CPU

Pipelining and Rendering

Now that we have the 69 lines of rendering code, let us think about what we need to do for GPU compute. My thinking is that we should have two buffers of particle data and render one while the other is being updated for the next frame by GPU compute. You can see this in the code:

static constexpr int    Frames = 2;
    StructuredBuffer        vertexBuffer[Frames];
    int                     current = 0;
and in the rendering code like this:
    // advance pipeline
    auto& readBuffer = vertexBuffer[current];
    current = (current + 1) % Frames;
    auto& writeBuffer = vertexBuffer[current];

GPU

Introduction

Now we have to think about how to implement a particle rendering system on the GPU. An important part of the CPU algorithm is a sorting/partitioning step after the update() to collect all dead particles together at the end of the pool, to make it a fast operation to emit new ones. At first you might think we need to replicate that step on the GPU, which is technically possible via a Bitonic* sort algorithm (MiniEngine actually contains an implementation of this algorithm), but after further thinking you may realize that this sort is only required if you want fast looping over the particle pool when emitting. On the GPU this loop is not required, and is replaced by a GPU thread being launched to process each particle in the pool in parallel (remember the title of this article is Parallel Processing with DirectX 3D* 12). Knowing this, you may realize that all that is actually needed is for each thread to have access to a global count of particles to emit for each frame. Each thread will then examine its data to see if its particle is available for emitting, and if it is it will atomically get-and-decrement the global counter. If it gets a value that is positive, the thread actually goes ahead and emits the particle for further processing; otherwise the thread does nothing.

Atomic Counter

If only Direct3D 12 had an atomic counter available and easily accessed by the compute shader …. Hmmm.

The compute shader RWStructuredBuffer data type has an optional hidden counter variable. Examining the MiniEngine source code reveals that it implements this optional feature and wraps it in a convenient member function:

void CommandContext::WriteBuffer( GpuResource& Dest, size_t DestOffset, const void* BufferData, size_t NumBytes )

void CommandContext::WriteBuffer( GpuResource& Dest, size_t DestOffset, const void* BufferData, size_t NumBytes )

This makes the compute C++ rendering code straightforward, as shown:

// render compute
ComputeContext& context = ComputeContext::Begin(L"Scene Compute");

// update counter
context.WriteBuffer(writeBuffer.GetCounterBuffer(), 0, &flow.num2Create, 4);

// compute
context.SetRootSignature(compute.rootSig);
context.SetPipelineState(compute.pso);
context.TransitionResource(readBuffer, D3D12_RESOURCE_STATE_NON_PIXEL_SHADER_RESOURCE);
context.TransitionResource(writeBuffer, D3D12_RESOURCE_STATE_UNORDERED_ACCESS);
context.SetConstant(0, particles::nParticles, 0);
context.SetConstant(0, flow.dt,               1);
context.SetDynamicDescriptor(1, 0, readBuffer.GetSRV());
context.SetDynamicDescriptor(2, 0, writeBuffer.GetUAV());
context.Dispatch((particles::nParticles+255)/256, 1, 1);
context.InsertUAVBarrier(writeBuffer);

// finish
context.Finish();

The corresponding compute shader code:

[numthreads(256,1,1)]
void main(uint3 DTid : SV_DispatchThreadID)
{
    // initialize random number generator
    rand_seed(DTid.x);

    // get the particle
    Particle p = I[DTid.x];

    // compute the particle if it's alive
    if (p.age >= 0)
        compute(p);

    // otherwise initialize the particle if we got one from the pool
    else if (!(O.DecrementCounter()>>31))
        initialize(p);

    // write the particle
    O[DTid.x] = p;
}

Results

Just like that, we have a compute version of our particle system in only an additional 20 lines of C++ code.

App

Introduction

CPU Compute

Renderer

Interface

Implementation

Event Loop

Initialization

Custom Upload Heap

Shaders

Rendering

Vertex Buffer View

Results

GPU Compute

Introduction

MiniEngine*

Installation

Direct3D 12 Debug Layers

Customize MiniEngine

Sample Code

Setup Code

Rendering Code

CPU

Pipelining and Rendering

GPU

Introduction

Atomic Counter

Results

Trending Articles