This blog series is a part of the write-up assignments of my Real-Time Game Rendering class in the Master of Entertainment Arts & Engineering program at University of Utah. The series will focus on C++, Direct3D 11 API and HLSL.

In this post, I’ll talk about how I optimize my current rendering pipeline to minimize Graphics API calls and reduce some memory usage.

Render Command

Before, we were storing arrays of pointers to meshes and effects (shaders) in our “buckets” on the application thread and the rendering thread. However, there is a chance (pretty huge) that multiple meshes will share the same effects, or we might be repeatedly drawing the same mesh onto the screen, which means that a lot of the pointers in my “buckets” will actually be the same!

We can alleviate this issue by using handles IDs instead. Besides that, we can also encode the information needed into a RENDER COMMAND that represents different kinds of operations, such bind an effect and then draw, or clear the image buffer. But currently, I’m just gonna start with only the draw command. Below shows how I structure the render commands.

Starting from the least significant bit, bit 0 – 7 is the mesh handle ID, this limits the total different mesh counts to 256, but it can be easily modified in the future. From bit 8 – 15 is the z depth of the mesh int the projected space, I will be using this to sort the rendering order from closer to further. Bit 16 – 23 is the effect handle ID, and bit 60 – 63 is the render command type, which allows 16 different commands for now.

UPDATE (01/17):
After talking to my teacher, I decided that I am going to store the index of the mesh transform also inside the render command so I don’t have to sort 3 arrays.

// 64 bits for our render command
typedef  uint64_t RENDER_COMMAND;

// render command type
enum class eRenderCommandType : uint8_t {
    Draw = 2
};

enum class eRenderCommandBitMasks : uint64_t {
    // Mesh transform Id has 8 bits, (0 - 7)
    MeshTransformIDInBuckets = 0xFF,

    // Mesh Handle Id has 8 bits, (8 - 15)
    MeshHandleID = 0xFF00,

    // Mesh z depth has 8 bits, (16 - 23)
    MeshZDepth = 0xFF0000,

    // Effect Handle ID has 8 bits, (24 - 32)
    EffectHandleID = 0xFF000000,

    /*
    Some other stuff
    */

    // Render command type has 4 bits, 60 - 63
    RenderCommandType = 0xF000000000000000
};

// The values below are used as bit masks
enum class eRenderCommandBitShifts : uint8_t {
    // Mesh transform Id has 8 bits, (0 - 7)
    MeshTransformIDInBuckets = 0,

    // Mesh Handle Id has 8 bits, (8 - 15)
    MeshHandleID = 8,

    // Mesh z depth has 8 bits, (16 - 23)
    MeshZDepth = 16,

    // Effect Handle ID has 8 bits, (24 - 32)
    EffectHandleID = 24,

    /*
    Some other stuff
    */

    // Render command type has 4 bits, 60 - 63
    RenderCommandType = 60
};

Now, I can just use simple bit-wise & and >>, << to encode/decode the render commands. Let’s say I have an effect and a mesh both with handle ID as 1, the encoded command is shown below.

RenderCommandHexView.PNG
Encoding Render Command

Sorting

Now that I’ve successfully encoded the data that I need to render commands. It is trivial to just treat them as uint64_t and call sort.

void eae6320::Graphics::SortRenderCommands(RENDER_COMMAND* i_renderCommands, uint8_t i_arraySize) {
    // treat like uint and sort
    std::sort(i_renderCommands, i_renderCommands + i_arraySize);
}

UPDATE (01/17):

The sorting method that I am showing below was the approach that I took before. Now that I have transform indices also encoded inside the render command, I don’t have to do this anymore.

HOWEVER! I am storing the position and orientation data inside my buckets, and they won’t match anymore after the render commands are sorted into a new order.

To solve this problem, I can either also encode the indices of the position and orientation into my commands, or I can sort those two arrays. After spending some thoughts, I assume that our render commands won’t be able to afford that after having more data being incorporated, so I decided to also sort those two arrays.

So I created some template functions and modified my sorting functions to take those into account.

void eae6320::Graphics::SortRenderCommands(RENDER_COMMAND* i_renderCommands, uint8_t i_arraySize) {
    // get permutation
    std::vector permutation = eae6320::Math::SortPermutation(i_renderCommands, static_cast(i_arraySize), std::less());

    // treat like uint and sort
    std::sort(i_renderCommands, i_renderCommands + i_arraySize);

    EAE6320_ASSERT(s_dataBeingRenderedByRenderThread);
    // Apply to rotation and position array at the same time
    eae6320::Math::ApplyPermutationInPlace(s_dataBeingRenderedByRenderThread->meshEffectPairPositionArrayPtr_perFrame.data(),
        s_dataBeingRenderedByRenderThread->meshEffectPairRotationArrayPtr_perFrame.data(), static_cast(i_arraySize), permutation);
}

template
std::vector SortPermutation(const T * const i_arrayPtr, size_t i_arraySize, comp i_comp)
{
    std::vector p(i_arraySize);
    std::iota(p.begin(), p.end(), 0);
    std::sort(p.begin(), p.end(),
        [&](size_t i, size_t j) { return i_comp(i_arrayPtr[i], i_arrayPtr[j]); });
    return p;
}

template
void ApplyPermutationInPlace(T * i_arrayPtr1, K* i_arrayPtr2, size_t i_arraySize, const std::vector& i_permutation)
{
    std::vector applied(i_arraySize);
    for (size_t i = 0; i < i_arraySize; ++i) {
        if (applied[i]) continue;
        applied[i] = true;
        size_t prev_j = i;
        size_t j = i_permutation[i];
        while (i != j) {
            std::swap(i_arrayPtr1[prev_j], i_arrayPtr1[j]);
            std::swap(i_arrayPtr2[prev_j], i_arrayPtr2[j]);
            applied[j] = true;
            prev_j = j;
            j = i_permutation[j];
        }
    }
}

Fortunately, both retrieving and applying permutations have time complexity O(N), and will only get called once per frame, so it's not too bad.

Results

After sorting the render commands properly, you can see that the rendering system is drawing the meshes with the same effect first accordingly to their depth, and then draw the meshes with the other effect.

If I rotate the camera into a new position (where the wolf and deer are in the back), you will see that the drawing order also changed since the depth of each of the meshes is different now.

FrontView.gif
(Front View Draw Order)
BackView.gif
(Back View Draw Order)

Optimize D3D API Calls

Now that I have grouped the draw calls of the meshes with the same effect together, I can start trying to reduce the API calls to the graphics library since if I’ve already bound effect A in the last draw call, I don’t need to bind it again for the next 5 meshes if they all use effect A as well! Looking at the graphics debugger, we can see that it’s also telling us that we’re doing redundant calls!

unneccessaryapicallsgpuwork
(Graphics debugger, GPU work before removing redundant calls)

At first, I was thinking about using VSGetShader and PSGetShader before setting to avoid the situation, but then I realized that they ARE ALSO API CALLS, what was I thinking?

The approach that I’m taking right now is storing the previously bound effect and the previously drawn mesh in my Graphics namespace. Now that I have removed the redundant calls, let’s look at the GPU work again. You can see that at draw 83, it is still using the pixel shader that was set in draw 59!

reducedapicallsgpuwork
(Graphics debugger, GPU work after removing redundant calls)

Even Further

If we look at the GPU work again now, there are still redundant calls mainly for the render states and setting primitive topology (I’m only using triangle here).

RenderStatePrimitiveRedundant.PNG
(GPU work with unnecessary render states binding)

I modified the render state class so that it’s also being loaded through asset handle now and being reference counted! In this case, I can easily check whether the render states are actually the same between different effect binding calls, and just skip the binding if not necessary! You can see in below that those are not bound again anymore if already set before!

CleanerGPUWork.PNG
(GPU work after cleaning up render states)

Let’s take a look at the timeline and compare what we had before and what we have now.

UnneccessaryAPICALLSTimelineHIGHLIGHT.png
(before)
CleanerGPUTimelineHIGHLIGHT.png
(After)

You can see that we got rid of repeatedly setting vertex/index buffers, input layout (green boxes), and setting the shaders along with render state (red boxes). We can still do more by eliminating the IASetPrimitiveTopology calls. But I’m gonna stop right here for now.