Daniel Maters

Posted on Nov 19

Render Graph, a simple implementation

Preface

This blog post should be taken lightly, as it isn't a work of research and presentation, but a simple blog post about the renderer I've been working on.
I have spent some time researching and working on the implementation, but the topics described are complex and full of subtleties, for which I don't have the experience to grasp all of it.
Many advanced details might be missing, for those I'd advise to refer to better-constructed bodies of work available online.
But any feedback or advice is appreciated.

Also most of my experience is with OpenGL and Vulkan, so there might be some descriptions that are not accurate for DirectX.

Why and what

As technology moved forward, GPUs were tasked with more complex and heavier tasks, meaning evolution of software and hardware.
This demand brought into the stage low level graphics APIs (Vulkan, DX12, Metal), which gave more control to the developer, but also offloading much of the heavy lifting the driver had to do to output something to the screen.
This work is now our responsibility, which increases setup work and state tracking, aka verbosity.

For small scope renderers, where the frame is pretty deterministic, it is acceptable to manage the frame setup manually, and submit when needed, but for a bigger scope renderer it's mandatory to implement some sort of abstraction layer to handle all the boilerplate for you.

Render graphs are a common solution to this problem.

This central system helps with repeating code, which is defined as a task, that can be reused many times (RenderPass for example),

It helps with resource management as well, lifetimes, barriers and layouts.
A resource can live so long inside a frame, so keeping its memory busy for the whole frame is a waste of memory, we can use it for another resource later on the frame. This mechanism is called aliasing, and is really appealing for production ready renderers, where VRAM is limited and a contended resource.

Aliasing can happen at resource level and memory level. Resource level uses virtual resources, that all reference the same base instanced resource. Memory level is possible through the APIs, it's more flexible but requires a more complex implementation.

Render Graph

A render graph is an abstraction layer over the raw command list management, where we give it a list of tasks and their dependencies, and it encodes them into command list readable by the GPU.
At its fundamental state, it's an directed acyclic graph that represents the renderer's execution state: the nodes are operations the GPU has to execute (copy image, render pass, transfer operations, etc...), while the edges are the resources used for such operations.
Then it uses graph theory to the information it needs to best optimize the workload for the current frame, such as reordering or trimming unneeded tasks to improve performance.
It can also utilize the resources lifetime information to smartly allocate memory during the frame lifetime.

Info: the main operation when you have a complete graph is traversal, you take the output node, and go backwards exploring all connected nodes, that will give you the execution list.

In my current renderer, passes are defined via code, intrinsically ordered by the developer, so no reordering is applied since it's not yet necessary.
This is true for resources as well, since their internal indexes are passed when we define a task, only already registered resources can be used, so there won't ever be a reference to a resource registered later, unless we use an invalid index. More about it below.

Components

Resources

There are two types of base resources (textures and buffers), which correspond to what the APIs broadly define as resources.
Those base resources themselves can be classified into two types: Transient and Static.

Static resources are declared externally and fully managed by the ResourceManager. Their lifetime is tied to the scene or the application itself, so they can be considered completely static and can be ignored.

Transient resources live only within a single frame, and they are dynamic (they will be modified during the frame time), so the graph has partial ownership.
The allocation is still handled by the ResourceManager, but the graph will handle their lifetime, and usage.

ResourceIndex createImage(
    std::string_view name,
    ResourceManager::ImageDescription desc,
    uint8_t swapchainRatio = 0
);

ResourceIndex createBuffer(
    std::string_view name, 
    ResourceManager::BufferDescription desc
);

This will register the resources and bind them to a ResourceIndex, which is used to keep track of the resources in the context of the RenderGraph.
They are not allocated at this stage since we are still compiling the graph, so we will just keep the ResourceIndex and the resource description in memory.

swapchainRatio is a additional variable for dynamic resolution images (mostly render targets).
It defines the image resolution as the following function
Let sR = swapchainRatio, res = image resolution;
If sR = 0: res = static value.
If sR > 0: res= sR × swapchainResolution.
If sR < 0: res = (1 / sR) × swapchainResolution.

Tasks

A task is a generic piece of work that will work within the graph.
As envisioned it has input and output resources (READ / WRITE) and it's self-contained, so any dependency to other tasks is only through resources.
The tasks are defined as std::functions, allowing us to use lambdas and their concise grammar for simpler task definition, and to be able to capture variables that might be needed during execution.
Capturing variables also reduces the context size.
The performance hit for using lambdas is negligible for now.

using Task = std::function<void(TaskContext&)>;

void addTask(
    std::string_view name,
    TaskType type,
    std::vector<ResourceDependency> inputResources,
    std::vector<ResourceDependency> outputResources,
    Task task
);

TaskType

The task type parameter is groundwork for a future more efficient handling of tasks.
With modern desktop GPUs we have access to multiple queues, so if the work is split across them we will have some performance improvement. We are already keeping track of which type of task we are handling.

enum class TaskType {
    CPU,
    Graphic,
    Compute,
    Transfer,
};

This is also true for the task encoding, since they are contained, it's possible to split the encoding process across threads without synchronization problems, we would only need to collect all the secondary buffers and merge them in order to the primary buffer. This is left as well for the future, when performance problems will start to arise.

CPU is defined as a task type since there could be work that is done on the CPU and needs synchronization to GPU work, one notable example is culling.

ResourceDependency

Resource Dependency is a pair describing a single usage of a resource by a task. This will help with state tracking of the resource and building the memory barriers accordingly.

using ResourceDependency = std::pair<ResourceIndex, ResourceUsage::Type>;

enum class Type {
    SampledRead,
    ShaderWrite,
    ShaderRead,
    ColorAttachmentWrite,
    DepthStencilRead,
    DepthStencilWrite,
    VertexBuffer,
    IndexBuffer,
    UniformBuffer,
    StorageBufferWrite,
    StorageBufferRead,
    TransferSrc,
    TransferDst,
};

There might be some missing or inaccurate usage.
They will change as the renderer evolves and more functionality is added.

RenderGraphBuilder

My current renderer targets static scenes, so there won't be much variance in passes and operations between frames, if the creation workload is slim the graph can be recompiled when needed during runtime.
A separate class has been defined for such operation : RenderGraphBuilder.
The task and resource register functions are part of the builder.
It additionally defines GraphData build(), which queries the graph to define the barriers between tasks, and flatten the graph into a linear vector.
The barriers and layout changes are partially compiled, because the resources are not allocated yet, so we don't have direct access to the resource handle (vkImage/vkBuffer).

TLDR:

You initialize the RenderGraphBuilder

You register resource A

You register task A1, with usage of A

You set A1 as output

You build, and get the data needed to initialize RenderGraph (resources to initialize, vector of tasks, barriers)

RenderGraph

The RenderGraph, with the data collected by the RenderGraphBuilder, is constructed.
During the construction step we allocate all the resources that were previously registered, and we finalize the compilation of barriers.
The RenderGraph will then be called every frame, it allocates the command buffer, registers the barriers and executes the tasks. The current implementation also handles presentation.
The heavy lifting has been already done by the builder, so this class works in a pretty straightforward way: it just iterates over the vector of tasks, and runs the lambdas passing the context.

struct TaskContext {
    vk::CommandBuffer& commandBuffer;
    std::vector<ResourceIndex>& inputs;
    std::vector<ResourceIndex>& outputs;
    std::unordered_map<ResourceIndex, ImageHandle>& images;
    std::unordered_map<ResourceIndex, BufferHandle>& buffers;
    ResourceManager& resourceManager;
    MaterialManager& materialManager;
};

The current implementation is really simple and naive, but needed for initial flexibility.
For a future version the context will be cleaned up, since it's not good that a task (supposedly a unit of independent work), has direct access to the core system.
We know which resources are needed by the task, so it's easy to collect them and just pass them.

One notable detail is handling resolution changes.
The graph has full responsibility for the frame, so it listens for any VK_ERROR_OUT_OF_DATE_KHR and will reinitialize all the resolution-dependent images, and recompile all the memory barriers.

Footnote: Now and the future

This system is a work in progress, as the renderer is still young it's unclear which features are needed as it evolves.
For the current workload the current implementation is good enough, although it needs some code cleaning.

I will describe though a few quality of life improvements that are common in render graphs, and already planned to be implemented (hopefully) soon:

Smarter RenderGraphBuilder:

The current implementation just follows what the developer has defined.
It works for a smaller scope, where every task is essential, but as the renderer grows there will be different implementations of the same feature, a user could decide which feature they would like to enable, or hardware differences that can prevent some feature to be enabled.
A feature flag that will be required, where it will determine which tasks are filtered out , and the RenderGraphBuilder has to respond accordingly, skipping those tasks, and their resources since they are not needed.
Device specific features

Already described before, different devices will have different capabilities. It would be too much work keeping track of each capability for each task and handling edge cases, so a task should request some capability from the hardware, and the rendergraph should be responsible of enabling or not such tasks depending on the capability required, and providing a fallback if there is any. This is also true for multiple or single queue devices.
Configurable resources/tasks:

The current tasks are defined via code. This means every change has to be recompiled.
Since the renderer has a focus on flexibility and ease of use it would be better if they are implemented in configuration files (maybe with a scripting language like Lua), so they are not inside the core engine, and eventually support hot-reload.
Multithreading / async:

Multithreading is needed in a bigger engine because the frame is really complex, so encoding can take a big chunk of the available CPU time. For the current renderer that performance is not yet required, but since the tasks are already contained pieces of work implementing it should be somewhat easy, improving performance for less powerful devices.
This works in the context of queues as well. In modern GPUs we have access to multiple independent queues (Graphic, Transfer, Compute), so tasks can be enqueued in parallel queues, synchronized properly with semaphores, to use all the GPU's resources more efficiently and improve performance.
Improved resource management

Resources for now live for the entirety of a frame, which is fine for simple tasks, but for temporal-aware effects it's mandatory to store information across multiple frames ( the resource has to live longer).
We lose also the performance benefits we could have from aliasing, where the same allocation can be used by multiple (not overlapping) resources across the frames.
For doing so a better lifetime management solution is required.

Conclusion

Graphics programming is hard, it goes to the limit of high performance computing with difficult problems.
It takes the best talent and work of thousands of dedicated passionate people to get to the point we are today.
This is but a scratch on the surface of renderers and the incredible technology behind it, yet I hope it's going to be helpful for those who are just starting to dive into this incredible world.

Thank you for reading this blog, and feel free to leave any comment or criticism.

DEV Community