DEV Community

Cover image for Inside 3 Weeks of Vulkan Engine Dev: Render Graphs, Descriptors & Deterministic Frame Pacing
p3nGu1nZz
p3nGu1nZz

Posted on

Inside 3 Weeks of Vulkan Engine Dev: Render Graphs, Descriptors & Deterministic Frame Pacing

  • Author: Cat Game Research Team
  • Date: October 18, 2025
  • Milestone: M4 Phase 2+ - Advanced Rendering Infrastructure
  • Technical Level: Intermediate to Advanced

Inside 3 Weeks of Vulkan Engine Dev: Render Graphs, Descriptor Allocators, and Deterministic Frame Pacing

In three intense weeks I refactored render timing, integrated VMA into the render graph, and built a test-driven descriptor allocator—here’s what changed, why it matters, and how it shapes the next phase.

Quick context and acronyms

  • VMA = Vulkan Memory Allocator (used for suballocation and budget tracking)
  • TDD = Test-Driven Development
  • VSync = vertical sync (present-mode vsync flag)

This article summarizes work from Oct 6 → Oct 18, 2025: design choices, implementation notes, test strategy, measured outcomes, and the immediate next steps for M4 Phase 2.

Key outcomes in brief: integrated a DAG-based RenderGraph with a VMA-backed allocator, shipped a spec-first descriptor allocator with a focused unit suite, and tightened frame pacing and startup logging. Tests added during M4 include focused coverage (73 assertions for descriptor allocator tests; ~90 assertions across render-graph/VMA tests), contributing to repository totals of ~1,125 assertions across ~160 test cases. Practically, these changes improved startup traceability, reduced frame-pacing variance for profiling, and simplified cross-platform startup.

Motivation & goals

Over the past three weeks, we pushed major updates to our Vulkan-based engine — Void Frontier. From render-graph memory aliasing to descriptor allocator internals, here’s a narrative of what we built and why it matters.

We spent that time turning an experimental renderer into a reproducible subsystem. The issues were practical and tightly coupled: subtle lifetime bugs that caused crashes, unpredictable GPU memory during asset streaming, and ad-hoc descriptor bookkeeping that leaked resources and slowed iteration. Our approach was pragmatic: write a spec, implement the minimal change to satisfy it, and iterate.

1) Render & Timing — clearer startup and deterministic pacing

Summary: Reworked present-mode selection, startup logging, and the frame-limiter so developers and CI see the chosen present mode and VSync state reliably.

What changed

  • Present mode and VSync are logged at device init for reproducible startup traces
  • TimingSystem now reads frames.max_fps (0 = unlimited) and enforces a deterministic frame limiter for capture and profiling
  • Startup logs use the new microsecond-precision Logger for consistent timestamps

Why it matters

  • Deterministic frame pacing reduces jitter during profiling and automated capture; clearer logs speed debugging of present-mode mismatches.

Details

We consolidated presentation and timing configuration so developers see the chosen present mode and VSync state at startup, and the TimingSystem now reads frames.max_fps (treating 0 as unlimited). These changes make frame pacing deterministic for profiling and capture.

What is VSync and why it matters

VSync (vertical sync) controls whether the GPU presents frames synchronized to the display's vertical blank. Its primary purpose is to avoid screen tearing (when multiple frames are visible within one scan) but enabling or disabling it affects latency, smoothness, and power use.

Key present modes (Vulkan terminology)

  • FIFO — the standard vsync mode: no tearing, predictable presentation, can add input/display latency when framerate > display refresh
  • MAILBOX — low-latency vsync with buffering: avoids tearing while allowing the renderer to replace queued frames (good for high-framerate, low-latency workflows)
  • IMMEDIATE — present as soon as possible: lowest latency but allows tearing

Trade-offs and practical notes

  • Tearing vs latency: enabling VSync (FIFO/MAILBOX) prevents tearing but may increase perceived input/display latency. IMMEDIATE lowers latency but can show tearing.
  • Frame limiter interaction: frames.max_fps is used when VSync is disabled to control CPU/GPU load and keep captures deterministic.
  • Validation: visually test with a fast-moving scene to detect tearing; use the startup log (present mode + VSync flag) to verify configuration in CI or on dev machines.

How we configure and log it

  • Config: set frames.vsync = true|false and frames.max_fps in config/render.toml.
  • Logging: at device init the RenderSystem logs the chosen present mode and the VSync boolean so CI logs and developer traces show the active behavior.

See also: the fix(vsync) commit c4e4c1b which tightened present-mode selection and startup logging.

2) Memory & RenderGraph — VMA integration and transient aliasing

Summary: Integrated VMA into the RenderGraph to centralize allocation, expose budgets, and enable safe aliasing of transient attachments.

What changed

  • VMA allocator initialized inside RenderSystem::initialize_vma() for explicit lifecycle control
  • RenderGraph records first/last use of resources and enables aliasing for transient attachments where safe
  • Helper allocation paths added for device-local images and staging buffers to reduce boilerplate

Why it matters

  • Centralized allocation reduces fragmentation, enables budget-aware behavior, and makes long-run memory usage auditable.

Details

We integrated VMA (Vulkan Memory Allocator) to provide controlled suballocation, budget enforcement, and incremental defragmentation. The RenderGraph (DAG-based) now records resource lifetimes and enables aliasing for transient attachments. Together these systems reduce memory pressure and make resource lifetimes auditable.

What is VMA and why it matters

VMA (Vulkan Memory Allocator) is a widely-used helper library that sits on top of Vulkan's raw memory APIs and provides suballocation, pooling, and allocation strategies that make GPU memory management tractable in real projects.

Why VMA matters

  • Suballocation: VMA lets us carve many small buffers and images from larger device memory allocations, which reduces wasted space and fragmentation compared with creating one allocation per resource.
  • Budgets & tracking: VMA exposes per-heap/device memory usage so the engine can refuse or defer large allocations when budgets are exceeded.
  • Defragmentation: VMA supports moving allocations and defragmenting memory when fragmentation grows, which is critical for long-running sessions and streaming workloads.

How we use it in the engine

  • Initialization: RenderSystem::initialize_vma() creates the allocator once the Vulkan device is available, ensuring correct lifecycle ordering.
  • Allocation helpers: we added convenience functions for common patterns (device-local images, staging buffers) so callers don't repeat boilerplate flags and flags combinations.
  • RenderGraph wiring: when a transient resource is declared, the RenderGraph asks VMA for a suitable allocation and records first/last use for safe aliasing.

Validation notes

  • Unit tests cover the VMA wrapper initialization and basic allocation/free behavior.
  • For streaming scenarios we created a stress test that allocates and frees transient attachments across many frames and reports peak memory usage and fragmentation counters exported by VMA.
  • If you want raw numbers, I can add a small microbenchmark that reports fragmentation before/after a simulated streaming session.

3) Descriptor System (TDD) — reliable descriptor lifetimes

Summary: Implemented a test-driven DescriptorAllocator with layout caching and resettable transient pools, prioritizing correctness and predictable growth.

What changed

  • DescriptorAllocator with dynamic pool growth and a layout cache to avoid duplicated VkDescriptorSetLayout
  • Transient per-frame pools that can be reset to avoid fragmentation under high allocation churn
  • Unit tests validating initialization, allocation, pool expansion, and cache behavior

Why it matters

  • Predictable descriptor lifecycle prevents leaks and reduces runtime fragmentation, paving the way to bindless descriptor arrays.

Details

We built a spec-first descriptor allocator (TDD — test-driven development) with layout caching and resettable transient pools for per-frame allocations. This design reduces pool fragmentation and sets the stage for bindless descriptor arrays.

Pseudocode test (TDD style)

// Pseudocode: allocate -> bind -> free should not leak
TEST_CASE("Descriptor allocate-bind-free") {
    DescriptorAllocator alloc;
    auto set = alloc.allocate(layout);
    bind_descriptor_set(set);
    alloc.free(set);
    REQUIRE(alloc.live_allocations() == 0);
}
Enter fullscreen mode Exit fullscreen mode

4) Build, Platform, and Tooling — simpler deploys and fewer platform surprises

Summary: Moved config deployment into scripts, hardened DLL exports, and tightened static linkage to reduce cross-platform linkage issues.

What changed

  • Moved non-code config deployment to ./scripts/build.sh for consistent cross-platform behavior
  • Added ENGINE_EXPORT macros to avoid missing-symbols on Windows DLL boundaries
  • Fixed inline/static singleton linkage to be robust across compilers and DLLs

Why it matters

  • These small build and platform fixes reduce CI flakiness, avoid runtime missing-symbols, and make dev machines behave more like CI.

Details

We simplified config deployment by moving non-code deployment steps into ./scripts/build.sh, fixed Windows DLL/export issues by adding ENGINE_EXPORT macros, and tightened singleton/static linkage to be robust across compilers.

5) Docs, Tests, and Phase Status — spec-first and verifiable

Summary: Everything was spec-driven and validated with Catch2 tests; CI runs headless to keep render-dependent tests stable.

What changed

  • Acceptance criteria captured under docs/specs/ and mapped 1:1 to Catch2 tests
  • Tests use tests/test_mocks.hpp so render-dependent logic runs in headless CI
  • Test counts added: descriptor allocator tests (73 assertions, 8 cases); render-graph & VMA tests (~90 assertions). Repository total ~1,125 assertions across ~160 test cases

Why it matters

  • The spec-first workflow yields lightweight, focused tests that reduce regression risk for complex subsystems.

Validation & how we ran it

Run the test suite with the project script to reproduce results:

./scripts/test.sh --filter "*descriptor*" linux-debug --verbose
Enter fullscreen mode Exit fullscreen mode

Visuals (placeholders)

  • Hero image (cosmic control panel) — 1200×600px, stylized sci-fi control panel to attract clicks
  • Technical diagram — RenderGraph nodes + VMA allocation overlay + descriptor pools, 800–1200px wide. Caption: "RenderGraph with VMA aliasing overlay"

Part I — Render Graph & Execution Model

This part focuses on the Render Graph core: how we represent passes and resources, automatic barrier insertion, resource lifetime tracking, pass culling, and the executor model used by the RenderSystem.

Architectural overview

Below we unpack the concrete architecture and the trade-offs made while implementing it.

Render Graph Core

Why a render graph?

Render graphs let you declare what a frame needs and let the engine decide when and how to execute it. This separation reduces synchronization bugs, centralizes lifetime logic, and makes pass culling and memory aliasing tractable.

What we implemented is a DAG-based render graph that models passes and their attachments, a compiler that performs topological sorting and inserts the necessary barriers, and a lifetime tracker that records first/last use so transient resources can be aliased when safe. We added pass-culling so dead work is removed before execution, and a GraphViz exporter to help visualize and debug complex graphs. In practice the game registers passes during setup, the RenderSystem compiles the graph at initialization, and an executor callback records command buffers using pipeline and swapchain accessors from the VulkanDevice. Our design favored correctness first — a clear API surface and lightweight RenderPassContext objects for executor code — with room to optimize barrier heuristics in later iterations.


Part II — GPU Memory (VMA) Integration

This part explains the VMA integration: the allocator wrapper, staging pool, budget tracking, and how VMA-backed allocations are wired into the RenderGraph for transient and persistent resources.

VMA integration and allocator design

Why VMA?

Vulkan's raw memory APIs are powerful but low-level; VMA gives us a pragmatic layer for suballocation, budget enforcement, and incremental defragmentation — features that materially reduce memory-related bugs in long sessions.

We integrated VMA by wrapping allocator initialization in RenderSystem::initialize_vma() so lifecycle ordering remained explicit: create device, init VMA, create the render graph, then allocate resources. The RenderSystem now exposes get_vma_allocator() and is_vma_initialized() for safe test and subsystem access, and the CMake includes were adjusted so CI builds on Linux and Windows pick up the VMA header. We emphasized RAII-friendly allocations and strict shutdown ordering, and added helper allocation paths for common patterns like device-local images and staging buffers.


Part III — Descriptor Management & Bindless Plans

This part covers descriptor allocation, layout caching, transient descriptor pools, and the roadmap toward bindless descriptor arrays.

Descriptor Allocator: goals & architecture

Descriptors connect CPU-managed resources to shader bindings. Our allocator focuses on two extremes: persistent sets for long-lived resources, and transient pools for high-churn per-frame allocations.

Our descriptor work targeted dynamic pool growth, layout caching to avoid duplicated VkDescriptorSetLayout creation, resettable transient pools for per-frame allocations, and runtime statistics for telemetry. We shipped a descriptor_allocator core that maintains pool bookkeeping and a layout cache, added a DescriptorVulkanFunctions table for dynamic function loading consistent with our VMA approach, and delivered unit tests that exercise initialization, allocation, pool expansion, and cache behavior. These changes reduce runtime fragmentation and provide the foundation for future bindless features.


Engine & platform polish

We tightened a few cross-cutting concerns that reduce developer friction and CI surprises:

  • DLL export/import macros: added ENGINE_EXPORT where needed to avoid missing-symbol and RTTI issues on Windows DLL boundaries.
  • Singleton and inline static linkage fixes: reworked inline/static definitions to be robust across compilers and DLL boundaries.
  • Logger system: implemented a microsecond-precision logger with a CRTP-based singleton pattern to replace ad-hoc prints; this made timestamps deterministic in logs and improved test reliability.

Together these changes reduced platform surprises, improved headless CI stability, and made diagnostic output more actionable.

Tests, TDD, and spec-first workflow

Our workflow, in practice:

  1. Draft a narrowly-scoped spec in docs/specs/ that defines acceptance criteria.
  2. Implement Catch2 tests that map to those criteria.
  3. Write the minimal, well-typed code to pass tests and iterate until the suite is green.

Notable test work:

  • Descriptor allocator tests: 73 assertions, 8 test cases.
  • Render graph & VMA tests: 27 + 65 assertions across multiple test cases.
  • Integration tests for RenderSystem VMA initialization and render graph plumbing.

Total counts were carefully updated in TODO.md and copilot-instructions.md to keep test metrics accurate across M3→M4 transition (e.g., 1125 assertions, 160 test cases at one point).

CI considerations:

  • Tests use tests/test_mocks.hpp to be headless-friendly so render-dependent logic can run in CI without real GPU hardware.
  • ./scripts/test.sh orchestrates building and running tests with proper presets.

Performance and correctness considerations

A few areas to keep an eye on as the work continues:

  • Barrier insertion cost: the render graph's automatic barriers are correct but may over-insert in complex graphs; profiling passes should be added to measure the overhead.
  • Descriptor allocation pressure: transient pools help, but a bindless approach with large descriptor indexing will be necessary for many textures.
  • Memory aliasing correctness: aliasing can reduce memory usage but increases correctness complexity (ensure use-after-free is impossible across frames).

We added tests and logging hooks so these can be iteratively profiled and improved without breaking behavior in CI.

Lessons learned and next steps

Lessons:

  • Spec-first TDD pays off: writing tests and specs first reduced iteration costs when tackling a complex system like descriptors and memory allocators.
  • Keep lifecycle simple: centralizing VMA and render graph under RenderSystem made shutdown-and-startup ordering easier to reason about.
  • Logging is critical: a microsecond logger and clearer messages made diagnosing early-present-mode and vsync issues trivial.

Next steps:

  1. Integrate the descriptor allocator into the RenderGraph as planned in M4 Phase 2.
  2. Implement DescriptorWriter and higher-level helpers for bindless texture arrays.
  3. Add advanced shader pipeline support and finalize shader hot-reload for rapid iteration.
  4. Profile barrier insertion and reduce unnecessary transitions where safe.

Further reading & code pointers

If you want to trace the implementation and learn more from the authoritative specs and implementation notes, start with these documents (paths are relative to repo root):

  • Render Graph design & API — docs/specs/systems/render/render_graph.md (DAG, barriers, lifetime analysis, GraphViz export)
  • Render System integration — docs/specs/systems/render/render_system.md (RenderSystem lifecycle, device/surface integration, M4 integration points)
  • VMA integration & wrapper — docs/specs/systems/render/vma_integration.md (VmaAllocatorWrapper API, staging pool, defragmentation)
  • Descriptor system & bindless design — docs/specs/systems/render/descriptor_system.md (DescriptorAllocator, DescriptorWriter, Bindless arrays)
  • Timing & VSync behavior — docs/specs/systems/timing_system.md (TimingSystem, frame pacing, target FPS handling)

These specs include directory and file pointers for the implementation (look under engine/systems/render/graph/ for the render graph sources, and engine/systems/render/device/ for device code).

Commit highlights (short)

  • 282e88d — refactor(build): Move config file deployment to build script — simplified cross-platform startup
  • 21dba9f — refactor(render): Integrate VMA allocator into RenderGraph — centralized allocations & aliasing
  • c0a1a87 — feat(M4): Implement DescriptorAllocator — test-first allocator with transient pools
  • c4e4c1b — fix(vsync): Correct VSync configuration handling and improve startup logging

Full commit list and diffs are available in the repository history. To inspect locally:

git --no-pager log --oneline --decorate --graph --since="2025-10-06" | sed -n '1,50p'
Enter fullscreen mode Exit fullscreen mode

Metrics & validation notes

  • Descriptor allocator tests: 73 assertions, 8 test cases (unit)
  • RenderGraph & VMA tests: ~90 assertions across unit/integration cases
  • CI: tests run headless via ./scripts/test.sh; all new tests pass in our linux-debug preset during local validation (see CI logs in the run artifacts)

If you want precise microbenchmarks (frame jitter reduction, memory-fragmentation deltas), I can add a short microbenchmark and report numbers from a synthetic scene.

Impact summary

These changes lay the foundation for advanced shader pipelines, bindless resource systems, and more deterministic frame pacing—critical for both performance and reproducibility. They also reduced CI flakiness and made startup behavior more predictable for devs and automated tests.

Example — VSync logging (before / after)

Before (ad-hoc):

// old: scattered prints during device init
printf("InitDevice...\n");
// sometimes printed vsync state inside swapchain code
Enter fullscreen mode Exit fullscreen mode

After (centralized, deterministic):

// In VulkanDevice::init() after swapchain creation
LOG_INFO("Render API: %s", get_api_name());
LOG_INFO("Present mode: %s", present_mode_to_string(chosen_mode));
LOG_INFO("VSync: %s", config_.frames.vsync ? "true" : "false");
Enter fullscreen mode Exit fullscreen mode

Example log excerpt (startup):

[0.00032034s] [gAME] INFO RenderSystem: Render API: Vulkan
[0.00032034s] [gAME] INFO RenderSystem: Present mode: FIFO
[0.00032034s] [gAME] INFO RenderSystem: VSync: false
Enter fullscreen mode Exit fullscreen mode

Example — Descriptor pool (before / after)

Before (fragile):

// naive: single pool, no growth
vkCreateDescriptorPool(device, &pool_info, nullptr, &pool);
vkAllocateDescriptorSets(device, &alloc_info, &set);
Enter fullscreen mode Exit fullscreen mode

After (managed pools, growth):

// DescriptorAllocator::allocate()
if (!try_allocate_from_current_pool(layout, out_set)) {
    auto new_pool = create_pool(pool_config_);
    pools_.push_back(new_pool);
    allocate_from_pool(new_pool, layout, out_set);
}
Enter fullscreen mode Exit fullscreen mode

What's next

  • Integrate the descriptor allocator into the RenderGraph so passes can request descriptor sets declaratively
  • Add bindless resource arrays and DescriptorWriter helpers to simplify shader bindings
  • Finalize shader pipeline support and enable hot-reload for faster iteration
  • Profile barrier insertion and reduce unnecessary transitions where safe

This technical deep dive represents our journey through M4 Phase 1 of Bad Cat: Void Frontier development. We hope it provides value to other graphics programmers tackling similar challenges. The complete source code is available in our repository for study and adaptation.

Happy rendering! 🎮

~p3n

Top comments (0)