beefed.ai

Posted on Apr 9 • Originally published at beefed.ai

Fast BVH Construction for Real-Time Ray Tracing

#programming

Why your BVH choice determines rays-per-second
How LBVH and HLBVH convert sorting into lightning-fast builds
Memory layouts and traversal micro-optimizations that cut bandwidth
Keeping moving parts fast: refit, rebuild, and multi-level BVHs
A practical BVH build & update checklist you can run today

Ray tracing performance is dominated by two things: how many rays you can trace per second and how much time you spend rebuilding the spatial index that lets those rays skip work. Pick the wrong acceleration strategy and no amount of shader twiddling or denoiser magic will recover the lost throughput; choose the right one and you reclaim entire frames for higher-quality effects.

Dynamic scenes stutter, GPU memory bandwidth spikes, shadow and reflection noise persists — those are the symptoms. Practically, what you see when your BVH strategy is off: long per-frame stalls during BVH rebuilds, degraded rays-per-second because traversal visits many overlapping boxes, and hard-to-debug temporal noise because traversal divergence amplifies intersection variance. These are not academic exercises; they are the runtime failures that ruin 60 Hz targets and make QA teams cranky.

Why your BVH choice determines rays-per-second

The BVH is the single most important acceleration structure for ray tracing: it decides which triangles a ray must test and therefore sets the baseline memory traffic and intersection work per ray. A high-quality BVH reduces node visits and slows down traversal far less than a cheap-but-poor tree, so rays-per-second is effectively the product of traversal efficiency and memory bandwidth behavior.
Builders fall on a spectrum: algorithms that minimize build time (e.g., Morton/LBVH) tend to produce worse SAH cost and therefore more traversal work; the best-SA H methods minimize traversal work but cost more to build. Lauterbach et al. measured LBVH builds as more than an order of magnitude faster than full SAH builds, yet reported traversal slowdowns up to ~85% in pathological cases — a real trade-off you must measure against your frame budget.
Measure what matters: report both the per-frame BVH build time (ms) and rays-per-second for the same scene/seed. If the build steals more than your frame slack (e.g., >4 ms on a 16.6 ms frame budget), you must move to faster builds or background/partial updates. Industry-grade toolchains (Embree / OptiX / Vulkan/DXR) expose builders and update modes precisely so you can tune this trade.

Important: The single-number metric to optimize is effective rays-per-second under the exact workload you will run in production (same ray lengths, distribution, SPP, and dynamic behavior). Architect the BVH to maximize that metric, not to minimize build time in isolation.

How LBVH and HLBVH convert sorting into lightning-fast builds

What LBVH does in plain engineering terms:

Compute a representative point per primitive (usually triangle centroid).
Quantize coordinates and compute a Morton code for each point.
Radix-sort the primitives by Morton key (this is the heavy-lifting but embarrassingly parallel on GPU).
Build a binary/radix tree by grouping consecutive Morton-sorted ranges into nodes — this reduces build to sorting plus local scans. The algorithm exposes massive parallelism and maps neatly to radix sort primitives (Thrust/CUB) and parallel scans.

HLBVH (and its faster later variants) add two pragmatic layers:

Use LBVH-style sorting to cheaply form coarse clusters (exploit temporal/spatial coherence).
Build a small top-level tree using a binned SAH or greedy SAH on those clusters, then expand cluster subtrees with LBVH-style local builders. That hybrid gives you most of SAH quality with a build cost that is orders of magnitude lower than full SAH on every primitive. NVIDIA’s HLBVH reports realtime rebuilds for millions of triangles (HLBVH <35 ms for 1M triangles; later work shows sub-11 ms for ~1.75M triangles on GTX 480 for improved implementations).

Practical GPU pipeline (high level):

// Pseudocode (CUDA-friendly pipeline)
computeMortonCodes<<<blocks>>>(centroids, codes);
CUB::DeviceRadixSort::SortPairs(scratch, codes, primIndices); // or thrust::sort_by_key
emitLeafAABBs<<<blocks>>>(primIndices, leafAABBs);
buildRadixTreeInPlace<<<blocks>>>(codes, primIndices, internalNodes); // binary radix tree / in-place method
if (useSAH_top) buildSAH_topLevelOnClusters(internalNodes, clusters);
packNodesForTraversal(nodes, memoryLayout);

Key notes: use GPU radix sort (CUB/Thrust or vendor-optimized sort), emit treelets in parallel, and only run a SAH top build over a small number of coarse clusters.

Contrarian insight from the trenches: you rarely need pure SAH for every triangle every frame. HLBVH-style full resorting (no refit) that leverages the cheap sorting step will outperform refit-based pipelines when the deformation is chaotic (fracture/explosion) or when you can amortize the top-level SAH across clusters. The pragmatic answer is hybrid: use LBVH for the leaves and SAH for the coarse topology.

Memory layouts and traversal micro-optimizations that cut bandwidth

The traversal bottleneck is memory bandwidth and divergence. The micro-optimizations you apply to node layout and addressing often yield greater rays-per-second gains than improving intersection kernels.

SoA vs AoS: store node bounding boxes in an SoA format per axis (e.g., arrays minX[], maxX[], minY[], maxY[], minZ[], maxZ[]) so that a warp loading child bounds does coalesced reads. Many GPU and SIMD-optimized CPU runtimes use an AoS-of-SoA hybrid (pack nodes into arrays of float4s) to match cache lines and vector loads. Embree documents and implementers use node packing that matches SIMD width; GPUs benefit from that same thinking.
N-ary nodes (BVH4/BVH8): increasing branching factor reduces tree depth and can amortize the cost of one node visit across more children, matching vector instruction widths or warp sizes. Embree/BVH implementations take advantage of 4- and 8-wide nodes for CPU SIMD; on GPU the sweet spot depends on your warp behavior and memory trade-offs (more children → larger node → more bandwidth per node).
Quantized/packed nodes: local quantization (e.g., store AABB deltas in 16-bit or node-local 8/10-bit grids) reduces memory traffic at the cost of a dequantize step per node. Papers and patents show that careful quantization with conservative rounding yields significant bandwidth savings and smaller dwell per traversal. Liktor & Vaidyanathan introduced a memory layout and addressing scheme that reduces bandwidth for fixed-function traversal; quantized nodes are useful when bandwidth is the bottleneck.
Pointerless addressing and compact indices: store 32-bit offsets instead of 64-bit pointers; pack leaf flags into sign bits to avoid extra bytes. On GPU you want contiguous arrays and offsets accessible with a single buffer load. This reduces cache pressure and avoids scattered indirect loads.
Stackless traversal and restart trails: short-stack / stackless traversal schemes (e.g., restart-trail) let you reduce per-ray stack memory and bandwidth, which can be crucial for hardware-assisted or wavefront-style traversal strategies. These techniques let you trade a few bits per node for avoiding large stack spills in worst-case traversal.
Warp-cooperative traversal and ray reordering: sort rays or trace in packets that maintain coherence (or perform on-the-fly scheduling where warps cooperate on a treelet). HLBVH implementations and later work use warp-synchronous task queues to keep threads inside warps doing the same node tests, which dramatically reduces divergence and memory churn.

Table — high-level comparison of common memory layouts

Layout	Typical node size	Pros	Cons
AoS float bounds + ptrs	48–96 B	simple, easy to implement	poor coalescing on GPU, larger traffic
SoA per-axis arrays	24–40 B	coalesced loads, vector-friendly	more complex build/pack logic
BVH4/BVH8 packed SoA	64–128 B	thinner trees, SIMD-friendly	larger node reads per visit
Quantized local-grid	16–48 B	reduced bandwidth, cache-friendly	dequantize cost, care for conservativeness.

A concrete C++-style node that works well on GPU (conceptual):

struct BVHNodeSoA {
    // child indices or leaf flags (32-bit offsets)
    uint32_t child0, child1, child2, child3;
    // axis-aligned bounds as packed float4s, aligned to 16 bytes
    float minX, maxX;
    float minY, maxY;
    float minZ, maxZ;
};

Pack and align nodes so a warp load (128 bytes) fetches either a whole node or the parts you need in one or two cache lines.

Keeping moving parts fast: refit, rebuild, and multi-level BVHs

There are three practical update patterns; choose the one that fits your dynamics and budget:

Refit (fast, cheap topology): recompute node bounds along the existing topology; linear complexity and very cheap, but topology can become poor for large deformations and cause traversal to degenerate. Practical when deformations are small and continuous.
Rebuild (full rebuild): rebuild topology from scratch (LBVH/HLBVH/SAH). Highest quality and most robust for chaotic changes but costs more CPU/GPU time. HLBVH-style rebuilds convert this cost to sorting plus local operations and can be done in real-time for millions of triangles on current hardware when implemented carefully.
Hybrid / multi-level: use a two-level strategy — keep static geometry in a compact BLAS and update only dynamic geometry BLAS or instances; update the TLAS with instance transforms (cheap) or rebuild only the BLAS for changed objects. This is the DXR/Vulkan model (BLAS/TLAS) and is how modern engines separate concerns. Use BLAS for mesh-level index/vertex data and TLAS for scene-level instance transforms.

Practical trade-offs and engine patterns:

Large static world + moving characters: build SAH BLAS offline for static world; use LBVH/HLBVH for characters or refit if deformation is small.
Many small dynamic objects: group them into a single dynamic BLAS or cluster them spatially to amortize build cost; HLBVH's compress-sort-decompress and task queues pay off here.
Chaotic mesh changes (fracture, destruction): prefer full resorting (HLBVH) over refit; refit gives arbitrarily bad trees under large topology changes.

OptiX and other runtimes give explicit knobs: OptiX exposes builders like Lbvh, Trbvh, and Sbvh and a refit property for acceleration structures; Trbvh is often a good trade-off that OptiX itself recommends for many datasets. Use these runtime-provided options when available rather than rolling your own from scratch unless you have very specific constraints.

Practical kernel sketch for a GPU refit pass (node bounds only):

// Each thread handles one internal node and reduces children AABBs
__global__ void refitKernel(Node* nodes, Leaf* leaves, int nodeCount) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i >= nodeCount) return;
  if (nodes[i].isLeaf()) {
    nodes[i].bounds = computeLeafBounds(leaves[nodes[i].leafFirst], nodes[i].leafCount);
  } else {
    AABB b0 = nodes[nodes[i].child0].bounds;
    AABB b1 = nodes[nodes[i].child1].bounds;
    // expand for more children...
    nodes[i].bounds = merge(b0,b1);
  }
}

Refit is linear-time and very cheap compared to full builds, but this kernel alone does not fix topology degenerations.

A practical BVH build & update checklist you can run today

Use this checklist as a deterministic sequence you can run in your engine or prototype. No fluff — measurable steps.

1) Establish measurements (baseline)

Measure: rays-per-second with a representative ray set (primary rays + typical secondary rays), and measure BVH build time (ms) on your target GPU/CPU. Record memory footprint. Use vendor tools (Nsight / RRA / PIX) to capture bandwidth and divergence hotspots.

2) Choose primary strategy (based on dynamics)

Mostly static, traversal-bound: build SAH offline / precompute, pack nodes for traversal (SoA/BVH4/8), enable spatial-splits if memory allows.
Highly dynamic (many triangles move each frame): use HLBVH or optimized LBVH pipeline on GPU with a top-level SAH over clusters.
Mixed: static objects in BLAS, dynamic in separate BLAS and update TLAS (DXR/Vulkan BLAS/TLAS model).

3) Implement the build pipeline (order of operations)

Compute centroids → compute Morton codes (k bits per axis) → parallel radix sort (CUB/Thrust) → emit treelets (binary radix or Karras binary radix tree) → optional SAH top-level on clusters → pack nodes in final traversal layout.

4) Memory layout tuning

Pack SoA for bounds and 32-bit offsets for indices.
Align nodes to 128 bytes where possible to match warp load sizing.
If bandwidth-limited, prototype 16-bit or node-local quantization and measure dequantize overhead against bandwidth savings. Use the Liktor layout for guidance.

5) Update policy

Use refit for small deformations every frame (cheap).
Schedule full rebuilds when refit quality metric (e.g., measured SAH increase, average bounding box overlap metric) exceeds a threshold — do this adaptively (e.g., rebuild every N frames or when SAH increases > X%).
For many small moving objects, batch rebuilds by cluster and rebuild only dirty clusters (HLBVH-friendly).

6) Profiling & tuning loop

Profile traversal and counts: average node visits per ray, memory loads per ray, and thread divergence hotspots.
If node visits are high but build time is low, shift toward SAH top-level (HLBVH hybrid).
If build time dominates and traversal is acceptable, prefer LBVH or incremental rebuilds.
Re-measure after each tuning pass and iterate.

7) Implementation sanity checks

Validate conservative bounds after quantization to avoid missing intersections.
Ensure pointerless offsets and compaction do not introduce misaligned loads on GPU.
Test correctness for motion blur and instancing paths (they often require special-case builds).

Checklist condensed (copyable runbook)

Capture representative rays & baseline metrics.
Decide: static-SA H / LBVH / HLBVH based on dynamics.
Implement centroid → Morton → radix-sort → radix-tree pipeline.
Pack nodes in SoA, use 32-bit offsets.
Add quantized node prototype if bandwidth-limited.
Implement refit + periodic rebuild policy based on SAH drift.
Profile node visits, memory traffic, divergence; iterate.

Sources

Fast BVH Construction on GPUs (Lauterbach et al., Eurographics 2009) - Introduces LBVH, shows LBVH is order-of-magnitude faster to build than full SAH but can hurt traversal performance; the paper explains the Morton-code sorting reduction and hybrid LBVH+SAH ideas used by later work.

HLBVH: Hierarchical LBVH Construction for Real-Time Ray Tracing (Pantaleoni & Luebke, HPG 2010) - Defines HLBVH, the compress-sort-decompress approach, and the hybrid strategy that builds SAH top levels over LBVH clusters; contains rebuild-time and throughput figures for dynamic geometry.

Simpler and Faster HLBVH with Work Queues (Garanzha, Pantaleoni, McAllister, HPG 2011) - Practical improvements to HLBVH: task queues, parallel SAH top-level, and GPU-friendly pipeline; includes measured build times for large models on consumer GPUs.

Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees (Tero Karras, HPG 2012) - Presents in-place binary radix tree construction and techniques to build the entire tree in parallel — foundational for production GPU LBVH/HLBVH builders.

NVIDIA OptiX Host API / Builder Documentation - Details builder types (e.g., Lbvh, Trbvh, Sbvh), properties such as refit, and guidance on runtime builder selection and trade-offs.

VK_KHR_acceleration_structure - Vulkan Specification / Reference - Describes the BLAS/TLAS two-level model, build/update commands, and the API-level separation of bottom-level and top-level acceleration structures used in modern engines.

DirectX Raytracing (DXR) Functional Spec / D3D12 Raytracing Acceleration Structure Types (Microsoft) - Official description of TLAS/BLAS, incremental updates, and build semantics for DXR.

Intel® Embree — High Performance Ray Tracing - Production-grade BVH implementations and builder options (Morton/Morton+SAH/SAH), memory-layout and traversal optimizations; useful reference for node layouts, builder trade-offs, and real-world performance guidance.

Bandwidth-Efficient BVH Layout for Incremental Hardware Traversal (Liktor & Vaidyanathan, HPG 2016) - Proposes memory layouts and node addressing schemes that reduce memory bandwidth for hardware traversal through quantization and compact addressing.

Restart Trail for Stackless BVH Traversal (Samuli Laine, HPG 2010) - Describes the restart-trail algorithm for stackless BVH traversal, which reduces stack memory and memory traffic for traversal implementations.

Strong engineering final note: treat the BVH as a tuning knob that you measure against the concrete runtime workload — choose LBVH for aggressive rebuild budget, HLBVH for dynamic-but-quality-sensitive cases, and SAH for static high-quality scenes; lay out nodes to minimize bandwidth and divergence, and instrument continuously so your choices are driven by rays-per-second under real workload rather than by theoretic purity.