Geolm

Posted on Nov 2

onedraw — a GPU-driven 2D renderer

#gpudriven #metal #graphics #sdf2d

Hi here's a doc I wrote about my open-source metal renderer, it's the first 2 parts. Next I'll write about the rasterization.

Don't hesitate to clone the repo, ask questions, report a bug or contact me. Have a nice day, Geolm.

URL : https://github.com/Geolm/onedraw
contact : Geolm

Goals and initial architecture

I started the project with the following objectives:

Not triangle-based: shapes are defined using signed distance functions (SDFs).
High quality: anti-aliased edges by default, perfectly smooth curves (no tessellation required), optimized for high-resolution displays.
Fast and GPU-driven: offload as much work as possible to the GPU and minimize draw calls.
Efficient alpha blending: designed to make extensive use of transparency without significant performance cost.

Overview

To achieve high performance, onedraw minimizes unnecessary computations during rasterization. Since it primarily uses signed distance functions (SDFs) to render shapes—and these functions can be relatively expensive—efficient culling is essential.

The screen is divided into 16×16 pixel tiles. A compute shader builds a linked list of draw commands per tile.

When a tile is rasterized, the fragment shader has direct access to the exact set of draw commands that affect that tile—ensuring that only relevant shapes are processed.

GPU-Driven Pipeline

The entire linked-list generation process happens on the GPU.

If a tile contains one or more draw commands, it’s automatically added to a list of tiles to be rendered.

Finally, an indirect draw call is issued to rasterize only those active tiles—eliminating CPU overhead and allowing fully GPU-driven rendering.

Binning commands

Binning is the process that generates the per-tile linked lists of draw commands.

It works by performing intersection tests between each draw command and the bounding box of every tile.

Classic intersection methods are used, such as the Separating Axis Theorem (SAT) for oriented shapes and distance-to-center checks for circular ones.

Some additional factors are also considered:

anti-aliasing
groups of shapes
smoothmin

Anti-aliasing

The width of the anti-aliasing region (defined by the user in onedraw) must be taken into account during intersection testing.

In the example above, even if the disc does not mathematically intersect the tile’s bounding box, the anti-aliasing width extends beyond it.

To avoid visible seams or straight edges along tile borders, the disc is still added to the tile’s linked list so that edge pixels are correctly shaded.
We simply grow the size of the bounding box by the width of the anti-aliasing.

Group of shapes

onedraw supports combining multiple shapes into a group.

A group behaves like a single shape: for example, it can have a global transparency even if it contains multiple layers of shapes, or it can have an outline that applies to the entire group.

Groups are defined by wrapping shapes between begin group and end group commands.

The end group command triggers the color output (and, if enabled, the outline).

Each begin/end group command is assigned a global bounding box that encompasses all shapes within the group.

This ensures that both the begin and end commands are included whenever the group affects a tile.

If a group has no shapes intersecting the tile, the linked list is adjusted to skip that unused group, avoiding unnecessary processing.

Smoothmin

The smoothmin operator allows multiple shapes to blend smoothly, even when they don’t intersect.

The k-factor (as described on Inigo Quilez’s website) controls how far a shape can blend with another.

During binning, the tile’s bounding box is expanded by this value to ensure that shapes contributing to a smooth blend are not missed.

Performance considerations

Even though binning is performed on the GPU, it’s not free. Let’s look at the numbers:

At 1440p, the screen contains 160 × 90 = 14,400 tiles
onedraw supports up to 65,536 draw commands
In the worst case, that’s around 943 million intersection tests

With such a naïve approach, the binning step could easily cost more than the actual rasterization. 😬

Quantized AABB

The first optimization is to store an AABB (axis-aligned bounding box) for each command and use it as a quick pre-test before running more complex intersection math.

The bounding boxes are quantized to the tile resolution (16 pixels), making the test extremely fast and simple.

Additionally, the AABBs are stored in a separate buffer from the draw commands to avoid cache thrashing and improve memory access efficiency.

Hierarchical binning

Even with AABB pre-tests, having many commands means each tile still performs a significant amount of work.

To reduce this, we introduce hierarchical binning — pre-building a list of commands for larger screen regions.

A region covers 16 × 16 tiles, and each region keeps a list of the commands that affect it.

When binning individual tiles, we only consider the commands from that region’s list instead of all global commands.

This sounds great, but there’s a catch:
When binning commands for the region in a compute shader, if we assign one thread per region, it becomes very slow — there are relatively few regions but potentially many commands. GPUs perform best when running a large number of lightweight threads, not a few heavy ones.

To fix this, we invert the process: each thread processes a single command and adds it to the region lists it intersects.

However, since this happens in parallel, we lose the guaranteed ordering of commands, which is critical in 2D rendering.

Predicate + exclusive scan

To keep the order of commands, we use the classic pattern :

A predicate compute shader evaluates the visibility of each command (one thread per command) and writes the result — 0 or 1 — to a buffer.
An exclusive scan compute shader then processes this buffer to build a compact list of visible command indices.
Finally, another compute shader uses this predicate and the indices lists to write the corresponding commands into the output buffer. This approach allows us to keep a thread-per-command model while efficiently filtering out invisible ones.

This process is applied for all regions.

Intrinsics trick

Typically, the exclusive scan pass consists of multiple cascaded compute shaders to produce the final result. In our case, however, since we have a maximum of 65k commands and know the SIMD group size on Apple Silicon, we can perform the entire operation in a single pass using the simd_prefix_exclusive_sum function and threadgroup memory.

You can look at the binning shader.

DEV Community