Andrey Kolkov

Posted on Feb 28

Porting Vello's GPU Tile Rasterizer to Pure Go

#go #graphics #gpu #algorithms

When you call dc.DrawCircle(400, 300, 100) in gogpu/gg, what happens under the hood? A tile-based rasterization pipeline — a direct port of Vello's GPU compute pipeline from the linebender team — converts your vector paths into per-pixel coverage values. It runs on both CPU and GPU, and it's written entirely in Pure Go.

This article walks through the internals of tilecompute — a 6,700-line dual-execution port of Vello's original GPU compute rasterizer (circa 2020–2023) into Pure Go, with a CPU reference implementation and GPU compute dispatch via 9 WGSL shaders.

Why Not Just Use Scanline?

Traditional scanline rasterizers process one row of pixels at a time. They work, but they have a fundamental scalability problem: for a 4K canvas (3840x2160), you're iterating over 8.3 million pixels regardless of how simple your shapes are.

Tile-based rasterizers flip this around. They divide the canvas into small tiles (16x16 pixels in our case) and only process tiles that the vector path actually touches. A circle in the corner of a 4K canvas? You process maybe 40 tiles instead of 8 million pixels.

Vello — created by Raph Levien and the linebender team — pioneered this approach using GPU compute shaders (originally as piet-gpu in 2020, renamed to Vello in 2022). We ported its GPU compute pipeline to Pure Go for use as the rasterization core of gogpu/gg.

Note: This article covers the original GPU compute pipeline (16x16 tiles, 2020–2023). We've also ported Vello's newer Sparse Strips algorithm (4x4 tiles, August 2024) — see the Two Rasterizers, Two Targets section below.

The GPU Compute Pipeline

Vello's original rasterizer (2020–2023) uses a tile-based GPU compute approach with 16x16 pixel tiles. The key insight: instead of asking "which pixels does this path cover?", ask "which tiles does each line segment cross, and what's the winding contribution?"

The full pipeline has 9 GPU compute stages (matching our 9 WGSL shaders), plus scene encoding and curve flattening on the CPU:

Vector Paths (cubics, arcs, lines)
    ↓
Scene Encoding (CPU — pack paths, transforms, styles)
    ↓
 1. PathTag Reduce    ─┐
 2. PathTag Scan      ─┘ monoid prefix sums over path structure
    ↓
Flatten (Euler spiral → line segments)
    ↓
 3. Draw Reduce       ─┐
 4. Draw Leaf Scan    ─┘ monoid prefix sums over draw commands
    ↓
 5. Path Count        (DDA walk — which tiles does each segment cross?)
    ↓
 6. Backdrop          (prefix sum — left-to-right winding accumulation)
    ↓
 7. Coarse            (generate Per-Tile Command Lists)
    ↓
 8. Path Tiling       (clip segments to tile boundaries, compute yEdge)
    ↓
 9. Fine              (per-pixel analytic anti-aliased coverage)
    ↓
Per-pixel RGBA output

The first four stages use monoid prefix sums — Vello's core parallelism primitive. A monoid is an associative operation with an identity element; prefix sums over monoids can be computed in O(log n) parallel steps on a GPU, turning what would be sequential parsing into massively parallel work.

Let's walk through the key stages.

Stages 1–4: Monoid Prefix Sums

The first four stages parse the encoded scene in parallel using monoid prefix sums. Each path tag and draw tag is reduced into a monoid (an associative structure), then scanned to produce cumulative values:

// PathMonoid — accumulated state from scanning path tags
type PathMonoid struct {
TransIx       uint32 // Transform index
PathSegIx     uint32 // Path segment index
PathSegOffset uint32 // Offset into path segment data
StyleIx       uint32 // Style index
PathIx        uint32 // Path index
}

// DrawMonoid — accumulated state from scanning draw tags
type DrawMonoid struct {
PathIx      uint32 // Which path this draw belongs to
ClipIx      uint32 // Clip stack depth
SceneOffset uint32 // Offset into scene data
InfoOffset  uint32 // Offset into info buffer
}

On the GPU, reduce + scan runs in O(log n) parallel steps via our 9 WGSL compute shaders dispatched through VelloComputeDispatcher. On the CPU, it's a sequential scan — the data structures are identical, making the CPU code a reference implementation that validates GPU correctness.

Stage 5: Path Count — DDA Tile Walk

The path count stage answers: "which tiles does each line segment cross?"

We use a Digital Differential Analyzer (DDA) — an algorithm that traces a line through a grid, visiting every cell the line passes through. For each tile visited, we record two things:

Segment count — how many line segments cross this tile
Backdrop — the signed winding contribution at the tile's left edge

// pathCountMain — DDA walk through the tile grid
func pathCountMain(bump *BumpAllocators, lines []LineSoup,
paths []Path, tile []Tile, segCounts []SegmentCount) {

for lineIx := uint32(0); lineIx < bump.Lines; lineIx++ {
line := lines[lineIx]
p0 := vec2FromArray(line.P0)
p1 := vec2FromArray(line.P1)

// Sort by Y for consistent DDA walking
isDown := p1.y >= p0.y
var xy0, xy1 vec2
if isDown {
xy0, xy1 = p0, p1
} else {
xy0, xy1 = p1, p0
}

// Scale to tile coordinates (pixel / 16)
s0 := xy0.mul(tileScale)
s1 := xy1.mul(tileScale)

// DDA walk: trace the line through tile grid cells
// counting X crossings and Y crossings separately
dx := abs32(s1.x - s0.x)
dy := s1.y - s0.y
idxdy := 1.0 / (dx + dy)
a := dx * idxdy

// ... walk through tiles, update backdrop and segment counts
}
}

The backdrop is the critical concept. When a line segment crosses a tile's left edge, it contributes a signed delta to the winding number: -1 for downward segments, +1 for upward. This is how we know whether a pixel is "inside" or "outside" the path without checking every segment.

Stage 6: Backdrop Prefix Sum

This is where tile-based rasterization really shines. Instead of checking every segment for every pixel, we accumulate winding numbers left-to-right across each row of tiles:

// Backdrop prefix sum — left-to-right winding accumulation
for y := 0; y < bboxH; y++ {
sum := int32(0)
for x := 0; x < bboxW; x++ {
idx := base + y*bboxW + x
sum += tiles[idx].Backdrop
tiles[idx].Backdrop = sum
}
}

After this step, each tile knows the accumulated winding number from all segments to its left. A tile with winding number 1 and no segments crossing it is fully solid — no per-pixel computation needed.

Stage 7: Coarse — Allocation + PTCL Generation

The coarse stage allocates space in a global segment buffer and generates Per-Tile Command Lists. Segment counts are converted into indices using Vello's inverted index trick:

// Convert counts to indices using bitwise NOT
nextSegIx := uint32(0)
for i := range tiles {
nSegs := tiles[i].SegmentCountOrIx
if nSegs != 0 {
tiles[i].SegmentCountOrIx = ^nextSegIx // !seg_ix in Rust
nextSegIx += nSegs
}
}

The ^ (bitwise NOT) serves double duty: it marks the tile as "has segments" (the NOT of a valid index is always negative as int32) while encoding the starting index into the global segment array.

Stage 8: Path Tiling

Now we clip each line segment to its tile boundaries and compute the crucial yEdge value:

// PathSegment — a line segment clipped to tile boundaries
type PathSegment struct {
Point0 [2]float32 // Tile-relative start point
Point1 [2]float32 // Tile-relative end point
YEdge  float32    // Y where segment crosses tile left edge (x=0)
}

YEdge tells the fine rasterizer where the segment enters or exits the tile at x=0. If the segment doesn't cross the left edge, YEdge is set to 1e9 (sentinel value). This single float32 captures the geometric relationship needed for coverage computation.

Stage 9: Fine Rasterization

The final stage computes per-pixel anti-aliased coverage within each 16x16 tile:

// fillPath computes per-pixel coverage for a single tile.
// Direct port of fill_path from vello_shaders/src/cpu/fine.rs.
func fillPath(area []float32, segments []PathSegment,
backdrop int32, evenOdd bool) {

// Initialize area with backdrop winding number
for i := range area {
area[i] = float32(backdrop)
}

for _, seg := range segments {
// For each column in the tile, compute the area
// contribution of this segment using analytic integration
// ...
}

// Apply fill rule and convert to alpha [0, 1]
if evenOdd {
for i := range area {
area[i] = abs32(area[i] - 2.0*round32(0.5*area[i]))
}
} else {
for i := range area {
area[i] = min32(abs32(area[i]), 1.0)
}
}
}

This is analytic anti-aliasing — we compute exact sub-pixel coverage by integrating line segment contributions, not by supersampling. The result is mathematically precise edges with smooth alpha gradients.

Euler Spiral Flattening

But wait — DrawCircle produces cubic Bezier curves, not line segments. How do we get from curves to lines?

Vello uses Euler spiral approximation for adaptive curve flattening. Unlike naive subdivision (which produces too many or too few segments), Euler spirals provide optimal error bounds:

// FlattenFill flattens cubic Beziers using Vello's Euler spiral algorithm
func FlattenFill(cubics []CubicBezier) []LineSoup {
var lines []LineSoup
for _, c := range cubics {
p0 := vec2{c.P0[0], c.P0[1]}
p1 := vec2{c.P1[0], c.P1[1]}
p2 := vec2{c.P2[0], c.P2[1]}
p3 := vec2{c.P3[0], c.P3[1]}
flattenEulerFill(p0, p1, p2, p3, &lines)
}
return lines
}

The algorithm evaluates curvature at each point and subdivides only where the error exceeds a tolerance (0.25 pixels by default). A nearly-straight segment becomes 1 line. A tight curve gets more subdivisions. The result: minimum line segments for maximum visual quality.

Per-Tile Command Lists (PTCL)

The coarse stage (Stage 7) generates Per-Tile Command Lists — each tile gets a stream of commands like "fill with coverage from segment N", "apply color #FF0000", "begin clip", "end clip". This is what makes the pipeline work for multiple overlapping paths (UI with buttons, text, backgrounds) in a single fine rasterization pass:

// PTCL commands
const (
CmdEnd       = 0
CmdFill      = 1  // Compute coverage from segments
CmdSolid     = 3  // Full coverage (no segments needed)
CmdColor     = 5  // Apply RGBA color with source-over blending
CmdBeginClip = 10 // Push clip layer
CmdEndClip   = 11 // Pop and composite clip
)

The fine rasterizer walks each tile's PTCL, executing commands and compositing results with premultiplied alpha — exactly like a GPU would:

func fineRasterizeTile(ptcl *PTCL, segments []PathSegment,
bgColor [4]float32) [TileWidth * TileHeight][4]float32 {

const pixelCount = TileWidth * TileHeight
var rgba [pixelCount][4]float32
for i := range rgba {
rgba[i] = bgColor
}

for {
tag, nextOffset := ptcl.ReadCmd(offset)
switch tag {
case CmdEnd:
return rgba
case CmdFill:
// Compute coverage, store in area buffer
case CmdColor:
// Apply color using area as mask, source-over blend
case CmdBeginClip:
// Push current pixels onto clip stack
case CmdEndClip:
// Pop and composite with clip mask
}
}
}

Dual Execution: Why Both CPU and GPU?

tilecompute is a dual-execution pipeline: the same 9-stage algorithm runs on both CPU (sequential Go code) and GPU (WGSL compute shaders dispatched via VelloComputeDispatcher). Why maintain both?

1. CPU is the Core. After analyzing 8 enterprise 2D engines (Skia, Cairo, Vello, Blend2D, tiny-skia, piet, Qt RHI, Pathfinder), we found that in zero of them is CPU rasterization a "backend". It's always the core. GPU is the optional accelerator.

2. Correctness Reference. The CPU implementation serves as a reference for the GPU compute shaders. When GPU and CPU produce different results, we know which one to trust.

3. Universal Availability. Servers, CI/CD, embedded systems — many environments have no GPU. A server generating 10,000 chart images doesn't need GPU acceleration; it needs reliable software rendering.

4. Identical Algorithm, Dual Execution. The CPU code mirrors the GPU pipeline stage-by-stage — same data structures, same logic. The 9 WGSL compute shaders are //go:embedded and compiled into GPU compute pipelines via hal.Device.CreateComputePipeline(). When a GPU is available, VelloComputeDispatcher dispatches all 9 stages in parallel with pass.Dispatch(); when not, the CPU executes them sequentially.

Two Rasterizers, Two Targets

tilecompute is not the only Vello algorithm we've ported. gogpu/gg includes both of Vello's rasterization pipelines — each optimized for a different execution target:

	tilecompute	SparseStripsRasterizer
Vello era	Original (2020–2023)	New (Issue #670, August 2024)
Source	`vello_shaders/src/cpu/`	`sparse_strips/vello_cpu/`
Tile size	16x16 (256 pixels)	4x4 (16 pixels)
Optimized for	GPU compute workgroups	CPU / SIMD registers
Key insight	256 pixels = GPU workgroup size	16 u8 pixels = one 128-bit SSE register
Data flow	Monoid prefix sums → PTCL	Sort by Y,X → strip rendering
Package	`internal/gpu/tilecompute/`	`internal/gpu/sparse_strips.go`

Why 16x16 for GPU? GPU compute shaders process tiles in parallel workgroups. 256 pixels per tile matches the typical workgroup size (256 threads), giving each thread exactly one pixel.

Why 4x4 for CPU? SIMD instructions operate on 128-bit registers. 16 pixels of u8 coverage data fit into a single SSE register, enabling vectorized operations across an entire tile at once — the same approach Intel used in Larrabee.

Both rasterizers use analytic anti-aliasing, Euler spiral flattening, and support NonZero/EvenOdd fill rules. The difference is purely in how they partition the canvas and process tiles.

gogpu/gg Rasterization Engines
    │
    ├── tilecompute (16x16 tiles) — DUAL EXECUTION
    │      ├── CPU: sequential Go (reference + fallback)
    │      └── GPU: 9 WGSL shaders via VelloComputeDispatcher
    │
    └── SparseStripsRasterizer (4×4 tiles) — CPU
           CPU/SIMD-optimized pipeline
           Sort by Y,X + strip rendering

Having both means gogpu/gg selects the optimal algorithm for the target: GPU compute dispatches the 16x16 pipeline via WGSL shaders, CPU rendering defaults to the 4x4 pipeline for SIMD efficiency.

Smart Multi-Engine Selection

Having multiple rasterizers raises a question: who decides which algorithm handles which path?

We analyzed 8 enterprise 2D engines — Skia, Cairo, Blend2D, Vello, Qt, Direct2D, Flutter, SwiftUI — and found that none of them do per-path dynamic algorithm selection. Skia has separate CPU/GPU pipelines but no cross-selection. Vello has a planned vello_api for CPU/GPU choice, not yet built. Direct2D recognizes simple shapes but doesn't switch algorithms.

gogpu/gg is the first 2D graphics library with systematic multi-factor per-path selection across 5 rasterization algorithms:

Path arrives at Context.Fill()
    │
    ├── Shape detection → SDF Accelerator (circles, rrects)
    │     GPU SDF or CPU SDF — smoothstep coverage, highest quality
    │
    ├── Adaptive threshold check
    │     │
    │     ├── Below threshold → AnalyticFiller (scanline)
    │     │     Zero tile overhead, O(width × edges)
    │     │
    │     └── Above threshold → AdaptiveFiller
    │           │
    │           ├── < 10K segments → SparseStrips (4×4 tiles)
    │           │     CPU/SIMD-optimized, lower tile overhead
    │           │
    │           └── > 10K segments + large canvas → TileCompute (16×16 tiles)
    │                 GPU workgroup-ready, 16× fewer tiles
    │
    └── GPU Compute → Vello PTCL pipeline (full scene)
          9-stage GPU compute, massively parallel

The Coregex Analogy

The pattern is inspired by coregex — a Go regex library with 17 strategies and an intelligent selector that picks the optimal engine per-pattern. Same idea: analyze the input, pick the optimal engine. Both Go libraries, both first-of-kind multi-engine approaches.

	coregex	gogpu/gg
Engines	17 regex strategies	5 rasterization algorithms
Selection	Pattern analysis	7-dimension path analysis
Input	Regex pattern	Vector path
Output	Match result	Pixel coverage

Adaptive Threshold

The key insight: scanline cost grows with width × edge crossings, while tile cost grows with fill area. For large shapes, tiles win at lower complexity. For tiny shapes (< 32px), scanline always wins.

The selection between scanline and tile-based rasterizers uses an adaptive threshold derived from the path's bounding box area:

// threshold = max(32, 2048/sqrt(bboxArea))
//
// 100×100 bbox → threshold 20 elements (tiles kick in early)
// 500×500 bbox → threshold  4 elements (large shapes → always tiles)
//  30×30  bbox → always scanline (below bboxMinDimension)
func adaptiveThreshold(bboxArea float64) int {
if bboxArea <= 0 {
return maxElementThreshold
}
threshold := int(2048.0 / math.Sqrt(bboxArea))
return clamp(threshold, 32, 256)
}

User Override

Auto-selection is the default, but the user always has final say — the same principle as database query hints or GPU driver force flags:

dc := gg.NewContext(800, 600)

// Auto (default) — intelligent per-path selection
dc.SetRasterizerMode(gg.RasterizerAuto)

// Force scanline for all paths (debugging, isolation)
dc.SetRasterizerMode(gg.RasterizerAnalytic)

// Force 4×4 tiles (benchmarking CPU/SIMD path)
dc.SetRasterizerMode(gg.RasterizerSparseStrips)

// Force 16×16 tiles (benchmarking GPU workgroup path)
dc.SetRasterizerMode(gg.RasterizerTileCompute)

// Force SDF for maximum circle/rrect quality
dc.SetRasterizerMode(gg.RasterizerSDF)

The mode is per-Context — different contexts can use different strategies simultaneously. This makes A/B benchmarking trivial: render the same scene with two contexts, compare output and timing.

Numbers

Metric	Value
Go source	2,878 LOC
Test code	2,194 LOC
WGSL shaders	1,695 LOC (9 shaders)
Total	6,767 LOC
Tile size	16x16 pixels
Fill rules	NonZero, EvenOdd
Golden test threshold	0.15% max pixel difference

The 9 WGSL compute shaders are //go:embedded into VelloComputeDispatcher, compiled into GPU compute pipelines, and dispatched with pass.Dispatch() — the same algorithm running on both CPU (reference/fallback) and GPU (parallel compute).

Part of a Larger Ecosystem

tilecompute is the rasterization core of gogpu/gg, which is part of a 466K+ line Pure Go GPU computing ecosystem:

Project	Description	LOC
gogpu/gg	2D graphics library	186K
gogpu/wgpu	Pure Go WebGPU	110K
gogpu/naga	Shader compiler	61K
gogpu/ui	GUI widget toolkit	61K
gogpu/gogpu	GPU framework	40K

The default stack is zero CGO, Pure Go — from shader compilation to GPU command submission. But gogpu also supports a Rust backend via go-webgpu FFI to wgpu-native for maximum GPU performance when needed:

Backend	Stack	Build	Use Case
Pure Go (default)	gogpu/wgpu → Vulkan/Metal/GLES	`go build`	Zero dependencies, easy cross-compile
Rust FFI (opt-in)	go-webgpu → wgpu-native	`go build -tags rust`	Maximum GPU performance, production

Both backends use the same gg API — the choice is transparent to application code. gg doesn't know or care which WebGPU implementation is underneath; it talks to hal.Queue.

When GPU acceleration is available, gg uses the registered WebGPU backend (Pure Go or Rust) with support for Vulkan, Metal, DX12, and OpenGL ES. When it's not — tilecompute and the CPU rasterizer handle everything.

Acknowledgments

The Vello team and Raph Levien for the tile rasterization pipeline and Euler spiral flattening research
The variable naming in tilecompute intentionally matches Vello's Rust originals for easy cross-reference

Links

GitHub: gogpu/gg
Package: internal/gpu/tilecompute/
Vello: linebender/vello
Raph Levien's blog: raphlinus.github.io
Discussion: Join the conversation

go get github.com/gogpu/gg@latest

DEV Community