Andrey Kolkov

Posted on Dec 26

GPU Compute Shaders in Pure Go: gogpu/gg v0.15.0

#go #graphics #gpu #webgpu

Update (December 26, 2025): gogpu/gg v0.15.0 released! GPU compute shaders, vello-style pipeline, significant performance improvements for complex scenes. Part of the 224K LOC Pure Go ecosystem.
https://www.reddit.com/r/golang/comments/1pujj63/go_now_has_its_own_professional_graphics/

GPU Compute Shaders in Pure Go: gogpu/gg v0.15.0

Two days ago, we shipped gogpu/gg v0.14.0 with alpha masks and fluent PathBuilder. Yes, two days — we're moving fast. But looking at performance profiles, I saw a problem.

The CPU was the bottleneck.

While our GPU could render millions of pixels in milliseconds, the CPU spent significant time tessellating paths. This is the classic 2D graphics problem: CPU tessellation doesn't scale.

So we moved the entire rasterization pipeline to GPU compute shaders.

Today, gogpu/gg v0.15.0 is here: 2,280 lines of WGSL compute shaders, vello-style pipeline, dramatic speedups for complex scenes. All in Pure Go.

The Performance Challenge

Let me show you what we were up against:

// Drawing 10,000 circles
ctx := gg.NewContext(800, 600)
for i := 0; i < 10000; i++ {
    ctx.DrawCircle(float64(i%100)*8, float64(i/100)*6, 3)
}
ctx.Fill()

The problem with traditional 2D rendering:

CPU tessellates paths to triangles (sequential, single-threaded)
GPU renders triangles (massively parallel)
CPU becomes the bottleneck for complex scenes

This is true for every traditional 2D library (Cairo, Skia, even our earlier versions).

The Solution: All-GPU Pipeline

What if we never tessellated on CPU?

Traditional: Path → Tessellate (CPU) → Triangles → GPU
     v0.15: Path → Flatten (GPU) → Coarse (GPU) → Fine (GPU)

This is the vello approach. We implemented it in Pure Go with WGSL compute shaders:

1. Flatten Shader — Curves to Segments

Converts Bezier curves to line segments on the GPU:

@compute @workgroup_size(256)
fn flatten_main(
    @builtin(global_invocation_id) id: vec3<u32>
) {
    let curve = curves[id.x];

    // Adaptive subdivision based on curvature
    let segments = subdivide_bezier(curve);

    // Write to global buffer (thousands in parallel!)
    for (var i = 0u; i < segment_count; i++) {
        output_segments[base + i] = segments[i];
    }
}

Key insight: Thousands of GPU cores process curves simultaneously. CPU does one at a time.

2. Coarse Shader — Tile Binning

Maps segments to screen tiles using atomic operations (thread-safe on GPU!):

@compute @workgroup_size(256)
fn coarse_main(
    @builtin(global_invocation_id) id: vec3<u32>
) {
    let segment = segments[id.x];
    let bounds = segment_bounds(segment);

    // Find overlapping tiles
    for (var y = tile_min.y; y <= tile_max.y; y++) {
        for (var x = tile_min.x; x <= tile_max.x; x++) {
            let tile_id = y * tile_width + x;

            // Atomic append (safe across threads!)
            let index = atomicAdd(&tile_counts[tile_id], 1u);
            tile_segments[tile_id][index] = id.x;
        }
    }
}

This is where GPU shines — thousands of threads binning segments in parallel, atomics keeping data consistent.

3. Fine Shader — Per-Pixel Coverage

Computes anti-aliased coverage analytically (no MSAA needed!):

@compute @workgroup_size(8, 8)
fn fine_main(
    @builtin(global_invocation_id) pixel: vec3<u32>
) {
    let tile_id = (pixel.y / TILE_SIZE) * tile_width + (pixel.x / TILE_SIZE);

    var coverage = 0.0;
    for (var i = 0u; i < tile_counts[tile_id]; i++) {
        let seg = segments[tile_segments[tile_id][i]];
        coverage += compute_segment_coverage(seg, pixel.xy);
    }

    // Apply fill rule, write pixel
    output[pixel.xy] = vec4<f32>(color.rgb, saturate(coverage));
}

Perfect anti-aliasing at any scale. No jaggies, no MSAA overhead.

Expected Performance Gains

Based on our architecture and GPU parallelism theory, we expect significant improvements for complex scenes:

Workload	Expected Behavior
Simple paths (< 100 segments)	CPU may be faster (GPU upload overhead)
Medium complexity (100-1K segments)	GPU starts to win
Complex scenes (1K-10K segments)	GPU significantly faster (10-30x estimated)
Very complex (10K+ segments)	GPU dominates (30-60x estimated)

These are architectural projections. Actual performance varies by hardware, driver, and scene complexity. We're working on comprehensive benchmarks.

The larger the scene, the bigger the advantage. This is parallelism at work.

Hybrid Architecture: Best of Both Worlds

Here's the clever part — we don't always use GPU:

// backend/wgpu/sparse_strips_gpu.go
type HybridFineRasterizer struct {
    mu               sync.RWMutex
    gpu              *GPUFineRasterizer  // nil if unavailable
    cpu              *FineRasterizer     // always available
    segmentThreshold int                 // min segments for GPU
    gpuAvailable     bool
}

func (h *HybridFineRasterizer) Rasterize(
    coarse *CoarseRasterizer,
    segments *SegmentList,
    backdrop []int32,
) {
    // GPU worthwhile only above threshold
    if h.gpuAvailable && segments.Len() >= h.segmentThreshold {
        // GPU path: dispatch compute shaders
        h.gpu.Rasterize(coarse, segments, backdrop, scene.FillNonZero)
    } else {
        // CPU path: software rasterization
        h.cpu.RasterizeSegments(segments, backdrop)
    }
}

Why? Small paths (< 100 segments by default) have GPU dispatch overhead. CPU wins for simple cases.

Result: Automatic selection based on segment count.

Building This Was Hard

Challenge 1: WGSL Compilation

WGSL needs to compile to SPIR-V for Vulkan, MSL for Metal. We use our own gogpu/naga compiler (~23K LOC Pure Go):

// Simple compilation to SPIR-V
spirv, err := naga.Compile(wgslSource)
if err != nil {
    log.Fatal(err)
}

// With options
spirv, err := naga.CompileWithOptions(wgslSource, naga.CompileOptions{
    SPIRVVersion: spirv.Version1_3,
    Debug:        true,
    Validate:     true,
})

Supports vertex, fragment, compute shaders. Atomics, barriers, texture sampling. Pure Go.

Challenge 2: GPU Memory Management

Buffer sizing is critical:

// Too small? GPU crash
// Too large? Wasted memory
maxSegments := pathComplexity * 4  // Conservative estimate
segmentBuffer := device.CreateBuffer(maxSegments * 24) // 24 bytes/segment

We use conservative estimates with safety margins. Real-world paths rarely exceed our allocations.

Challenge 3: Atomic Operations

WGSL atomics are strict:

Only atomic<u32> and atomic<i32> supported
Specific memory orders required
Buffer layout matters

// This works
@group(0) @binding(1) var<storage, read_write> counts: array<atomic<u32>>;
atomicAdd(&counts[i], 1u);

// This doesn't
var<storage, read_write> counts: array<u32>; // Not atomic!

Debugging this was tricky. WGSL validation errors are... cryptic.

What We Shipped

Statistics:

2,280 LOC WGSL shaders (8 shader files)
~20K LOC Go in backend/wgpu/
74% test coverage overall
0 linter issues

Shader Files:

backend/wgpu/shaders/
├── flatten.wgsl     # 589 LOC — Bezier curve flattening
├── coarse.wgsl      # 335 LOC — Tile binning with atomics
├── fine.wgsl        # 290 LOC — Per-pixel coverage
├── blend.wgsl       # 424 LOC — 29 blend modes on GPU
├── composite.wgsl   # 235 LOC — Layer compositing
├── strip.wgsl       # 155 LOC — Sparse strip rendering
├── blit.wgsl        #  43 LOC — Final output blit
└── msdf_text.wgsl   # 209 LOC — MSDF text rendering

Go Implementation:

backend/wgpu/
├── gpu_flatten.go       # 809 LOC — Flatten pipeline
├── gpu_coarse.go        # 698 LOC — Coarse rasterization
├── gpu_fine.go          # 752 LOC — Fine rasterization
├── sparse_strips_gpu.go # 837 LOC — Hybrid CPU/GPU selection
├── renderer.go          # 822 LOC — Main renderer
├── pipeline.go          # 369 LOC — Pipeline orchestration
├── memory.go            # 413 LOC — GPU memory management
└── ... (40+ files total)

Try It Yourself

Installation

go get github.com/gogpu/gg@v0.15.0

Quick Example

package main

import "github.com/gogpu/gg"

func main() {
    ctx := gg.NewContext(512, 512)
    ctx.ClearWithColor(gg.White)

    // 1000 circles — GPU backend handles complex scenes efficiently
    ctx.SetColor(gg.Hex("#e74c3c"))
    for i := 0; i < 1000; i++ {
        ctx.DrawCircle(float64(i%32)*16, float64(i/32)*16, 8)
    }
    ctx.Fill()

    ctx.SavePNG("gpu_circles.png")
}

Backend Selection

The library automatically selects the best backend. GPU is used by default when available:

# Default: GPU enabled (uses Vulkan/Metal when available)
go build ./...

# Disable GPU (force software-only)
go build -tags nogpu ./...

The GoGPU Ecosystem

With v0.15.0, the GoGPU ecosystem now totals:

Project	Description	Version	LOC
gogpu/gg	2D graphics + GPU compute	v0.15.0	~104K
gogpu/gogpu	Graphics framework	v0.8.0	~26K
gogpu/wgpu	Pure Go WebGPU	v0.7.0	~71K
gogpu/naga	WGSL shader compiler	v0.6.0	~23K
gogpu/ui	GUI toolkit (planned)	—	—

Total: ~224K lines of Pure Go. No CGO. No Rust. Just go build.

Comparison with Industry

Feature	gogpu/gg v0.15.0	vello (Rust)	Skia (C++)
GPU Compute	WGSL	WGSL	Metal/Vulkan
Language	Pure Go	Rust	C++
CGO Required	No	No	Yes
Hybrid Mode	Auto	GPU-first	Manual
Platforms	Win/Lin/Mac	Win/Lin/Mac	All

We're the only Pure Go library with GPU compute shaders for 2D graphics.

Entering the Stabilization Phase

With v0.15.0, we're closing the rapid development chapter.

In just three weeks, gogpu/gg went from v0.1.0 to v0.15.0 — from a basic context to GPU compute shaders. We've built the foundation: software rasterizer, GPU backend, text rendering, scene graph, blend modes, and now compute shaders.

What comes next is different. We're shifting focus to:

Stability — Bug fixes, edge cases, cross-platform testing
Performance — Real benchmarks, profiling, optimization
Documentation — API docs, tutorials, examples
Polish — Better error messages, developer experience

We Need Your Help!

This is where community testing becomes critical. We've built the architecture — now we need real-world usage to find the rough edges.

How to contribute:

Test it — Try gogpu/gg in your projects, report issues
Benchmark it — Run performance tests on your hardware
Break it — Find edge cases, corner cases, crashes
Document it — Examples, tutorials, use cases

go get github.com/gogpu/gg@v0.15.0

Found a bug? Open an issue.
Have questions? Join discussions.
Want to contribute? PRs welcome!

Roadmap to v1.0

v0.16.0 — v0.19.0 (Stabilization):

Comprehensive benchmark suite
Performance profiling and optimization
Bug fixes from community feedback
Documentation and examples
GPU gradient shaders
Shadow/blur effects

v1.0.0 (Production Ready):

Stable, documented API
Full test coverage
Performance guarantees
Migration guide
gogpu/ui integration

Acknowledgments

This work was inspired by:

vello — Modern GPU 2D renderer in Rust
piet-gpu — GPU compute experiments
pathfinder — GPU path rendering

Special thanks to the Go community for the continued support and feedback!

DEV Community

GPU Compute Shaders in Pure Go: gogpu/gg v0.15.0

GPU Compute Shaders in Pure Go: gogpu/gg v0.15.0

The Performance Challenge

The Solution: All-GPU Pipeline

1. Flatten Shader — Curves to Segments

2. Coarse Shader — Tile Binning

3. Fine Shader — Per-Pixel Coverage

Expected Performance Gains

Hybrid Architecture: Best of Both Worlds

Building This Was Hard

Challenge 1: WGSL Compilation

Challenge 2: GPU Memory Management

Challenge 3: Atomic Operations

What We Shipped

Try It Yourself

Installation

Quick Example

Backend Selection

The GoGPU Ecosystem

Comparison with Industry

Entering the Stabilization Phase

We Need Your Help!

Roadmap to v1.0

Acknowledgments

Links

Top comments (0)