DEV Community

Andrey Kolkov
Andrey Kolkov

Posted on

GPU Compute Shaders in Pure Go: gogpu/gg v0.15.0

Update (December 26, 2025): gogpu/gg v0.15.0 released! GPU compute shaders, vello-style pipeline, significant performance improvements for complex scenes. Part of the 224K LOC Pure Go ecosystem.
https://www.reddit.com/r/golang/comments/1pujj63/go_now_has_its_own_professional_graphics/

GPU Compute Shaders in Pure Go: gogpu/gg v0.15.0

Two days ago, we shipped gogpu/gg v0.14.0 with alpha masks and fluent PathBuilder. Yes, two days — we're moving fast. But looking at performance profiles, I saw a problem.

The CPU was the bottleneck.

While our GPU could render millions of pixels in milliseconds, the CPU spent significant time tessellating paths. This is the classic 2D graphics problem: CPU tessellation doesn't scale.

So we moved the entire rasterization pipeline to GPU compute shaders.

Today, gogpu/gg v0.15.0 is here: 2,280 lines of WGSL compute shaders, vello-style pipeline, dramatic speedups for complex scenes. All in Pure Go.


The Performance Challenge

Let me show you what we were up against:

// Drawing 10,000 circles
ctx := gg.NewContext(800, 600)
for i := 0; i < 10000; i++ {
    ctx.DrawCircle(float64(i%100)*8, float64(i/100)*6, 3)
}
ctx.Fill()
Enter fullscreen mode Exit fullscreen mode

The problem with traditional 2D rendering:

  • CPU tessellates paths to triangles (sequential, single-threaded)
  • GPU renders triangles (massively parallel)
  • CPU becomes the bottleneck for complex scenes

This is true for every traditional 2D library (Cairo, Skia, even our earlier versions).


The Solution: All-GPU Pipeline

What if we never tessellated on CPU?

Traditional: Path → Tessellate (CPU) → Triangles → GPU
     v0.15: Path → Flatten (GPU) → Coarse (GPU) → Fine (GPU)
Enter fullscreen mode Exit fullscreen mode

This is the vello approach. We implemented it in Pure Go with WGSL compute shaders:

1. Flatten Shader — Curves to Segments

Converts Bezier curves to line segments on the GPU:

@compute @workgroup_size(256)
fn flatten_main(
    @builtin(global_invocation_id) id: vec3<u32>
) {
    let curve = curves[id.x];

    // Adaptive subdivision based on curvature
    let segments = subdivide_bezier(curve);

    // Write to global buffer (thousands in parallel!)
    for (var i = 0u; i < segment_count; i++) {
        output_segments[base + i] = segments[i];
    }
}
Enter fullscreen mode Exit fullscreen mode

Key insight: Thousands of GPU cores process curves simultaneously. CPU does one at a time.

2. Coarse Shader — Tile Binning

Maps segments to screen tiles using atomic operations (thread-safe on GPU!):

@compute @workgroup_size(256)
fn coarse_main(
    @builtin(global_invocation_id) id: vec3<u32>
) {
    let segment = segments[id.x];
    let bounds = segment_bounds(segment);

    // Find overlapping tiles
    for (var y = tile_min.y; y <= tile_max.y; y++) {
        for (var x = tile_min.x; x <= tile_max.x; x++) {
            let tile_id = y * tile_width + x;

            // Atomic append (safe across threads!)
            let index = atomicAdd(&tile_counts[tile_id], 1u);
            tile_segments[tile_id][index] = id.x;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

This is where GPU shines — thousands of threads binning segments in parallel, atomics keeping data consistent.

3. Fine Shader — Per-Pixel Coverage

Computes anti-aliased coverage analytically (no MSAA needed!):

@compute @workgroup_size(8, 8)
fn fine_main(
    @builtin(global_invocation_id) pixel: vec3<u32>
) {
    let tile_id = (pixel.y / TILE_SIZE) * tile_width + (pixel.x / TILE_SIZE);

    var coverage = 0.0;
    for (var i = 0u; i < tile_counts[tile_id]; i++) {
        let seg = segments[tile_segments[tile_id][i]];
        coverage += compute_segment_coverage(seg, pixel.xy);
    }

    // Apply fill rule, write pixel
    output[pixel.xy] = vec4<f32>(color.rgb, saturate(coverage));
}
Enter fullscreen mode Exit fullscreen mode

Perfect anti-aliasing at any scale. No jaggies, no MSAA overhead.


Expected Performance Gains

Based on our architecture and GPU parallelism theory, we expect significant improvements for complex scenes:

Workload Expected Behavior
Simple paths (< 100 segments) CPU may be faster (GPU upload overhead)
Medium complexity (100-1K segments) GPU starts to win
Complex scenes (1K-10K segments) GPU significantly faster (10-30x estimated)
Very complex (10K+ segments) GPU dominates (30-60x estimated)

These are architectural projections. Actual performance varies by hardware, driver, and scene complexity. We're working on comprehensive benchmarks.

The larger the scene, the bigger the advantage. This is parallelism at work.


Hybrid Architecture: Best of Both Worlds

Here's the clever part — we don't always use GPU:

// backend/wgpu/sparse_strips_gpu.go
type HybridFineRasterizer struct {
    mu               sync.RWMutex
    gpu              *GPUFineRasterizer  // nil if unavailable
    cpu              *FineRasterizer     // always available
    segmentThreshold int                 // min segments for GPU
    gpuAvailable     bool
}

func (h *HybridFineRasterizer) Rasterize(
    coarse *CoarseRasterizer,
    segments *SegmentList,
    backdrop []int32,
) {
    // GPU worthwhile only above threshold
    if h.gpuAvailable && segments.Len() >= h.segmentThreshold {
        // GPU path: dispatch compute shaders
        h.gpu.Rasterize(coarse, segments, backdrop, scene.FillNonZero)
    } else {
        // CPU path: software rasterization
        h.cpu.RasterizeSegments(segments, backdrop)
    }
}
Enter fullscreen mode Exit fullscreen mode

Why? Small paths (< 100 segments by default) have GPU dispatch overhead. CPU wins for simple cases.

Result: Automatic selection based on segment count.


Building This Was Hard

Challenge 1: WGSL Compilation

WGSL needs to compile to SPIR-V for Vulkan, MSL for Metal. We use our own gogpu/naga compiler (~23K LOC Pure Go):

// Simple compilation to SPIR-V
spirv, err := naga.Compile(wgslSource)
if err != nil {
    log.Fatal(err)
}

// With options
spirv, err := naga.CompileWithOptions(wgslSource, naga.CompileOptions{
    SPIRVVersion: spirv.Version1_3,
    Debug:        true,
    Validate:     true,
})
Enter fullscreen mode Exit fullscreen mode

Supports vertex, fragment, compute shaders. Atomics, barriers, texture sampling. Pure Go.

Challenge 2: GPU Memory Management

Buffer sizing is critical:

// Too small? GPU crash
// Too large? Wasted memory
maxSegments := pathComplexity * 4  // Conservative estimate
segmentBuffer := device.CreateBuffer(maxSegments * 24) // 24 bytes/segment
Enter fullscreen mode Exit fullscreen mode

We use conservative estimates with safety margins. Real-world paths rarely exceed our allocations.

Challenge 3: Atomic Operations

WGSL atomics are strict:

  • Only atomic<u32> and atomic<i32> supported
  • Specific memory orders required
  • Buffer layout matters
// This works
@group(0) @binding(1) var<storage, read_write> counts: array<atomic<u32>>;
atomicAdd(&counts[i], 1u);

// This doesn't
var<storage, read_write> counts: array<u32>; // Not atomic!
Enter fullscreen mode Exit fullscreen mode

Debugging this was tricky. WGSL validation errors are... cryptic.


What We Shipped

Statistics:

  • 2,280 LOC WGSL shaders (8 shader files)
  • ~20K LOC Go in backend/wgpu/
  • 74% test coverage overall
  • 0 linter issues

Shader Files:

backend/wgpu/shaders/
├── flatten.wgsl     # 589 LOC — Bezier curve flattening
├── coarse.wgsl      # 335 LOC — Tile binning with atomics
├── fine.wgsl        # 290 LOC — Per-pixel coverage
├── blend.wgsl       # 424 LOC — 29 blend modes on GPU
├── composite.wgsl   # 235 LOC — Layer compositing
├── strip.wgsl       # 155 LOC — Sparse strip rendering
├── blit.wgsl        #  43 LOC — Final output blit
└── msdf_text.wgsl   # 209 LOC — MSDF text rendering
Enter fullscreen mode Exit fullscreen mode

Go Implementation:

backend/wgpu/
├── gpu_flatten.go       # 809 LOC — Flatten pipeline
├── gpu_coarse.go        # 698 LOC — Coarse rasterization
├── gpu_fine.go          # 752 LOC — Fine rasterization
├── sparse_strips_gpu.go # 837 LOC — Hybrid CPU/GPU selection
├── renderer.go          # 822 LOC — Main renderer
├── pipeline.go          # 369 LOC — Pipeline orchestration
├── memory.go            # 413 LOC — GPU memory management
└── ... (40+ files total)
Enter fullscreen mode Exit fullscreen mode

Try It Yourself

Installation

go get github.com/gogpu/gg@v0.15.0
Enter fullscreen mode Exit fullscreen mode

Quick Example

package main

import "github.com/gogpu/gg"

func main() {
    ctx := gg.NewContext(512, 512)
    ctx.ClearWithColor(gg.White)

    // 1000 circles — GPU backend handles complex scenes efficiently
    ctx.SetColor(gg.Hex("#e74c3c"))
    for i := 0; i < 1000; i++ {
        ctx.DrawCircle(float64(i%32)*16, float64(i/32)*16, 8)
    }
    ctx.Fill()

    ctx.SavePNG("gpu_circles.png")
}
Enter fullscreen mode Exit fullscreen mode

Backend Selection

The library automatically selects the best backend. GPU is used by default when available:

# Default: GPU enabled (uses Vulkan/Metal when available)
go build ./...

# Disable GPU (force software-only)
go build -tags nogpu ./...
Enter fullscreen mode Exit fullscreen mode

The GoGPU Ecosystem

With v0.15.0, the GoGPU ecosystem now totals:

Project Description Version LOC
gogpu/gg 2D graphics + GPU compute v0.15.0 ~104K
gogpu/gogpu Graphics framework v0.8.0 ~26K
gogpu/wgpu Pure Go WebGPU v0.7.0 ~71K
gogpu/naga WGSL shader compiler v0.6.0 ~23K
gogpu/ui GUI toolkit (planned)

Total: ~224K lines of Pure Go. No CGO. No Rust. Just go build.


Comparison with Industry

Feature gogpu/gg v0.15.0 vello (Rust) Skia (C++)
GPU Compute WGSL WGSL Metal/Vulkan
Language Pure Go Rust C++
CGO Required No No Yes
Hybrid Mode Auto GPU-first Manual
Platforms Win/Lin/Mac Win/Lin/Mac All

We're the only Pure Go library with GPU compute shaders for 2D graphics.


Entering the Stabilization Phase

With v0.15.0, we're closing the rapid development chapter.

In just three weeks, gogpu/gg went from v0.1.0 to v0.15.0 — from a basic context to GPU compute shaders. We've built the foundation: software rasterizer, GPU backend, text rendering, scene graph, blend modes, and now compute shaders.

What comes next is different. We're shifting focus to:

  • Stability — Bug fixes, edge cases, cross-platform testing
  • Performance — Real benchmarks, profiling, optimization
  • Documentation — API docs, tutorials, examples
  • Polish — Better error messages, developer experience

We Need Your Help!

This is where community testing becomes critical. We've built the architecture — now we need real-world usage to find the rough edges.

How to contribute:

  1. Test it — Try gogpu/gg in your projects, report issues
  2. Benchmark it — Run performance tests on your hardware
  3. Break it — Find edge cases, corner cases, crashes
  4. Document it — Examples, tutorials, use cases
go get github.com/gogpu/gg@v0.15.0
Enter fullscreen mode Exit fullscreen mode

Found a bug? Open an issue.
Have questions? Join discussions.
Want to contribute? PRs welcome!


Roadmap to v1.0

v0.16.0 — v0.19.0 (Stabilization):

  • Comprehensive benchmark suite
  • Performance profiling and optimization
  • Bug fixes from community feedback
  • Documentation and examples
  • GPU gradient shaders
  • Shadow/blur effects

v1.0.0 (Production Ready):

  • Stable, documented API
  • Full test coverage
  • Performance guarantees
  • Migration guide
  • gogpu/ui integration

Acknowledgments

This work was inspired by:

Special thanks to the Go community for the continued support and feedback!


Links


From CPU bottleneck to GPU parallelism. From sequential tessellation to massively parallel compute shaders.

This is what Pure Go can do.

go get github.com/gogpu/gg@v0.15.0

Star the repo if you find it useful!


Part of the GoGPU Journey series:

  1. GoGPU: A Pure Go Graphics Library for GPU Programming
  2. From Idea to 100K Lines in Two Weeks
  3. Building a Shader Compiler in Pure Go
  4. Introducing gogpu/gg v0.14.0
  5. GPU Compute Shaders in Pure Go ← You are here

Top comments (0)