Update (December 26, 2025): gogpu/gg v0.15.0 released! GPU compute shaders, vello-style pipeline, significant performance improvements for complex scenes. Part of the 224K LOC Pure Go ecosystem.
https://www.reddit.com/r/golang/comments/1pujj63/go_now_has_its_own_professional_graphics/
GPU Compute Shaders in Pure Go: gogpu/gg v0.15.0
Two days ago, we shipped gogpu/gg v0.14.0 with alpha masks and fluent PathBuilder. Yes, two days — we're moving fast. But looking at performance profiles, I saw a problem.
The CPU was the bottleneck.
While our GPU could render millions of pixels in milliseconds, the CPU spent significant time tessellating paths. This is the classic 2D graphics problem: CPU tessellation doesn't scale.
So we moved the entire rasterization pipeline to GPU compute shaders.
Today, gogpu/gg v0.15.0 is here: 2,280 lines of WGSL compute shaders, vello-style pipeline, dramatic speedups for complex scenes. All in Pure Go.
The Performance Challenge
Let me show you what we were up against:
// Drawing 10,000 circles
ctx := gg.NewContext(800, 600)
for i := 0; i < 10000; i++ {
ctx.DrawCircle(float64(i%100)*8, float64(i/100)*6, 3)
}
ctx.Fill()
The problem with traditional 2D rendering:
- CPU tessellates paths to triangles (sequential, single-threaded)
- GPU renders triangles (massively parallel)
- CPU becomes the bottleneck for complex scenes
This is true for every traditional 2D library (Cairo, Skia, even our earlier versions).
The Solution: All-GPU Pipeline
What if we never tessellated on CPU?
Traditional: Path → Tessellate (CPU) → Triangles → GPU
v0.15: Path → Flatten (GPU) → Coarse (GPU) → Fine (GPU)
This is the vello approach. We implemented it in Pure Go with WGSL compute shaders:
1. Flatten Shader — Curves to Segments
Converts Bezier curves to line segments on the GPU:
@compute @workgroup_size(256)
fn flatten_main(
@builtin(global_invocation_id) id: vec3<u32>
) {
let curve = curves[id.x];
// Adaptive subdivision based on curvature
let segments = subdivide_bezier(curve);
// Write to global buffer (thousands in parallel!)
for (var i = 0u; i < segment_count; i++) {
output_segments[base + i] = segments[i];
}
}
Key insight: Thousands of GPU cores process curves simultaneously. CPU does one at a time.
2. Coarse Shader — Tile Binning
Maps segments to screen tiles using atomic operations (thread-safe on GPU!):
@compute @workgroup_size(256)
fn coarse_main(
@builtin(global_invocation_id) id: vec3<u32>
) {
let segment = segments[id.x];
let bounds = segment_bounds(segment);
// Find overlapping tiles
for (var y = tile_min.y; y <= tile_max.y; y++) {
for (var x = tile_min.x; x <= tile_max.x; x++) {
let tile_id = y * tile_width + x;
// Atomic append (safe across threads!)
let index = atomicAdd(&tile_counts[tile_id], 1u);
tile_segments[tile_id][index] = id.x;
}
}
}
This is where GPU shines — thousands of threads binning segments in parallel, atomics keeping data consistent.
3. Fine Shader — Per-Pixel Coverage
Computes anti-aliased coverage analytically (no MSAA needed!):
@compute @workgroup_size(8, 8)
fn fine_main(
@builtin(global_invocation_id) pixel: vec3<u32>
) {
let tile_id = (pixel.y / TILE_SIZE) * tile_width + (pixel.x / TILE_SIZE);
var coverage = 0.0;
for (var i = 0u; i < tile_counts[tile_id]; i++) {
let seg = segments[tile_segments[tile_id][i]];
coverage += compute_segment_coverage(seg, pixel.xy);
}
// Apply fill rule, write pixel
output[pixel.xy] = vec4<f32>(color.rgb, saturate(coverage));
}
Perfect anti-aliasing at any scale. No jaggies, no MSAA overhead.
Expected Performance Gains
Based on our architecture and GPU parallelism theory, we expect significant improvements for complex scenes:
| Workload | Expected Behavior |
|---|---|
| Simple paths (< 100 segments) | CPU may be faster (GPU upload overhead) |
| Medium complexity (100-1K segments) | GPU starts to win |
| Complex scenes (1K-10K segments) | GPU significantly faster (10-30x estimated) |
| Very complex (10K+ segments) | GPU dominates (30-60x estimated) |
These are architectural projections. Actual performance varies by hardware, driver, and scene complexity. We're working on comprehensive benchmarks.
The larger the scene, the bigger the advantage. This is parallelism at work.
Hybrid Architecture: Best of Both Worlds
Here's the clever part — we don't always use GPU:
// backend/wgpu/sparse_strips_gpu.go
type HybridFineRasterizer struct {
mu sync.RWMutex
gpu *GPUFineRasterizer // nil if unavailable
cpu *FineRasterizer // always available
segmentThreshold int // min segments for GPU
gpuAvailable bool
}
func (h *HybridFineRasterizer) Rasterize(
coarse *CoarseRasterizer,
segments *SegmentList,
backdrop []int32,
) {
// GPU worthwhile only above threshold
if h.gpuAvailable && segments.Len() >= h.segmentThreshold {
// GPU path: dispatch compute shaders
h.gpu.Rasterize(coarse, segments, backdrop, scene.FillNonZero)
} else {
// CPU path: software rasterization
h.cpu.RasterizeSegments(segments, backdrop)
}
}
Why? Small paths (< 100 segments by default) have GPU dispatch overhead. CPU wins for simple cases.
Result: Automatic selection based on segment count.
Building This Was Hard
Challenge 1: WGSL Compilation
WGSL needs to compile to SPIR-V for Vulkan, MSL for Metal. We use our own gogpu/naga compiler (~23K LOC Pure Go):
// Simple compilation to SPIR-V
spirv, err := naga.Compile(wgslSource)
if err != nil {
log.Fatal(err)
}
// With options
spirv, err := naga.CompileWithOptions(wgslSource, naga.CompileOptions{
SPIRVVersion: spirv.Version1_3,
Debug: true,
Validate: true,
})
Supports vertex, fragment, compute shaders. Atomics, barriers, texture sampling. Pure Go.
Challenge 2: GPU Memory Management
Buffer sizing is critical:
// Too small? GPU crash
// Too large? Wasted memory
maxSegments := pathComplexity * 4 // Conservative estimate
segmentBuffer := device.CreateBuffer(maxSegments * 24) // 24 bytes/segment
We use conservative estimates with safety margins. Real-world paths rarely exceed our allocations.
Challenge 3: Atomic Operations
WGSL atomics are strict:
- Only
atomic<u32>andatomic<i32>supported - Specific memory orders required
- Buffer layout matters
// This works
@group(0) @binding(1) var<storage, read_write> counts: array<atomic<u32>>;
atomicAdd(&counts[i], 1u);
// This doesn't
var<storage, read_write> counts: array<u32>; // Not atomic!
Debugging this was tricky. WGSL validation errors are... cryptic.
What We Shipped
Statistics:
- 2,280 LOC WGSL shaders (8 shader files)
- ~20K LOC Go in backend/wgpu/
- 74% test coverage overall
- 0 linter issues
Shader Files:
backend/wgpu/shaders/
├── flatten.wgsl # 589 LOC — Bezier curve flattening
├── coarse.wgsl # 335 LOC — Tile binning with atomics
├── fine.wgsl # 290 LOC — Per-pixel coverage
├── blend.wgsl # 424 LOC — 29 blend modes on GPU
├── composite.wgsl # 235 LOC — Layer compositing
├── strip.wgsl # 155 LOC — Sparse strip rendering
├── blit.wgsl # 43 LOC — Final output blit
└── msdf_text.wgsl # 209 LOC — MSDF text rendering
Go Implementation:
backend/wgpu/
├── gpu_flatten.go # 809 LOC — Flatten pipeline
├── gpu_coarse.go # 698 LOC — Coarse rasterization
├── gpu_fine.go # 752 LOC — Fine rasterization
├── sparse_strips_gpu.go # 837 LOC — Hybrid CPU/GPU selection
├── renderer.go # 822 LOC — Main renderer
├── pipeline.go # 369 LOC — Pipeline orchestration
├── memory.go # 413 LOC — GPU memory management
└── ... (40+ files total)
Try It Yourself
Installation
go get github.com/gogpu/gg@v0.15.0
Quick Example
package main
import "github.com/gogpu/gg"
func main() {
ctx := gg.NewContext(512, 512)
ctx.ClearWithColor(gg.White)
// 1000 circles — GPU backend handles complex scenes efficiently
ctx.SetColor(gg.Hex("#e74c3c"))
for i := 0; i < 1000; i++ {
ctx.DrawCircle(float64(i%32)*16, float64(i/32)*16, 8)
}
ctx.Fill()
ctx.SavePNG("gpu_circles.png")
}
Backend Selection
The library automatically selects the best backend. GPU is used by default when available:
# Default: GPU enabled (uses Vulkan/Metal when available)
go build ./...
# Disable GPU (force software-only)
go build -tags nogpu ./...
The GoGPU Ecosystem
With v0.15.0, the GoGPU ecosystem now totals:
| Project | Description | Version | LOC |
|---|---|---|---|
| gogpu/gg | 2D graphics + GPU compute | v0.15.0 | ~104K |
| gogpu/gogpu | Graphics framework | v0.8.0 | ~26K |
| gogpu/wgpu | Pure Go WebGPU | v0.7.0 | ~71K |
| gogpu/naga | WGSL shader compiler | v0.6.0 | ~23K |
| gogpu/ui | GUI toolkit (planned) | — | — |
Total: ~224K lines of Pure Go. No CGO. No Rust. Just go build.
Comparison with Industry
| Feature | gogpu/gg v0.15.0 | vello (Rust) | Skia (C++) |
|---|---|---|---|
| GPU Compute | WGSL | WGSL | Metal/Vulkan |
| Language | Pure Go | Rust | C++ |
| CGO Required | No | No | Yes |
| Hybrid Mode | Auto | GPU-first | Manual |
| Platforms | Win/Lin/Mac | Win/Lin/Mac | All |
We're the only Pure Go library with GPU compute shaders for 2D graphics.
Entering the Stabilization Phase
With v0.15.0, we're closing the rapid development chapter.
In just three weeks, gogpu/gg went from v0.1.0 to v0.15.0 — from a basic context to GPU compute shaders. We've built the foundation: software rasterizer, GPU backend, text rendering, scene graph, blend modes, and now compute shaders.
What comes next is different. We're shifting focus to:
- Stability — Bug fixes, edge cases, cross-platform testing
- Performance — Real benchmarks, profiling, optimization
- Documentation — API docs, tutorials, examples
- Polish — Better error messages, developer experience
We Need Your Help!
This is where community testing becomes critical. We've built the architecture — now we need real-world usage to find the rough edges.
How to contribute:
- Test it — Try gogpu/gg in your projects, report issues
- Benchmark it — Run performance tests on your hardware
- Break it — Find edge cases, corner cases, crashes
- Document it — Examples, tutorials, use cases
go get github.com/gogpu/gg@v0.15.0
Found a bug? Open an issue.
Have questions? Join discussions.
Want to contribute? PRs welcome!
Roadmap to v1.0
v0.16.0 — v0.19.0 (Stabilization):
- Comprehensive benchmark suite
- Performance profiling and optimization
- Bug fixes from community feedback
- Documentation and examples
- GPU gradient shaders
- Shadow/blur effects
v1.0.0 (Production Ready):
- Stable, documented API
- Full test coverage
- Performance guarantees
- Migration guide
- gogpu/ui integration
Acknowledgments
This work was inspired by:
- vello — Modern GPU 2D renderer in Rust
- piet-gpu — GPU compute experiments
- pathfinder — GPU path rendering
Special thanks to the Go community for the continued support and feedback!
Links
- GitHub: https://github.com/gogpu/gg
- Release: v0.15.0
- GoGPU Organization: https://github.com/gogpu
- Discussions: Join the conversation
From CPU bottleneck to GPU parallelism. From sequential tessellation to massively parallel compute shaders.
This is what Pure Go can do.
go get github.com/gogpu/gg@v0.15.0
Star the repo if you find it useful!
Part of the GoGPU Journey series:
- GoGPU: A Pure Go Graphics Library for GPU Programming
- From Idea to 100K Lines in Two Weeks
- Building a Shader Compiler in Pure Go
- Introducing gogpu/gg v0.14.0
- GPU Compute Shaders in Pure Go ← You are here
Top comments (0)