Andrey Kolkov

Posted on May 3

Born ML v0.8.0: We Killed Our Last .dll — Pure Go GPU Is Here

#ai #go #machinelearning #opensource

TL;DR: Born v0.8.0 replaces go-webgpu (Rust FFI + shared libraries) with gogpu/wgpu — pure Go WebGPU. No .dll. No .so. No runtime downloads. go build now gives you a GPU-accelerated ML binary. We also fixed 5 critical GPU bugs and validated on real model training. Next up: DeepSeek V4 inference support.

The Last Dependency

Five months ago I skipped my birthday to release Born. A few weeks later we made training 18x faster with lazy GPU evaluation. The framework was growing. Contributors were showing up. Real people were using it.

But there was a problem I couldn't ignore anymore.

Every time someone wanted to use GPU acceleration, the conversation went like this:

"How do I run the GPU examples?"

"Download wgpu-native .dll for your platform, put it in your PATH..."

"...I thought you said pure Go?"

They were right. Born's CPU path was pure Go. But the GPU backend used go-webgpu — Go bindings to Rust's wgpu-native via FFI. You needed a platform-specific shared library at runtime. On Windows, a .dll. On Linux, a .so. On macOS, a .dylib.

For a framework whose tagline is "single binary deployment", that was embarrassing.

So we fixed it.

Why Not Earlier?

Fair question. gogpu/wgpu existed for months before v0.8.0. Why did we ship 29 releases on go-webgpu first?

Because that was the plan.

go-webgpu wraps Rust's wgpu-native — a battle-tested GPU abstraction used by Firefox and dozens of production projects. When you're building a new ML framework from scratch, you don't want to debug your GPU backend and your tensor math at the same time. If training produces wrong gradients, is the bug in your autodiff engine or in your WebGPU implementation? With Rust wgpu-native underneath, we knew: the GPU layer works. Any bug is ours.

So we built Born v0.1 through v0.7 on a proven foundation. Tensor ops, autodiff, attention, Flash Attention, speculative decoding, ONNX import, GGUF loading — all validated against a GPU backend we could trust. By v0.7.16, Born had 1,394 tests, 3 external contributors, and real model training working.

Meanwhile, gogpu/wgpu was maturing through its own path — powering gogpu/gg (2D graphics library with GPU compute shaders), running real rendering workloads, stabilizing the Core API across Vulkan, Metal, DX12, and GLES.

When both sides were proven, the migration became simple: we knew Born's code was correct, and we knew gogpu/wgpu's Core API was stable. Any bug found during migration was specifically a wgpu Go integration issue — easy to isolate, easy to fix.

That's exactly what happened. Five bugs, all in resource lifecycle. All fixed in days, not weeks.

Validate on proven foundation first. Swap the foundation second. This is not how you move fast. This is how you move right.

The Migration

Born v0.8.0 replaces go-webgpu with gogpu/wgpu — a pure Go WebGPU implementation from our own GoGPU ecosystem.

- github.com/go-webgpu/webgpu v0.4.1
+ github.com/gogpu/wgpu v0.26.8

One line in go.mod. 27 files changed. 1,830 additions, 1,518 deletions.

What changed:

	go-webgpu (before)	gogpu/wgpu (after)
Implementation	Rust wgpu-native via FFI	Pure Go
CGO	None (goffi)	None
Runtime .dll/.so	Required	None
Build	`go build` + download .dll	`go build`. Period.
Vulkan/Metal/DX12	Via Rust	Via Go
WGSL shaders	Unchanged	Unchanged
Control	External project	Our project

That last row matters. gogpu/wgpu isn't some random dependency — it's our project. When Born needs a WebGPU API change, we change it upstream. Both sides of the interface are under our control.

Five Bugs Nobody Told Us About

Swapping the GPU backend is like replacing a car engine while driving. Everything looks the same from the outside, but internally the timing, resource lifecycle, and synchronization are completely different.

We found five critical bugs during migration:

1. PipelineLayout Freed Too Early

Vulkan requires compute pipeline layouts to stay alive during SetBindGroup(). go-webgpu's internal reference counting kept them alive. gogpu/wgpu doesn't — you own your resources.

We fixed this by storing PipelineLayout alongside the pipeline in our cache.

2. Lazy Ops and the Destroy Queue

Born uses lazy evaluation — GPU ops chain without CPU sync. But when a tensor gets garbage-collected mid-chain, its buffer goes to the destroy queue. If the pending operations haven't submitted yet, the buffer is destroyed before the GPU reads it.

Fix: immediate submit for lazy ops. Every operation submits its command encoder before returning.

3. Buffer Copy Race

copyGPUBuffer (used by Data() to read results back to CPU) was queuing the copy but not submitting. The next operation might overwrite the source buffer before the copy executed.

Fix: immediate submit after copy.

4. GC vs GPU

Go's garbage collector doesn't know about GPU resources. A runtime.SetFinalizer on a tensor could fire while the GPU was still computing with that tensor's buffer.

Fix: runtime.KeepAlive() guards around every GPU operation that uses the tensor.

5. Device Cleanup Order

When destroying the GPU device, all pending work must complete first. Without Poll(PollWait) before resource destruction, Vulkan validation layers scream.

Fix: explicit Poll(PollWait) in Release() to ensure GPU idle.

None of these bugs existed with go-webgpu. They're all about resource lifecycle differences between Rust's ownership model (where wgpu-native tracks everything for you) and Go's GC-based model (where you track it yourself).

After fixing all five, we ran all GPU tests and a 20-epoch model training with zero crashes.

What You Get

True Single Binary

go build -o myapp ./cmd/myapp
# That's it. Ship the binary. GPU works.

No .dll downloads. No LD_LIBRARY_PATH. No platform-specific install steps. The binary works on any machine with a Vulkan-capable GPU.

Same API, Same Shaders

If you have existing Born code with GPU, nothing changes:

import (
    "github.com/born-ml/born/backend/cpu"
    "github.com/born-ml/born/autodiff"
)

// CPU-only (always worked)
backend := autodiff.New(cpu.New())

import "github.com/born-ml/born/backend/webgpu"

// GPU-accelerated (now pure Go!)
if webgpu.IsAvailable() {
    gpu, _ := webgpu.New()
    backend := autodiff.New(gpu)
    defer gpu.Release()
}

WGSL shaders are unchanged. The Backend interface (52 methods) is unchanged. Your code just works — minus the .dll.

Validated on Real Training

We didn't just run unit tests. We trained a real Hierarchical Reasoning Model (HRM) for 20 epochs on GPU. Zero crashes. Correct gradients. Same accuracy as go-webgpu.

The Numbers

Metric	Value
Go source	~47K LOC
Tests	~34K LOC, 1,394 test functions
ONNX operators	49
Backend methods	52
GPU tests	105
Contributors	4 (@kolkov, @gmohmad, @bennibbelink, @jsully1720)
Releases	30
Stars	80 (organic, no marketing)

Community

v0.8.0 isn't just about the migration. Since v0.7.0, three external contributors have landed real code:

@jsully1720 — ONNX Equal operator
@bennibbelink — Erf, Sign/Abs, Clamp ops (3 PRs, all full vertical slices: backend → CPU → GPU → autodiff → tests)
@gmohmad — LayerNorm, BatchMatMul broadcasting, Squeeze fix, 9 new ONNX ops, inplace mutation bug fix (5 PRs)

These aren't drive-by typo fixes. These are production-quality contributions from people who studied the codebase and followed the patterns. If you're considering contributing, look at what they did — that's the bar.

What's Next: DeepSeek V4 Inference

With the GPU backend stable and pure Go, we can focus on what matters: running real models.

DeepSeek released V4 on April 24, 2026 — two models:

V4-Pro: 1.6 trillion params, 49B active
V4-Flash: 284B total, 13B active — fits on a consumer GPU

V4-Flash with 13B active parameters is Born's sweet spot. It's the most capable open model that fits on a single 24GB GPU. API pricing is tied to chip availability ($1.74/M tokens, bottleneck pricing) — users want local inference alternatives.

We started researching V4 architecture before it launched — back in early April, when only the Engram paper and V3.2 sparse attention existed. We predicted V4 would combine MoE + Engram + manifold-constrained residuals + compressed sparse attention. On April 24th, the tech report confirmed all four. Two weeks head start on architecture analysis. (We do this kind of research openly — see Discussion #60 for our Recurrent-Depth Transformer analysis.)

Here's the full component breakdown:

Component	What	Why It Matters
MoE Routing	Top-16 sparse expert selection	Also unlocks Mixtral, DBRX
MXFP4 Dequantization	FP4 expert weights with block scaling	V4's native format — not INT4 GPTQ
Engram	O(1) hash-lookup factual memory	Unique to DeepSeek, DRAM-resident
Three-Pool Attention	SWA + C4 + C128 compression	1M context with <10% throughput drop
Hyper-Connections (mHC)	4D manifold-constrained residual	Every transformer layer uses this
MTP Drafting	Integrated speculative decoding	~2.5 tokens accepted per step
KV Cache Tiering	CPU-GPU cache with LRU eviction	128K+ context on 24GB consumer GPU
PD-Disaggregation	Prefill/Decode split serving	Production throughput scaling

Total estimate: 22-30 weeks. It's a lot. But MoE routing alone unlocks V4, Mixtral, and BAR (Allen AI's modular post-training). Each component is independently valuable.

The GoGPU Ecosystem

Born's GPU backend is powered by the GoGPU ecosystem — pure Go GPU infrastructure:

Project	What	LOC
gogpu/gg	2D graphics with GPU compute shaders	~222K
gogpu/naga	Shader compiler (WGSL → SPIR-V, MSL, HLSL, GLSL, DXIL)	~199K
gogpu/wgpu	Pure Go WebGPU (Vulkan, Metal, DX12, GLES, Software)	~156K
gogpu/gogpu	Graphics framework + windowing	~52K

Combined with Born's ~81K LOC, that's 710K+ lines of pure Go GPU code. No CGO. No Rust. Just go build.

Try It

git clone https://github.com/born-ml/born
cd born
go build ./...
go test ./... -short

Run the examples:

cd examples/mnist && go run .       # MLP: 97.44% accuracy
cd examples/mnist-cnn && go run .   # CNN: 98.18% accuracy
cd examples/mnist-gpu && go run .   # GPU-accelerated inference

GPU examples now work with go run — no .dll download step.

Build This With Us

Born is at an inflection point. GPU is stable. The architecture is proven. The roadmap to DeepSeek V4 is clear.

We're not looking for passive users. We're looking for people who want to help build one of the best ML frameworks in the world. In Go.

How you can make a difference:

File issues. Found a bug? A missing operator? An edge case that breaks your model? Every issue makes Born more production-ready. Our three external contributors started exactly this way.
Send PRs. Missing tensor ops (TopK, Scatter — needed for MoE), CPU optimizations (the inner loops are naive — lots of low-hanging fruit), new ONNX operators, quantization infrastructure. Look at what @bennibbelink and @gmohmad have done — full vertical slices, production quality. That's the standard.
Bring breakthrough ideas. The hardest problems ahead — MoE routing, FP4 dequantization, compressed sparse attention, CPU-GPU cache tiering — are open research questions in Go. If you have insights on how to make these work efficiently in pure Go, we want to hear them.
Challenge our assumptions. Tell us what we're doing wrong. Tell us what's missing. The best frameworks are shaped by people who care enough to argue.

Found a bug? Open an issue
Have a big idea? Feature Requests & Roadmap Discussion
Questions? Getting Started & FAQ
Ready to code? Contributing Guide

Links

Resource	Link
GitHub	github.com/born-ml/born
v0.8.0 Release	Release Notes
Documentation	pkg.go.dev/github.com/born-ml/born
Roadmap	ROADMAP.md
Changelog	CHANGELOG.md
GoGPU	github.com/gogpu

Five months ago, Born was a birthday project with zero stars. Today it's a pure Go ML framework with GPU acceleration, 4 contributors, 49 ONNX operators, and a roadmap to run DeepSeek V4.

No .dll. No .so. No excuses. Models are born production-ready.

go build. Ship. Done.

Star us on GitHub: github.com/born-ml/born

DEV Community