TL;DR: Born v0.8.0 replaces go-webgpu (Rust FFI + shared libraries) with gogpu/wgpu — pure Go WebGPU. No .dll. No .so. No runtime downloads. go build now gives you a GPU-accelerated ML binary. We also fixed 5 critical GPU bugs and validated on real model training. Next up: DeepSeek V4 inference support.
The Last Dependency
Five months ago I skipped my birthday to release Born. A few weeks later we made training 18x faster with lazy GPU evaluation. The framework was growing. Contributors were showing up. Real people were using it.
But there was a problem I couldn't ignore anymore.
Every time someone wanted to use GPU acceleration, the conversation went like this:
"How do I run the GPU examples?"
"Download wgpu-native
.dllfor your platform, put it in your PATH...""...I thought you said pure Go?"
They were right. Born's CPU path was pure Go. But the GPU backend used go-webgpu — Go bindings to Rust's wgpu-native via FFI. You needed a platform-specific shared library at runtime. On Windows, a .dll. On Linux, a .so. On macOS, a .dylib.
For a framework whose tagline is "single binary deployment", that was embarrassing.
So we fixed it.
Why Not Earlier?
Fair question. gogpu/wgpu existed for months before v0.8.0. Why did we ship 29 releases on go-webgpu first?
Because that was the plan.
go-webgpu wraps Rust's wgpu-native — a battle-tested GPU abstraction used by Firefox and dozens of production projects. When you're building a new ML framework from scratch, you don't want to debug your GPU backend and your tensor math at the same time. If training produces wrong gradients, is the bug in your autodiff engine or in your WebGPU implementation? With Rust wgpu-native underneath, we knew: the GPU layer works. Any bug is ours.
So we built Born v0.1 through v0.7 on a proven foundation. Tensor ops, autodiff, attention, Flash Attention, speculative decoding, ONNX import, GGUF loading — all validated against a GPU backend we could trust. By v0.7.16, Born had 1,394 tests, 3 external contributors, and real model training working.
Meanwhile, gogpu/wgpu was maturing through its own path — powering gogpu/gg (2D graphics library with GPU compute shaders), running real rendering workloads, stabilizing the Core API across Vulkan, Metal, DX12, and GLES.
When both sides were proven, the migration became simple: we knew Born's code was correct, and we knew gogpu/wgpu's Core API was stable. Any bug found during migration was specifically a wgpu Go integration issue — easy to isolate, easy to fix.
That's exactly what happened. Five bugs, all in resource lifecycle. All fixed in days, not weeks.
Validate on proven foundation first. Swap the foundation second. This is not how you move fast. This is how you move right.
The Migration
Born v0.8.0 replaces go-webgpu with gogpu/wgpu — a pure Go WebGPU implementation from our own GoGPU ecosystem.
- github.com/go-webgpu/webgpu v0.4.1
+ github.com/gogpu/wgpu v0.26.8
One line in go.mod. 27 files changed. 1,830 additions, 1,518 deletions.
What changed:
| go-webgpu (before) | gogpu/wgpu (after) | |
|---|---|---|
| Implementation | Rust wgpu-native via FFI | Pure Go |
| CGO | None (goffi) | None |
| Runtime .dll/.so | Required | None |
| Build |
go build + download .dll |
go build. Period. |
| Vulkan/Metal/DX12 | Via Rust | Via Go |
| WGSL shaders | Unchanged | Unchanged |
| Control | External project | Our project |
That last row matters. gogpu/wgpu isn't some random dependency — it's our project. When Born needs a WebGPU API change, we change it upstream. Both sides of the interface are under our control.
Five Bugs Nobody Told Us About
Swapping the GPU backend is like replacing a car engine while driving. Everything looks the same from the outside, but internally the timing, resource lifecycle, and synchronization are completely different.
We found five critical bugs during migration:
1. PipelineLayout Freed Too Early
Vulkan requires compute pipeline layouts to stay alive during SetBindGroup(). go-webgpu's internal reference counting kept them alive. gogpu/wgpu doesn't — you own your resources.
We fixed this by storing PipelineLayout alongside the pipeline in our cache.
2. Lazy Ops and the Destroy Queue
Born uses lazy evaluation — GPU ops chain without CPU sync. But when a tensor gets garbage-collected mid-chain, its buffer goes to the destroy queue. If the pending operations haven't submitted yet, the buffer is destroyed before the GPU reads it.
Fix: immediate submit for lazy ops. Every operation submits its command encoder before returning.
3. Buffer Copy Race
copyGPUBuffer (used by Data() to read results back to CPU) was queuing the copy but not submitting. The next operation might overwrite the source buffer before the copy executed.
Fix: immediate submit after copy.
4. GC vs GPU
Go's garbage collector doesn't know about GPU resources. A runtime.SetFinalizer on a tensor could fire while the GPU was still computing with that tensor's buffer.
Fix: runtime.KeepAlive() guards around every GPU operation that uses the tensor.
5. Device Cleanup Order
When destroying the GPU device, all pending work must complete first. Without Poll(PollWait) before resource destruction, Vulkan validation layers scream.
Fix: explicit Poll(PollWait) in Release() to ensure GPU idle.
None of these bugs existed with go-webgpu. They're all about resource lifecycle differences between Rust's ownership model (where wgpu-native tracks everything for you) and Go's GC-based model (where you track it yourself).
After fixing all five, we ran all GPU tests and a 20-epoch model training with zero crashes.
What You Get
True Single Binary
go build -o myapp ./cmd/myapp
# That's it. Ship the binary. GPU works.
No .dll downloads. No LD_LIBRARY_PATH. No platform-specific install steps. The binary works on any machine with a Vulkan-capable GPU.
Same API, Same Shaders
If you have existing Born code with GPU, nothing changes:
import (
"github.com/born-ml/born/backend/cpu"
"github.com/born-ml/born/autodiff"
)
// CPU-only (always worked)
backend := autodiff.New(cpu.New())
import "github.com/born-ml/born/backend/webgpu"
// GPU-accelerated (now pure Go!)
if webgpu.IsAvailable() {
gpu, _ := webgpu.New()
backend := autodiff.New(gpu)
defer gpu.Release()
}
WGSL shaders are unchanged. The Backend interface (52 methods) is unchanged. Your code just works — minus the .dll.
Validated on Real Training
We didn't just run unit tests. We trained a real Hierarchical Reasoning Model (HRM) for 20 epochs on GPU. Zero crashes. Correct gradients. Same accuracy as go-webgpu.
The Numbers
| Metric | Value |
|---|---|
| Go source | ~47K LOC |
| Tests | ~34K LOC, 1,394 test functions |
| ONNX operators | 49 |
| Backend methods | 52 |
| GPU tests | 105 |
| Contributors | 4 (@kolkov, @gmohmad, @bennibbelink, @jsully1720) |
| Releases | 30 |
| Stars | 80 (organic, no marketing) |
Community
v0.8.0 isn't just about the migration. Since v0.7.0, three external contributors have landed real code:
- @jsully1720 — ONNX Equal operator
- @bennibbelink — Erf, Sign/Abs, Clamp ops (3 PRs, all full vertical slices: backend → CPU → GPU → autodiff → tests)
- @gmohmad — LayerNorm, BatchMatMul broadcasting, Squeeze fix, 9 new ONNX ops, inplace mutation bug fix (5 PRs)
These aren't drive-by typo fixes. These are production-quality contributions from people who studied the codebase and followed the patterns. If you're considering contributing, look at what they did — that's the bar.
What's Next: DeepSeek V4 Inference
With the GPU backend stable and pure Go, we can focus on what matters: running real models.
DeepSeek released V4 on April 24, 2026 — two models:
- V4-Pro: 1.6 trillion params, 49B active
- V4-Flash: 284B total, 13B active — fits on a consumer GPU
V4-Flash with 13B active parameters is Born's sweet spot. It's the most capable open model that fits on a single 24GB GPU. API pricing is tied to chip availability ($1.74/M tokens, bottleneck pricing) — users want local inference alternatives.
We started researching V4 architecture before it launched — back in early April, when only the Engram paper and V3.2 sparse attention existed. We predicted V4 would combine MoE + Engram + manifold-constrained residuals + compressed sparse attention. On April 24th, the tech report confirmed all four. Two weeks head start on architecture analysis. (We do this kind of research openly — see Discussion #60 for our Recurrent-Depth Transformer analysis.)
Here's the full component breakdown:
| Component | What | Why It Matters |
|---|---|---|
| MoE Routing | Top-16 sparse expert selection | Also unlocks Mixtral, DBRX |
| MXFP4 Dequantization | FP4 expert weights with block scaling | V4's native format — not INT4 GPTQ |
| Engram | O(1) hash-lookup factual memory | Unique to DeepSeek, DRAM-resident |
| Three-Pool Attention | SWA + C4 + C128 compression | 1M context with <10% throughput drop |
| Hyper-Connections (mHC) | 4D manifold-constrained residual | Every transformer layer uses this |
| MTP Drafting | Integrated speculative decoding | ~2.5 tokens accepted per step |
| KV Cache Tiering | CPU-GPU cache with LRU eviction | 128K+ context on 24GB consumer GPU |
| PD-Disaggregation | Prefill/Decode split serving | Production throughput scaling |
Total estimate: 22-30 weeks. It's a lot. But MoE routing alone unlocks V4, Mixtral, and BAR (Allen AI's modular post-training). Each component is independently valuable.
The GoGPU Ecosystem
Born's GPU backend is powered by the GoGPU ecosystem — pure Go GPU infrastructure:
| Project | What | LOC |
|---|---|---|
| gogpu/gg | 2D graphics with GPU compute shaders | ~222K |
| gogpu/naga | Shader compiler (WGSL → SPIR-V, MSL, HLSL, GLSL, DXIL) | ~199K |
| gogpu/wgpu | Pure Go WebGPU (Vulkan, Metal, DX12, GLES, Software) | ~156K |
| gogpu/gogpu | Graphics framework + windowing | ~52K |
Combined with Born's ~81K LOC, that's 710K+ lines of pure Go GPU code. No CGO. No Rust. Just go build.
Try It
git clone https://github.com/born-ml/born
cd born
go build ./...
go test ./... -short
Run the examples:
cd examples/mnist && go run . # MLP: 97.44% accuracy
cd examples/mnist-cnn && go run . # CNN: 98.18% accuracy
cd examples/mnist-gpu && go run . # GPU-accelerated inference
GPU examples now work with go run — no .dll download step.
Build This With Us
Born is at an inflection point. GPU is stable. The architecture is proven. The roadmap to DeepSeek V4 is clear.
We're not looking for passive users. We're looking for people who want to help build one of the best ML frameworks in the world. In Go.
How you can make a difference:
File issues. Found a bug? A missing operator? An edge case that breaks your model? Every issue makes Born more production-ready. Our three external contributors started exactly this way.
Send PRs. Missing tensor ops (TopK, Scatter — needed for MoE), CPU optimizations (the inner loops are naive — lots of low-hanging fruit), new ONNX operators, quantization infrastructure. Look at what @bennibbelink and @gmohmad have done — full vertical slices, production quality. That's the standard.
Bring breakthrough ideas. The hardest problems ahead — MoE routing, FP4 dequantization, compressed sparse attention, CPU-GPU cache tiering — are open research questions in Go. If you have insights on how to make these work efficiently in pure Go, we want to hear them.
Challenge our assumptions. Tell us what we're doing wrong. Tell us what's missing. The best frameworks are shaped by people who care enough to argue.
Found a bug? Open an issue
Have a big idea? Feature Requests & Roadmap Discussion
Questions? Getting Started & FAQ
Ready to code? Contributing Guide
Links
| Resource | Link |
|---|---|
| GitHub | github.com/born-ml/born |
| v0.8.0 Release | Release Notes |
| Documentation | pkg.go.dev/github.com/born-ml/born |
| Roadmap | ROADMAP.md |
| Changelog | CHANGELOG.md |
| GoGPU | github.com/gogpu |
Five months ago, Born was a birthday project with zero stars. Today it's a pure Go ML framework with GPU acceleration, 4 contributors, 49 ONNX operators, and a roadmap to run DeepSeek V4.
No .dll. No .so. No excuses. Models are born production-ready.
go build. Ship. Done.
Star us on GitHub: github.com/born-ml/born
Top comments (0)