DEV Community

Cover image for Born ML v0.8.0: We Killed Our Last .dll — Pure Go GPU Is Here
Andrey Kolkov
Andrey Kolkov

Posted on

Born ML v0.8.0: We Killed Our Last .dll — Pure Go GPU Is Here

TL;DR: Born v0.8.0 replaces go-webgpu (Rust FFI + shared libraries) with gogpu/wgpu — pure Go WebGPU. No .dll. No .so. No runtime downloads. go build now gives you a GPU-accelerated ML binary. We also fixed 5 critical GPU bugs and validated on real model training. Next up: DeepSeek V4 inference support.


The Last Dependency

Five months ago I skipped my birthday to release Born. A few weeks later we made training 18x faster with lazy GPU evaluation. The framework was growing. Contributors were showing up. Real people were using it.

But there was a problem I couldn't ignore anymore.

Every time someone wanted to use GPU acceleration, the conversation went like this:

"How do I run the GPU examples?"

"Download wgpu-native .dll for your platform, put it in your PATH..."

"...I thought you said pure Go?"

They were right. Born's CPU path was pure Go. But the GPU backend used go-webgpu — Go bindings to Rust's wgpu-native via FFI. You needed a platform-specific shared library at runtime. On Windows, a .dll. On Linux, a .so. On macOS, a .dylib.

For a framework whose tagline is "single binary deployment", that was embarrassing.

So we fixed it.


Why Not Earlier?

Fair question. gogpu/wgpu existed for months before v0.8.0. Why did we ship 29 releases on go-webgpu first?

Because that was the plan.

go-webgpu wraps Rust's wgpu-native — a battle-tested GPU abstraction used by Firefox and dozens of production projects. When you're building a new ML framework from scratch, you don't want to debug your GPU backend and your tensor math at the same time. If training produces wrong gradients, is the bug in your autodiff engine or in your WebGPU implementation? With Rust wgpu-native underneath, we knew: the GPU layer works. Any bug is ours.

So we built Born v0.1 through v0.7 on a proven foundation. Tensor ops, autodiff, attention, Flash Attention, speculative decoding, ONNX import, GGUF loading — all validated against a GPU backend we could trust. By v0.7.16, Born had 1,394 tests, 3 external contributors, and real model training working.

Meanwhile, gogpu/wgpu was maturing through its own path — powering gogpu/gg (2D graphics library with GPU compute shaders), running real rendering workloads, stabilizing the Core API across Vulkan, Metal, DX12, and GLES.

When both sides were proven, the migration became simple: we knew Born's code was correct, and we knew gogpu/wgpu's Core API was stable. Any bug found during migration was specifically a wgpu Go integration issue — easy to isolate, easy to fix.

That's exactly what happened. Five bugs, all in resource lifecycle. All fixed in days, not weeks.

Validate on proven foundation first. Swap the foundation second. This is not how you move fast. This is how you move right.


The Migration

Born v0.8.0 replaces go-webgpu with gogpu/wgpu — a pure Go WebGPU implementation from our own GoGPU ecosystem.

- github.com/go-webgpu/webgpu v0.4.1
+ github.com/gogpu/wgpu v0.26.8
Enter fullscreen mode Exit fullscreen mode

One line in go.mod. 27 files changed. 1,830 additions, 1,518 deletions.

What changed:

go-webgpu (before) gogpu/wgpu (after)
Implementation Rust wgpu-native via FFI Pure Go
CGO None (goffi) None
Runtime .dll/.so Required None
Build go build + download .dll go build. Period.
Vulkan/Metal/DX12 Via Rust Via Go
WGSL shaders Unchanged Unchanged
Control External project Our project

That last row matters. gogpu/wgpu isn't some random dependency — it's our project. When Born needs a WebGPU API change, we change it upstream. Both sides of the interface are under our control.


Five Bugs Nobody Told Us About

Swapping the GPU backend is like replacing a car engine while driving. Everything looks the same from the outside, but internally the timing, resource lifecycle, and synchronization are completely different.

We found five critical bugs during migration:

1. PipelineLayout Freed Too Early

Vulkan requires compute pipeline layouts to stay alive during SetBindGroup(). go-webgpu's internal reference counting kept them alive. gogpu/wgpu doesn't — you own your resources.

We fixed this by storing PipelineLayout alongside the pipeline in our cache.

2. Lazy Ops and the Destroy Queue

Born uses lazy evaluation — GPU ops chain without CPU sync. But when a tensor gets garbage-collected mid-chain, its buffer goes to the destroy queue. If the pending operations haven't submitted yet, the buffer is destroyed before the GPU reads it.

Fix: immediate submit for lazy ops. Every operation submits its command encoder before returning.

3. Buffer Copy Race

copyGPUBuffer (used by Data() to read results back to CPU) was queuing the copy but not submitting. The next operation might overwrite the source buffer before the copy executed.

Fix: immediate submit after copy.

4. GC vs GPU

Go's garbage collector doesn't know about GPU resources. A runtime.SetFinalizer on a tensor could fire while the GPU was still computing with that tensor's buffer.

Fix: runtime.KeepAlive() guards around every GPU operation that uses the tensor.

5. Device Cleanup Order

When destroying the GPU device, all pending work must complete first. Without Poll(PollWait) before resource destruction, Vulkan validation layers scream.

Fix: explicit Poll(PollWait) in Release() to ensure GPU idle.

None of these bugs existed with go-webgpu. They're all about resource lifecycle differences between Rust's ownership model (where wgpu-native tracks everything for you) and Go's GC-based model (where you track it yourself).

After fixing all five, we ran all GPU tests and a 20-epoch model training with zero crashes.


What You Get

True Single Binary

go build -o myapp ./cmd/myapp
# That's it. Ship the binary. GPU works.
Enter fullscreen mode Exit fullscreen mode

No .dll downloads. No LD_LIBRARY_PATH. No platform-specific install steps. The binary works on any machine with a Vulkan-capable GPU.

Same API, Same Shaders

If you have existing Born code with GPU, nothing changes:

import (
    "github.com/born-ml/born/backend/cpu"
    "github.com/born-ml/born/autodiff"
)

// CPU-only (always worked)
backend := autodiff.New(cpu.New())
Enter fullscreen mode Exit fullscreen mode
import "github.com/born-ml/born/backend/webgpu"

// GPU-accelerated (now pure Go!)
if webgpu.IsAvailable() {
    gpu, _ := webgpu.New()
    backend := autodiff.New(gpu)
    defer gpu.Release()
}
Enter fullscreen mode Exit fullscreen mode

WGSL shaders are unchanged. The Backend interface (52 methods) is unchanged. Your code just works — minus the .dll.

Validated on Real Training

We didn't just run unit tests. We trained a real Hierarchical Reasoning Model (HRM) for 20 epochs on GPU. Zero crashes. Correct gradients. Same accuracy as go-webgpu.


The Numbers

Metric Value
Go source ~47K LOC
Tests ~34K LOC, 1,394 test functions
ONNX operators 49
Backend methods 52
GPU tests 105
Contributors 4 (@kolkov, @gmohmad, @bennibbelink, @jsully1720)
Releases 30
Stars 80 (organic, no marketing)

Community

v0.8.0 isn't just about the migration. Since v0.7.0, three external contributors have landed real code:

  • @jsully1720 — ONNX Equal operator
  • @bennibbelink — Erf, Sign/Abs, Clamp ops (3 PRs, all full vertical slices: backend → CPU → GPU → autodiff → tests)
  • @gmohmad — LayerNorm, BatchMatMul broadcasting, Squeeze fix, 9 new ONNX ops, inplace mutation bug fix (5 PRs)

These aren't drive-by typo fixes. These are production-quality contributions from people who studied the codebase and followed the patterns. If you're considering contributing, look at what they did — that's the bar.


What's Next: DeepSeek V4 Inference

With the GPU backend stable and pure Go, we can focus on what matters: running real models.

DeepSeek released V4 on April 24, 2026 — two models:

  • V4-Pro: 1.6 trillion params, 49B active
  • V4-Flash: 284B total, 13B active — fits on a consumer GPU

V4-Flash with 13B active parameters is Born's sweet spot. It's the most capable open model that fits on a single 24GB GPU. API pricing is tied to chip availability ($1.74/M tokens, bottleneck pricing) — users want local inference alternatives.

We started researching V4 architecture before it launched — back in early April, when only the Engram paper and V3.2 sparse attention existed. We predicted V4 would combine MoE + Engram + manifold-constrained residuals + compressed sparse attention. On April 24th, the tech report confirmed all four. Two weeks head start on architecture analysis. (We do this kind of research openly — see Discussion #60 for our Recurrent-Depth Transformer analysis.)

Here's the full component breakdown:

Component What Why It Matters
MoE Routing Top-16 sparse expert selection Also unlocks Mixtral, DBRX
MXFP4 Dequantization FP4 expert weights with block scaling V4's native format — not INT4 GPTQ
Engram O(1) hash-lookup factual memory Unique to DeepSeek, DRAM-resident
Three-Pool Attention SWA + C4 + C128 compression 1M context with <10% throughput drop
Hyper-Connections (mHC) 4D manifold-constrained residual Every transformer layer uses this
MTP Drafting Integrated speculative decoding ~2.5 tokens accepted per step
KV Cache Tiering CPU-GPU cache with LRU eviction 128K+ context on 24GB consumer GPU
PD-Disaggregation Prefill/Decode split serving Production throughput scaling

Total estimate: 22-30 weeks. It's a lot. But MoE routing alone unlocks V4, Mixtral, and BAR (Allen AI's modular post-training). Each component is independently valuable.


The GoGPU Ecosystem

Born's GPU backend is powered by the GoGPU ecosystem — pure Go GPU infrastructure:

Project What LOC
gogpu/gg 2D graphics with GPU compute shaders ~222K
gogpu/naga Shader compiler (WGSL → SPIR-V, MSL, HLSL, GLSL, DXIL) ~199K
gogpu/wgpu Pure Go WebGPU (Vulkan, Metal, DX12, GLES, Software) ~156K
gogpu/gogpu Graphics framework + windowing ~52K

Combined with Born's ~81K LOC, that's 710K+ lines of pure Go GPU code. No CGO. No Rust. Just go build.


Try It

git clone https://github.com/born-ml/born
cd born
go build ./...
go test ./... -short
Enter fullscreen mode Exit fullscreen mode

Run the examples:

cd examples/mnist && go run .       # MLP: 97.44% accuracy
cd examples/mnist-cnn && go run .   # CNN: 98.18% accuracy
cd examples/mnist-gpu && go run .   # GPU-accelerated inference
Enter fullscreen mode Exit fullscreen mode

GPU examples now work with go run — no .dll download step.


Build This With Us

Born is at an inflection point. GPU is stable. The architecture is proven. The roadmap to DeepSeek V4 is clear.

We're not looking for passive users. We're looking for people who want to help build one of the best ML frameworks in the world. In Go.

How you can make a difference:

  • File issues. Found a bug? A missing operator? An edge case that breaks your model? Every issue makes Born more production-ready. Our three external contributors started exactly this way.

  • Send PRs. Missing tensor ops (TopK, Scatter — needed for MoE), CPU optimizations (the inner loops are naive — lots of low-hanging fruit), new ONNX operators, quantization infrastructure. Look at what @bennibbelink and @gmohmad have done — full vertical slices, production quality. That's the standard.

  • Bring breakthrough ideas. The hardest problems ahead — MoE routing, FP4 dequantization, compressed sparse attention, CPU-GPU cache tiering — are open research questions in Go. If you have insights on how to make these work efficiently in pure Go, we want to hear them.

  • Challenge our assumptions. Tell us what we're doing wrong. Tell us what's missing. The best frameworks are shaped by people who care enough to argue.

Found a bug? Open an issue
Have a big idea? Feature Requests & Roadmap Discussion
Questions? Getting Started & FAQ
Ready to code? Contributing Guide


Links

Resource Link
GitHub github.com/born-ml/born
v0.8.0 Release Release Notes
Documentation pkg.go.dev/github.com/born-ml/born
Roadmap ROADMAP.md
Changelog CHANGELOG.md
GoGPU github.com/gogpu

Five months ago, Born was a birthday project with zero stars. Today it's a pure Go ML framework with GPU acceleration, 4 contributors, 49 ONNX operators, and a roadmap to run DeepSeek V4.

No .dll. No .so. No excuses. Models are born production-ready.

go build. Ship. Done.

Star us on GitHub: github.com/born-ml/born

Top comments (0)