TL;DR: Born v0.6.0 ships ONNX model import and lazy GPU evaluation. Training went from ~90 seconds per step to under 5. No CGO. Pure Go. Let me show you what changed.
Three weeks ago, I skipped my birthday to release Born - a pure Go ML framework. The response was better than I expected. 74 people read the post. Some starred the repo. Some asked hard questions.
One star caught my attention: Jan Pfeifer, creator of GoMLX (1.2k stars) and GoNB (the Go Jupyter kernel, 959 stars). A Google Research engineer who's been building Go ML tools for years.
When someone who knows the space that well takes notice, you pay attention. It also tells me: the Go ML community is watching. They want this to work.
So I got back to coding.
The Problem: GPU Thrashing
Born 0.5.x had GPU acceleration. 123x speedup on MatMul. All the shaders were there.
But training was painfully slow.
Why? Every single tensor operation was doing this:
GPU compute → GPU-to-CPU copy → CPU work → CPU-to-GPU copy → next op
~200 GPU command submissions per forward pass. ~200 sync points. The GPU spent more time waiting than computing.
I measured it. A single training step: ~90 seconds.
That's not a framework. That's a coffee break generator.
The Fix: Lazy Everything
The solution was conceptually simple: stop copying data until someone actually needs it.
// Before: every operation immediately syncs
result := a.Add(b) // GPU compute → CPU copy → done
x := result.Mul(c) // GPU compute → CPU copy → done
// After: chain stays on GPU
result := a.Add(b) // GPU compute → keep on GPU
x := result.Mul(c) // GPU compute → keep on GPU
data := x.Data() // NOW copy to CPU (user asked for it)
This pattern is called "lazy evaluation." PyTorch does it. JAX does it. Now Born does it too.
Implementation details:
-
LazyGPUDataholds a reference to GPU buffer - Operations chain without CPU round-trips
-
runtime.SetFinalizerhandles GPU memory cleanup - Explicit
FlushCommands()when you need sync
The result?
| Metric | Before | After |
|---|---|---|
| Training step | ~90s | <5s |
| GPU submits per chain | ~200 | 1-2 |
| Memory leaks | Some | Fixed |
~18x faster. Same hardware. Same model. Just smarter data movement.
ONNX Import: Train Anywhere, Deploy in Go
The other big feature in v0.6.0: ONNX model import.
ONNX is the interchange format for ML models. Train in PyTorch, export to .onnx, load anywhere. Now "anywhere" includes Born.
import "github.com/born-ml/born/onnx"
// Load PyTorch model exported as ONNX
model, err := onnx.Load("resnet50.onnx", backend)
if err != nil {
log.Fatal(err)
}
// Run inference
output := model.Forward(input)
What's supported:
- 30+ ONNX operators (MatMul, Conv2D, ReLU, Softmax, Gather, Reshape...)
- Protobuf parsing
- Weight extraction
- Computation graph reconstruction
- Extensible operator registry
What this means: you can train models in Python (because let's be real, that's where the research happens), then deploy them as single Go binaries. No Python runtime. No ONNX Runtime. Just Go.
The Full v0.6.0 Changelog
Beyond lazy evaluation and ONNX:
Raw Tensor Operations (50+ new ops):
- Argmax, TopK
- Type conversions (Float32 ↔ Int32 ↔ Bool)
- Advanced indexing (Gather, Scatter)
- NumPy-style broadcasting
GPU-to-GPU Copy:
-
CopyBufferToBufferfor direct GPU memory transfer - No more GPU→CPU→GPU round-trips in lazy chains
Bug Fixes:
- Fixed GPU memory leak when lazy tensors go out of scope
- Fixed typed accessors bypassing lazy realization
- Fixed Where and Sum operations missing lazy mode support
Tests:
- 15+ new ONNX tests
- Lazy mode chain tests
- Command batching tests
What's Next: v0.7.0 Roadmap
We're not slowing down. v0.7.0 (targeting January 2026) focuses on inference optimization:
Flash Attention 2
- Tiled attention algorithm
- O(N) memory instead of O(N²)
- 128K+ context support
- 2x speedup over standard attention
Speculative Decoding
- Draft model generates K tokens
- Target model verifies in parallel
- 2-4x inference speedup
GGUF Import
- Load llama.cpp quantized models directly
- Q4_K_M, Q5_K_M, Q6_K support
- Access to thousands of pre-quantized models
The goal: make Born competitive with vLLM and llama.cpp for inference, while staying pure Go.
Why This Matters
Go has excellent tooling for production systems. Kubernetes, Docker, Terraform, Prometheus - all Go. Single binary deployment. Easy cross-compilation. Strong typing.
But ML? You had to leave Go, write Python, then figure out how to glue it all together.
Born changes that. Train in Python if you want. But deploy in Go. Single binary. No Python interpreter. No conda environments. No "it works on my machine."
// Your entire ML inference pipeline
model := born.Load("model.born")
http.HandleFunc("/predict", func(w http.ResponseWriter, r *http.Request) {
input := parseRequest(r)
output := model.Predict(input)
json.NewEncoder(w).Encode(output)
})
That's it. go build. Deploy anywhere Go runs.
Try It Yourself
git clone https://github.com/born-ml/born
cd born
make build
make test
Or jump straight to examples:
cd examples/mnist && go run . # MLP: 97.44% accuracy
cd examples/mnist-cnn && go run . # CNN: 98.18% accuracy
Get Involved
Born is open source and we need your help to make it better.
Found a bug? Open an issue - we fix them fast:
Have an idea? Start a discussion:
Want to contribute? PRs are welcome:
- Contributing Guide
- Good first issues are labeled
Just want to follow progress? Star the repo:
Links
| Resource | Link |
|---|---|
| GitHub | github.com/born-ml/born |
| Documentation | pkg.go.dev/github.com/born-ml/born |
| Roadmap | ROADMAP.md |
| Changelog | CHANGELOG.md |
| First Article | I Skipped My Birthday to Give Go Its First Real ML Framework |
The Go ML ecosystem is growing. Whether you're a Go developer curious about ML, or an ML engineer tired of deployment hell - give Born a try. Report bugs. Suggest features. Help us build something great.
Let's make Go a first-class citizen in the ML world.
Star us on GitHub: github.com/born-ml/born
Top comments (0)