Andrey Kolkov

Posted on Dec 4, 2025

Born ML v0.6.0: From 90 Seconds to 5 - How We Made Go ML Training Actually Fast

#go #machinelearning #gpu #opensource

TL;DR: Born v0.6.0 ships ONNX model import and lazy GPU evaluation. Training went from ~90 seconds per step to under 5. No CGO. Pure Go. Let me show you what changed.

Three weeks ago, I skipped my birthday to release Born - a pure Go ML framework. The response was better than I expected. 74 people read the post. Some starred the repo. Some asked hard questions.

One star caught my attention: Jan Pfeifer, creator of GoMLX (1.2k stars) and GoNB (the Go Jupyter kernel, 959 stars). A Google Research engineer who's been building Go ML tools for years.

When someone who knows the space that well takes notice, you pay attention. It also tells me: the Go ML community is watching. They want this to work.

So I got back to coding.

The Problem: GPU Thrashing

Born 0.5.x had GPU acceleration. 123x speedup on MatMul. All the shaders were there.

But training was painfully slow.

Why? Every single tensor operation was doing this:

GPU compute → GPU-to-CPU copy → CPU work → CPU-to-GPU copy → next op

~200 GPU command submissions per forward pass. ~200 sync points. The GPU spent more time waiting than computing.

I measured it. A single training step: ~90 seconds.

That's not a framework. That's a coffee break generator.

The Fix: Lazy Everything

The solution was conceptually simple: stop copying data until someone actually needs it.

// Before: every operation immediately syncs
result := a.Add(b)  // GPU compute → CPU copy → done
x := result.Mul(c)  // GPU compute → CPU copy → done

// After: chain stays on GPU
result := a.Add(b)  // GPU compute → keep on GPU
x := result.Mul(c)  // GPU compute → keep on GPU
data := x.Data()    // NOW copy to CPU (user asked for it)

This pattern is called "lazy evaluation." PyTorch does it. JAX does it. Now Born does it too.

Implementation details:

LazyGPUData holds a reference to GPU buffer
Operations chain without CPU round-trips
runtime.SetFinalizer handles GPU memory cleanup
Explicit FlushCommands() when you need sync

The result?

Metric	Before	After
Training step	~90s	<5s
GPU submits per chain	~200	1-2
Memory leaks	Some	Fixed

~18x faster. Same hardware. Same model. Just smarter data movement.

ONNX Import: Train Anywhere, Deploy in Go

The other big feature in v0.6.0: ONNX model import.

ONNX is the interchange format for ML models. Train in PyTorch, export to .onnx, load anywhere. Now "anywhere" includes Born.

import "github.com/born-ml/born/onnx"

// Load PyTorch model exported as ONNX
model, err := onnx.Load("resnet50.onnx", backend)
if err != nil {
    log.Fatal(err)
}

// Run inference
output := model.Forward(input)

What's supported:

30+ ONNX operators (MatMul, Conv2D, ReLU, Softmax, Gather, Reshape...)
Protobuf parsing
Weight extraction
Computation graph reconstruction
Extensible operator registry

What this means: you can train models in Python (because let's be real, that's where the research happens), then deploy them as single Go binaries. No Python runtime. No ONNX Runtime. Just Go.

The Full v0.6.0 Changelog

Beyond lazy evaluation and ONNX:

Raw Tensor Operations (50+ new ops):

Argmax, TopK
Type conversions (Float32 ↔ Int32 ↔ Bool)
Advanced indexing (Gather, Scatter)
NumPy-style broadcasting

GPU-to-GPU Copy:

CopyBufferToBuffer for direct GPU memory transfer
No more GPU→CPU→GPU round-trips in lazy chains

Bug Fixes:

Fixed GPU memory leak when lazy tensors go out of scope
Fixed typed accessors bypassing lazy realization
Fixed Where and Sum operations missing lazy mode support

Tests:

15+ new ONNX tests
Lazy mode chain tests
Command batching tests

What's Next: v0.7.0 Roadmap

We're not slowing down. v0.7.0 (targeting January 2026) focuses on inference optimization:

Flash Attention 2

Tiled attention algorithm
O(N) memory instead of O(N²)
128K+ context support
2x speedup over standard attention

Speculative Decoding

Draft model generates K tokens
Target model verifies in parallel
2-4x inference speedup

GGUF Import

Load llama.cpp quantized models directly
Q4_K_M, Q5_K_M, Q6_K support
Access to thousands of pre-quantized models

The goal: make Born competitive with vLLM and llama.cpp for inference, while staying pure Go.

Why This Matters

Go has excellent tooling for production systems. Kubernetes, Docker, Terraform, Prometheus - all Go. Single binary deployment. Easy cross-compilation. Strong typing.

But ML? You had to leave Go, write Python, then figure out how to glue it all together.

Born changes that. Train in Python if you want. But deploy in Go. Single binary. No Python interpreter. No conda environments. No "it works on my machine."

// Your entire ML inference pipeline
model := born.Load("model.born")
http.HandleFunc("/predict", func(w http.ResponseWriter, r *http.Request) {
    input := parseRequest(r)
    output := model.Predict(input)
    json.NewEncoder(w).Encode(output)
})

That's it. go build. Deploy anywhere Go runs.

Try It Yourself

git clone https://github.com/born-ml/born
cd born
make build
make test

Or jump straight to examples:

cd examples/mnist && go run .      # MLP: 97.44% accuracy
cd examples/mnist-cnn && go run .  # CNN: 98.18% accuracy

Get Involved

Born is open source and we need your help to make it better.

Found a bug? Open an issue - we fix them fast:

GitHub Issues

Have an idea? Start a discussion:

Want to contribute? PRs are welcome:

Contributing Guide
Good first issues are labeled

Just want to follow progress? Star the repo:

github.com/born-ml/born

Links

Resource	Link
GitHub	github.com/born-ml/born
Documentation	pkg.go.dev/github.com/born-ml/born
Roadmap	ROADMAP.md
Changelog	CHANGELOG.md
First Article	I Skipped My Birthday to Give Go Its First Real ML Framework

The Go ML ecosystem is growing. Whether you're a Go developer curious about ML, or an ML engineer tired of deployment hell - give Born a try. Report bugs. Suggest features. Help us build something great.

Let's make Go a first-class citizen in the ML world.

Star us on GitHub: github.com/born-ml/born