DEV Community

Cover image for Born ML v0.6.0: From 90 Seconds to 5 - How We Made Go ML Training Actually Fast
Andrey Kolkov
Andrey Kolkov

Posted on

Born ML v0.6.0: From 90 Seconds to 5 - How We Made Go ML Training Actually Fast

TL;DR: Born v0.6.0 ships ONNX model import and lazy GPU evaluation. Training went from ~90 seconds per step to under 5. No CGO. Pure Go. Let me show you what changed.


Three weeks ago, I skipped my birthday to release Born - a pure Go ML framework. The response was better than I expected. 74 people read the post. Some starred the repo. Some asked hard questions.

One star caught my attention: Jan Pfeifer, creator of GoMLX (1.2k stars) and GoNB (the Go Jupyter kernel, 959 stars). A Google Research engineer who's been building Go ML tools for years.

When someone who knows the space that well takes notice, you pay attention. It also tells me: the Go ML community is watching. They want this to work.

So I got back to coding.


The Problem: GPU Thrashing

Born 0.5.x had GPU acceleration. 123x speedup on MatMul. All the shaders were there.

But training was painfully slow.

Why? Every single tensor operation was doing this:

GPU compute → GPU-to-CPU copy → CPU work → CPU-to-GPU copy → next op
Enter fullscreen mode Exit fullscreen mode

~200 GPU command submissions per forward pass. ~200 sync points. The GPU spent more time waiting than computing.

I measured it. A single training step: ~90 seconds.

That's not a framework. That's a coffee break generator.


The Fix: Lazy Everything

The solution was conceptually simple: stop copying data until someone actually needs it.

// Before: every operation immediately syncs
result := a.Add(b)  // GPU compute → CPU copy → done
x := result.Mul(c)  // GPU compute → CPU copy → done

// After: chain stays on GPU
result := a.Add(b)  // GPU compute → keep on GPU
x := result.Mul(c)  // GPU compute → keep on GPU
data := x.Data()    // NOW copy to CPU (user asked for it)
Enter fullscreen mode Exit fullscreen mode

This pattern is called "lazy evaluation." PyTorch does it. JAX does it. Now Born does it too.

Implementation details:

  • LazyGPUData holds a reference to GPU buffer
  • Operations chain without CPU round-trips
  • runtime.SetFinalizer handles GPU memory cleanup
  • Explicit FlushCommands() when you need sync

The result?

Metric Before After
Training step ~90s <5s
GPU submits per chain ~200 1-2
Memory leaks Some Fixed

~18x faster. Same hardware. Same model. Just smarter data movement.


ONNX Import: Train Anywhere, Deploy in Go

The other big feature in v0.6.0: ONNX model import.

ONNX is the interchange format for ML models. Train in PyTorch, export to .onnx, load anywhere. Now "anywhere" includes Born.

import "github.com/born-ml/born/onnx"

// Load PyTorch model exported as ONNX
model, err := onnx.Load("resnet50.onnx", backend)
if err != nil {
    log.Fatal(err)
}

// Run inference
output := model.Forward(input)
Enter fullscreen mode Exit fullscreen mode

What's supported:

  • 30+ ONNX operators (MatMul, Conv2D, ReLU, Softmax, Gather, Reshape...)
  • Protobuf parsing
  • Weight extraction
  • Computation graph reconstruction
  • Extensible operator registry

What this means: you can train models in Python (because let's be real, that's where the research happens), then deploy them as single Go binaries. No Python runtime. No ONNX Runtime. Just Go.


The Full v0.6.0 Changelog

Beyond lazy evaluation and ONNX:

Raw Tensor Operations (50+ new ops):

  • Argmax, TopK
  • Type conversions (Float32 ↔ Int32 ↔ Bool)
  • Advanced indexing (Gather, Scatter)
  • NumPy-style broadcasting

GPU-to-GPU Copy:

  • CopyBufferToBuffer for direct GPU memory transfer
  • No more GPU→CPU→GPU round-trips in lazy chains

Bug Fixes:

  • Fixed GPU memory leak when lazy tensors go out of scope
  • Fixed typed accessors bypassing lazy realization
  • Fixed Where and Sum operations missing lazy mode support

Tests:

  • 15+ new ONNX tests
  • Lazy mode chain tests
  • Command batching tests

What's Next: v0.7.0 Roadmap

We're not slowing down. v0.7.0 (targeting January 2026) focuses on inference optimization:

Flash Attention 2

  • Tiled attention algorithm
  • O(N) memory instead of O(N²)
  • 128K+ context support
  • 2x speedup over standard attention

Speculative Decoding

  • Draft model generates K tokens
  • Target model verifies in parallel
  • 2-4x inference speedup

GGUF Import

  • Load llama.cpp quantized models directly
  • Q4_K_M, Q5_K_M, Q6_K support
  • Access to thousands of pre-quantized models

The goal: make Born competitive with vLLM and llama.cpp for inference, while staying pure Go.


Why This Matters

Go has excellent tooling for production systems. Kubernetes, Docker, Terraform, Prometheus - all Go. Single binary deployment. Easy cross-compilation. Strong typing.

But ML? You had to leave Go, write Python, then figure out how to glue it all together.

Born changes that. Train in Python if you want. But deploy in Go. Single binary. No Python interpreter. No conda environments. No "it works on my machine."

// Your entire ML inference pipeline
model := born.Load("model.born")
http.HandleFunc("/predict", func(w http.ResponseWriter, r *http.Request) {
    input := parseRequest(r)
    output := model.Predict(input)
    json.NewEncoder(w).Encode(output)
})
Enter fullscreen mode Exit fullscreen mode

That's it. go build. Deploy anywhere Go runs.


Try It Yourself

git clone https://github.com/born-ml/born
cd born
make build
make test
Enter fullscreen mode Exit fullscreen mode

Or jump straight to examples:

cd examples/mnist && go run .      # MLP: 97.44% accuracy
cd examples/mnist-cnn && go run .  # CNN: 98.18% accuracy
Enter fullscreen mode Exit fullscreen mode

Get Involved

Born is open source and we need your help to make it better.

Found a bug? Open an issue - we fix them fast:

Have an idea? Start a discussion:

Want to contribute? PRs are welcome:

Just want to follow progress? Star the repo:


Links


The Go ML ecosystem is growing. Whether you're a Go developer curious about ML, or an ML engineer tired of deployment hell - give Born a try. Report bugs. Suggest features. Help us build something great.

Let's make Go a first-class citizen in the ML world.

Star us on GitHub: github.com/born-ml/born

Top comments (0)