Why I built a Rust deep learning framework (and what I got wrong twice first)

#rust #machinelearning #showdev #deeplearning

The Python script that made me give up. It had more boilerplate for freezing and re-composing submodules than it did for the actual model. I'd already pivoted three times. The next pivot was going to cost another rewrite. That's the day I decided to build this in Rust.

I should say upfront: before this, I had never trained a deep learning model. The path to here was unusual. A theoretical physics degree (with a PhD grant I turned down), then a long detour through documentary film and independent cinema, then self-taught software engineering and twenty years architecting scalable data systems through the startup-scaling era. Pattern recognition across domains is the thing I trust most in my own thinking. A wide-focus lens.

What I'm building flodl for is research called FBRL, Feedback Recursive Loops. It started as a hobby to explore the field. For now the shape is what matters. Mixing modalities. Feedback loops: images read, classified, and reproduced to force honest attention. Composition that goes letter, then word, then line, then paragraph. Each level frozen and used as an oracle for the level above. The vision was always nested, always partially-frozen, always graph-shaped. That shape is what broke Python for me.

The Python dead end

I started with sound and vision mixed together. That failed. I reframed to a foveal approach: the model reads letters by attention, and at each step it also tries to reproduce what it read. The reproduction forces more abstract latent representations. The letter model is the most developed part of this work so far.

Composition was always the next step. Read a letter. Then read a word that reuses the frozen letter reader. Then read a line that reuses both. Each level adds capability while the frozen levels below stay reliable.

Before I even tried to write the composition code, the Python script for the letter model alone had exploded in complexity. Every architectural pivot, and there were many, added more boilerplate than it removed. Per-op dispatch overhead was biting on top of that, especially in the recurrent attention loop where I was making hundreds of small kernel calls per training step.

What I actually wanted was a way to describe the network. Not procedurally assemble it from instances and module hierarchies. Describe it. What's its structure? What's tagged? What's frozen? What loads from what checkpoint? Looking at the kind of object I needed to express, the answer was obvious. This is a graph, not a script.

I had a prior on this. A few years back I'd built a graph library for a data-processing project, and I knew what the graph-shaped headspace felt like. The itch was familiar.

Two false starts

Python failed me first, Go second. The project was called goDL. It taught me more than it shipped.

The thing that killed it was the GC trap. Garbage collection plus GPU memory ownership do not compose. You end up with tensors the garbage collector thinks are dead but that the GPU is still using, or the inverse, tensors the GPU is done with but that the GC won't clean for another generation. You can layer manual lifetime management on top, but at that point you've reinvented Rust ownership in a language that's actively fighting you about it.

So: Rust. Then flodl.

The libtorch FFI bet

Several Rust deep learning frameworks exist already. Pure-Rust GPU paths are real and the people building them are doing serious work. None of them, when I started, gave me what I wanted.

The bet I made for flodl was libtorch FFI through a thin C++ shim. It is not pure Rust. It does not run on every backend. It inherits libtorch's memory footprint. What I get in exchange is CUDA parity today. NCCL today. Tensor Cores today. Mixed precision, CUDA Graphs, fused multi-tensor optimizers. Not in six months. Now.

I came to programming through the startup-scaling era, where the daily question was how to architect systems that hold up at volume. Shipping production-grade systems is what I do know. The deep learning math I'm still learning as I build. When I chose libtorch FFI, it was the shipping instinct talking: stand on a battle-tested C++ library, and you get production-grade performance today rather than hoping a pure-Rust kernel path catches up over the next few release cycles.

That bet pays off in measurable ways. There's a benchmark suite that compares flodl to PyTorch on ten architectures, more on that later. For now the point is just: libtorch FFI was a deliberate choice with known costs, not a shortcut.

What flodl looks like today

flodl has, today:

Tensor and autograd backed by libtorch. 100+ tensor operations, 90+ differentiable.
nn modules at rough PyTorch parity: activations, losses, optimizers (SGD, Adam, AdamW, RMSprop, RAdam, NAdam, with fused CUDA kernels), conv (1d/2d/3d, transposed), recurrent (GRU and LSTM cells and full sequences), attention, normalization (batch, layer, group, instance, RMS), pooling, dropout variants, embedding.
A declarative graph DSL called FlowBuilder, with visualization.
Hierarchical graph composition with selective freeze and partial checkpoint loading. The thing the FBRL composition shape needs.
Transparent multi-GPU training on heterogeneous hardware. One training loop, one or N GPUs.
Production niceties: mixed precision, CUDA Graphs, fused optimizers, async data prefetch.

Saying that as a list does not do the work. The point is that flodl has crossed the line from "can I express what I want" to "does the framework hold up under real workloads I care about." It does.

flodl is two months old. The velocity comes from AI collaboration on implementation. I'm the architect and the decision-maker. The bets, the API shape, the priorities, the truth-discipline this series will hold itself to: those are mine. Many of the lines of code: not. The pace is what AI partnership makes possible, and I'll be explicit about that throughout the series.

What hooked me

I started flodl because FBRL needed it. Building it pulled me into questions I didn't expect: ergonomics, performance, distributed training, convergence under heterogeneous compute. That is the rest of this series. Walking through what got built and why.

For now the simplest thing I can say about why flodl exists is the thing that has stayed true through three Python rewrites, one failed Go attempt, and the Rust work that became flodl: