I built a local-first AI toolkit in pure Rust — here's what I learned

#llm #rust #machinelearning #opensource

I Built a Local-First AI Toolkit in Pure Rust — Here's What I Learned

I got tired of the same cycle every time I wanted to run a local LLM:

pip install breaking my entire environment
2GB+ Python dependencies just to get a single inference
300ms+ cold starts before generating a single token
Ollama as a required daemon just to chat with a model

So I built GwenLand — a local-first AI developer toolkit
written entirely in pure Rust. No Python runtime. No Ollama.
No setup drama.

Fair warning: this is early-stage and experimental.
But the benchmarks are real.

The Specs

Metric	GwenLand	Typical Python stack
Binary size	~10MB stripped	2GB+ environment
Cold start	~9ms	300ms+
Dequant throughput	~9.8 GiB/s avg	depends on llama.cpp
Python required	❌	✅
Single binary	✅	❌

Why Rust?

Honest answer: I wanted to see if it was possible.

The common assumption is that serious ML tooling needs Python —
for ecosystem, for flexibility, for "that's just how it's done."
I disagree. Your local machine deserves better than
Python overhead.

Rust gave me:

Predictable memory without a GC
SIMD intrinsics without dropping into C
A single stripped binary I can put on a USB drive
Compile-time guarantees that my dequant math is correct

The Hardest Part: GGUF Dequantization

This is where most Rust ML projects give up and call into
llama.cpp. I didn't want that dependency.

I wrote GGQR-CF-mmap — a custom dequantization kernel using:

mmap for OOM-safe model loading (graceful SSD fallback)
AVX2 SIMD (__m256d) for parallel f64 dequant
MADV_SEQUENTIAL hint for sequential SSD reads
64KB chunk size (cache-line optimized)

Benchmark results on my machine (i3, no GPU):
full_dequant: 4.3 GiB/s (+433% vs baseline)
parallel: 9.7 GiB/s (+198%)
peak f64 AVX2: 11.18 GiB/s
avg: 9.82 GiB/s

For reference, llama.cpp hits ~5.0–9.0 GiB/s on the same machine.
We're in the same ballpark. Pure Rust.

Correctness check: sum=340913024 identical across all code paths.

The Closure vs Trait Tradeoff

This one's in the comments of runner.rs and I'll be honest
about it here too.

My model dispatch looks like this:

match kind {
    ModelKind::LLaMA3   => run_quantized_loop(cfg, loaded, ...),
    ModelKind::Mistral  => run_quantized_loop(cfg, loaded, ...),
    ModelKind::Qwen     => run_quantized_loop(cfg, loaded, ...),
    ModelKind::Phi3     => run_quantized_loop(cfg, loaded, ...),
}

They all call the same function. Why not a trait?

Because candle-transformers model types don't share a common
trait for the forward pass. Boxing a closure that owns the model
is simpler than defining a new dispatch trait just for this.
It works. It's honest. It's in the comments.

What's Coming Next

The current inference backend is candle-transformers.
It works, but model coverage is limited.

Next milestone: replace it with mistral.rs as the inference
engine — which supports Qwen3, LLaMA3, Gemma, Phi, and more
out of the box. candle stays for LoRA training. GGQR stays
for dequant. Best of three.

Full pipeline will be:
GGUF file
↓ GGQR-CF-mmap (dequant, ~9.8 GiB/s)
↓ mistral.rs (inference, multi-arch)
↓ candle (LoRA training)

The Philosophy

"Your machine. Your models. Your rules."

Local AI shouldn't require a PhD in DevOps to set up.
It shouldn't need Python, Ollama, CUDA drama, or a
beefy internet connection. It should be one binary,
one command, and it runs.

GwenLand is my attempt at that.