I Built a Local-First AI Toolkit in Pure Rust — Here's What I Learned
I got tired of the same cycle every time I wanted to run a local LLM:
-
pip installbreaking my entire environment - 2GB+ Python dependencies just to get a single inference
- 300ms+ cold starts before generating a single token
- Ollama as a required daemon just to chat with a model
So I built GwenLand — a local-first AI developer toolkit
written entirely in pure Rust. No Python runtime. No Ollama.
No setup drama.
Fair warning: this is early-stage and experimental.
But the benchmarks are real.
The Specs
| Metric | GwenLand | Typical Python stack |
|---|---|---|
| Binary size | ~10MB stripped | 2GB+ environment |
| Cold start | ~9ms | 300ms+ |
| Dequant throughput | ~9.8 GiB/s avg | depends on llama.cpp |
| Python required | ❌ | ✅ |
| Single binary | ✅ | ❌ |
Why Rust?
Honest answer: I wanted to see if it was possible.
The common assumption is that serious ML tooling needs Python —
for ecosystem, for flexibility, for "that's just how it's done."
I disagree. Your local machine deserves better than
Python overhead.
Rust gave me:
- Predictable memory without a GC
- SIMD intrinsics without dropping into C
- A single stripped binary I can put on a USB drive
- Compile-time guarantees that my dequant math is correct
The Hardest Part: GGUF Dequantization
This is where most Rust ML projects give up and call into
llama.cpp. I didn't want that dependency.
I wrote GGQR-CF-mmap — a custom dequantization kernel using:
-
mmapfor OOM-safe model loading (graceful SSD fallback) - AVX2 SIMD (
__m256d) for parallel f64 dequant -
MADV_SEQUENTIALhint for sequential SSD reads - 64KB chunk size (cache-line optimized)
Benchmark results on my machine (i3, no GPU):
full_dequant: 4.3 GiB/s (+433% vs baseline)
parallel: 9.7 GiB/s (+198%)
peak f64 AVX2: 11.18 GiB/s
avg: 9.82 GiB/s
For reference, llama.cpp hits ~5.0–9.0 GiB/s on the same machine.
We're in the same ballpark. Pure Rust.
Correctness check: sum=340913024 identical across all code paths.
The Closure vs Trait Tradeoff
This one's in the comments of runner.rs and I'll be honest
about it here too.
My model dispatch looks like this:
match kind {
ModelKind::LLaMA3 => run_quantized_loop(cfg, loaded, ...),
ModelKind::Mistral => run_quantized_loop(cfg, loaded, ...),
ModelKind::Qwen => run_quantized_loop(cfg, loaded, ...),
ModelKind::Phi3 => run_quantized_loop(cfg, loaded, ...),
}
They all call the same function. Why not a trait?
Because candle-transformers model types don't share a common
trait for the forward pass. Boxing a closure that owns the model
is simpler than defining a new dispatch trait just for this.
It works. It's honest. It's in the comments.
What's Coming Next
The current inference backend is candle-transformers.
It works, but model coverage is limited.
Next milestone: replace it with mistral.rs as the inference
engine — which supports Qwen3, LLaMA3, Gemma, Phi, and more
out of the box. candle stays for LoRA training. GGQR stays
for dequant. Best of three.
Full pipeline will be:
GGUF file
↓ GGQR-CF-mmap (dequant, ~9.8 GiB/s)
↓ mistral.rs (inference, multi-arch)
↓ candle (LoRA training)
The Philosophy
"Your machine. Your models. Your rules."
Local AI shouldn't require a PhD in DevOps to set up.
It shouldn't need Python, Ollama, CUDA drama, or a
beefy internet connection. It should be one binary,
one command, and it runs.
GwenLand is my attempt at that.
Links
- GitHub: https://github.com/JinXSuper/gwenland
- Built with: Rust, candle, GGQR-CF-mmap (custom)
- License: MIT + Commons Clause
Feedback welcome — especially brutal ones.
This is experimental and I want to get the architecture
right before locking in the API.

Top comments (0)