DEV Community

JinX Super
JinX Super

Posted on

I built a local-first AI toolkit in pure Rust — here's what I learned

I Built a Local-First AI Toolkit in Pure Rust — Here's What I Learned

I got tired of the same cycle every time I wanted to run a local LLM:

  • pip install breaking my entire environment
  • 2GB+ Python dependencies just to get a single inference
  • 300ms+ cold starts before generating a single token
  • Ollama as a required daemon just to chat with a model

So I built GwenLand — a local-first AI developer toolkit
written entirely in pure Rust. No Python runtime. No Ollama.
No setup drama.

Fair warning: this is early-stage and experimental.
But the benchmarks are real.


The Specs

Metric GwenLand Typical Python stack
Binary size ~10MB stripped 2GB+ environment
Cold start ~9ms 300ms+
Dequant throughput ~9.8 GiB/s avg depends on llama.cpp
Python required
Single binary

Why Rust?

Honest answer: I wanted to see if it was possible.

The common assumption is that serious ML tooling needs Python —
for ecosystem, for flexibility, for "that's just how it's done."
I disagree. Your local machine deserves better than
Python overhead.

Rust gave me:

  • Predictable memory without a GC
  • SIMD intrinsics without dropping into C
  • A single stripped binary I can put on a USB drive
  • Compile-time guarantees that my dequant math is correct

The Hardest Part: GGUF Dequantization

This is where most Rust ML projects give up and call into
llama.cpp. I didn't want that dependency.

I wrote GGQR-CF-mmap — a custom dequantization kernel using:

  • mmap for OOM-safe model loading (graceful SSD fallback)
  • AVX2 SIMD (__m256d) for parallel f64 dequant
  • MADV_SEQUENTIAL hint for sequential SSD reads
  • 64KB chunk size (cache-line optimized)

Benchmark results on my machine (i3, no GPU):
full_dequant: 4.3 GiB/s (+433% vs baseline)
parallel: 9.7 GiB/s (+198%)
peak f64 AVX2: 11.18 GiB/s
avg: 9.82 GiB/s

For reference, llama.cpp hits ~5.0–9.0 GiB/s on the same machine.
We're in the same ballpark. Pure Rust.

Correctness check: sum=340913024 identical across all code paths.


The Closure vs Trait Tradeoff

This one's in the comments of runner.rs and I'll be honest
about it here too.

My model dispatch looks like this:

match kind {
    ModelKind::LLaMA3   => run_quantized_loop(cfg, loaded, ...),
    ModelKind::Mistral  => run_quantized_loop(cfg, loaded, ...),
    ModelKind::Qwen     => run_quantized_loop(cfg, loaded, ...),
    ModelKind::Phi3     => run_quantized_loop(cfg, loaded, ...),
}
Enter fullscreen mode Exit fullscreen mode

They all call the same function. Why not a trait?

Because candle-transformers model types don't share a common
trait for the forward pass. Boxing a closure that owns the model
is simpler than defining a new dispatch trait just for this.
It works. It's honest. It's in the comments.


What's Coming Next

The current inference backend is candle-transformers.
It works, but model coverage is limited.

Next milestone: replace it with mistral.rs as the inference
engine — which supports Qwen3, LLaMA3, Gemma, Phi, and more
out of the box. candle stays for LoRA training. GGQR stays
for dequant. Best of three.

Full pipeline will be:
GGUF file
↓ GGQR-CF-mmap (dequant, ~9.8 GiB/s)
↓ mistral.rs (inference, multi-arch)
↓ candle (LoRA training)


The Philosophy

"Your machine. Your models. Your rules."

Local AI shouldn't require a PhD in DevOps to set up.
It shouldn't need Python, Ollama, CUDA drama, or a
beefy internet connection. It should be one binary,
one command, and it runs.

GwenLand is my attempt at that.


Links

Feedback welcome — especially brutal ones.
This is experimental and I want to get the architecture
right before locking in the API.

Preview/Proof:
Built in Benchmark

Top comments (0)