<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: JinX Super</title>
    <description>The latest articles on DEV Community by JinX Super (@jinxsuper).</description>
    <link>https://dev.to/jinxsuper</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3972082%2F83262462-63df-48d3-b3e2-9f39627bcd08.png</url>
      <title>DEV Community: JinX Super</title>
      <link>https://dev.to/jinxsuper</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jinxsuper"/>
    <language>en</language>
    <item>
      <title>I built a local-first AI toolkit in pure Rust — here's what I learned</title>
      <dc:creator>JinX Super</dc:creator>
      <pubDate>Sun, 07 Jun 2026 05:26:51 +0000</pubDate>
      <link>https://dev.to/jinxsuper/i-built-a-local-first-ai-toolkit-in-pure-rust-heres-what-i-learned-5efg</link>
      <guid>https://dev.to/jinxsuper/i-built-a-local-first-ai-toolkit-in-pure-rust-heres-what-i-learned-5efg</guid>
      <description>&lt;h1&gt;
  
  
  I Built a Local-First AI Toolkit in Pure Rust — Here's What I Learned
&lt;/h1&gt;

&lt;p&gt;I got tired of the same cycle every time I wanted to run a local LLM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pip install&lt;/code&gt; breaking my entire environment&lt;/li&gt;
&lt;li&gt;2GB+ Python dependencies just to get a single inference&lt;/li&gt;
&lt;li&gt;300ms+ cold starts before generating a single token&lt;/li&gt;
&lt;li&gt;Ollama as a required daemon just to chat with a model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I built &lt;strong&gt;GwenLand&lt;/strong&gt; — a local-first AI developer toolkit &lt;br&gt;
written entirely in pure Rust. No Python runtime. No Ollama. &lt;br&gt;
No setup drama.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Fair warning: this is early-stage and experimental. &lt;br&gt;
But the benchmarks are real.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  The Specs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;GwenLand&lt;/th&gt;
&lt;th&gt;Typical Python stack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Binary size&lt;/td&gt;
&lt;td&gt;~10MB stripped&lt;/td&gt;
&lt;td&gt;2GB+ environment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cold start&lt;/td&gt;
&lt;td&gt;~9ms&lt;/td&gt;
&lt;td&gt;300ms+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dequant throughput&lt;/td&gt;
&lt;td&gt;~9.8 GiB/s avg&lt;/td&gt;
&lt;td&gt;depends on llama.cpp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python required&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single binary&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  Why Rust?
&lt;/h2&gt;

&lt;p&gt;Honest answer: I wanted to see if it was possible.&lt;/p&gt;

&lt;p&gt;The common assumption is that serious ML tooling needs Python — &lt;br&gt;
for ecosystem, for flexibility, for "that's just how it's done." &lt;br&gt;
I disagree. Your local machine deserves better than &lt;br&gt;
Python overhead.&lt;/p&gt;

&lt;p&gt;Rust gave me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predictable memory without a GC&lt;/li&gt;
&lt;li&gt;SIMD intrinsics without dropping into C&lt;/li&gt;
&lt;li&gt;A single stripped binary I can put on a USB drive&lt;/li&gt;
&lt;li&gt;Compile-time guarantees that my dequant math is correct&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  The Hardest Part: GGUF Dequantization
&lt;/h2&gt;

&lt;p&gt;This is where most Rust ML projects give up and call into &lt;br&gt;
llama.cpp. I didn't want that dependency.&lt;/p&gt;

&lt;p&gt;I wrote &lt;strong&gt;GGQR-CF-mmap&lt;/strong&gt; — a custom dequantization kernel using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;mmap&lt;/code&gt; for OOM-safe model loading (graceful SSD fallback)&lt;/li&gt;
&lt;li&gt;AVX2 SIMD (&lt;code&gt;__m256d&lt;/code&gt;) for parallel f64 dequant&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MADV_SEQUENTIAL&lt;/code&gt; hint for sequential SSD reads&lt;/li&gt;
&lt;li&gt;64KB chunk size (cache-line optimized)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Benchmark results on my machine (i3, no GPU):&lt;br&gt;
full_dequant:    4.3 GiB/s  (+433% vs baseline)&lt;br&gt;
parallel:        9.7 GiB/s  (+198%)&lt;br&gt;
peak f64 AVX2:  11.18 GiB/s&lt;br&gt;
avg:             9.82 GiB/s&lt;/p&gt;

&lt;p&gt;For reference, llama.cpp hits ~5.0–9.0 GiB/s on the same machine.&lt;br&gt;
&lt;strong&gt;We're in the same ballpark. Pure Rust.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Correctness check: &lt;code&gt;sum=340913024&lt;/code&gt; identical across all code paths.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Closure vs Trait Tradeoff
&lt;/h2&gt;

&lt;p&gt;This one's in the comments of &lt;code&gt;runner.rs&lt;/code&gt; and I'll be honest &lt;br&gt;
about it here too.&lt;/p&gt;

&lt;p&gt;My model dispatch looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nn"&gt;ModelKind&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;LLaMA3&lt;/span&gt;   &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;run_quantized_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loaded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nn"&gt;ModelKind&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Mistral&lt;/span&gt;  &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;run_quantized_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loaded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nn"&gt;ModelKind&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Qwen&lt;/span&gt;     &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;run_quantized_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loaded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nn"&gt;ModelKind&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Phi3&lt;/span&gt;     &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;run_quantized_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loaded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;They all call the same function. Why not a trait?&lt;/p&gt;

&lt;p&gt;Because &lt;code&gt;candle-transformers&lt;/code&gt; model types don't share a common &lt;br&gt;
trait for the forward pass. Boxing a closure that owns the model &lt;br&gt;
is simpler than defining a new dispatch trait just for this. &lt;br&gt;
It works. It's honest. It's in the comments.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Coming Next
&lt;/h2&gt;

&lt;p&gt;The current inference backend is &lt;code&gt;candle-transformers&lt;/code&gt;. &lt;br&gt;
It works, but model coverage is limited.&lt;/p&gt;

&lt;p&gt;Next milestone: replace it with &lt;strong&gt;mistral.rs&lt;/strong&gt; as the inference &lt;br&gt;
engine — which supports Qwen3, LLaMA3, Gemma, Phi, and more &lt;br&gt;
out of the box. candle stays for LoRA training. GGQR stays &lt;br&gt;
for dequant. Best of three.&lt;/p&gt;

&lt;p&gt;Full pipeline will be:&lt;br&gt;
GGUF file&lt;br&gt;
↓ GGQR-CF-mmap (dequant, ~9.8 GiB/s)&lt;br&gt;
↓ mistral.rs (inference, multi-arch)&lt;br&gt;
↓ candle (LoRA training)&lt;/p&gt;




&lt;h2&gt;
  
  
  The Philosophy
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"Your machine. Your models. Your rules."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Local AI shouldn't require a PhD in DevOps to set up. &lt;br&gt;
It shouldn't need Python, Ollama, CUDA drama, or a &lt;br&gt;
beefy internet connection. It should be one binary, &lt;br&gt;
one command, and it runs.&lt;/p&gt;

&lt;p&gt;GwenLand is my attempt at that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/JinXSuper/gwenland" rel="noopener noreferrer"&gt;https://github.com/JinXSuper/gwenland&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Built with: Rust, candle, GGQR-CF-mmap (custom)&lt;/li&gt;
&lt;li&gt;License: MIT + Commons Clause&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feedback welcome — especially brutal ones. &lt;br&gt;
This is experimental and I want to get the architecture &lt;br&gt;
right before locking in the API.&lt;/p&gt;

&lt;p&gt;Preview/Proof:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftf3puc0f3lw85eqg68xf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftf3puc0f3lw85eqg68xf.png" alt="Built in Benchmark" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
