<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jason Duke</title>
    <description>The latest articles on DEV Community by Jason Duke (@jason_duke).</description>
    <link>https://dev.to/jason_duke</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3843120%2F3b529d0b-d16b-439e-8113-6942a406d27f.jpg</url>
      <title>DEV Community: Jason Duke</title>
      <link>https://dev.to/jason_duke</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jason_duke"/>
    <language>en</language>
    <item>
      <title>Stop Paying Frontier Prices for Tasks a Local Model Handles Fine</title>
      <dc:creator>Jason Duke</dc:creator>
      <pubDate>Wed, 08 Apr 2026 09:50:47 +0000</pubDate>
      <link>https://dev.to/jason_duke/stop-paying-frontier-prices-for-tasks-a-local-model-handles-fine-544o</link>
      <guid>https://dev.to/jason_duke/stop-paying-frontier-prices-for-tasks-a-local-model-handles-fine-544o</guid>
      <description>&lt;p&gt;Small open-weight models got good. Qwen 9B, Llama 8B, Gemma 4B handle 80% of production LLM workloads (extraction, classification, summarisation, tagging) with output quality indistinguishable from frontier APIs.&lt;/p&gt;

&lt;p&gt;The remaining 20% genuinely needs the big model. But nobody routes. Every request hits the same endpoint. You are paying $3-15 per million tokens for work that a free local model does identically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost arithmetic
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Cost per 1M tokens&lt;/th&gt;
&lt;th&gt;Typical tasks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local 9B (Ollama/vLLM)&lt;/td&gt;
&lt;td&gt;~$0.005&lt;/td&gt;
&lt;td&gt;Extraction, classification, summarisation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local 27B (vLLM, quantised)&lt;/td&gt;
&lt;td&gt;~$0.02&lt;/td&gt;
&lt;td&gt;Reasoning, code generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud API (Gemini Flash)&lt;/td&gt;
&lt;td&gt;$0.15-0.60&lt;/td&gt;
&lt;td&gt;Overflow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontier API (Claude, GPT-4)&lt;/td&gt;
&lt;td&gt;$3-15.00&lt;/td&gt;
&lt;td&gt;Complex reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Route 80% of traffic from the frontier tier to a local 9B and your blended cost drops from ~$10 to ~$0.50 per million tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Kronaxis Router works
&lt;/h2&gt;

&lt;p&gt;Single Go binary. Sits between your app and your model backends. Every request passes through a lightweight rule-based classifier (no LLM call, under 1ms) that assigns a task category:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structured extraction:&lt;/strong&gt; JSON schema, constrained output -&amp;gt; cheap model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Classification:&lt;/strong&gt; single-label, yes/no, sentiment -&amp;gt; cheap model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarisation:&lt;/strong&gt; condensation, bullet points -&amp;gt; cheap model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning:&lt;/strong&gt; "analyse", "compare", multi-step -&amp;gt; capable model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code generation:&lt;/strong&gt; language specs, complex constraints -&amp;gt; capable model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The classifier is deliberately conservative. Ambiguous cases route to the more capable model. Evaluated against 25 labelled prompts: 100% accuracy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The quality safety net
&lt;/h2&gt;

&lt;p&gt;Routing to a cheap model blindly is a bad idea. The router samples 5% of cheap-model responses and validates them against a reference model. Sliding window per task category. If quality drops below threshold, that category auto-promotes to the next tier.&lt;/p&gt;

&lt;p&gt;Savings by default. Automatic safety net.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client App  --&amp;gt;  Kronaxis Router  --&amp;gt;  Backend A (local 9B, Ollama/vLLM)
                      |           --&amp;gt;  Backend B (local 27B, vLLM)
                      |           --&amp;gt;  Backend C (Gemini Flash)
                      |
                  Classifier (rule-based, &amp;lt;1ms)
                  Cache Layer (SHA-256, temp=0 only)
                  Budget Enforcer (downgrade on limit)
                  Quality Validator (5% sampling)
                  Batch Router (50% off on 7 providers)
                  Metrics Collector (Prometheus)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why Go
&lt;/h3&gt;

&lt;p&gt;Single static binary. No Python runtime, no Node, no containers required. 2MB memory under full load. 22,770 req/s throughput. The router will never be the bottleneck when LLM inference takes 500ms-30s.&lt;/p&gt;

&lt;h3&gt;
  
  
  Backend failover
&lt;/h3&gt;

&lt;p&gt;3 consecutive failures marks a backend DOWN. 1 success recovers it. When a request fails, the router tries the next backend in the chain. Local vLLM crash gracefully overflows to cloud without client-side changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  LoRA adapter routing
&lt;/h3&gt;

&lt;p&gt;If your vLLM instance serves multiple LoRA adapters, the router rewrites the &lt;code&gt;model&lt;/code&gt; field to the correct adapter based on request metadata. The client sends a standard OpenAI-compatible request and never needs to know which adapter exists.&lt;/p&gt;

&lt;h3&gt;
  
  
  Batch API routing
&lt;/h3&gt;

&lt;p&gt;Seven providers offer 50% off on batch API requests. The router handles this transparently: tag a request as &lt;code&gt;bulk&lt;/code&gt; priority and it auto-submits to the provider's batch endpoint. For overnight jobs, this halves your cloud costs on top of the routing savings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Response caching
&lt;/h3&gt;

&lt;p&gt;Deterministic requests (same prompt, temperature 0) served from an in-memory SHA-256 keyed cache. 30% hit rate on extraction workloads in our production traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Budget enforcement
&lt;/h3&gt;

&lt;p&gt;Set a daily dollar limit per service. When hit, the router downgrades to a cheaper model instead of returning errors. Your pipeline keeps running.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this compares to alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Kronaxis Router&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;OpenRouter&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;th&gt;Martian&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost-based routing&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Some&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;ML-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quality validation&lt;/td&gt;
&lt;td&gt;Closed loop&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Implicit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch API (50% off)&lt;/td&gt;
&lt;td&gt;7 providers&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Response caching&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget enforcement&lt;/td&gt;
&lt;td&gt;Downgrade&lt;/td&gt;
&lt;td&gt;Alerts&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Alerts&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LoRA routing&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;2MB&lt;/td&gt;
&lt;td&gt;300MB+&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;22K req/s&lt;/td&gt;
&lt;td&gt;~2K req/s&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider count&lt;/td&gt;
&lt;td&gt;4 types&lt;/td&gt;
&lt;td&gt;100+&lt;/td&gt;
&lt;td&gt;200+&lt;/td&gt;
&lt;td&gt;15+&lt;/td&gt;
&lt;td&gt;100+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Price&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Free/$150+&lt;/td&gt;
&lt;td&gt;Margin&lt;/td&gt;
&lt;td&gt;$99+/mo&lt;/td&gt;
&lt;td&gt;Usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Licence&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;td&gt;Closed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM is a universal gateway. OpenRouter is zero-setup SaaS. Portkey is observability. Martian is ML routing. Kronaxis Router is a cost optimiser. Different tools for different problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/Kronaxis/kronaxis-router/main/install.sh | bash

&lt;span class="c"&gt;# Auto-detect local models and API keys, generate config&lt;/span&gt;
kronaxis-router init

&lt;span class="c"&gt;# Start&lt;/span&gt;
kronaxis-router
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also available: &lt;code&gt;brew install kronaxis/tap/kronaxis-router&lt;/code&gt;, &lt;code&gt;go install&lt;/code&gt;, Docker, deb/rpm.&lt;/p&gt;

&lt;p&gt;For Claude Code and Cursor: &lt;code&gt;kronaxis-router init --claude&lt;/code&gt; or &lt;code&gt;kronaxis-router init --cursor&lt;/code&gt; configures the built-in MCP server for conversational management of backends, costs, and rules.&lt;/p&gt;

&lt;p&gt;81 tests. Apache 2.0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Kronaxis/kronaxis-router" rel="noopener noreferrer"&gt;github.com/Kronaxis/kronaxis-router&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full blog post:&lt;/strong&gt; &lt;a href="https://kronaxis.co.uk/blog/llm-routing-cost-savings" rel="noopener noreferrer"&gt;kronaxis.co.uk/blog/llm-routing-cost-savings&lt;/a&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>llm</category>
      <category>opensource</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
