Stop Paying Frontier Prices for Tasks a Local Model Handles Fine

#go #llm #opensource #devops

Small open-weight models got good. Qwen 9B, Llama 8B, Gemma 4B handle 80% of production LLM workloads (extraction, classification, summarisation, tagging) with output quality indistinguishable from frontier APIs.

The remaining 20% genuinely needs the big model. But nobody routes. Every request hits the same endpoint. You are paying $3-15 per million tokens for work that a free local model does identically.

The cost arithmetic

Backend	Cost per 1M tokens	Typical tasks
Local 9B (Ollama/vLLM)	~$0.005	Extraction, classification, summarisation
Local 27B (vLLM, quantised)	~$0.02	Reasoning, code generation
Cloud API (Gemini Flash)	$0.15-0.60	Overflow
Frontier API (Claude, GPT-4)	$3-15.00	Complex reasoning

Route 80% of traffic from the frontier tier to a local 9B and your blended cost drops from ~$10 to ~$0.50 per million tokens.

How Kronaxis Router works

Single Go binary. Sits between your app and your model backends. Every request passes through a lightweight rule-based classifier (no LLM call, under 1ms) that assigns a task category:

Structured extraction: JSON schema, constrained output -> cheap model
Classification: single-label, yes/no, sentiment -> cheap model
Summarisation: condensation, bullet points -> cheap model
Reasoning: "analyse", "compare", multi-step -> capable model
Code generation: language specs, complex constraints -> capable model

The classifier is deliberately conservative. Ambiguous cases route to the more capable model. Evaluated against 25 labelled prompts: 100% accuracy.

The quality safety net

Routing to a cheap model blindly is a bad idea. The router samples 5% of cheap-model responses and validates them against a reference model. Sliding window per task category. If quality drops below threshold, that category auto-promotes to the next tier.

Savings by default. Automatic safety net.

Architecture

Client App  -->  Kronaxis Router  -->  Backend A (local 9B, Ollama/vLLM)
                      |           -->  Backend B (local 27B, vLLM)
                      |           -->  Backend C (Gemini Flash)
                      |
                  Classifier (rule-based, <1ms)
                  Cache Layer (SHA-256, temp=0 only)
                  Budget Enforcer (downgrade on limit)
                  Quality Validator (5% sampling)
                  Batch Router (50% off on 7 providers)
                  Metrics Collector (Prometheus)

Why Go

Single static binary. No Python runtime, no Node, no containers required. 2MB memory under full load. 22,770 req/s throughput. The router will never be the bottleneck when LLM inference takes 500ms-30s.

Backend failover

3 consecutive failures marks a backend DOWN. 1 success recovers it. When a request fails, the router tries the next backend in the chain. Local vLLM crash gracefully overflows to cloud without client-side changes.

LoRA adapter routing

If your vLLM instance serves multiple LoRA adapters, the router rewrites the model field to the correct adapter based on request metadata. The client sends a standard OpenAI-compatible request and never needs to know which adapter exists.

Batch API routing

Seven providers offer 50% off on batch API requests. The router handles this transparently: tag a request as bulk priority and it auto-submits to the provider's batch endpoint. For overnight jobs, this halves your cloud costs on top of the routing savings.

Response caching

Deterministic requests (same prompt, temperature 0) served from an in-memory SHA-256 keyed cache. 30% hit rate on extraction workloads in our production traffic.

Budget enforcement

Set a daily dollar limit per service. When hit, the router downgrades to a cheaper model instead of returning errors. Your pipeline keeps running.

How this compares to alternatives

Feature	Kronaxis Router	LiteLLM	OpenRouter	Portkey	Martian
Self-hosted	Yes	Yes	No	No	No
Cost-based routing	Automatic	Manual	Some	Manual	ML-based
Quality validation	Closed loop	No	No	No	Implicit
Batch API (50% off)	7 providers	No	No	No	No
Response caching	Built in	No	No	No	No
Budget enforcement	Downgrade	Alerts	No	Alerts	No
LoRA routing	Yes	No	No	No	No
Memory	2MB	300MB+	SaaS	SaaS	SaaS
Throughput	22K req/s	~2K req/s	N/A	N/A	N/A
Provider count	4 types	100+	200+	15+	100+
Price	Free	Free/$150+	Margin	$99+/mo	Usage
Licence	Apache 2.0	MIT	Closed	Closed	Closed

LiteLLM is a universal gateway. OpenRouter is zero-setup SaaS. Portkey is observability. Martian is ML routing. Kronaxis Router is a cost optimiser. Different tools for different problems.

Getting started

# Install
curl -fsSL https://raw.githubusercontent.com/Kronaxis/kronaxis-router/main/install.sh | bash

# Auto-detect local models and API keys, generate config
kronaxis-router init

# Start
kronaxis-router

Also available: brew install kronaxis/tap/kronaxis-router, go install, Docker, deb/rpm.

For Claude Code and Cursor: kronaxis-router init --claude or kronaxis-router init --cursor configures the built-in MCP server for conversational management of backends, costs, and rules.

81 tests. Apache 2.0.

GitHub: github.com/Kronaxis/kronaxis-router

Full blog post: kronaxis.co.uk/blog/llm-routing-cost-savings