Lakshmi Sravya Vedantham

Posted on Mar 6

I Put a 5MB Rust Binary Between My Code and Every LLM API — It Cut My Bill by 40%

#ai #programming #rust #opensource

Every developer using LLMs faces the same three problems:

Cost blindness — you cannot answer "how much did I spend today?"
No failover — when OpenAI goes down, your app goes down
Wasted money — identical prompts hit the API over and over instead of being cached

I built llmux to fix all three with zero code changes.

What is llmux?

A single Rust binary (~5MB) that sits between your code and every LLM API. It handles failover, caching, rate limiting, and cost tracking automatically.

Your code (any language)
        |
   http://localhost:4000
        |
     ┌──────┐
     │ llmux │  ← single binary, ~5MB
     └──┬───┘
        │
   ┌────┼────┬────────┐
   ▼    ▼    ▼        ▼
OpenAI Claude Gemini  Ollama

Zero Code Changes

You change one environment variable. That is it.

# Before
export OPENAI_BASE_URL=https://api.openai.com

# After
export OPENAI_BASE_URL=http://localhost:4000/v1

Your existing code — Python, TypeScript, Go, whatever — keeps working exactly the same. llmux intercepts the calls and adds superpowers.

Quick Start

git clone https://github.com/LakshmiSravyaVedantham/llmux.git
cd llmux
cargo build --release
cp config.example.toml config.toml
# Edit config.toml with your API keys
./target/release/llmux start

That is it. Your gateway is running.

What You Get

Feature	What it does
Multi-provider proxy	Routes to OpenAI, Anthropic, Google, Mistral, Ollama
Automatic failover	Provider down? Routes to next one automatically
Response caching	Identical prompts return cached responses — saves money instantly
Token budgets	Set daily spend caps — warn mode or hard block
Cost tracking	Real-time cost estimation per model, per provider
TUI dashboard	Live terminal dashboard showing spend, cache hits, request log
Request logging	Every call logged to embedded SQLite — query anytime

The TUI Dashboard

Run llmux dash to see live stats:

┌─ llmux dashboard (q to quit) ──────────────────────┐
│ Requests today:  142                                │
│ Cache hits:       38 (26.8%)                        │
│ Input tokens:    284,000                            │
│ Output tokens:   71,000                             │
│ Spend today:     $3.4200                            │
└─────────────────────────────────────────────────────┘
┌─ Recent Requests ───────────────────────────────────┐
│ Time     Provider  Model          Status  Cost      │
│ 14:23:01 openai    gpt-4          200     $0.0450   │
│ 14:22:58 openai    gpt-4          200     $0.0000   │
│ 14:22:55 anthropic claude-sonnet  200     $0.0120   │
└─────────────────────────────────────────────────────┘

The second request shows $0.00 — that is a cache hit. Same prompt, zero cost.

How Failover Works

Configure providers with priorities:

[[providers]]
name = "openai"
api_key = "${OPENAI_API_KEY}"
priority = 1
base_url = "https://api.openai.com"

[[providers]]
name = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
priority = 2
base_url = "https://api.anthropic.com"

OpenAI returns a 5xx? llmux marks it unhealthy and routes to Anthropic. Connection refused? Same thing. Your app never sees the error.

The Cache Math

If 25% of your LLM calls are identical prompts (common in dev workflows, CI, repeated queries), and you spend $100/month:

Without llmux: $100/month
With llmux: $75/month (25% cache hits = $0)
With aggressive caching (1hr TTL): $60/month

The cache is an in-memory LRU with configurable TTL. Keys are SHA256 hashes of provider + request body. No data leaves your machine.

Budget Protection

[budget]
daily_limit_usd = 5.00
action = "block"

Set action = "warn" to log warnings, or "block" to return HTTP 429 when the limit is hit. Never wake up to a surprise LLM bill again.

Why Rust?

Single binary — no runtime, no dependencies, no Docker required
~15MB RAM at steady state
<2ms proxy overhead per request
Thread-safe — handles concurrent requests with zero data races

The entire gateway is ~1,600 lines of Rust across 7 modules.

What is Next

llmux is the first in a trilogy:

llmux (this project) — gateway, caching, cost tracking
llm-lens — observability and tracing for AI agents, builds on llmux request capture
llm-guard — runtime safety monitor, detects loops, hallucinations, budget overruns

Each one is standalone but they compose into a full AI agent infrastructure stack.

Try It

git clone https://github.com/LakshmiSravyaVedantham/llmux.git
cd llmux && cargo build --release

Star the repo if this is useful: github.com/LakshmiSravyaVedantham/llmux

llmux is MIT licensed and open source. Built with Rust, axum, tokio, ratatui, and rusqlite.

DEV Community