DEV Community

Lakshmi Sravya Vedantham
Lakshmi Sravya Vedantham

Posted on

I Put a 5MB Rust Binary Between My Code and Every LLM API — It Cut My Bill by 40%

Every developer using LLMs faces the same three problems:

  1. Cost blindness — you cannot answer "how much did I spend today?"
  2. No failover — when OpenAI goes down, your app goes down
  3. Wasted money — identical prompts hit the API over and over instead of being cached

I built llmux to fix all three with zero code changes.

What is llmux?

A single Rust binary (~5MB) that sits between your code and every LLM API. It handles failover, caching, rate limiting, and cost tracking automatically.

Your code (any language)
        |
   http://localhost:4000
        |
     ┌──────┐
     │ llmux │  ← single binary, ~5MB
     └──┬───┘
        │
   ┌────┼────┬────────┐
   ▼    ▼    ▼        ▼
OpenAI Claude Gemini  Ollama
Enter fullscreen mode Exit fullscreen mode

Zero Code Changes

You change one environment variable. That is it.

# Before
export OPENAI_BASE_URL=https://api.openai.com

# After
export OPENAI_BASE_URL=http://localhost:4000/v1
Enter fullscreen mode Exit fullscreen mode

Your existing code — Python, TypeScript, Go, whatever — keeps working exactly the same. llmux intercepts the calls and adds superpowers.

Quick Start

git clone https://github.com/LakshmiSravyaVedantham/llmux.git
cd llmux
cargo build --release
cp config.example.toml config.toml
# Edit config.toml with your API keys
./target/release/llmux start
Enter fullscreen mode Exit fullscreen mode

That is it. Your gateway is running.

What You Get

Feature What it does
Multi-provider proxy Routes to OpenAI, Anthropic, Google, Mistral, Ollama
Automatic failover Provider down? Routes to next one automatically
Response caching Identical prompts return cached responses — saves money instantly
Token budgets Set daily spend caps — warn mode or hard block
Cost tracking Real-time cost estimation per model, per provider
TUI dashboard Live terminal dashboard showing spend, cache hits, request log
Request logging Every call logged to embedded SQLite — query anytime

The TUI Dashboard

Run llmux dash to see live stats:

┌─ llmux dashboard (q to quit) ──────────────────────┐
│ Requests today:  142                                │
│ Cache hits:       38 (26.8%)                        │
│ Input tokens:    284,000                            │
│ Output tokens:   71,000                             │
│ Spend today:     $3.4200                            │
└─────────────────────────────────────────────────────┘
┌─ Recent Requests ───────────────────────────────────┐
│ Time     Provider  Model          Status  Cost      │
│ 14:23:01 openai    gpt-4          200     $0.0450   │
│ 14:22:58 openai    gpt-4          200     $0.0000   │
│ 14:22:55 anthropic claude-sonnet  200     $0.0120   │
└─────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The second request shows $0.00 — that is a cache hit. Same prompt, zero cost.

How Failover Works

Configure providers with priorities:

[[providers]]
name = "openai"
api_key = "${OPENAI_API_KEY}"
priority = 1
base_url = "https://api.openai.com"

[[providers]]
name = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
priority = 2
base_url = "https://api.anthropic.com"
Enter fullscreen mode Exit fullscreen mode

OpenAI returns a 5xx? llmux marks it unhealthy and routes to Anthropic. Connection refused? Same thing. Your app never sees the error.

The Cache Math

If 25% of your LLM calls are identical prompts (common in dev workflows, CI, repeated queries), and you spend $100/month:

  • Without llmux: $100/month
  • With llmux: $75/month (25% cache hits = $0)
  • With aggressive caching (1hr TTL): $60/month

The cache is an in-memory LRU with configurable TTL. Keys are SHA256 hashes of provider + request body. No data leaves your machine.

Budget Protection

[budget]
daily_limit_usd = 5.00
action = "block"
Enter fullscreen mode Exit fullscreen mode

Set action = "warn" to log warnings, or "block" to return HTTP 429 when the limit is hit. Never wake up to a surprise LLM bill again.

Why Rust?

  • Single binary — no runtime, no dependencies, no Docker required
  • ~15MB RAM at steady state
  • <2ms proxy overhead per request
  • Thread-safe — handles concurrent requests with zero data races

The entire gateway is ~1,600 lines of Rust across 7 modules.

What is Next

llmux is the first in a trilogy:

  1. llmux (this project) — gateway, caching, cost tracking
  2. llm-lens — observability and tracing for AI agents, builds on llmux request capture
  3. llm-guard — runtime safety monitor, detects loops, hallucinations, budget overruns

Each one is standalone but they compose into a full AI agent infrastructure stack.

Try It

git clone https://github.com/LakshmiSravyaVedantham/llmux.git
cd llmux && cargo build --release
Enter fullscreen mode Exit fullscreen mode

Star the repo if this is useful: github.com/LakshmiSravyaVedantham/llmux


llmux is MIT licensed and open source. Built with Rust, axum, tokio, ratatui, and rusqlite.

Top comments (0)