Six months ago I was spending $400/month on OpenAI API calls for a side project with exactly zero paying users. Today that same project runs on $11/month — mostly electricity. Here's the exact stack and the decisions that got me there.
Why "Just Use GPT-4" Is a Trap for Builders
The seduction is real: paste your API key, make one client.chat.completions.create() call, ship. Works great until you actually get traction, or until you realize you've been burning tokens on tasks that don't need frontier model intelligence.
The mistake is treating AI cost as a fixed percentage of revenue. It's not. It's an engineering problem.
The math is brutal on long-context tasks. Summarizing a 10,000-token document with GPT-4o costs ~$0.025. Run that 500 times a day and you're at $375/month on one feature. Run the same task with Mistral 7B locally: the compute is free after hardware, or roughly $0.0003/request on a cheap VPS.
That 97% number isn't marketing. It's what happens when you stop treating every LLM call like it needs the smartest model in the room.
The Two-Tier Model Strategy
The biggest lever is model routing — sending tasks to the cheapest model that can handle them reliably.
Tier 1: Local models via Ollama handle 80% of requests:
- Classification, tagging, intent detection →
llama3.2:3b - Summarization, extraction, reformatting →
mistral:7b - Code completion for known patterns →
qwen2.5-coder:7b - Embedding generation →
nomic-embed-text
Tier 2: API calls (Claude/GPT-4) handle the 20% that actually need it:
- Complex reasoning chains
- Novel code generation
- Tasks requiring up-to-date knowledge
- Anything customer-facing where quality directly impacts conversion
Setting this up with Ollama takes about 20 minutes:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral:7b
ollama pull llama3.2:3b
ollama pull nomic-embed-text
The OpenAI-compatible API endpoint (localhost:11434/v1) means zero code changes if you're already using the OpenAI SDK — just swap the base URL and model name.
Token Reduction That Actually Moves the Needle
Most developers optimize the wrong thing. They agonize over prompt wording when the real gains come from structural changes.
Strip your system prompts ruthlessly. A system prompt that says "You are a helpful AI assistant that specializes in..." is wasting 15-30 tokens on every single call. Rewrite it as instructions only: "Extract the company name, role, and pain point from this message. Return JSON."
Use structured output instead of parsing prose. Asking an LLM to "explain the sentiment and give me a score" then regex-parsing the answer is expensive and fragile. Use response_format: { type: "json_object" } or grammar-constrained generation via llama.cpp. You get smaller, more predictable outputs.
Cache aggressively at the semantic level. Not just exact-match caching — semantic caching. If 40% of your users ask functionally identical questions with different phrasing, you're paying for the same generation 40% of the time. A simple vector similarity check against recent responses with a 0.92 threshold cuts repeat inference costs dramatically. Redis + your local embedding model handles this for free.
Context window hygiene. Every token in your context costs money on API calls. Audit what you're actually sending. I found a bug in my code that was including a full 3,000-token RAG retrieval even when the query matched a cached answer. That alone was 18% of my bill.
Deploying on a $6/Month VPS
You don't need a GPU for 7B models. A CPU-only VPS with 16GB RAM runs mistral:7b at 8-12 tokens/second — slow for interactive chat, fast enough for async background processing.
For a hobby project or internal tool, this setup works:
$6/month — 2 vCPU / 8GB RAM: llama3.2:3b, fast, 15+ tok/s
$12/month — 4 vCPU / 16GB RAM: mistral:7b, usable for most tasks
$24/month — 8 vCPU / 32GB RAM: mixtral or llama3:70b quantized
Hetzner and Contabo are the go-to options here. Oracle Cloud Free Tier gives you 4 ARM cores and 24GB RAM permanently free — Mistral 7B runs fine on it.
Run Ollama as a systemd service, expose it on Tailscale (not public internet), and you have a private inference endpoint your whole stack can hit.
[Service]
ExecStart=/usr/bin/ollama serve
Environment="OLLAMA_HOST=0.0.0.0:11434"
The Hybrid Routing Pattern in Practice
Here's the actual routing logic I use, simplified:
FAST_TASKS = {"classify", "extract", "tag", "embed", "summarize"}
SMART_TASKS = {"generate", "reason", "plan", "debug"}
def route_request(task_type: str, complexity_score: float):
if task_type in FAST_TASKS and complexity_score < 0.6:
return {"base_url": "http://ollama-host:11434/v1", "model": "mistral:7b"}
return {"base_url": "https://api.anthropic.com", "model": "claude-haiku-4-5"}
complexity_score is just a heuristic — input length, presence of technical jargon, whether it's a multi-step task. You don't need a model to compute it; a few rules get you 90% of the value.
The key insight: Claude Haiku and GPT-4o-mini are your fallback tier, not your default. They're still 10-50x cheaper than frontier models and should handle the tasks that local models can't.
Measuring What Actually Matters
You can't optimize what you don't measure. Add this to every LLM call from day one:
{
"model": model_name,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"latency_ms": elapsed,
"task_type": task_type,
"local": is_local
}
Log it to a SQLite file, query it weekly. Within two weeks you'll see exactly which features are burning budget and whether the local model quality is acceptable for each task type.
Quality monitoring matters too. For every 1,000 local model calls, sample 50 and run them through your frontier model as a judge. If agreement drops below 85%, that task type needs to move up a tier.
The full breakdown — hardware configs, quantization settings, the routing framework with benchmarks, and cost tracking templates — is something I've been putting together for months.
I compiled everything into a practical guide: Local LLMs: Réduisez vos coûts IA de 97%
Top comments (0)