LLM API bills can spiral fast once you're in production. Here are eight concrete techniques that actually move the needle, ranked roughly by impact.
1. Cache repeated prompts
If your app sends the same system prompt or common queries repeatedly, you're paying for the same computation over and over. Even a simple in-memory cache keyed on the exact prompt text can eliminate a meaningful chunk of spend — in our own usage data, repeated identical prompts accounted for a noticeable share of total cost.
2. Use cheaper models for simpler tasks
Not every request needs your most capable model. Classification, simple extraction, and short-form responses often work fine on smaller, cheaper models (GPT-4o-mini, Claude Haiku, Gemini Flash). Reserve the expensive models for tasks that actually need the reasoning power.
3. Trim your system prompts
Long system prompts get sent with every single request. If yours has grown organically over months of tweaks, audit it — every redundant sentence is a recurring cost multiplied by your request volume.
4. Set hard output limits
Use max_tokens aggressively. Open-ended generation tasks can produce far more output than you need, and you pay per token either way.
5. Batch requests where possible
Some providers offer batch APIs at a discount (often 50% off) for non-real-time workloads. If you're processing things asynchronously — summarizing a backlog, generating reports — batch APIs are free money left on the table if you're not using them.
6. Monitor for anomalies, not just totals
A monthly total doesn't tell you when something went wrong. A buggy retry loop or an unexpected usage spike can burn through a budget in hours. Daily-level monitoring with alerting on deviations from your normal spend catches this before it becomes a surprise bill.
7. A/B test before committing
Before switching your whole app to a "cheaper" model, actually measure it. Sometimes a cheaper model needs more retries or longer prompts to get usable output, which erases the savings. Compare cost AND output quality side by side on real traffic.
8. Know your per-feature cost breakdown
If you can't answer "which feature in my app costs the most," you can't prioritize optimization. Tagging requests by feature or use case (even just in your logs) turns a vague cost problem into a concrete, fixable one.
I built LLMWatch after running into most of these problems myself — it's a proxy that logs cost/latency per request, flags repeated prompts you could cache, and warns you when spend spikes. Free tier covers 1,000 requests/month if you want to see your own breakdown.
Top comments (0)