After a few surprise invoices from OpenAI and Anthropic, I spent a weekend figuring out where the money actually goes when you call an LLM API. Here are the five levers that moved my bill the most, roughly in order of impact.
1. Cache your static prefix
Every chat turn re-sends the same system prompt. All three major providers let you cache that prefix and bill repeat reads at ~10% of the input rate. On a chatbot with a 2,000-token system message this alone cut my input cost ~80%.
2. Output is ~5× input — cap it
On every frontier model, output tokens cost about five times input tokens. Generation is autoregressive; input is processed in parallel. Setting max_tokens aggressively and prompting for terse answers is the single easiest win.
3. Route by difficulty
Don't send "extract this email" to a flagship. A two-tier setup — a cheap model (Haiku / Flash-Lite / Nano) for the easy 80%, a flagship for the hard 20% — saved me 60–85% versus running everything through the big model.
4. Batch what can wait
Nightly summaries, eval runs, enrichment — anything that can tolerate 24h — gets a flat 50% discount via the Batch API on every major vendor.
5. Watch the tokenizer
Non-English text (Cyrillic, CJK) tokenizes 2–4× worse than English, so it costs 2–4× more. If you serve a multilingual audience this is a real multiplier, and some models (Gemini, DeepSeek) handle it better than the GPT-4 family.
Estimating before you ship
The thing that helped most was estimating cost before writing the code. I've been using a free calculator that takes a prompt + model and shows per-call and at-scale cost with input/output priced separately: gpt-cost.com. It also has per-model pages (e.g. Claude Opus 4.8) and a deeper write-up on the cheapest LLM by workload that informed a lot of the above.
What levers am I missing? Curious what's worked for others at scale.ai, llm, webdev, cost
Top comments (0)