How I cut my LLM API bill by ~60% (5 levers that actually work)

Markys Lindred — Tue, 30 Jun 2026 01:30:26 +0000

After a few surprise invoices from OpenAI and Anthropic, I spent a weekend figuring out where the money actually goes when you call an LLM API. Here are the five levers that moved my bill the most, roughly in order of impact.

1. Cache your static prefix

Every chat turn re-sends the same system prompt. All three major providers let you cache that prefix and bill repeat reads at ~10% of the input rate. On a chatbot with a 2,000-token system message this alone cut my input cost ~80%.

2. Output is ~5× input — cap it

On every frontier model, output tokens cost about five times input tokens. Generation is autoregressive; input is processed in parallel. Setting max_tokens aggressively and prompting for terse answers is the single easiest win.

3. Route by difficulty

Don't send "extract this email" to a flagship. A two-tier setup — a cheap model (Haiku / Flash-Lite / Nano) for the easy 80%, a flagship for the hard 20% — saved me 60–85% versus running everything through the big model.

4. Batch what can wait

Nightly summaries, eval runs, enrichment — anything that can tolerate 24h — gets a flat 50% discount via the Batch API on every major vendor.

5. Watch the tokenizer

Non-English text (Cyrillic, CJK) tokenizes 2–4× worse than English, so it costs 2–4× more. If you serve a multilingual audience this is a real multiplier, and some models (Gemini, DeepSeek) handle it better than the GPT-4 family.

Estimating before you ship

The thing that helped most was estimating cost before writing the code. I've been using a free calculator that takes a prompt + model and shows per-call and at-scale cost with input/output priced separately: gpt-cost.com. It also has per-model pages (e.g. Claude Opus 4.8) and a deeper write-up on the cheapest LLM by workload that informed a lot of the above.

What levers am I missing? Curious what's worked for others at scale.ai, llm, webdev, cost