I was building a document Q&A feature for my SaaS.
Estimated 100,000 LLM requests per month.
Picked GPT-4o without thinking.
Then I actually ran the numbers.
Here's what I found.
The Setup
Typical request profile for a document Q&A backend:
- Input tokens per request: 1,500 (system prompt + retrieved context)
- Output tokens per request: 500 (answer)
- Volume: 100,000 requests/month
Simple calculation. Turns out not so simple on the wallet.
The Cost Table
| Model | Input $/1M | Output $/1M | Monthly Cost |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | $875 |
| Claude 3.5 Sonnet | $3.00 | $15.00 | $1,200 |
| Mistral Large | $2.00 | $6.00 | $600 |
| Llama 3.1 70B (Together AI) | $0.88 | $0.88 | $220 |
| GPT-4o mini | $0.15 | $0.60 | $52 |
| Claude 3.5 Haiku | $0.80 | $4.00 | $320 |
| Gemini 1.5 Flash | $0.075 | $0.30 | $26 |
$875/month vs $26/month for the same 100K requests.
That's a 33× price gap between GPT-4o and Gemini Flash.
What I Actually Did
I didn't just blindly switch to the cheapest model.
I ran a tiered approach:
Routing layer (GPT-4o mini) → classifies the query complexity → $52/month
Simple queries (Gemini Flash) → factual lookups, short answers → $26/month
Complex queries (GPT-4o) → reasoning, synthesis, long-form → $175/month
Total: ~$253/month instead of $875.
Same quality. 71% cheaper.
The Hidden Cost: Context Bloat
Most tutorials show you per-token pricing.
Nobody talks about context window bloat.
As your conversation history grows, your input tokens explode:
- Turn 1: 1,500 tokens input
- Turn 5: 6,000+ tokens input (full history)
- Turn 10: 12,000+ tokens input
At GPT-4o pricing, a 10-turn conversation costs 8× more than a single request.
Solutions: summarize history after turn 3, use semantic compression, or cache repeated context.
Batch API: The 50% Discount Nobody Uses
OpenAI's Batch API gives you 50% off for non-realtime workloads.
Same models. Same quality. Just async (results in ~24h).
Use cases that work perfectly with batch:
- Document indexing pipelines
- Nightly report generation
- Bulk content classification
- Offline data enrichment
If your use case tolerates async, you're leaving half your budget on the table.
Prompt Caching: 75–90% Off Repeated Context
Anthropic's prompt caching lets you cache your system prompt + static context.
Cache hit cost: ~10% of normal input price.
For document Q&A with a fixed system prompt (say 2,000 tokens),
caching saves you 90% on that chunk every request.
At 100K requests/month, that's meaningful.
The Calculator I Used
I was doing all this math in spreadsheets until I found
APICalculators.com —
a free browser-based LLM cost calculator.
You plug in your token averages and monthly volume,
it shows you the breakdown across all major providers instantly.
No signup, runs locally.
Useful for sanity-checking before you commit to a model in production.
The Decision Framework
Under $50/month budget:
Gemini Flash or GPT-4o mini. Full stop.
$50–$200/month, quality matters:
Claude 3.5 Haiku or Mistral Small.
Good reasoning, fraction of flagship cost.
$200–$500/month, complex reasoning needed:
GPT-4o mini for routing + GPT-4o for hard queries only.
Model routing cuts cost 60–70%.
Over $500/month:
Audit your prompts first.
Most overspend comes from bloated system prompts, not model choice.
TL;DR
- GPT-4o is 33× more expensive than Gemini Flash at the same volume
- Model routing (cheap router + expensive worker) cuts costs 60–70%
- Batch API = 50% discount for async workloads
- Prompt caching = 75–90% off repeated context
- Use a cost calculator before picking a model in prod
What's your current LLM spend? Have you tried model routing?
Top comments (0)