Last year I deployed a GPT-4o powered support chatbot for a small SaaS. Traffic was modest — maybe 500 active users. I checked the OpenAI bill at the end of the month.
$400.
The app was supposed to cost $40/month. I'd done the mental math: "1,000 API calls a day, maybe 500 tokens each, GPT-4o is $10/M output... that's like $50/month, fine."
What I forgot: my system prompt was 600 tokens. It was sent on every single call. At 8,000 daily calls (users were chatty), that's 4.8M extra input tokens per day just from the system prompt. At $2.50/M, that's $12/day — $360/month — from a prompt I copy-pasted and never measured.
That's when I built the LLM API Cost Calculator — a free tool that covers 18 models, 7 currencies, a live token counter, and a Compare tab that ranks every model by cost for your exact workload. Let me show you how to use it properly, and walk through the math so you never have a surprise bill month.
The Mental Model Most Developers Get Wrong
Before touching any calculator, you need to understand one thing: output tokens cost 3–5× more than input tokens, and your architecture determines which one dominates your bill.
Take Claude Sonnet 4:
- Input: $3.00 per million tokens
- Output: $15.00 per million tokens
That's a 5× ratio. Now look at two very different workloads:
Sentiment classification:
- Input: 300 tokens (review + system prompt)
- Output: 10 tokens ("positive" / "negative")
- 5,000 calls/day
Input cost: 300 × 5,000 / 1,000,000 × $3.00 = $4.50/day
Output cost: 10 × 5,000 / 1,000,000 × $15.00 = $0.75/day
→ Input dominates. Optimize your system prompt.
Support chatbot:
- Input: 800 tokens (history + system prompt + user message)
- Output: 400 tokens (detailed response)
- 1,000 calls/day
Input cost: 800 × 1,000 / 1,000,000 × $3.00 = $2.40/day
Output cost: 400 × 1,000 / 1,000,000 × $15.00 = $6.00/day
→ Output dominates. Cap response length with max_tokens.
Same model, radically different cost structure. The cost calculator's breakdown bar shows you exactly this split so you know which side to optimize.
LLM Pricing Snapshot — June 2026
Here's where the major models sit right now. All prices are per million tokens (input / output):
| Model | Input | Output | Best For |
|---|---|---|---|
| Mistral Small 3 | $0.10 | $0.30 | Cheapest input overall |
| Llama 4 Scout | $0.11 | $0.34 | High-volume classification |
| GPT-4o mini | $0.15 | $0.60 | Budget general-purpose |
| Gemini 2.5 Flash | $0.15 | $0.60 | Cheap + thinking mode |
| DeepSeek V3 | $0.27 | $1.10 | Coding, open-source value |
| DeepSeek R1 | $0.55 | $2.19 | Reasoning tasks |
| Llama 3.3 70B | $0.59 | $0.79 | Balanced open-source |
| o4-mini | $1.10 | $4.40 | OpenAI reasoning, budget |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M context window |
| o3 | $2.00 | $8.00 | Advanced reasoning |
| Mistral Large 2 | $2.00 | $6.00 | EU data residency |
| GPT-4o | $2.50 | $10.00 | Flagship OpenAI |
| Claude Sonnet 4 | $3.00 | $15.00 | Coding, instruction-following |
| Grok 3 Mini | $0.30 | $0.50 | Small gap in/out pricing |
| Grok 3 | $3.00 | $15.00 | xAI flagship |
| Claude Haiku 3.5 | $0.80 | $4.00 | Budget Anthropic |
| Claude Opus 4 | $15.00 | $75.00 | Most expensive overall |
The price gap is brutal: Claude Opus 4 output costs 250× more than Llama 4 Scout output. For most real workloads, that premium is unjustifiable.
4 Real Scenarios With Real Numbers
Rather than abstract pricing, let me run 4 actual workloads through the calculator. You can replicate all of these yourself — just open the Compare Models tab and enter these numbers.
Scenario 1 — Startup Chatbot (500 users/day)
800 input / 400 output / 1,000 calls/day
| Model | Monthly Cost |
|---|---|
| Llama 4 Scout | ~$7 |
| GPT-4o mini | ~$7 |
| Claude Haiku 3.5 | ~$40 |
| GPT-4o | ~$120 |
| Claude Sonnet 4 | ~$144 |
| Claude Opus 4 | ~$720 |
If your chatbot handles general questions, GPT-4o mini vs GPT-4o is literally $113/month saved — $1,356/year — for quality most users won't notice.
Scenario 2 — RAG Document Search (Enterprise)
3,000 input (chunked docs) / 500 output / 500 calls/day
| Model | Monthly Cost |
|---|---|
| DeepSeek V3 | ~$25 |
| Gemini 2.5 Flash | ~$36 |
| Gemini 2.5 Pro | ~$112 |
| Claude Sonnet 4 | ~$189 |
DeepSeek V3 at $25/month vs Gemini 2.5 Pro at $112/month for RAG. Unless you need Gemini's 1M context window for very large documents, that's a 78% cost reduction for identical architecture.
Scenario 3 — Code Review in CI/CD
2,000 input (diff + context) / 800 output / 200 calls/day
| Model | Monthly Cost |
|---|---|
| DeepSeek V3 | ~$14 |
| GPT-4o | ~$54 |
| Claude Sonnet 4 | ~$72 |
This is a case where quality might justify cost. Claude Sonnet 4 genuinely outperforms on nuanced code review. But $72 vs $14/month is a real conversation — benchmark 100 real diffs before committing.
Scenario 4 — High-Volume Classification (5,000 calls/day)
300 input / 50 output / 5,000 calls/day
| Model | Monthly Cost |
|---|---|
| Mistral Small 3 | ~$4.50 |
| Llama 4 Scout | ~$5.40 |
| GPT-4o mini | ~$6.75 |
| Claude Haiku 3.5 | ~$12 |
For simple classification, you're choosing between $4.50 and $12/month. Mistral Small 3 wins unless you specifically need Anthropic's API capabilities.
The System Prompt Tax (What Killed My Budget)
Here's the thing I got wrong — and it's the most common mistake I see in production AI apps.
A 400-token system prompt at 10,000 calls/day:
400 tokens × 10,000 calls = 4,000,000 input tokens/day
At GPT-4o ($2.50/M): $10/day = $300/month
$300/month just from your system prompt. Before a single user message.
What to do about it:
- Measure first. Paste your system prompt into the Token Counter tab and see the exact count.
- Compress it. You can often cut a 400-token system prompt to 200 tokens without losing behavior. Use the AI Prompt Optimizer to reduce token footprint without losing quality — that alone saves $150/month in this example.
- Cache it. OpenAI's Prompt Caching gives you 50% off cached input tokens{:target="_blank"}{:rel="noopener"} for prompts over 1,024 tokens. Anthropic has similar caching. If your system prompt is fixed across calls — and it usually is — you could cut input costs by 40–50% overnight.
Conversation History: The Silent Cost Multiplier
Here's the second thing developers consistently underestimate:
Turn 1: 800 input tokens sent
Turn 2: 800 + 300 (turn 1 response) = 1,100 tokens sent
Turn 3: 1,100 + 300 = 1,400 tokens sent
Turn 4: 1,400 + 300 = 1,700 tokens sent
Turn 5: 1,700 + 300 = 2,000 tokens sent
A 5-turn conversation that starts at 800 tokens averages 1,400 tokens per call — not 800. Your cost estimate needs to reflect the average turn depth, not just the first message.
Strategies:
- Sliding window: Only keep the last N turns in context
- Summarization: After turn 5, summarize history into 200 tokens and continue
- Topic detection: Reset context when the topic changes
The agentic workflows guide covers context management for multi-step agents in detail — the same principles apply to chatbot history.
Agentic Loops: Where Estimates Fall Apart Completely
If you're building AI agents — tools that make multiple API calls in a loop — standard cost estimation breaks down fast.
An agent that makes 5 tool calls per user request, each with a growing context:
Call 1: 2,000 tokens in, 500 out (tool invocation)
Call 2: 2,500 tokens in, 500 out (tool result added)
Call 3: 3,000 tokens in, 500 out
Call 4: 3,500 tokens in, 500 out
Call 5: 4,000 tokens in, 800 out (final response)
Total: 15,000 input + 2,800 output per "1 user request"
At Claude Sonnet 4: $0.045 input + $0.042 output = $0.087 per user request.
Looks small. At 1,000 agent tasks/day: $87/day = $2,610/month.
Use the AI Agent preset in the calculator (8,000 in / 2,000 out / 100 calls) as a starting point, but measure your actual loop depth before estimating production costs.
Sharing Estimates With Your Team
One feature I find genuinely useful: the Share Estimate button encodes everything into the URL:
?m=claude-sonnet-4&in=800&out=400&d=1000&c=PKR
Open that URL and all settings restore automatically. Useful for:
- Sending a cost comparison to a teammate: "Here's Gemini vs DeepSeek for our RAG workload — click the link and switch models"
- Client proposals: pre-fill their expected volume so they can explore numbers themselves
- Budget reviews: bookmark the URL with your current production numbers
Nothing sensitive in the URL — just model name, token counts, call volume, and currency.
Quick Checklist Before You Pick a Model
Before committing to any LLM in production:
- [ ] Measure your actual system prompt token count (Token Counter tab)
- [ ] Estimate real output length from 5+ sample responses — don't guess
- [ ] Account for conversation history growth (average turn depth, not turn 1)
- [ ] Run the Compare Models tab with your actual numbers
- [ ] Check if cheaper models pass a 100-sample quality benchmark for your task
- [ ] Verify if your provider offers caching discounts for your prompt pattern
- [ ] Model the worst-case volume (peak traffic, not average)
- [ ] Switch to the cheapest model during development
Try It
The LLM API Cost Calculator is free, no signup, runs entirely in your browser. 18 models, 7 currencies, token counter, shareable URLs, downloadable CSV reports.
If you're building something where costs matter — and they always do eventually — spending 5 minutes with the Compare tab before picking a model is the highest ROI activity in your planning process.
What's your current monthly API bill? And which model surprised you most with its cost in production? Drop it in the comments — I'm genuinely curious what architectures people are running.
Top comments (0)