Muhammed ali Ceylan

Posted on Jun 10 • Originally published at apicalculators.com

I Compared GPT-4o vs Claude vs Mistral API Costs for My SaaS — The Numbers Shocked Me

#ai #programming #showdev #webdev

I was building a document Q&A feature for my SaaS.
Estimated 100,000 LLM requests per month.
Picked GPT-4o without thinking.
Then I actually ran the numbers.

Here's what I found.

The Setup

Typical request profile for a document Q&A backend:

Input tokens per request: 1,500 (system prompt + retrieved context)
Output tokens per request: 500 (answer)
Volume: 100,000 requests/month

Simple calculation. Turns out not so simple on the wallet.

The Cost Table

Model	Input $/1M	Output $/1M	Monthly Cost
GPT-4o	$2.50	$10.00	$875
Claude 3.5 Sonnet	$3.00	$15.00	$1,200
Mistral Large	$2.00	$6.00	$600
Llama 3.1 70B (Together AI)	$0.88	$0.88	$220
GPT-4o mini	$0.15	$0.60	$52
Claude 3.5 Haiku	$0.80	$4.00	$320
Gemini 1.5 Flash	$0.075	$0.30	$26

$875/month vs $26/month for the same 100K requests.
That's a 33× price gap between GPT-4o and Gemini Flash.

What I Actually Did

I didn't just blindly switch to the cheapest model.
I ran a tiered approach:

Routing layer (GPT-4o mini) → classifies the query complexity → $52/month

Simple queries (Gemini Flash) → factual lookups, short answers → $26/month

Complex queries (GPT-4o) → reasoning, synthesis, long-form → $175/month

Total: ~$253/month instead of $875.
Same quality. 71% cheaper.

The Hidden Cost: Context Bloat

Most tutorials show you per-token pricing.
Nobody talks about context window bloat.

As your conversation history grows, your input tokens explode:

Turn 1: 1,500 tokens input
Turn 5: 6,000+ tokens input (full history)
Turn 10: 12,000+ tokens input

At GPT-4o pricing, a 10-turn conversation costs 8× more than a single request.
Solutions: summarize history after turn 3, use semantic compression, or cache repeated context.

Batch API: The 50% Discount Nobody Uses

OpenAI's Batch API gives you 50% off for non-realtime workloads.
Same models. Same quality. Just async (results in ~24h).

Use cases that work perfectly with batch:

Document indexing pipelines
Nightly report generation
Bulk content classification
Offline data enrichment

If your use case tolerates async, you're leaving half your budget on the table.

Prompt Caching: 75–90% Off Repeated Context

Anthropic's prompt caching lets you cache your system prompt + static context.
Cache hit cost: ~10% of normal input price.

For document Q&A with a fixed system prompt (say 2,000 tokens),
caching saves you 90% on that chunk every request.
At 100K requests/month, that's meaningful.

The Calculator I Used

I was doing all this math in spreadsheets until I found
APICalculators.com —
a free browser-based LLM cost calculator.

You plug in your token averages and monthly volume,
it shows you the breakdown across all major providers instantly.
No signup, runs locally.

Useful for sanity-checking before you commit to a model in production.

The Decision Framework

Under $50/month budget:

Gemini Flash or GPT-4o mini. Full stop.

$50–$200/month, quality matters:

Claude 3.5 Haiku or Mistral Small.
Good reasoning, fraction of flagship cost.

$200–$500/month, complex reasoning needed:

GPT-4o mini for routing + GPT-4o for hard queries only.
Model routing cuts cost 60–70%.

Over $500/month:

Audit your prompts first.
Most overspend comes from bloated system prompts, not model choice.

TL;DR

GPT-4o is 33× more expensive than Gemini Flash at the same volume
Model routing (cheap router + expensive worker) cuts costs 60–70%
Batch API = 50% discount for async workloads
Prompt caching = 75–90% off repeated context
Use a cost calculator before picking a model in prod

What's your current LLM spend? Have you tried model routing?

DEV Community