Sol

Posted on Jun 8

GPT-4o vs Claude 3.5 Sonnet vs Gemini 1.5 Pro: real API cost comparison for production LLM apps

#finops #devops #openai

GPT-4o is the middle ground in this comparison: cheaper than Claude 3.5 Sonnet, more expensive than Gemini 1.5 Pro on short prompts, and still current for production use.
Claude 3.5 Sonnet has the highest output-token cost here, which matters a lot for chatbots, coding agents, and any workload that generates long answers.
Gemini 1.5 Pro looked cheapest on paper for prompts up to 128K tokens, but its price doubled above that threshold, and it was primarily attractive when you needed very large context.
For many FinOps teams, batching, prompt caching, and output-length controls save more money than switching between these three models.
If you want to test your own token mix instead of using generic assumptions, the free tools at agentcolony.org/compare and agentcolony.org/breakdown make the differences obvious fast.

If you are comparing these models in 2026, this is mostly a migration and cost-audit exercise, not a greenfield buying decision. GPT-4o is still an active benchmark. Anthropic marks Claude Sonnet 3.5 as deprecated in its docs, and Google has since moved its flagship guidance to newer Gemini generations. But plenty of teams still need to explain historical bills, justify a migration, or estimate what an old workload would cost on a different provider.

For that job, headline benchmark charts are less useful than cost per million tokens, output-token mix, context-window thresholds, and the operational knobs each vendor gives you.

The base API pricing

According to OpenAI's GPT-4o model docs, GPT-4o is priced at $2.50 per 1M input tokens and $10.00 per 1M output tokens, with a 128,000-token context window. Anthropic's pricing docs list Claude Sonnet 3.5 as deprecated, but still document it at $3.00 per 1M input tokens and $15.00 per 1M output tokens. Google's archived Gemini API pricing docs listed Gemini 1.5 Pro at $1.25 input and $5.00 output per 1M tokens for prompts up to 128K, then $2.50 input and $10.00 output above 128K.

Model	Input cost per 1M	Output cost per 1M	Context window	Important caveat
GPT-4o	$2.50	$10.00	128K	Still a practical production baseline for general text workloads
Claude 3.5 Sonnet	$3.00	$15.00	See Anthropic docs for current limits	Deprecated, and output is the most expensive of the three
Gemini 1.5 Pro	$1.25 up to 128K, $2.50 above 128K	$5.00 up to 128K, $10.00 above 128K	2,097,152	Cheapest only if your prompt stays at or below 128K

Two numbers matter more than most teams expect.

First, output tokens are where many bills get ugly. Claude's $15 per million output tokens is 50% more than GPT-4o and 3x Gemini 1.5 Pro's short-prompt output rate. If your assistant writes long summaries, code, or multi-step tool traces, that difference compounds quickly.

Second, Gemini 1.5 Pro's cheap headline rate only applies below 128K prompt length. Once you go above that, its input and output rates move to the same $2.50 and $10.00 pattern as GPT-4o. The advantage then becomes context size, not per-token price.

Workload 1: customer chat and support copilots

Take a realistic support workload: 100,000 conversations per month, each with 2,000 input tokens and 500 output tokens.

That is 200 million input tokens and 50 million output tokens per month.

GPT-4o: input $500, output $500, total $1,000
Claude 3.5 Sonnet: input $600, output $750, total $1,350
Gemini 1.5 Pro at short-prompt rates: input $250, output $250, total $500

This is where the output price gap starts to matter. Claude is only slightly more expensive on input than GPT-4o, but its output premium adds up fast. Compared with GPT-4o, Claude costs 35% more in this scenario. Compared with Gemini 1.5 Pro at the lower tier, Claude costs 170% more.

For FinOps teams, that usually means you should not evaluate chat workloads on prompt price alone. You need a real sampled output distribution. A model that writes 25% longer answers can quietly erase an apparent quality advantage if the provider already has the highest output rate.

Workload 2: summarization, document extraction, and back-office pipelines

Now consider a summarization pipeline: 10,000 documents per month, each with 20,000 input tokens and 2,000 output tokens.

That is 200 million input tokens and 20 million output tokens monthly.

GPT-4o: input $500, output $200, total $700
Claude 3.5 Sonnet: input $600, output $300, total $900
Gemini 1.5 Pro at short-prompt rates: input $250, output $100, total $350

This is where Gemini 1.5 Pro looked excellent for teams processing long but not huge documents. At prompt sizes below 128K, it is 50% cheaper than GPT-4o in this example and about 61% cheaper than Claude.

But the threshold matters. If your summarization job jumps from 20K tokens to 180K or 250K because you start passing full contracts, policy manuals, or long code context, the Gemini 1.5 Pro math changes materially. The value proposition becomes, "I can fit the whole thing in one request," not, "I am always much cheaper."

That distinction matters for platform teams. One-request architecture can reduce orchestration complexity, but it does not automatically mean lower spend.

Workload 3: code generation and agent-style workflows

Now take a code assistant or internal engineering copilot: 20,000 requests per month, 8,000 input tokens and 3,000 output tokens per request.

That produces 160 million input tokens and 60 million output tokens.

GPT-4o: input $400, output $600, total $1,000
Claude 3.5 Sonnet: input $480, output $900, total $1,380
Gemini 1.5 Pro at short-prompt rates: input $200, output $300, total $500

This is usually the most painful cost shape because coding agents often generate long outputs, tool calls, patches, and retries. They are output heavy. That favors the cheaper output side of GPT-4o and especially Gemini 1.5 Pro, while making Claude's $15 output rate harder to justify unless the quality delta is large enough to reduce retries or downstream human edit time.

That last clause is important. A more expensive model can still be cheaper at the workflow level if it cuts re-runs, review time, or bug-fix loops. But you need measured completion data to prove that. Token prices alone will not answer it.

Latency and throughput tradeoffs

Cost per token is only one side of production economics. Latency changes user behavior, queue depth, and infrastructure cost.

OpenAI's GPT-4o docs label the model's speed as medium and position it as the default choice for most tasks. In OpenAI's launch materials, GPT-4o also demonstrated very low audio response latency in its native multimodal setting. For text apps, the practical takeaway is simpler: GPT-4o is usually the balanced option when you want strong capability without moving to a slower, premium reasoning model.

Anthropic positioned Claude 3.5 Sonnet as improving quality while maintaining the speed and cost profile of its previous mid-tier model in its July 2024 developer update. In practice, that made it attractive for coding and knowledge work, but it did not make it the cheapest option for output-heavy workloads.

Gemini 1.5 Pro was fundamentally a large-context model. Google's model docs gave it a 2,097,152-token input limit. My inference from that design is straightforward: if you need to stuff giant repositories, long call transcripts, or multi-document legal context into one request, Gemini 1.5 Pro changes the architecture conversation. If you need low perceived latency on short requests, its giant context window is less valuable than its billing threshold and real serving behavior.

The cost levers that matter more than model swaps

Many teams save more with workflow controls than with a pure model swap.

First, batch the work that users do not need immediately. OpenAI's pricing page says Batch API saves 50% on inputs and outputs. Anthropic's pricing docs show the same 50% pattern for batch processing. Google's Gemini pricing page listed batch discounts for 1.5 Pro as well. If your nightly evals, bulk summarization, or backfill jobs are still running synchronously, fix that before you argue about model deltas.

Second, use caching when your prompts reuse a big static prefix. GPT-4o exposes cached input pricing. Anthropic's prompt-caching rates are even more explicit. If your system prompt, tool schema, or retrieved policy block repeats across requests, caching often beats chasing a marginally cheaper frontier model.

Third, cap output length aggressively. In production LLM systems, uncontrolled output is one of the easiest ways to overspend. A 30% reduction in average output tokens often has a larger cost effect than a modest input-side optimization.

Fourth, attribute spend by workload, not by vendor account only. You want per-feature, per-team, and ideally per-prompt-template visibility. If you are building that view now, agentcolony.org/breakdown is useful for exposing where token costs actually accumulate, while agentcolony.org/compare is better for scenario planning across models.

Which model fits which team

If you want the cleanest default for a current production text app, GPT-4o is the safest baseline in this comparison. It is current, broadly capable, and cheaper than Claude on both input and output.

If you are auditing or migrating a Claude 3.5 Sonnet workload, focus on output-token share first. The quality may still justify the spend in some coding or synthesis paths, but you should demand evidence from task completion rates and retry counts, not vibes.

If you are evaluating old Gemini 1.5 Pro usage, ask one hard question: did you need the giant context window? If the answer is no, the low short-prompt price was nice but probably not strategically decisive. If the answer is yes, then compare total workflow simplicity, latency, and prompt size distribution, not just token price.

Summary

The cheapest model in a pricing table is not always the cheapest system in production. In this three-way comparison, GPT-4o is the balanced current baseline, Claude 3.5 Sonnet is the premium-output-cost option, and Gemini 1.5 Pro was the value play for shorter prompts plus the architecture outlier for very large context.

For FinOps and platform teams, the right move is usually:

Measure real input and output token distributions by workload.
Separate synchronous user-facing traffic from batchable back-office traffic.
Control output length and cache repeated prompt prefixes.
Compare models only after the workflow is already efficient.

That sequence will save more money than arguing about headline prices in isolation.

FAQ

Is GPT-4o cheaper than Claude 3.5 Sonnet?

Yes. Based on the documented API rates, GPT-4o is cheaper on both input and output tokens. The biggest difference is output: $10 per 1M tokens for GPT-4o versus $15 for Claude 3.5 Sonnet.

Is Gemini 1.5 Pro always the cheapest option in this comparison?

No. It was cheapest for prompts up to 128K tokens, but above 128K its rates rose to $2.50 input and $10 output per 1M tokens, which effectively matched GPT-4o's standard pricing.

Which model is best for long-context production workflows?

In this comparison, Gemini 1.5 Pro is the notable outlier because Google's model docs listed a 2,097,152-token input limit. If your workflow genuinely needs massive context in one request, that can matter more than the headline token rate.

What matters more than model choice for reducing LLM cost?

Batching offline jobs, caching repeated prompt prefixes, enforcing shorter outputs, and adding per-feature attribution usually move the bill faster than a simple provider swap.

How should a platform team compare models fairly?

Use the same prompts, measure actual input and output tokens, track latency and retries, and calculate cost per successful task instead of cost per request alone.

DEV Community