I kept picking the wrong model.
Not because I didn't know the benchmarks. Because the benchmarks don't tell you what a model actually costs when you're running it daily, or whether it holds up across a 3-hour agent session, or whether it can fit your whole codebase without truncating half of it.
Five frontier models shipped in early 2026. All of them are good. None of them is good at everything. DeepSeek V3.2 costs $1 per million input tokens. GPT-5.4 costs $2.50 for the same volume. That is a 2.5x spread at the top of the lineup, and the cheaper model is not always worse for the job at hand.
Here is what I learned after running them all.
The price gap nobody talks about
The headline numbers first, because they set the frame.
| Model | Input / Output (per 1M tokens) | Context Window |
|---|---|---|
| Claude Opus 4.7 | $5 / $25 | 1M tokens |
| GPT-5.4 | $2.50 / $15 | 256K tokens |
| Kimi K2.6 | $3 / $15 | 512K tokens |
| Gemini 3.1 Pro | $2 / $12 | 2M tokens |
| DeepSeek V3.2 | $1 / $4 | 128K tokens |
The price gap is real. DeepSeek V3.2 costs a fifth of what Opus 4.7 costs per input token. Context windows vary by 16x from smallest to largest. DeepSeek's 128K window handles a medium codebase. Gemini's 2M window fits an entire monorepo.
These gaps are not footnotes. For the right workloads, they are the whole decision.
Coding: where the separation actually shows up
The standard benchmark is SWE-Bench, real GitHub issues where the model writes a fix that passes the test suite. Good benchmark. It skews toward clean, well-specified problems.
CursorBench runs a different evaluation. Real prompts from Cursor users. Messy, underspecified, half-broken codebases. The kind of problems actual developers bring to an AI every day.
Opus 4.7 leads CursorBench at 70%. GPT-5.4 comes close at 68% on SWE-Bench. On clean, well-defined problems the two are nearly even. On messy problems, the gap widens.
What makes Opus 4.7 different on hard coding tasks is self-correction. Most models generate code, declare it done, and move on. Opus 4.7 reviews what it just wrote, spots the type error or logic gap, and fixes it in the same pass. One fewer debugging loop per session adds up across a week of engineering work. I noticed it first on a nasty legacy codebase with no tests and inconsistent patterns: Opus 4.7 held the thread across multiple refactoring steps where others started drifting.
Gemini 3.1 Pro scores 63% on SWE-Bench and is a solid coding model when the task requires pulling context from a large codebase. The 2M window means it can read the whole thing. Where it falls behind is on complex reasoning chains where the model has to hold a long chain of logic without losing it.
DeepSeek V3.2 at 52% is surprisingly capable on standard implementation tasks for its price. Clear prompt, unambiguous problem, it delivers. It does not belong on hard, ambiguous work, and it mostly knows that.
Long documents: two different dimensions
Context window size and document reasoning quality are separate things. A huge window is useless if the model loses the plot. Strong reasoning is limited if the document doesn't fit.
Gemini 3.1 Pro's 2M context is genuinely useful for real workloads: a large monorepo, a full set of legal contracts, a year of financial filings. Nothing gets truncated. If the task is "read everything and extract what matters," Gemini is the right tool.
Opus 4.7's edge is accuracy over what it reads. On dense source material, it produces 21% fewer errors than its predecessor. That gap shows up most clearly in legal and financial work where a wrong clause or misread number has consequences. You can fit more raw text into Gemini, but Opus 4.7 does more with the text it reads.
A practical combination for large, high-stakes documents: Gemini 3.1 Pro for the initial pass across the full document, Opus 4.7 for the sections that require careful reasoning. Full picture from Gemini, accuracy from Opus on the parts that matter.
Multi-step agents: where the real separation is
Agent tasks are where the gap between models becomes undeniable. A model that is great at one-shot prompts can fall apart when it has to run for 20 steps, use tools, and keep track of what it already did.
The failure mode looks the same across models: the agent starts losing coherence around step 10 to 15. It forgets what it already checked. It tries an approach it already tried. It produces a "done" message when the task is half-finished.
Opus 4.7 stays coherent across hours of work. It has the lowest tool error rate of the group. When a tool call returns an unexpected result, it adjusts rather than proceeding on a false assumption. The practical payoff: you can set Opus 4.7 on a multi-hour task, walk away, and come back to actual results.
GPT-5.4 is strong on short chains, 3 to 5 steps, well-defined, fast. It is the fastest model in this group, which matters for interactive workflows where you are watching and course-correcting in real time. At the long end, reliability drops compared to Opus 4.7.
DeepSeek V3.2 is the right call for lightweight agent work at volume. Bulk tagging, classification pipelines, structured extraction from well-formatted documents. Running 10M tokens through DeepSeek instead of Opus saves about $61 per batch.
What it actually costs per real workload
Headline prices only tell half the story. The actual cost depends on what you are running.
Daily coding sessions (roughly 200K tokens each):
| Model | Cost per Session |
|---|---|
| DeepSeek V3.2 | $0.26 |
| Gemini 3.1 Pro | $0.75 |
| Kimi K2.6 | $0.90 |
| GPT-5.4 | $1.60 |
| Opus 4.7 | $1.75 |
For coding sessions, DeepSeek is nearly 7x cheaper than Opus 4.7. GPT-5.4 and Opus are actually close in per-session cost — GPT-5.4 wins on speed, Opus wins on hard problems.
High-volume automation (10M tokens per month):
| Model | Monthly Cost |
|---|---|
| DeepSeek V3.2 | $14 |
| Gemini 3.1 Pro | $35 |
| Kimi K2.6 | $39 |
| Opus 4.7 | $75 |
| GPT-5.4 | $78 |
At bulk volumes, DeepSeek is in a different price category. $14 versus $78 for the same token volume is a fundamentally different operating cost. Gemini 3.1 Pro at $35/month is the surprise here: 2M context at less than half the price of Opus.
The default pair for most builders
Opus 4.7 handles the tasks where quality decides the outcome: hard coding, debugging legacy code, long agent runs, precise document analysis. DeepSeek V3.2 handles the tasks where volume and cost decide the outcome: bulk automation, classification, templated generation, anything with a clear spec.
Those two together cover 90% of what most builders actually need.
The other three have specific edges worth knowing. Gemini 3.1 Pro for any workload that needs a 2M context window at a competitive price. GPT-5.4 for fast interactive work on clean codebases. Kimi K2.6 for Chinese-language documents at a competitive price.
The question is never "which model is best." It is "which model is right for this task." Get that right and you spend less, finish faster, and fix fewer mistakes on the other side.
Full breakdown with all the benchmark tables and cost scenarios is here: buildthisnow.com/blog/models/2026-04-21-opus47-vs-frontier
Top comments (0)