The demo was perfect. A reasoning model, given a gnarly algorithmic problem, spent forty-five seconds thinking — you could watch it second-guess itself, backtrack, reconsider — then produced a solution that was genuinely elegant. The kind of solution a senior engineer would be proud of.
So naturally, people started routing everything through it. Database queries. Boilerplate CRUD endpoints. A function to format a date string.
This is where things go wrong.
The Seduction of "It Just Thinks Harder"
Reasoning models — the ones doing extended chain-of-thought before answering — are legitimately impressive. They've meaningfully closed the gap on hard math, multi-step logic problems, and competitive coding benchmarks. That's real. I'm not going to pretend the capability jump didn't happen.
But somewhere along the way, the framing shifted from "use this when the problem is actually hard" to "use this for everything, because more thinking = better output."
It doesn't. Not reliably. And the cost — in latency, in tokens, in actual dollars — is not trivial.
I've seen teams burn through budget routing customer support classification, simple code completion, and FAQ lookups through reasoning-class models. The outputs weren't meaningfully better. They were just slower and more expensive. Sometimes worse, because the model would over-think a trivially simple request and introduce unnecessary complexity.
Faster, cheaper models handle the boring 90% of production workloads just fine. Often better, because they're snappier and more predictable.
What Reasoning Models Are Actually Good At
Let me be specific, because vague claims don't help anyone.
They're good at:
Problems with non-obvious structure. If you're not sure what kind of problem you're dealing with — if the problem itself requires unpacking before solving — reasoning models shine. Debugging a concurrency issue across a distributed system where the bug only surfaces under specific timing conditions? Yeah, throw compute at that.
Long-context reasoning with real stakes. Synthesizing a 50-page technical spec and flagging internal contradictions. Reviewing a contract for edge cases. Analyzing a complex codebase change for unintended side effects. These are tasks where the cost of a wrong answer is high and the additional latency is tolerable.
Novel math or algorithmic derivation. If you need something proved or derived from first principles, not just retrieved, reasoning models have a meaningful edge.
They're not particularly better at:
- Summarization
- Code generation for standard patterns (CRUD, REST APIs, component scaffolding)
- Classification and routing tasks
- Reformatting or transforming data
- Writing documentation
- Anything where "pattern matching against training data" is basically all you need
This isn't an insult to reasoning models. It's just that most software work isn't novel. Most of it is applying known patterns to a specific context. Fast models are excellent at that.
The Latency Problem Nobody Talks About
Here's a thing I've noticed: developers who build with AI tools professionally think very differently about latency than developers who mostly demo AI tools.
In a demo, 15 seconds of "thinking" looks impressive. It signals depth. It builds anticipation.
In a production system where a user is waiting on a response, 15 seconds is a UX disaster. And in an agentic workflow where the model needs to make five sequential decisions? You're looking at over a minute for something that a faster model would've knocked out in eight seconds.
For most applications — and I mean this — latency is a first-class constraint, not an afterthought. Choosing the right model isn't just about capability ceiling, it's about the capability-per-millisecond and capability-per-dollar curves that actually fit your use case.
The teams I've seen succeed with AI in production are the ones who use reasoning models surgically — for the genuinely hard sub-problems — and route everything else to something leaner.
The Benchmark Trap
Benchmarks are doing some real damage here.
When a new reasoning model drops and it posts state-of-the-art on ARC-AGI or MATH or some coding olympiad leaderboard, it's genuinely exciting. Those benchmarks represent hard problems that matter for AI capability research.
But your production use case isn't a benchmark. It's a specific, grounded task with specific constraints. And the correlation between "wins on olympiad-style coding problems" and "is the best model for my particular application" is... weak. Often near zero.
I've run comparisons where a significantly cheaper, faster model outperformed a reasoning-class model on the actual task we were building for — not because the reasoning model was bad, but because the task rewarded speed and predictability over depth.
Evaluate on your task. Not on a leaderboard.
So What's the Right Mental Model?
Think about it like compute allocation. You wouldn't spin up a 64-core machine to run a cron job that checks disk space every hour. You match resources to requirements.
For AI, that means:
- Start with the cheapest model that's plausibly capable. If it works, ship it. If it doesn't, step up.
- Identify the 10% of your workload that's actually hard. Novel problems, high-stakes decisions, complex multi-step reasoning. That's where reasoning models earn their cost.
- Measure latency as a product constraint, not just an engineering metric. Users feel it. Design for it.
- Re-evaluate as models improve. A fast model from a year ago isn't the same as a fast model today. The capability gap between tiers is compressing.
The interesting thing is: as reasoning capabilities get distilled into smaller, faster models — which is actively happening — this whole conversation will look different in another 18 months. But right now, in March 2026, the tier distinctions are real and the cost differences are significant.
Reasoning models represent a genuine leap. I'm not downplaying that. But the lesson from the last year of watching teams actually build with these things isn't "use the most powerful model you can afford."
It's: know what problem you're actually solving.
Most of the time, it's not that hard. Ship accordingly.
Top comments (0)