There's a move happening in engineering teams right now that nobody's calling out loudly enough. Someone gets access to o3, or Claude's extended thinking, or Gemini's Deep Think — and suddenly it becomes the default for everything. Debugging a one-liner? Throw it at the reasoning model. Writing a unit test? Reasoning model. Summarizing a Slack thread? Reasoning model.
It feels right. More thinking equals better answers, right?
Wrong. And it's costing people real money and real time.
What "Reasoning" Actually Means
Quick level-set. Reasoning models don't just generate a response — they run an internal chain of thought before they answer. You're paying for and waiting on that deliberation whether you need it or not. o3 on a hard math competition problem? That thinking is doing real work. o3 asked whether you should use map() or a for loop? It's essentially arguing with itself about something that has an obvious answer.
The model category was built for a specific class of problem: tasks where the path to the answer is genuinely non-obvious. Multi-step proofs. Complex code that requires holding a lot of state in mind simultaneously. Long-horizon planning where getting step 3 wrong invalidates steps 4 through 10.
For everything else, you've got a Ferrari in a school zone.
Where They Actually Shine
I'll give credit where it's due — I've seen reasoning models do things that genuinely impressed me.
Debugging a gnarly concurrency issue in a Rust service where the race condition only appeared under specific scheduler timing. A standard model would give me the textbook answer about mutex guards. The reasoning model walked through the actual execution order, identified that my lock acquisition sequence was different in one branch, and caught it. That kind of problem — where you need to mentally simulate a system — is exactly what these models were built for.
Similarly: mathematical derivations, algorithm design from scratch (not "implement quicksort," but "design a scheduling algorithm with these four constraints that interact"), and security audits where you need to reason about what an attacker could do rather than what the code does. These are legitimately hard problems where 30 seconds of model thinking time is a bargain.
The pattern: reasoning models excel when the search space of possible answers is large, intermediate steps matter, and getting it wrong early cascades badly.
Where They're Actively Making Things Worse
Here's what doesn't get said enough — for certain tasks, reasoning models don't just underperform. They produce worse outputs.
Simple factual retrieval? The extended reasoning becomes a liability. The model overthinks its uncertainty, hedges more, sometimes talks itself out of a correct initial answer. I've watched a reasoning model spend an embarrassing number of tokens second-guessing whether Python's list.sort() is stable (it is, definitively, and a standard model would just say so).
Creative work is another casualty. Writing, brainstorming, generating five different angles on a problem — reasoning models converge on the "correct" answer when you actually want breadth and weirdness. The deliberation filters out the interesting edges. You end up with the most defensible response, which is usually the most boring one.
Fast iteration loops are where the latency really bites. If you're doing a back-and-forth debugging session — trying something, seeing the result, adjusting — a 15-second response time turns a 5-minute session into a 30-minute slog. The quality gain per turn is marginal. The friction is very real.
The Cost Nobody's Doing the Math On
Let's be concrete about money for a second.
Reasoning model API costs run roughly 3–10x higher than their standard counterparts, depending on provider and tier. Latency is often 5–20x worse at the p95. For a product with any meaningful usage volume, that math adds up fast — and it doesn't add up in your favor.
One team I know switched their entire coding assistant backend to a reasoning model after it benchmarked well on their eval set. Their API costs tripled. Response times made the product feel broken to users accustomed to near-instant replies. They switched back within three weeks, a little sheepishly.
The benchmark problem is subtle but important: reasoning models were evaluated on hard problems specifically designed to stress their strengths. If your actual workload looks like those benchmarks, great — you'll see the gains. But most production AI features don't. Most of the time you're doing retrieval, summarization, light formatting, classification, simple code generation. Standard models handle all of that well, at a fraction of the cost.
A Framework That Actually Works
After getting this wrong enough times, here's how I think about model selection now:
Reach for a reasoning model when:
- The problem requires chaining multiple logical steps where early errors cascade
- You need to satisfy constraints that interact with each other in non-obvious ways
- You'd write out a long scratchpad yourself if solving it by hand
- Getting it wrong is expensive — security analysis, correctness-critical code, anything financial
Stick with a standard model when:
- You're iterating quickly and latency is noticeable
- The task is mostly retrieval, summarization, or generation from a clear template
- You need creative breadth over a single correct answer
- Users will hit this repeatedly and cost at scale matters
A decent rule of thumb that's served me well: if you'd expect a smart colleague to answer in under a minute without needing to think out loud, you don't need a reasoning model.
The Seduction Problem
"Reasoning" is a great marketing term. It sounds like it means smarter, more thoughtful, more capable. And in the narrow domain these models were designed for, it does mean all of those things. But the term has been stretched to mean "better than standard in all cases," which is just not true.
It's the same trap we fall into with every new capability in tech. Microservices became the default architecture even for four-person startups with a single Postgres database. GraphQL got slapped on every API regardless of whether the query flexibility justified the overhead. Now reasoning models are becoming the default inference choice regardless of whether the problem actually warrants the tradeoff.
The models will tell you what they're good at, if you pay attention to where they're slow and where they fumble. That latency and cost is the reasoning working. It's the feature on hard problems. Stop paying for it on easy ones.
Top comments (1)
Solid framework, but one counterpoint on creative work — reasoning models with explicit "generate 5 divergent options" prompting can actually outperform standard models on breadth. The convergence problem you describe might be more about prompting patterns than an inherent model limitation?