Francesco Sardone

Posted on May 18

Model Sizing for Coding Agents: Bigger Is Not Always Better

#ai #productivity #programming #discuss

AI coding agents need capable models, and that part is obvious by now. What is less obvious is that model capability is only half the problem. The other half is model fit.

A model that works beautifully for one coding task can be wasteful for another. A frontier reasoning model may be the right choice for architecture, migration planning, or debugging a subtle production issue. The same model may be unnecessary for formatting code, generating a small helper function, renaming symbols, summarizing diffs, or applying a known pattern.

So the real question is not only:

"Are you using the best model?"

It is also:

"Are you using the right model for this task?"

That is the part I think we should talk about more.

Overview
The Hidden Problem: Model Fit
Why One Model Is Not Enough
Task Classes: What Kind of Coding Work Is This
Model Profiles: How Much Intelligence Should the Task Get
Why the Best Model Depends on the Harness
Routing, Escalation, and Cascades
Why This Matters in Real Repositories
Conclusion

Overview

A lot of discussion around AI coding still treats model choice as a leaderboard problem. The assumption is simple: if one model is strongest overall, then using it everywhere should produce the best results. That is true sometimes, but it is also incomplete.

In practice, coding work is not one task. It is a mix of small edits, local reasoning, repository navigation, test interpretation, dependency awareness, code generation, code review, refactoring, migration planning, debugging, and long-horizon agentic work. These are not all equally difficult, and they do not all benefit from the same model size.

That means model selection is not just a capability decision. It is a systems design decision.

The largest or newest model may be the safest default for hard tasks, but it is not automatically the best default for every task. Larger models usually cost more. They can be slower. They may produce longer outputs than needed. They may burn tokens on reasoning that the task does not require. OpenAI’s own latency guidance makes this point directly: model size is one of the main factors influencing inference speed, and smaller models are usually faster and cheaper; used correctly, they can even outperform larger models for some workloads. (OpenAI Developers)

So the useful framing changes. It becomes less about "which model is best?" and more about this:

What is the smallest model that can reliably complete this coding task at the required quality level?

The Hidden Problem: Model Fit

When an AI coding workflow gets expensive or slow, we often blame token volume. Sometimes that is fair. Context bloat is real. Large repository prompts are expensive. Tool loops can multiply cost quickly.

But sometimes the issue is more basic: the wrong model is doing the wrong job.

A frontier model may be excellent at difficult multi-step coding work, but that does not mean it should handle every operation in the pipeline. Many coding-agent workflows contain tasks that are more operational than intellectual: classify the request, summarize a file, identify relevant paths, generate a test command, rewrite a small function, apply a style rule, or explain a diff. These tasks still require accuracy, but they often do not require the maximum available reasoning budget.

This is where model sizing becomes important.

A model can be "too small" for a task, causing bad patches, shallow reasoning, or missed edge cases. But a model can also be "too large" for a task, creating unnecessary cost and latency without materially improving the outcome. The second failure mode is quieter, because the answer may still be good. It just may be inefficient.

In other words, the problem is not always model capability. Sometimes it is mis-sized capability, and that distinction matters because once you move from "use the best model" to "size the model to the task," the design space gets much more interesting.

Why One Model Is Not Enough

There is a quiet assumption in many coding-agent setups that one model should carry the whole workflow. Pick the strongest model you can afford, wire it into the agent, and let it handle everything.

That sounds clean, but real coding workflows do not behave that way.

A small bug fix is not the same as a cross-package refactor. A failing unit test is not the same as a flaky integration test. A code review comment is not the same as a schema migration. A repository-wide modernization is not the same as adding one missing null check.

The model market also no longer behaves like a single ladder where every better model is simply "more of the same." Providers now expose families of models with different cost, latency, context, and reasoning tradeoffs. OpenAI, for example, explicitly presents frontier models for complex reasoning and coding while also pointing to smaller models for lower-latency, lower-cost workloads. (OpenAI Developers) Anthropic’s pricing page similarly shows large price differences across Claude model tiers, with higher-end models costing materially more per token than smaller ones. (Claude Platform) Google’s Gemini pricing also separates Pro and Flash-style options with different cost profiles. (Google AI for Developers)

So a single fixed model quickly starts to look less like a simplification and more like an accidental constraint.

The better framing is that a coding agent should not have one model, but a model portfolio. Once you accept that, at least two dimensions become worth choosing deliberately: task difficulty, meaning how much reasoning the work needs, and execution role, meaning where in the workflow the model is being used.

Those two decisions matter much more than they usually get credit for.

Task Classes: What Kind of Coding Work Is This

The first question is task class. Not all coding tasks deserve the same model budget, and the easiest way to waste money is to treat them as if they do.

Mechanical Tasks

Mechanical tasks are low-ambiguity operations. They include formatting, renaming, applying simple conventions, translating obvious patterns, generating boilerplate, extracting structured data from code, summarizing a small diff, or making a localized edit with clear instructions.

These tasks can often be handled by smaller or faster models, especially when the repository context is clean and the expected output is easy to validate.

The important point is not that these tasks are trivial. The important point is that they are bounded. The model does not need to discover a new architecture. It needs to execute a known operation correctly.

Local Reasoning Tasks

Local reasoning tasks require understanding a function, file, module, or narrow dependency chain. Examples include fixing a failing unit test, adding a small feature inside an existing pattern, explaining a bug, or modifying behavior in one bounded area of the codebase.

These tasks often benefit from a mid-sized model. The model needs enough reasoning to understand cause and effect, but it may not need the most expensive frontier model if the context is well-scoped and tests are available.

Repository Reasoning Tasks

Repository reasoning tasks require understanding conventions, cross-file relationships, hidden assumptions, or interactions between systems. Examples include changing a public API, updating a persistence layer, modifying authentication behavior, or refactoring a shared package.

This is where stronger models become more valuable. The difficulty is not just writing code. The difficulty is preserving behavior across a larger surface area.

Long-Horizon Agentic Tasks

Long-horizon tasks require planning, tool use, repeated verification, and recovery from mistakes. Examples include migrating a framework version, resolving a complex GitHub issue, implementing a feature across multiple packages, or debugging failures after several failed attempts.

This is the class where the strongest models are most defensible. Benchmarks such as SWE-bench Verified focus on real software issues and evaluate whether generated patches pass associated tests, which makes them more relevant to this kind of work than simple code-generation benchmarks. SWE-bench Verified is a human-filtered subset of 500 tasks, and the SWE-bench site also distinguishes between different harnesses and agent setups, which is important because the model is only one part of the result. (SWE-bench)

The important point is not that one class is better. The important point is that they have different model requirements.

A mechanical task optimizes for speed and cost. A long-horizon task optimizes for reliability and recovery. Those are not the same objective.

Model Profiles: How Much Intelligence Should the Task Get

The second question is model profile. Once you understand the task class, you can decide how much model you actually need.

Small and Fast

Small models are useful when the task is bounded, the expected output is short, and failure can be cheaply detected. They are good candidates for classification, extraction, simple transformations, file summaries, quick explanations, and narrow edits.

This profile is especially valuable inside coding agents because many agent steps are not the final patch. They are supporting operations: deciding what to inspect, summarizing what changed, checking whether a file is relevant, or generating a small intermediate artifact.

Using a frontier model for every supporting operation can make the whole agent feel expensive before it has even started solving the real problem.

Mid-Sized and Balanced

Mid-sized models are often the best default for day-to-day coding. They can usually handle common implementation tasks, local debugging, test-driven edits, and straightforward refactors while keeping cost and latency under control.

This profile matters because most coding work is not at the extreme frontier. It is not always a research-level debugging problem. It is often a known engineering task inside a known codebase.

That is where balanced models can be very effective: enough capability to reason, not so much cost that every iteration becomes expensive.

Frontier and Expensive

Frontier models should be reserved for tasks where failure is expensive, ambiguity is high, or reasoning depth matters. They are useful for architecture, difficult debugging, security-sensitive review, large refactors, migrations, and agentic loops where a bad early decision can waste many later steps.

The key is not to avoid frontier models. The key is to spend them deliberately.

A strong model is most valuable when it is applied to the part of the workflow where intelligence is actually scarce. If it is used everywhere, it becomes less like a precision instrument and more like a very expensive default setting.

Why the Best Model Depends on the Harness

This is probably the most important point in the whole article. The right model depends not only on the task, but also on the harness, because different coding-agent environments do not consume model intelligence the same way.

Some harnesses front-load a lot of context. Some use retrieval. Some rely heavily on shell feedback. Some run tests aggressively. Some ask the model to plan and execute. Some split work across sub-agents. Some keep a human tightly in the loop. Others run closer to batch mode.

That changes model requirements.

A smaller model with excellent context, strong tools, tight tests, and clear instructions may outperform a larger model operating with poor context and weak feedback. This is also why benchmarks should be read carefully. SWE-bench, for example, is not only a model benchmark; results depend on the agent scaffold, tools, retries, and evaluation setup. The SWE-bench leaderboard explicitly separates benchmark variants and notes the harness used for some comparisons. (SWE-bench)

The same applies in production. A model that is too weak for open-ended coding may be perfectly reliable when used as a sub-agent for search, summarization, or patch verification. A frontier model that performs well in an interactive IDE may be too expensive for a high-volume automated review pipeline. A model that is excellent for code generation may not be the best choice for cheap classification or test-log summarization.

So when people ask what the ideal coding model is, I think the honest answer is this:

There is no ideal model independent of the consumption model.

There is only the best fit for the task, the repository, the harness, the validation loop, the latency target, and the cost budget.

That is why model choice should be a routing decision, not a static preference.

Routing, Escalation, and Cascades

A practical way to think about model optimization is fairly simple.

Use a small model when the task is bounded, reversible, cheap to validate, or part of the agent’s internal workflow.

Use a mid-sized model when the task requires local reasoning, but the scope is still clear and tests or review can catch most mistakes.

Use a frontier model when the task is ambiguous, cross-cutting, high-risk, long-horizon, or expensive to get wrong.

But the more interesting pattern is not manual selection. It is routing.

Model routing means choosing the model dynamically based on the task. Model escalation means starting cheaper and moving upward only when needed. Model cascades mean trying one model first, validating the result, and escalating if confidence or correctness is insufficient.

This is not just a theoretical idea. The research direction is well established. FrugalGPT proposed prompt adaptation, LLM approximation, and LLM cascades as ways to reduce inference cost, and reported that cascades could match the performance of the best individual LLM with up to 98% cost reduction in their experiments. (arXiv) RouteLLM similarly frames model choice as a cost-performance routing problem, dynamically selecting between stronger and weaker models at inference time. (OpenReview) More recent routing work continues in the same direction: select models based on task requirements, cost, latency, and expected quality rather than treating one model as universally optimal. (arXiv)

For coding agents, this suggests a useful architecture:

Do not ask one model to do every job. Build a workflow that knows when to spend intelligence.

That can look like this:

A small model classifies the task;
A small or mid-sized model retrieves and summarizes relevant files;
A mid-sized model attempts the patch;
Tests, linters, or static analysis validate the result;
A frontier model is invoked only when the task is hard, the patch fails, or the risk is high.

That is a very different cost profile from sending every step directly to the most expensive model.

It is also a better engineering posture. The goal is not to be cheap for its own sake. The goal is to reserve expensive reasoning for the moments where it changes the outcome.

Why This Matters in Real Repositories

This matters because coding-agent cost is not only a token problem. It is a workflow problem.

An agent that uses a frontier model to summarize every file may not need better summarization. It may need cheaper summarization. An agent that sends every lint fix to the largest model may not need more intelligence. It may need a mechanical edit path. An agent that uses the same model for task classification, repository search, code generation, test repair, and architectural reasoning may not need one better model. It may need a model-sizing strategy.

The economics become especially important once agents move from demos to real usage. Coding agents are iterative. They inspect files, call tools, run tests, read failures, revise patches, and sometimes repeat the loop many times. That means small inefficiencies compound quickly.

Prompt caching helps, and it should be part of the conversation. OpenAI states that prompt caching can reduce latency by up to 80% and input token costs by up to 90% for repeated prompt prefixes. (OpenAI Developers) But caching does not replace model sizing. Caching makes repeated context cheaper. Model sizing asks whether the expensive model needed to see that context in the first place.

The real optimization is layered:

better context, better routing, better validation, better caching, and better escalation.

That is why model selection deserves more engineering attention. Not just which model scores highest, but which model should do which part of the workflow, under which conditions, with which fallback path.

If we care about coding-agent performance, we should care about model capability. But we should also care about cost per successful task, latency per iteration, failure recovery, validation quality, and how often expensive reasoning is invoked unnecessarily.

That is not procurement detail. It is part of the runtime architecture of the agent.

Conclusion

A good AI coding workflow is not only about using a powerful model. It is about fit: fit to the task, fit to the repository, fit to the harness, and fit to the way the work is validated.

That is why I do not think coding agents should default to one model everywhere. Some tasks need the strongest available model. Some need a balanced model. Some need a small, fast model with good instructions and tight validation. Some need a cascade where the system starts cheap and escalates only when the task proves difficult.

The important shift is this:

Stop thinking of model choice as a leaderboard decision. Start thinking of it as a sizing problem.

Once you do that, cost and performance stop looking like opposing forces. They become part of the same engineering problem.

So if you are building coding-agent workflows, that naturally leads to a more useful question:

Are you using the best model, or are you using the right model at the right moment?

Top comments (2)

Harjot Singh • May 31

"Bigger is not always better" is exactly right for coding agents, and the reason is subtle: a coding agent's success depends way more on the harness (good context, tight tool definitions, a working test/feedback loop) than on raw model size. A mid-size model with an excellent harness routinely beats a frontier model fumbling in a bad one. People keep upgrading the model when they should be upgrading the scaffolding around it.

The corollary that follows: pick the smallest model that reliably clears the task's bar, then invest the savings into a better harness and verification - that combination beats "just use the biggest model" on cost AND often on quality. This is the whole thesis behind how Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) is built - right-sized models per step + a strong harness, which is why a build lands ~$3 flat without sacrificing the output. Really sharp post, this is the nuance the "always use the best model" crowd misses. In your testing, where was the sweet spot - did a mid-tier model with good scaffolding actually match a bigger one, or did size still win on the hardest tasks? That boundary is the whole sizing decision.

Francesco Sardone • Jun 21

Just picking up your comment now, I've lost it in the tornado of notifications I had.
Thank you kindly, your comment adds absolute value to this write up.

DEV Community