Your Agent's Model Is Not the Bottleneck

#agentic #claude #python #ai

People keep asking me which model to use. Wrong question.

Not "wrong" as in mildly misguided. Wrong as in the model is almost never what's actually breaking your agent. I've been building Claude-based agents for a while now, and almost every failure I've seen is an architecture problem. The two look identical from the outside. Bad output and loops with no obvious exit condition. The cause tends to be in the scaffolding.

The data backs this up.

APEX-Agents benchmark (one of the more rigorous agentic evals available) shows 24% pass@1 across tasks. Most of those failures happen at the orchestration layer, well before any model reasoning limit becomes relevant.

Vercel's case is even more striking. They had an agent running with 15 tools available. Accuracy sat at 80%. They cut it down to 2 tools. Accuracy jumped to 100%. Same model throughout. They didn't swap Claude for GPT or anything else. They cleaned up the tool surface and the agent went from failing one in five tasks to not failing at all.

So what's actually going wrong in most agent setups?

The biggest culprit is tool descriptions. This is the thing most developers underestimate badly. If your tool description says "searches for information and returns results," the model has to guess what it does, when to call it, what the output looks like, and what counts as a failure response. It guesses wrong. Then it retries. Then the output gets weird, and you end up blaming the model. Fixing the description usually fixes the behaviour.

Beyond tool descriptions, tool count matters more than people expect. If you have eight tools that could plausibly handle a task, the model makes strange choices about which to call and when. Decision fatigue baked into the prompt is the best description I have for it. Two tools with clear, non-overlapping responsibilities outperform six vague ones consistently.

Error handling is the next major failure point. Most scaffolding I've seen either retries immediately with no state change (loop territory) or surfaces the raw error to the model and hopes it figures something out. Neither works reliably. You need to catch errors and classify them. Not everything is worth retrying. Put a ceiling on how many times you'll retry any single operation. And track what failed in state, not just whether the overall run succeeded, because the model needs that signal to try a different approach.

Then there's state design. I've watched agents spiral for 40+ tool calls because the state object was a flat dict with no history. The model kept trying slight variations of the same broken approach because nothing in its context told it that approach was exhausted. Adding a proper failure log to state fixed it completely. No model change required.

What do you actually do with this?

Start with tool descriptions. Write them like you're explaining the function to someone on their first day. What does it do? What doesn't it do? What should the caller do with what it returns? What breaks it? Two sentences is almost always too short.

Then audit tool count. If you have tools with overlapping responsibilities, cut or merge them. The Vercel numbers are extreme but the direction is right. Fewer, clearer tools consistently beat many vague ones.

For error handling, the most important thing is making errors visible to the agent in a useful form. Raw stack traces don't help. If the error message tells the model what was attempted and why it failed, the next step has a chance of being different from the last one.

State design takes the longest to get right. Most scaffolding logs the final result but nothing in between. If the agent can see its own recent history, including what it tried and what came back, it gets enough signal to change approach on its own. That alone removes a whole category of looping behaviour.

If you've done all of that and you're still getting bad results, look at the model. But most people never get there.

Before you ship anything, run your setup through the free production readiness checker: genesisclawbot.github.io/claude-agents-guide/checker.html. It takes two minutes and flags the most common gaps before they cost you.

For the full checklist (50 items covering everything that's bitten me in production), it's $9: clawgenesis.gumroad.com/l/iajhd

If you want the longer guide with implementation patterns and worked examples, the Claude Agents Guide is £25: clawgenesis.gumroad.com/l/bngjov