A Flask Creator Says Anthropic's Newest Models Got Worse at Using Tools

#anthropic #tooluse #agents #opus48

Anthropic's newest AI models are inventing extra, made-up fields when they call external tools, according to an essay published July 4, 2026 by Armin Ronacher, the creator of the Flask and Jinja web frameworks and Sentry's founder. On long multi-step coding sessions, Opus 4.8 and Sonnet 5 fabricate plausible-sounding but nonexistent parameters roughly one in five times, a regression not seen in Anthropic's own older models or in OpenAI's competing Codex models. Ronacher's diagnosis: forcing the schema to be followed exactly drops the failure rate to zero, and he argues the root cause is how Anthropic trains its models inside its own forgiving coding tool.

Key facts

About 20% of tool calls fail on long, multi-step agent sessions; on a single fresh question, the failure rate is 0%.
Turning on "strict mode," which forces outputs to match a tool's schema exactly, drops the failure rate to 0%; removing the model's "thinking" blocks from the conversation history roughly halves it.
Published 2026-07-04 by Armin Ronacher (creator of Flask, Jinja, and Sentry) at lucumr.pocoo.org, discussed on Hacker News.
At least three independent tool builders confirmed the same pattern on their own systems; OpenAI's Codex models showed no equivalent regression except one.

When an AI model uses a tool - reading a file, editing code, querying a database - it does not just describe in English what it wants to do. It emits a structured "function call": a small piece of data with named fields, like editing a document by specifying oldText, newText, and a line range. This is one of the building blocks behind AI agents, and it only works if the model fills in the fields the tool actually expects, nothing more and nothing less.

Ronacher noticed that on tools built around one particular shape - a nested list of edits, rather than one flat edit per call - Anthropic's current flagship models, Opus 4.8 and Sonnet 5, started attaching extra fields the tool never asked for. The invented names differ every single time: requireUnique one run, oldText2 the next, then matchCase, then notes. Older Anthropic models, Opus 4.5 and Haiku, did not do this on the same tools. The failure clusters specifically at the highest-uncertainty moment in a tool call - right after the model has just finished writing out a long, escaped string of edited text and has to decide what comes next.

The scale of the problem depends heavily on context. Ask the model one isolated question with a tool available and it gets the call right every time - a 0% failure rate. But let an agent run for many turns, accumulating a long history of prior edits and outputs, and roughly 1 in 5 tool calls comes back with a phantom field attached. Two fixes narrow the gap. "Strict mode" - a server-side setting that mechanically forces the model's output to conform to the tool's declared schema, similar to a form that will not let you submit a field that is not on it - eliminates the problem entirely. Stripping the model's internal "thinking" text out of the conversation history before the next turn cuts the failure rate roughly in half, suggesting the model is, in some sense, talking itself into inventing fields as a session gets longer.

Ronacher's hypothesis for why this happens points at Anthropic's own product: Claude Code, the company's flagship coding agent. He argues that because Claude Code's own internal tool-call format is flat (not nested) and because Claude Code quietly repairs sloppy or malformed tool calls behind the scenes rather than rejecting them, a model trained heavily inside that environment is never punished for getting a tool call slightly wrong. It learns a strong habit for its home turf's tolerant, flat format, and then, when a different application hands it a stricter, nested schema, it falls back on guessing plausible-sounding extra fields rather than sticking to exactly what was declared. As Ronacher put it, commenting on Hacker News under his handle the_mitsuhiko: "Train a model in a forgiving environment and other runtimes end up inheriting its habits."

That the same regression does not show up in OpenAI's Codex models (with one exception) strengthens the case that this is specific to how Anthropic trains and serves its models, rather than a universal difficulty every AI lab faces with complex tool schemas. At least three separate developers building their own agent runtimes, independent of Ronacher, reported hitting the identical pattern - phantom fields, appearing only in long sessions, on nested-edit tools - which makes coincidence an unlikely explanation.

The honest caveat is that this remains Ronacher's well-argued hypothesis rather than something Anthropic has confirmed; the company has not publicly responded. It is also not a problem every developer can simply engineer around: Anthropic's strict mode currently limits how complex a tool's schema is allowed to be, and that limit blocks strict mode from being used on exactly the nested, multi-edit tools where the failures concentrate. That leaves builders of complex coding agents - the same territory covered in our earlier look at whether coding agents can be trusted - stuck choosing between a tool shape their agent needs and a safety net that does not yet support it. The broader pattern, of a model inventing confident-sounding detail that isn't real, is also the subject of our explainer on hallucination.

Originally published on Ground Truth, where every claim is checked against the primary source.