- Book: Prompt Engineering Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Picture a classification pipeline that has run cleanly for nine months. A new Sonnet release lands, and suddenly a noticeable slice of rows comes back empty. Same prompt, same inputs, same temperature. The diff: the model now likes to preface its JSON with Sure, here's the JSON you asked for:. The downstream regex peels off the first { it finds, hits the first stray brace inside a string value, and quietly returns {} for every row that contains a colon in free text.
The fix is not a better regex. The fix is deleting the regex.
Anthropic now documents native Structured Outputs on the Claude Developer Platform, joining the tool-use-with-strict-schema pattern that has been the durable answer for two years. If your prompts still ask the model to "respond with JSON only please," you are one model version away from a 3 a.m. page. Here's the anti-pattern, why it breaks, and the pattern that survives every version bump.
What brittle JSON prompting actually looks like
This is the pattern that ships in tutorials and bites in production:
import json
import re
import anthropic
client = anthropic.Anthropic()
PROMPT = """You are a support-ticket classifier.
Return ONLY valid JSON in this exact format:
{"category": "...", "priority": "...", "needs_escalation": true|false}
Do not include any other text. Do not wrap in markdown.
The category must be one of: billing, bug, feature, other.
The priority must be one of: low, medium, high.
Ticket: {ticket_text}
"""
def classify(ticket: str) -> dict:
r = client.messages.create(
model="claude-sonnet-4-7",
max_tokens=200,
messages=[
{
"role": "user",
"content": PROMPT.replace("{ticket_text}", ticket),
}
],
)
text = r.content[0].text
# Strip markdown fences if the model added them.
text = re.sub(r"^```
(?:json)?\s*|\s*
```$", "", text.strip())
# Find the first {...} block. The model sometimes wraps in prose.
m = re.search(r"\{.*\}", text, re.DOTALL)
return json.loads(m.group(0)) if m else {}
It works. It works on most inputs. It works until one of these happens:
- A model upgrade adjusts the assistant's preface style and your prefix-strip misses a new pattern.
- A user types a
}in a ticket body and your "first{...}" regex closes early. - The model emits
priority: "Medium"(capitalized) and your downstream code doesn't normalize. - The model decides one ticket is ambiguous and emits two JSON objects separated by
\n\n. - The model truncates at
max_tokens=200mid-string and yourjson.loadsraises.
Every one of these is a real production incident. The common cause: you're asking a probabilistic system to behave like a parser, and patching the cracks in the wrapper code. The wrapper is the bug.
Why "respond with JSON only" gets weaker over time
Each model version adjusts its instruction-following distribution. A subtle behavior, should I put a one-line preamble before the JSON when the user asks for explanation-free output, is noise across versions. Anthropic doesn't promise stability on it, because the broader instruction-following is what they tune.
The Pydantic AI team, which has dealt with every output mode under the sun, documents five output modes precisely because "ask the model nicely" is the weakest one. Their prompted mode (inject the schema in the prompt and hope) is the fallback for models with no better option. Their tool mode (the default) uses the model's tool-calling pathway because it's grammar-constrained on the API side.
Anthropic's structured outputs docs are even more direct: with output_format and a JSON schema, the platform constrains generation at the token level so the response validates against the schema in the normal path. No regex. No preamble-stripping. No "the model added a trailing comma again" Slack thread.
The durable pattern: tool use with a JSON schema
Two equivalent ways to anchor your output in 2026: native structured outputs (newer, simpler API) and tool use with tool_choice forcing a specific tool (older, broadly supported, works on every Claude version that ships tools).
Tool use is what I reach for first because it works across providers. The same shape ports to OpenAI's tool-calling and to Pydantic AI's tool mode without rewriting application code. Here's the same classifier, rewritten:
import anthropic
from pydantic import BaseModel, Field, ValidationError
from typing import Literal
client = anthropic.Anthropic()
class Classification(BaseModel):
category: Literal["billing", "bug", "feature", "other"]
priority: Literal["low", "medium", "high"]
needs_escalation: bool
reason: str = Field(
description="One sentence explaining the classification."
)
CLASSIFY_TOOL = {
"name": "record_classification",
"description": (
"Record the classification for a single support ticket. "
"Call this exactly once per ticket."
),
"input_schema": Classification.model_json_schema(),
}
SYSTEM = (
"You classify support tickets. For every ticket, call "
"`record_classification` exactly once with the structured fields."
)
def classify(ticket: str) -> Classification:
r = client.messages.create(
model="claude-sonnet-4-7",
max_tokens=400,
system=SYSTEM,
tools=[CLASSIFY_TOOL],
tool_choice={"type": "tool", "name": "record_classification"},
messages=[{"role": "user", "content": ticket}],
)
for block in r.content:
if block.type == "tool_use" and block.name == "record_classification":
try:
return Classification.model_validate(block.input)
except ValidationError as e:
# Validation should not fail; the schema constrains the model.
# If it does, log and re-raise. Your schema is the contract.
raise RuntimeError(
f"Schema drift on classify: {e}"
) from e
raise RuntimeError("No tool call in response.")
What changed and why each change matters:
tool_choice={"type": "tool", "name": "record_classification"}. This is the lynchpin. Anthropic's API guarantees the model will call this specific tool. No assistant-text preamble path. No "I'll think about this first" detour. The output is a structured tool_use block, not free text.
Classification.model_json_schema() is the source of truth. The Pydantic model defines the schema and the runtime validator. The schema goes to the API; the validator runs on the response. They cannot drift because they're generated from the same class. Adding a field is one edit, not three.
Literal[...] for enums. The schema produced by Pydantic includes enum: ["billing", "bug", ...]. The model will not return "Billing" or "bugs". The constrained generation path emits a token sequence in the enum. You don't need a normalization step.
max_tokens=400, not 200. Tool inputs include a small structural overhead. Budgeting tightly here was where the original brittle version sometimes truncated and produced parser-incompatible JSON. With tool use, truncation surfaces as a clean API error you can catch and retry, not a malformed string.
No regex anywhere. The wrapper code is six lines: call the model, find the tool block, validate. There is nothing to update on a model bump.
What survives a version bump and what doesn't
Run the brittle version and the durable version against the last several Claude releases on the same eval set, and the error shape is the part that matters.
The brittle version's failure rate tends to drift across versions, sometimes single-digit, sometimes well into double digits. Every version, the failures take a different shape: sometimes empty objects, sometimes capitalization mismatches, sometimes truncation. Each shape needs its own monkey-patch in the wrapper. The wrapper accretes over time and nobody on the team is confident about deleting any of it.
The durable version's shape failure rate stays effectively flat across the same versions. Not because the model is perfect. Because the failure mode is now the API rejecting an invalid schema, which is a 400 you can see in your error budget, not a silent {} that bleeds into your warehouse.
The lesson is not "tool use is more accurate." The lesson is tool use moves the failure mode from silent to loud. A regression in classification quality still happens. That's a different problem you address with evals. But you can no longer be wrong about whether the data shape was valid, and you no longer need a wrapper that pretends to parse arbitrary text.
The migration path for an existing prompt
You probably have a few hundred prompts asking for JSON. You don't need to migrate all of them. Start with the ones that match any of these:
- It's a classifier or extractor (single-shot structured task).
- The downstream consumer is code, not a human.
- The data lands in a database, queue, or another API.
- The current implementation has a regex, a
try/except json.JSONDecodeError, or a "strip markdown fences" helper.
For each one:
- Define a Pydantic model (or a JSON schema directly) with
Literalenums andFielddescriptions. The descriptions are part of the prompt the model sees. Write them as instructions. - Wrap it as a tool with a clear
description("Call this exactly once per ..."). - Force the call with
tool_choice. - Delete the regex, the markdown stripper, the JSON repair library, and the "I'll just retry on parse error" loop.
Code review the diff for what you removed, not just what you added. The wrapper code disappearing is the actual win.
When you still need free-text output
Not every output is structured. Drafting an email, summarizing a meeting, writing a code review: these are inherently free-text, and forcing them into a JSON shape is an anti-pattern in the other direction. The right move there is a hybrid response: a tool call that records structured metadata (sentiment, topics, recommended action), and a free-text field inside the tool input that holds the body. The structured envelope still gives you parseability; the body still reads like prose.
class EmailDraft(BaseModel):
subject: str
body: str = Field(description="Plain text, no markdown.")
tone: Literal["formal", "neutral", "warm"]
follow_up_in_days: int | None = None
One tool call, one validated object, one durable contract. The only thing the next model bump can do is improve it.
If this was useful
The Prompt Engineering Pocket Guide has a chapter on durable structured outputs that covers the version-bump failure modes, the tool-use pattern across Anthropic and OpenAI, and the eval setup that tells you when prompt drift is a quality regression versus a shape regression. If your prompts are starting to feel load-bearing, it's the short read.
Top comments (0)