A fourteen-word prompt worked in the playground. Passed the demo. Then it hit production, hallucinated a refund policy, and cost forty-seven thousand dollars.
The fix? More instructions. More specificity. More temperature tweaking. The universal cope of prompt engineering.
Here's the problem: you're treating a language model like a function call. It's a distributed system. And the gap between "a good prompt" and "a reliable AI system" is where production silently fails.
Prompt engineering optimizes a sentence. Prompt architecture designs the system around it.
This post covers five patterns that close that gap — with code you can ship today.
Pattern 1: Routing
One mega-prompt trying to handle every input will eventually contradict itself. Instead, classify first, then route to specialized prompts.
from anthropic import Anthropic
client = Anthropic()
ROUTES = {
"refund": "You are a refund specialist. Follow the refund policy strictly...",
"technical": "You are a technical support agent. Troubleshoot step by step...",
"general": "You are a helpful assistant. Answer concisely...",
}
def classify(user_input: str) -> str:
"""Classify input into a route using a small, fast model."""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=20,
messages=[{"role": "user", "content": user_input}],
system="Classify this input as exactly one of: refund, technical, general. Reply with only the category.",
)
return response.content[0].text.strip().lower()
def handle(user_input: str) -> str:
"""Route input to a specialized prompt."""
route = classify(user_input)
system_prompt = ROUTES.get(route, ROUTES["general"])
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": user_input}],
system=system_prompt,
)
return response.content[0].text
The classifier call costs a fraction of a cent. Each specialized prompt is fifty words instead of five hundred. No conflicting instructions. Accuracy per task skyrockets.
When it breaks: If 20%+ of inputs could belong to multiple routes, the classifier becomes the bottleneck. That's where fallbacks come in.
Pattern 2: Fallback Chains
Your primary model will fail. Hallucinations, JSON parse errors, API timeouts. In production, that's not an exception — that's Tuesday.
A fallback chain retries with error context, then drops to a different model:
import json
MODELS = ["claude-sonnet-4-6", "claude-haiku-4-5-20251001"]
def call_with_fallback(system: str, user_input: str, schema: dict) -> dict:
"""Try primary model, retry with error context, then fall back."""
last_error = None
for model in MODELS:
for attempt in range(2): # max 2 tries per model
prompt = user_input
if last_error:
prompt += f"\n\n[PREVIOUS ERROR: {last_error}. Please correct.]"
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
system=system,
)
try:
result = json.loads(response.content[0].text)
validate(result, schema) # raises on failure
return result
except (json.JSONDecodeError, ValidationError) as e:
last_error = str(e)
continue
raise RuntimeError(f"All models failed. Last error: {last_error}")
The key is error injection — you don't retry blindly. You tell the model what went wrong. "Your output was missing the price field. Here is the schema again." That contextual feedback makes the retry succeed 80% of the time.
Pattern 3: Structured Output
Most developers ask the model to return JSON and then pray. Brittle regex to extract it from markdown fences. Handling the case where the model wraps it in an explanation. Sound familiar?
Structured output means the response is guaranteed to match a schema. Use Instructor with Pydantic:
import instructor
from pydantic import BaseModel, Field
from anthropic import Anthropic
client = instructor.from_anthropic(Anthropic())
class ProductExtraction(BaseModel):
name: str
price: float = Field(gt=0, description="Price in USD, must be positive")
category: str
in_stock: bool
product = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
messages=[{"role": "user", "content": "Extract: The Nike Air Max 90 costs $129.99 and is currently available"}],
response_model=ProductExtraction,
)
print(product.name) # "Nike Air Max 90"
print(product.price) # 129.99
print(product.in_stock) # True
No string parsing. No JSON extraction. A typed Python object with validation built in. When it fails, your fallback chain knows exactly what went wrong — missing field, wrong type, value out of range.
Claude's native tool use and OpenAI's function calling both enforce output schemas at the API level. Instructor wraps either one.
Pattern 4: Validation Loops
Structured output catches type errors. It doesn't catch logical errors.
A product price of -$12.00? Valid float. Makes no sense. A date of February 30th? Passes the type check. Doesn't exist. A summary that contradicts the source document? Perfectly formatted. Completely wrong.
Add a second pass:
from pydantic import BaseModel, field_validator
from datetime import date
class OrderExtraction(BaseModel):
product: str
price: float
quantity: int
order_date: date
@field_validator("price")
@classmethod
def price_must_be_positive(cls, v):
if v <= 0:
raise ValueError(f"Price must be positive, got {v}")
return v
@field_validator("quantity")
@classmethod
def quantity_must_be_reasonable(cls, v):
if v > 10_000:
raise ValueError(f"Quantity {v} exceeds maximum order size")
return v
def extract_with_validation(text: str, max_retries: int = 3) -> OrderExtraction:
"""Extract and validate, retrying with error feedback."""
errors = []
for _ in range(max_retries):
try:
result = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
messages=[{"role": "user", "content": text}],
response_model=OrderExtraction,
)
return result
except Exception as e:
errors.append(str(e))
raise RuntimeError(f"Validation failed after {max_retries} attempts: {errors}")
For deterministic rules, use Pydantic validators. For semantic checks ("Does this summary accurately reflect the source?"), use a second LLM call as the validator.
The $47,000 hallucination? A validation loop catches it in two API calls, not forty-seven thousand dollars later.
Pattern 5: Cost Tracking & Observability
An architecture you can't measure is an architecture you can't improve. Every LLM call should log four things:
import time
import logging
from dataclasses import dataclass, field
from typing import Optional
logger = logging.getLogger("llm_ops")
@dataclass
class LLMCallMetrics:
model: str
input_tokens: int
output_tokens: int
latency_ms: float
passed_validation: bool
route: Optional[str] = None
retry_count: int = 0
@property
def cost_usd(self) -> float:
"""Estimate cost based on model pricing."""
rates = {
"claude-sonnet-4-6": (3.0, 15.0), # per 1M tokens (in, out)
"claude-haiku-4-5-20251001": (0.80, 4.0),
}
in_rate, out_rate = rates.get(self.model, (3.0, 15.0))
return (self.input_tokens * in_rate + self.output_tokens * out_rate) / 1_000_000
def tracked_call(model: str, messages: list, **kwargs) -> tuple:
"""Wrap any LLM call with metrics tracking."""
start = time.perf_counter()
response = client.messages.create(model=model, messages=messages, **kwargs)
elapsed = (time.perf_counter() - start) * 1000
metrics = LLMCallMetrics(
model=model,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
latency_ms=elapsed,
passed_validation=True, # caller updates on failure
)
logger.info(f"{model} | {metrics.cost_usd:.4f}$ | {elapsed:.0f}ms | "
f"in={metrics.input_tokens} out={metrics.output_tokens}")
return response, metrics
I built a pipeline last month. Estimated two cents per request. Actual cost: eleven cents. The fallback chain was triggering on 30% of inputs because the router was miscategorizing ambiguous queries. Without tracking, you'd never find that — just slowly losing money.
For hosted tracing, Langfuse, LangSmith, or Braintrust are all solid. But a structured logger writing to a database works until you're past 10K requests/day.
How the Patterns Compose
Here's the full picture:
User Input
↓
[Router] — classifies intent (Haiku, ~0.001$)
↓
[Specialized Prompt] — focused system prompt per route
↓
[Structured Output] — Pydantic model guarantees schema
↓
[Validation Loop] — business rules + semantic checks
↓
[Metrics Logger] — tokens, latency, cost, pass/fail
↓
Response (or fallback → retry with error context)
Total implementation: ~200 lines of Python. No LangChain. No LlamaIndex. No framework lock-in.
Bonus: Reasoning Architectures
Prompt architecture also applies to how you structure the model's reasoning:
| Pattern | How it works | Cost | Best for |
|---|---|---|---|
| Chain of Thought | Linear "think step by step" | 1x | 90% of use cases |
| Tree of Thought | Multiple paths in parallel, evaluate best branch | 3-5x | Planning, code generation |
| Graph of Thought | Paths merge and share context (non-linear) | 5-10x | Research-grade only |
Chain of Thought handles most problems. Tree of Thought is for when correctness is paramount and the search space is large. Graph of Thought? Don't touch it unless you enjoy debugging non-deterministic reasoning.
Key Takeaways
- Decompose with routing — classify first, then route to focused prompts. Kill the mega-prompt.
- Design for failure — fallback chains with error injection make retries succeed 80% of the time.
-
Demand typed output —
Instructor+ Pydantic = no more JSON prayer. - Validate separately — type checks ≠ logic checks. Add a second pass.
- Measure everything — four numbers per call: input tokens, output tokens, latency, validation status.
- Skip the frameworks — understand the five patterns, build them yourself. 200 lines of Python beats a 50K-line framework you can't debug.
The question isn't whether your prompt works today. It's what happens when the input distribution shifts, the model updates, or the traffic doubles.
The prompt stays the same. The architecture is what survives.
Watch the full video breakdown: Prompt Engineering Is Dead. Prompt Architecture Is What Matters.
The Machine Pulse covers the technology that's rewriting the rules — how AI actually works under the hood, what's hype vs. what's real, and what it means for your career and your future.
Follow @themachinepulse for weekly deep dives into AI, emerging tech, and the future of work.
Top comments (1)
the routing pattern is underrated honestly. we split our customer support bot into specialized sub-prompts last quarter and hallucination rate dropped like 60% overnight, way bigger impact than any single prompt tweak.