DEV Community: toshanthi-stack

Why Prompts Fail in Production (and the 4 Failure Vectors)

toshanthi-stack — Sun, 14 Jun 2026 03:04:12 +0000

Originally published on AI School — free AI & ML courses, no signup. This is lesson 1 of the free course Prompt Patterns That Survive Production.

The playground-to-production gap is real, consistent, and almost always fixable — once you know which four vectors are doing the damage.

The Playground Is a Lie

Every developer who has shipped an LLM-powered feature has been surprised in the same way. The prompt worked perfectly in the playground. The first fifty test users were fine. Then something went wrong — a weird response, a parsing error, an output that violated the format contract — and the investigation revealed that the prompt that seemed solid was actually fragile the whole time.

This is not bad luck. It is a predictable structural property of how prompts interact with LLMs. The playground hides the failure modes that matter most. You feed it the inputs you thought of. Real users feed it the inputs you didn't.

The Four Production Failure Vectors

Production prompt failures cluster into four categories. Understanding which vector is causing a failure is the first step to fixing it.

1. Input Distribution Shift

In the playground, you control what goes in. In production, users bring inputs that are longer, shorter, multilingual, adversarially formatted, semantically ambiguous, or just weird in ways you didn't anticipate. A classification prompt that works for the ten example categories you tested will silently miscategorize edge-case inputs that don't fit any bucket. A summarization prompt that works for well-structured documents will produce garbage on bullet-point lists or tables.

The failure is not the prompt — it's the assumption that the prompt was tested on a representative sample of the real distribution. It almost never was.

2. Context Contamination

In a multi-turn system, each turn appends to the context. By turn fifteen, the context contains earlier instructions, earlier outputs, user corrections, and possibly conflicting signals. A prompt that performs perfectly on turn one will degrade measurably by turn ten as the model's attention divides across a growing context that dilutes the behavioral instructions you set at the start. This is not a bug in any particular model — it is a property of transformer attention at length, and it applies to all current LLMs.

3. Model Updates

Hosted model providers update their models on schedules that do not align with your deployment calendar. A model update can change the default output format, modify how the model interprets ambiguous instructions, alter refusal thresholds, or change verbosity. A prompt that pinned to implicit model behavior — "it always returns JSON" without being told to — will break silently when that behavior changes. The teams that get burned are the ones whose prompts relied on undocumented model behavior rather than explicit constraints.

4. Adversarial and Unexpected User Creativity

Real users try things you didn't design for. They ask the system questions outside its scope. They try to override the system prompt. They input data in formats the prompt doesn't handle — code when you expected prose, tables when you expected paragraphs, emojis in every field. These inputs don't have to be malicious to be damaging. Even well-intentioned users routinely produce inputs that fall into the gaps your prompt didn't cover.

Playground Assumption	Production Reality
Inputs resemble my test cases	Inputs span a long tail you didn't test
First turn context is all there is	Conversation history contaminates later turns
Model behavior is stable	Providers update models without notice
Users follow the intended flow	Users explore, probe, and break the flow
Output parsing works	Format violations break downstream systems

The Engineering Mindset

The shift from "craft a good prompt" to "engineer a production prompt" is a mindset change, not just a skill change. Production prompts are software. They have contracts (the expected input/output format), failure modes (things that break them), regressions (changes that make them worse), and a lifecycle (they need to be versioned, tested, and monitored).

This framing matters because it changes the questions you ask:

Craft mindset: "Does this produce a good output for my test case?"
Engineering mindset: "What is the worst input I could receive, and what does my prompt do with it?"
Craft mindset: "Does this work?"
Engineering mindset: "How will I know when this stops working?"

✅ The Red-Team Rule: Before shipping any prompt, spend fifteen minutes trying to break it. Give it the worst inputs you can think of. If it fails gracefully, ship it. If it fails badly, fix the failure mode first. Every edge case you discover before production is one you don't investigate at 2 AM after a user complaint.

What "Surviving Production" Actually Means

A prompt survives production when it meets four criteria:

Output is parseable. Downstream code that depends on the output can process it without exception handling for format surprises.
Behavior is predictable under variance. The output stays within the intended behavioral envelope across the input distribution — not just for the happy path.
Failures are catchable. When the prompt does fail, the failure is detectable before the user sees a broken experience.
Changes can be made safely. When the prompt needs updating, you can make the change without unknowingly breaking something that was working.

None of these properties come for free. Each one requires deliberate design choices — the patterns and practices the full course covers.

What the Full Course Covers

The remaining lessons build from specific patterns to the full production discipline:

The five patterns that consistently survive production, with before/after examples
How to architect a system prompt with layers that maintain their guarantees
Output format enforcement — the techniques that parsers can rely on
Few-shot design at scale, including dynamic injection
The five failure categories and how to diagnose each
Versioning, regression testing, and eval pipelines
The 25-point pre-deploy checklist and the maturity model

I write these as part of AI School, a free learning platform (2,300+ courses, no signup). If this was useful, the full Prompt Patterns That Survive Production course is free there — and the cost side is covered in Token Optimization.

What Does the Claude API Actually Cost? (June 2026)

toshanthi-stack — Sun, 14 Jun 2026 03:02:27 +0000

Originally published on AI School — free AI & ML courses, no signup. Full guide: What Does the Claude API Actually Cost?

Per-token prices are public, but your bill is determined by three multipliers most teams ignore: caching, batching, and model routing. Here is the real math, with four fully worked scenarios.

Prices verified June 2026 — always confirm at anthropic.com/pricing.

The List Prices (June 2026)

Claude is billed per million tokens (MTok), with separate rates for input (what you send) and output (what the model generates). A token is roughly ¾ of an English word.

Model	Input / MTok	Output / MTok	Context	Sweet spot
Claude Opus 4.8	$5.00	$25.00	1M tokens	Agents, hard reasoning, long-horizon coding
Claude Sonnet 4.6	$3.00	$15.00	1M tokens	Most production workloads
Claude Haiku 4.5	$1.00	$5.00	200K tokens	Classification, routing, real-time chat

Two structural facts shape everything below:

Output costs 5× input. An app that generates long answers pays mostly for output; an app that reads long documents pays mostly for input. Know which one you are.
Input is re-billed every call. In a 20-turn conversation, turn 20 re-sends (and re-pays for) everything from turns 1–19 — unless you cache it.

The Three Multipliers

1. Prompt caching: reads at 0.1×, writes at 1.25×

Any stable prefix of your request (system prompt, documents, conversation history) can be cached. Cached tokens are re-read at 10% of the input price; writing them to cache costs a one-time 25% premium (or 2× for the 1-hour cache lifetime instead of the default 5 minutes).

Model	Base input	Cache write (5-min)	Cache read
Opus 4.8	$5.00	$6.25	$0.50
Sonnet 4.6	$3.00	$3.75	$0.30
Haiku 4.5	$1.00	$1.25	$0.10

Break-even is fast: with the 5-minute cache, the second request already saves money (1.25× + 0.1× = 1.35× vs 2× uncached).

⚠️ The silent minimum-size gotcha. Prefixes below a model-specific minimum silently refuse to cache — no error, you just pay full price forever. The minimum is 4,096 tokens on Opus 4.8 and Haiku 4.5 and 2,048 on Sonnet 4.6. A tidy 3,000-token system prompt on Haiku never caches. Check usage.cache_read_input_tokens in responses: if it stays 0, your "cached" prompt isn't.

2. Batch API: everything at 50% off

Jobs that can wait up to an hour (most finish faster) can run through the Message Batches API at half price on all tokens — and batching stacks with caching.

3. Model routing: a 5× lever before you optimize anything

Haiku input is 5× cheaper than Opus, output 5× cheaper. The standard production pattern is a cascade: Haiku handles the easy 80%, escalates the hard 20% to Sonnet or Opus. Optimize cost per successful task, not cost per token — a cheap model that fails and retries can out-spend an expensive one that succeeds first try.

Scenario 1 — Support Chatbot (Haiku 4.5)

Assumptions: 100,000 messages/month; 5,000-token system prompt (instructions + few-shot examples — deliberately above Haiku's 4,096 caching minimum); average 1,500 tokens of conversation history + 100-token user message per call; 300-token replies.

	Per message	Per month (100K msgs)
No caching: 6,600 in × $1 + 300 out × $5	$0.0081	$810
System prompt cached: 5,000 read × $0.10 + 1,600 in × $1 + 300 out × $5	$0.0036	$360

One cache_control breakpoint cuts the bill by 56%. Caching the conversation history too (the standard multi-turn pattern) pushes savings further on longer chats.

Scenario 2 — RAG Document Q&A (Sonnet 4.6)

Assumptions: a 50,000-token document loaded into context; users ask 20 questions per document session; 500-token questions, 800-token answers.

	Cost per 20-question session
No caching: every question re-sends the document at $3/MTok	$3.27
Document cached: one $0.19 write, then 19 reads at $0.30/MTok	$0.74

That is 77% off, and the cached version also responds faster — the model doesn't reprocess 50K tokens per question. At 1,000 document sessions a month, caching is the difference between $3,270 and $740.

Scenario 3 — Autonomous Coding Agent (Opus 4.8)

Agents are where costs explode, because the context is re-sent on every tool call. Assumptions: one task = 40 model calls; context grows from 20K to 150K tokens across the run (average 85K per call); ~500 output tokens per call.

	Per task	50 tasks/day
No caching: 40 calls × ~85K input at $5/MTok + 20K output	$17.50	$875/day
Incremental caching: each call re-reads the prefix at $0.50/MTok, only the ~3K new tokens pay the write premium	≈$2.95	≈$148/day

~83% off. For agentic workloads, prompt caching is not an optimization — it's the difference between a viable product and an impossible one. (Anthropic's own agent products rely on exactly this pattern.)

Scenario 4 — Nightly Classification Job (Haiku + Batch)

Assumptions: 100,000 records classified overnight; 400 input + 10 output tokens each.

	Per night	Per year
Real-time API: 40M in × $1 + 1M out × $5	$45.00	$16,425
Batch API (50% off everything)	$22.50	$8,213

If those records share a cacheable instruction prefix, batch + caching stack — many classification jobs land under $15/night.

Estimate Your Own Workload

Token counting is free — you can price a workload before spending anything:

# pip install anthropic
import anthropic

client = anthropic.Anthropic()

count = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system=MY_SYSTEM_PROMPT,
    messages=[{"role": "user", "content": SAMPLE_REQUEST}],
)

IN_PRICE, OUT_PRICE = 3.00, 15.00   # Sonnet 4.6, $/MTok
est_output = 600                     # your average reply length

per_call = (count.input_tokens * IN_PRICE + est_output * OUT_PRICE) / 1_000_000
print(f"{count.input_tokens} input tokens -> ${per_call:.4f} per call")
print(f"At 10K calls/day: ${per_call * 10_000:.2f}/day")

Then check the usage object on real responses — input_tokens, output_tokens, cache_read_input_tokens — to verify your assumptions against reality.

The Checklist

Know your shape: input-heavy (RAG, agents) or output-heavy (generation)? Optimize the expensive side first.
Cache anything stable over the minimum size (4,096 tokens on Opus/Haiku, 2,048 on Sonnet) that's reused within 5 minutes — and verify with cache_read_input_tokens.
Batch anything that can wait an hour — flat 50% off, stacks with caching.
Route by difficulty: Haiku first, escalate on failure. Measure cost per successful task.
Cap output: set max_tokens deliberately and prompt for concise answers — output is the 5×-priced direction.
Re-price quarterly: model prices and caching mechanics change; the math here is June 2026.

Sources: Anthropic pricing · Prompt caching docs · Batch API docs. All scenario math uses list prices as of June 4, 2026; assumptions are stated inline so you can re-run them with your own numbers.

I write these as part of AI School, a free learning platform (no signup, no paywall). The cost-control techniques above are covered in depth in the free Token Optimization course — context engineering, output control, and cost governance.