<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: toshanthi-stack</title>
    <description>The latest articles on DEV Community by toshanthi-stack (@toshanthistack).</description>
    <link>https://dev.to/toshanthistack</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3983338%2Fca80db77-4730-4453-a49b-b49dd274967c.png</url>
      <title>DEV Community: toshanthi-stack</title>
      <link>https://dev.to/toshanthistack</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/toshanthistack"/>
    <language>en</language>
    <item>
      <title>Why Prompts Fail in Production (and the 4 Failure Vectors)</title>
      <dc:creator>toshanthi-stack</dc:creator>
      <pubDate>Sun, 14 Jun 2026 03:04:12 +0000</pubDate>
      <link>https://dev.to/toshanthistack/why-prompts-fail-in-production-and-the-4-failure-vectors-3adg</link>
      <guid>https://dev.to/toshanthistack/why-prompts-fail-in-production-and-the-4-failure-vectors-3adg</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://lillytechsystems.com/ai-school/" rel="noopener noreferrer"&gt;AI School&lt;/a&gt; — free AI &amp;amp; ML courses, no signup. This is lesson 1 of the free course &lt;a href="https://lillytechsystems.com/ai-school/prompt-patterns-production/" rel="noopener noreferrer"&gt;Prompt Patterns That Survive Production&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The playground-to-production gap is real, consistent, and almost always fixable — once you know which four vectors are doing the damage.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Playground Is a Lie
&lt;/h2&gt;

&lt;p&gt;Every developer who has shipped an LLM-powered feature has been surprised in the same way. The prompt worked perfectly in the playground. The first fifty test users were fine. Then something went wrong — a weird response, a parsing error, an output that violated the format contract — and the investigation revealed that the prompt that seemed solid was actually fragile the whole time.&lt;/p&gt;

&lt;p&gt;This is not bad luck. It is a predictable structural property of how prompts interact with LLMs. The playground hides the failure modes that matter most. You feed it the inputs you thought of. Real users feed it the inputs you didn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Production Failure Vectors
&lt;/h2&gt;

&lt;p&gt;Production prompt failures cluster into four categories. Understanding which vector is causing a failure is the first step to fixing it.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Input Distribution Shift
&lt;/h3&gt;

&lt;p&gt;In the playground, you control what goes in. In production, users bring inputs that are longer, shorter, multilingual, adversarially formatted, semantically ambiguous, or just weird in ways you didn't anticipate. A classification prompt that works for the ten example categories you tested will silently miscategorize edge-case inputs that don't fit any bucket. A summarization prompt that works for well-structured documents will produce garbage on bullet-point lists or tables.&lt;/p&gt;

&lt;p&gt;The failure is not the prompt — it's the assumption that the prompt was tested on a representative sample of the real distribution. It almost never was.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Context Contamination
&lt;/h3&gt;

&lt;p&gt;In a multi-turn system, each turn appends to the context. By turn fifteen, the context contains earlier instructions, earlier outputs, user corrections, and possibly conflicting signals. A prompt that performs perfectly on turn one will degrade measurably by turn ten as the model's attention divides across a growing context that dilutes the behavioral instructions you set at the start. This is not a bug in any particular model — it is a property of transformer attention at length, and it applies to all current LLMs.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Model Updates
&lt;/h3&gt;

&lt;p&gt;Hosted model providers update their models on schedules that do not align with your deployment calendar. A model update can change the default output format, modify how the model interprets ambiguous instructions, alter refusal thresholds, or change verbosity. A prompt that pinned to implicit model behavior — "it always returns JSON" without being told to — will break silently when that behavior changes. The teams that get burned are the ones whose prompts relied on undocumented model behavior rather than explicit constraints.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Adversarial and Unexpected User Creativity
&lt;/h3&gt;

&lt;p&gt;Real users try things you didn't design for. They ask the system questions outside its scope. They try to override the system prompt. They input data in formats the prompt doesn't handle — code when you expected prose, tables when you expected paragraphs, emojis in every field. These inputs don't have to be malicious to be damaging. Even well-intentioned users routinely produce inputs that fall into the gaps your prompt didn't cover.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Playground Assumption&lt;/th&gt;
&lt;th&gt;Production Reality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Inputs resemble my test cases&lt;/td&gt;
&lt;td&gt;Inputs span a long tail you didn't test&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First turn context is all there is&lt;/td&gt;
&lt;td&gt;Conversation history contaminates later turns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model behavior is stable&lt;/td&gt;
&lt;td&gt;Providers update models without notice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Users follow the intended flow&lt;/td&gt;
&lt;td&gt;Users explore, probe, and break the flow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output parsing works&lt;/td&gt;
&lt;td&gt;Format violations break downstream systems&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Engineering Mindset
&lt;/h2&gt;

&lt;p&gt;The shift from "craft a good prompt" to "engineer a production prompt" is a mindset change, not just a skill change. Production prompts are software. They have contracts (the expected input/output format), failure modes (things that break them), regressions (changes that make them worse), and a lifecycle (they need to be versioned, tested, and monitored).&lt;/p&gt;

&lt;p&gt;This framing matters because it changes the questions you ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Craft mindset:&lt;/strong&gt; "Does this produce a good output for my test case?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering mindset:&lt;/strong&gt; "What is the worst input I could receive, and what does my prompt do with it?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Craft mindset:&lt;/strong&gt; "Does this work?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering mindset:&lt;/strong&gt; "How will I know when this stops working?"&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;The Red-Team Rule:&lt;/strong&gt; Before shipping any prompt, spend fifteen minutes trying to break it. Give it the worst inputs you can think of. If it fails gracefully, ship it. If it fails badly, fix the failure mode first. Every edge case you discover before production is one you don't investigate at 2 AM after a user complaint.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What "Surviving Production" Actually Means
&lt;/h2&gt;

&lt;p&gt;A prompt survives production when it meets four criteria:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Output is parseable.&lt;/strong&gt; Downstream code that depends on the output can process it without exception handling for format surprises.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Behavior is predictable under variance.&lt;/strong&gt; The output stays within the intended behavioral envelope across the input distribution — not just for the happy path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failures are catchable.&lt;/strong&gt; When the prompt does fail, the failure is detectable before the user sees a broken experience.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Changes can be made safely.&lt;/strong&gt; When the prompt needs updating, you can make the change without unknowingly breaking something that was working.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these properties come for free. Each one requires deliberate design choices — the patterns and practices the full course covers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Full Course Covers
&lt;/h2&gt;

&lt;p&gt;The remaining lessons build from specific patterns to the full production discipline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The five patterns that consistently survive production, with before/after examples&lt;/li&gt;
&lt;li&gt;How to architect a system prompt with layers that maintain their guarantees&lt;/li&gt;
&lt;li&gt;Output format enforcement — the techniques that parsers can rely on&lt;/li&gt;
&lt;li&gt;Few-shot design at scale, including dynamic injection&lt;/li&gt;
&lt;li&gt;The five failure categories and how to diagnose each&lt;/li&gt;
&lt;li&gt;Versioning, regression testing, and eval pipelines&lt;/li&gt;
&lt;li&gt;The 25-point pre-deploy checklist and the maturity model&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;I write these as part of &lt;a href="https://lillytechsystems.com/ai-school/" rel="noopener noreferrer"&gt;AI School&lt;/a&gt;, a free learning platform (2,300+ courses, no signup). If this was useful, the full &lt;a href="https://lillytechsystems.com/ai-school/prompt-patterns-production/" rel="noopener noreferrer"&gt;Prompt Patterns That Survive Production&lt;/a&gt; course is free there — and the cost side is covered in &lt;a href="https://lillytechsystems.com/ai-school/token-optimization/" rel="noopener noreferrer"&gt;Token Optimization&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>promptengineering</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>What Does the Claude API Actually Cost? (June 2026)</title>
      <dc:creator>toshanthi-stack</dc:creator>
      <pubDate>Sun, 14 Jun 2026 03:02:27 +0000</pubDate>
      <link>https://dev.to/toshanthistack/what-does-the-claude-api-actually-cost-june-2026-9mh</link>
      <guid>https://dev.to/toshanthistack/what-does-the-claude-api-actually-cost-june-2026-9mh</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://lillytechsystems.com/ai-school/" rel="noopener noreferrer"&gt;AI School&lt;/a&gt; — free AI &amp;amp; ML courses, no signup. Full guide: &lt;a href="https://lillytechsystems.com/ai-school/guides/claude-api-costs.html" rel="noopener noreferrer"&gt;What Does the Claude API Actually Cost?&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Per-token prices are public, but your bill is determined by three multipliers most teams ignore: caching, batching, and model routing. Here is the real math, with four fully worked scenarios.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Prices verified June 2026 — always confirm at &lt;a href="https://www.anthropic.com/pricing" rel="noopener noreferrer"&gt;anthropic.com/pricing&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The List Prices (June 2026)
&lt;/h2&gt;

&lt;p&gt;Claude is billed per million tokens (MTok), with separate rates for input (what you send) and output (what the model generates). A token is roughly ¾ of an English word.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input / MTok&lt;/th&gt;
&lt;th&gt;Output / MTok&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Sweet spot&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;$25.00&lt;/td&gt;
&lt;td&gt;1M tokens&lt;/td&gt;
&lt;td&gt;Agents, hard reasoning, long-horizon coding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Sonnet 4.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;1M tokens&lt;/td&gt;
&lt;td&gt;Most production workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Haiku 4.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1.00&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;200K tokens&lt;/td&gt;
&lt;td&gt;Classification, routing, real-time chat&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two structural facts shape everything below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Output costs 5× input.&lt;/strong&gt; An app that generates long answers pays mostly for output; an app that reads long documents pays mostly for input. Know which one you are.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input is re-billed every call.&lt;/strong&gt; In a 20-turn conversation, turn 20 re-sends (and re-pays for) everything from turns 1–19 — unless you cache it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Three Multipliers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Prompt caching: reads at 0.1×, writes at 1.25×
&lt;/h3&gt;

&lt;p&gt;Any stable prefix of your request (system prompt, documents, conversation history) can be cached. Cached tokens are re-read at &lt;strong&gt;10% of the input price&lt;/strong&gt;; writing them to cache costs a one-time 25% premium (or 2× for the 1-hour cache lifetime instead of the default 5 minutes).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Base input&lt;/th&gt;
&lt;th&gt;Cache write (5-min)&lt;/th&gt;
&lt;th&gt;Cache read&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Opus 4.8&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;$6.25&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.50&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$3.75&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.30&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;$1.00&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.10&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Break-even is fast: with the 5-minute cache, the &lt;em&gt;second&lt;/em&gt; request already saves money (1.25× + 0.1× = 1.35× vs 2× uncached).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;The silent minimum-size gotcha.&lt;/strong&gt; Prefixes below a model-specific minimum &lt;em&gt;silently refuse to cache&lt;/em&gt; — no error, you just pay full price forever. The minimum is &lt;strong&gt;4,096 tokens on Opus 4.8 and Haiku 4.5&lt;/strong&gt; and 2,048 on Sonnet 4.6. A tidy 3,000-token system prompt on Haiku never caches. Check &lt;code&gt;usage.cache_read_input_tokens&lt;/code&gt; in responses: if it stays 0, your "cached" prompt isn't.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2. Batch API: everything at 50% off
&lt;/h3&gt;

&lt;p&gt;Jobs that can wait up to an hour (most finish faster) can run through the &lt;a href="https://platform.claude.com/docs/en/build-with-claude/batch-processing" rel="noopener noreferrer"&gt;Message Batches API&lt;/a&gt; at &lt;strong&gt;half price on all tokens&lt;/strong&gt; — and batching stacks with caching.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Model routing: a 5× lever before you optimize anything
&lt;/h3&gt;

&lt;p&gt;Haiku input is 5× cheaper than Opus, output 5× cheaper. The standard production pattern is a &lt;em&gt;cascade&lt;/em&gt;: Haiku handles the easy 80%, escalates the hard 20% to Sonnet or Opus. Optimize &lt;strong&gt;cost per successful task&lt;/strong&gt;, not cost per token — a cheap model that fails and retries can out-spend an expensive one that succeeds first try.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario 1 — Support Chatbot (Haiku 4.5)
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Assumptions: 100,000 messages/month; 5,000-token system prompt (instructions + few-shot examples — deliberately above Haiku's 4,096 caching minimum); average 1,500 tokens of conversation history + 100-token user message per call; 300-token replies.&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Per message&lt;/th&gt;
&lt;th&gt;Per month (100K msgs)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;No caching&lt;/strong&gt;: 6,600 in × $1 + 300 out × $5&lt;/td&gt;
&lt;td&gt;$0.0081&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$810&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;System prompt cached&lt;/strong&gt;: 5,000 read × $0.10 + 1,600 in × $1 + 300 out × $5&lt;/td&gt;
&lt;td&gt;$0.0036&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$360&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One &lt;code&gt;cache_control&lt;/code&gt; breakpoint cuts the bill by &lt;strong&gt;56%&lt;/strong&gt;. Caching the conversation history too (the standard multi-turn pattern) pushes savings further on longer chats.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario 2 — RAG Document Q&amp;amp;A (Sonnet 4.6)
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Assumptions: a 50,000-token document loaded into context; users ask 20 questions per document session; 500-token questions, 800-token answers.&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Cost per 20-question session&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;No caching&lt;/strong&gt;: every question re-sends the document at $3/MTok&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$3.27&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Document cached&lt;/strong&gt;: one $0.19 write, then 19 reads at $0.30/MTok&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.74&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That is &lt;strong&gt;77% off&lt;/strong&gt;, and the cached version also responds faster — the model doesn't reprocess 50K tokens per question. At 1,000 document sessions a month, caching is the difference between $3,270 and $740.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario 3 — Autonomous Coding Agent (Opus 4.8)
&lt;/h2&gt;

&lt;p&gt;Agents are where costs explode, because the context is re-sent on &lt;em&gt;every tool call&lt;/em&gt;. &lt;em&gt;Assumptions: one task = 40 model calls; context grows from 20K to 150K tokens across the run (average 85K per call); ~500 output tokens per call.&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Per task&lt;/th&gt;
&lt;th&gt;50 tasks/day&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;No caching&lt;/strong&gt;: 40 calls × ~85K input at $5/MTok + 20K output&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$17.50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$875/day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Incremental caching&lt;/strong&gt;: each call re-reads the prefix at $0.50/MTok, only the ~3K new tokens pay the write premium&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;≈$2.95&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;≈$148/day&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;~83% off.&lt;/strong&gt; For agentic workloads, prompt caching is not an optimization — it's the difference between a viable product and an impossible one. (Anthropic's own agent products rely on exactly this pattern.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario 4 — Nightly Classification Job (Haiku + Batch)
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Assumptions: 100,000 records classified overnight; 400 input + 10 output tokens each.&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Per night&lt;/th&gt;
&lt;th&gt;Per year&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Real-time API&lt;/strong&gt;: 40M in × $1 + 1M out × $5&lt;/td&gt;
&lt;td&gt;$45.00&lt;/td&gt;
&lt;td&gt;$16,425&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Batch API&lt;/strong&gt; (50% off everything)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$22.50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$8,213&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If those records share a cacheable instruction prefix, batch + caching stack — many classification jobs land under $15/night.&lt;/p&gt;

&lt;h2&gt;
  
  
  Estimate Your Own Workload
&lt;/h2&gt;

&lt;p&gt;Token counting is free — you can price a workload before spending anything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pip install anthropic
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MY_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SAMPLE_REQUEST&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;IN_PRICE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OUT_PRICE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;3.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;15.00&lt;/span&gt;   &lt;span class="c1"&gt;# Sonnet 4.6, $/MTok
&lt;/span&gt;&lt;span class="n"&gt;est_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;600&lt;/span&gt;                     &lt;span class="c1"&gt;# your average reply length
&lt;/span&gt;
&lt;span class="n"&gt;per_call&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;IN_PRICE&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;est_output&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;OUT_PRICE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; input tokens -&amp;gt; $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;per_call&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; per call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;At 10K calls/day: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;per_call&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10_000&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then check the &lt;code&gt;usage&lt;/code&gt; object on real responses — &lt;code&gt;input_tokens&lt;/code&gt;, &lt;code&gt;output_tokens&lt;/code&gt;, &lt;code&gt;cache_read_input_tokens&lt;/code&gt; — to verify your assumptions against reality.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Know your shape:&lt;/strong&gt; input-heavy (RAG, agents) or output-heavy (generation)? Optimize the expensive side first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache anything stable&lt;/strong&gt; over the minimum size (4,096 tokens on Opus/Haiku, 2,048 on Sonnet) that's reused within 5 minutes — and verify with &lt;code&gt;cache_read_input_tokens&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch anything that can wait an hour&lt;/strong&gt; — flat 50% off, stacks with caching.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route by difficulty:&lt;/strong&gt; Haiku first, escalate on failure. Measure cost per successful task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cap output:&lt;/strong&gt; set &lt;code&gt;max_tokens&lt;/code&gt; deliberately and prompt for concise answers — output is the 5×-priced direction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-price quarterly:&lt;/strong&gt; model prices and caching mechanics change; the math here is June 2026.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt; &lt;a href="https://www.anthropic.com/pricing" rel="noopener noreferrer"&gt;Anthropic pricing&lt;/a&gt; · &lt;a href="https://platform.claude.com/docs/en/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Prompt caching docs&lt;/a&gt; · &lt;a href="https://platform.claude.com/docs/en/build-with-claude/batch-processing" rel="noopener noreferrer"&gt;Batch API docs&lt;/a&gt;. All scenario math uses list prices as of June 4, 2026; assumptions are stated inline so you can re-run them with your own numbers.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write these as part of &lt;a href="https://lillytechsystems.com/ai-school/" rel="noopener noreferrer"&gt;AI School&lt;/a&gt;, a free learning platform (no signup, no paywall). The cost-control techniques above are covered in depth in the free &lt;a href="https://lillytechsystems.com/ai-school/token-optimization/" rel="noopener noreferrer"&gt;Token Optimization course&lt;/a&gt; — context engineering, output control, and cost governance.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>claude</category>
    </item>
  </channel>
</rss>
