<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manish Ramavat</title>
    <description>The latest articles on DEV Community by Manish Ramavat (@manishramavat).</description>
    <link>https://dev.to/manishramavat</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3922912%2F151bbdbd-e293-447a-8ae2-e163bffa726f.jpg</url>
      <title>DEV Community: Manish Ramavat</title>
      <link>https://dev.to/manishramavat</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/manishramavat"/>
    <language>en</language>
    <item>
      <title>How to Save 90% on Claude API Input Costs With Prompt Caching (2026)</title>
      <dc:creator>Manish Ramavat</dc:creator>
      <pubDate>Sun, 10 May 2026 16:03:12 +0000</pubDate>
      <link>https://dev.to/manishramavat/how-to-save-90-on-claude-api-input-costs-with-prompt-caching-2026-4l78</link>
      <guid>https://dev.to/manishramavat/how-to-save-90-on-claude-api-input-costs-with-prompt-caching-2026-4l78</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;If you're calling the Claude API with a large system prompt, every request reprocesses the same tokens from scratch. Production AI systems — agents, RAG pipelines, customer-facing assistants — routinely carry 10K–30K token system prompts (tool definitions, reference docs, few-shot examples). At $3/MTok across hundreds of thousands of daily requests, redundant prefix processing can easily run $500–$3,000+/day. That's pure waste for context the model has already seen.&lt;/p&gt;

&lt;p&gt;Anthropic's prompt caching solves this. You mark a stable prefix as cacheable, pay a small one-time write surcharge (1.25×), and every subsequent request reads that prefix at &lt;strong&gt;10% of the standard price&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I ran a controlled experiment to measure the real-world savings. Here are the numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Prompt Caching Works
&lt;/h2&gt;

&lt;p&gt;The mechanism is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You attach &lt;code&gt;cache_control: {"type": "ephemeral"}&lt;/code&gt; to a content block in your request&lt;/li&gt;
&lt;li&gt;The API caches everything up to and including that block (the "prefix")&lt;/li&gt;
&lt;li&gt;On the next request with a byte-for-byte identical prefix, the model reads from cache instead of reprocessing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Pricing (Claude Sonnet 4.5):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Price / MTok&lt;/th&gt;
&lt;th&gt;Relative to Base&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard input&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;1×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache write&lt;/td&gt;
&lt;td&gt;$3.75&lt;/td&gt;
&lt;td&gt;1.25×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache read&lt;/td&gt;
&lt;td&gt;$0.30&lt;/td&gt;
&lt;td&gt;0.1×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Constraints:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minimum prefix: 1,024 tokens (model-dependent)&lt;/li&gt;
&lt;li&gt;TTL: 5 minutes, refreshed on each hit&lt;/li&gt;
&lt;li&gt;Max 4 explicit breakpoints per request&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;system&lt;/code&gt; must be passed as an array of content blocks (not a plain string)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Experiment Design
&lt;/h2&gt;

&lt;p&gt;Three API calls. Same system prompt (~2,158 tokens). Same user question. The only variable is whether caching is enabled:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Call&lt;/th&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Expected Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;No &lt;code&gt;cache_control&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Baseline — all tokens at standard rate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Explicit &lt;code&gt;cache_control&lt;/code&gt; on system block&lt;/td&gt;
&lt;td&gt;Cache WRITE (1.25× on prefix)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Same as Call 2&lt;/td&gt;
&lt;td&gt;Cache READ (0.1× on prefix)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Baseline (no caching):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# plain string — caching not possible
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With explicit cache breakpoint:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One structural change: &lt;code&gt;system&lt;/code&gt; becomes a list of content blocks. That's the only code difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Call&lt;/th&gt;
&lt;th&gt;&lt;code&gt;input_tokens&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;cache_creation&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;cache_read&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Total Input&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 (baseline)&lt;/td&gt;
&lt;td&gt;2,180&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2,180&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2 (write)&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;2,158&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2,180&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 (read)&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2,158&lt;/td&gt;
&lt;td&gt;2,180&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The API usage fields tell the full story:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;input_tokens&lt;/code&gt;&lt;/strong&gt; = non-cached tail (the user message — 22 tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cache_creation_input_tokens&lt;/code&gt;&lt;/strong&gt; = prefix written to cache&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cache_read_input_tokens&lt;/code&gt;&lt;/strong&gt; = prefix served from cache&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Call 3 reads 2,158 tokens from cache at $0.30/MTok instead of $3.00/MTok.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Analysis
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Call&lt;/th&gt;
&lt;th&gt;Actual Input Cost&lt;/th&gt;
&lt;th&gt;Baseline Cost&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 (baseline)&lt;/td&gt;
&lt;td&gt;$0.006540&lt;/td&gt;
&lt;td&gt;$0.006540&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2 (write)&lt;/td&gt;
&lt;td&gt;$0.008159&lt;/td&gt;
&lt;td&gt;$0.006540&lt;/td&gt;
&lt;td&gt;+24.7% (write surcharge)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 (read)&lt;/td&gt;
&lt;td&gt;$0.000713&lt;/td&gt;
&lt;td&gt;$0.006540&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−89.1%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The write costs 25% more than baseline. The read costs 89% less. Break-even: &lt;strong&gt;2 requests&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Projection
&lt;/h2&gt;

&lt;p&gt;At 10,000 requests/day with a 5-minute TTL, cache writes occur 288 times/day (once per TTL window). The remaining 9,712 requests pay cache-read pricing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;With Caching&lt;/th&gt;
&lt;th&gt;Without&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Daily&lt;/td&gt;
&lt;td&gt;$54.28&lt;/td&gt;
&lt;td&gt;$110.40&lt;/td&gt;
&lt;td&gt;$56.12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly (30d)&lt;/td&gt;
&lt;td&gt;$1,628&lt;/td&gt;
&lt;td&gt;$3,312&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1,684&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Savings&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50.8%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is with a ~2,158 token system prompt. For agent-style workloads with 10K–30K token system prompts (tool definitions, reference docs, few-shot examples), the write surcharge becomes negligible relative to the prefix size, and total savings approach 85–89%.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Pitfall Worth Documenting
&lt;/h2&gt;

&lt;p&gt;My first implementation used &lt;strong&gt;top-level automatic caching&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ Fails silently with varying user messages
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cache_control&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# breakpoint at last block
&lt;/span&gt;    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;  &lt;span class="c1"&gt;# varies per request
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every call triggered a cache &lt;strong&gt;write&lt;/strong&gt; — never a read. The API returned &lt;code&gt;cache_creation_input_tokens &amp;gt; 0&lt;/code&gt; on every request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Top-level &lt;code&gt;cache_control&lt;/code&gt; places the breakpoint at the &lt;em&gt;last cacheable block&lt;/em&gt;, which includes the user message. Different messages produce different prefixes, so the cache key never matches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Use explicit &lt;code&gt;cache_control&lt;/code&gt; on the system prompt block. The cached prefix then covers only the stable system prompt, and varying user messages sit after the breakpoint.&lt;/p&gt;

&lt;p&gt;This is not documented prominently in Anthropic's guides, but it's the critical distinction between "caching that works" and "caching that silently charges you 25% more on every call."&lt;/p&gt;

&lt;h2&gt;
  
  
  When Prompt Caching Makes Sense
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Expected Input Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Static system prompt (&amp;gt;1K tokens) across requests&lt;/td&gt;
&lt;td&gt;~89%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-turn conversations (growing message history)&lt;/td&gt;
&lt;td&gt;70–85%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG with stable reference documents&lt;/td&gt;
&lt;td&gt;80–90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent loops with large tool catalogues&lt;/td&gt;
&lt;td&gt;60–80%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Implementation Checklist
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Verify system prompt exceeds the model's minimum (1,024 tokens for Sonnet 4.5)&lt;/li&gt;
&lt;li&gt;Restructure &lt;code&gt;system&lt;/code&gt; from a plain string to a list of content blocks&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;"cache_control": {"type": "ephemeral"}&lt;/code&gt; on the last stable block&lt;/li&gt;
&lt;li&gt;Place static content before dynamic content in the prompt&lt;/li&gt;
&lt;li&gt;Confirm cache reads by checking &lt;code&gt;cache_read_input_tokens &amp;gt; 0&lt;/code&gt; in responses&lt;/li&gt;
&lt;li&gt;Ensure request frequency stays within the 5-minute TTL window&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Full Experiment
&lt;/h2&gt;

&lt;p&gt;Reproducible notebook with all code: &lt;strong&gt;&lt;a href="https://www.kaggle.com/code/manishramavat/how-prompt-caching-cuts-claude-api-costs-by-90" rel="noopener noreferrer"&gt;Kaggle →&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Anthropic Prompt Caching Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.anthropic.com/en/docs/about-claude/pricing" rel="noopener noreferrer"&gt;Claude API Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/cookbook/misc-prompt-caching" rel="noopener noreferrer"&gt;Prompt Caching Cookbook (Official)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Article #2 in the LLM Engineering Experiments series. Previous: &lt;a href="https://dev.to/manishramavat/how-to-choose-the-right-prompt-engineering-pattern-and-why-simpler-is-usually-better-446b"&gt;How to Choose the Right Prompt Engineering Pattern&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Choose the Right Prompt Engineering Pattern (And Why Simpler Is Usually Better)</title>
      <dc:creator>Manish Ramavat</dc:creator>
      <pubDate>Sun, 10 May 2026 15:04:58 +0000</pubDate>
      <link>https://dev.to/manishramavat/how-to-choose-the-right-prompt-engineering-pattern-and-why-simpler-is-usually-better-446b</link>
      <guid>https://dev.to/manishramavat/how-to-choose-the-right-prompt-engineering-pattern-and-why-simpler-is-usually-better-446b</guid>
      <description>&lt;p&gt;I spent the past weekend running a head-to-head experiment: five popular prompt engineering patterns, one real model (Claude Sonnet 4.5), fifty real movie reviews. The goal was simple — find out which technique actually delivers the best results.&lt;/p&gt;

&lt;p&gt;The result? &lt;strong&gt;The simplest approach won.&lt;/strong&gt; And the most sophisticated one — Chain-of-Thought — didn't just underperform. It &lt;em&gt;actively made things worse.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's what happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5 Patterns I Tested
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Zero-Shot
&lt;/h3&gt;

&lt;p&gt;Direct instruction, no examples.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Classify this movie review as 'positive' or 'negative'. Reply with only one word.

Review: "The food was terrible but the service was great."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Few-Shot (k=3)
&lt;/h3&gt;

&lt;p&gt;Three input-output examples before the query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Classify movie reviews as 'positive' or 'negative'. Reply with only one word.

Review: "a masterpiece of modern cinema" → positive
Review: "boring and pointless" → negative
Review: "absolutely loved every minute" → positive

Review: "The food was terrible but the service was great." →
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Chain-of-Thought (CoT)
&lt;/h3&gt;

&lt;p&gt;Ask the model to reason step by step.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Classify this movie review as 'positive' or 'negative'.
Think step by step about the sentiment words and overall tone,
then give your final answer on the last line as just one word.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Role Prompting
&lt;/h3&gt;

&lt;p&gt;Assign the model a persona.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are an expert sentiment analyst with 20 years of experience
in film criticism and NLP.
Classify this movie review as 'positive' or 'negative'. Reply with only one word.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Structured Output
&lt;/h3&gt;

&lt;p&gt;Force the model to respond in JSON.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Analyze&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;this&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;movie&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;review&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;and&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;respond&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;ONLY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;with&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;valid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;JSON:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"sentiment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"positive"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;or&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"negative"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Experiment
&lt;/h2&gt;

&lt;p&gt;I ran all 5 patterns against 50 real movie reviews from the SST-2 dataset using Claude Sonnet 4.5. Each review was classified as positive or negative, and I measured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt; — did it get the right answer?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; — how long did it take?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token cost&lt;/strong&gt; — how many tokens were consumed?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;Avg Tokens&lt;/th&gt;
&lt;th&gt;Relative Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Zero-Shot&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;98.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.58s&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;1.0x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few-Shot (k=3)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;98.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.78s&lt;/td&gt;
&lt;td&gt;86&lt;/td&gt;
&lt;td&gt;1.7x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Role Prompting&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;98.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.83s&lt;/td&gt;
&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;1.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structured Output&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;98.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.06s&lt;/td&gt;
&lt;td&gt;65&lt;/td&gt;
&lt;td&gt;1.3x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chain-of-Thought&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;64.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5.23s&lt;/td&gt;
&lt;td&gt;228&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.6x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Surprising Finding
&lt;/h2&gt;

&lt;p&gt;Four out of five patterns hit &lt;strong&gt;98% accuracy&lt;/strong&gt;. The model is simply good enough at binary sentiment that Zero-Shot, Few-Shot, Role Prompting, and Structured Output all achieve nearly the same result.&lt;/p&gt;

&lt;p&gt;But Chain-of-Thought &lt;strong&gt;collapsed to 64%&lt;/strong&gt; — barely better than guessing.&lt;/p&gt;

&lt;p&gt;Here's a real example. For the review &lt;em&gt;"an utterly compelling 'torture' story"&lt;/em&gt; (label: positive), Zero-Shot immediately returned "positive." But Chain-of-Thought went down a rabbit hole:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The word 'torture' has negative connotations... however 'compelling' is positive... the quotes around 'torture' suggest it may be used figuratively... but the overall sentiment is ambiguous..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And got it wrong.&lt;/p&gt;

&lt;p&gt;Why? Because asking the model to "think step by step" about something it already knows how to do &lt;em&gt;introduces confusion&lt;/em&gt;. The reasoning process picks up on ambiguity that doesn't exist when the model just answers directly. It overthinks.&lt;/p&gt;

&lt;p&gt;And it costs 4.6x more tokens for that worse result.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for You
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Don't reach for complex patterns by default
&lt;/h3&gt;

&lt;p&gt;If your model is capable enough for the task, Zero-Shot might be all you need. In my test, the simplest approach was the cheapest, fastest, AND tied for most accurate. There was literally no reason to use anything fancier.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Chain-of-Thought can actively hurt on "simple" tasks
&lt;/h3&gt;

&lt;p&gt;CoT is designed for multi-step reasoning (math, logic, planning). When you apply it to tasks the model already handles well in one shot, you're adding noise, not signal. In my test, it cut accuracy by 34 percentage points while costing nearly 5x more. That's the worst possible trade-off.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Fancier patterns cost more without accuracy gains
&lt;/h3&gt;

&lt;p&gt;Few-Shot used 1.7x the tokens. Role Prompting used 1.4x. Structured Output used 1.3x. All hit the same 98% as Zero-Shot. If you're running thousands of classifications per day in production, that cost difference adds up — for literally zero accuracy benefit.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Match your pattern to your problem
&lt;/h3&gt;

&lt;p&gt;Stop asking "which pattern is best?" and start asking "how hard is this task for this model?" If the answer is "not very" — and for a frontier model on binary classification, it usually isn't — just use Zero-Shot and move on.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Practical Decision Guide
&lt;/h2&gt;

&lt;p&gt;Based on these results and broader published research:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;Recommended Pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model is capable enough for the task&lt;/td&gt;
&lt;td&gt;Zero-Shot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model needs calibration on output format&lt;/td&gt;
&lt;td&gt;Few-Shot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-step math or logic problems&lt;/td&gt;
&lt;td&gt;Chain-of-Thought&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need specific tone or perspective&lt;/td&gt;
&lt;td&gt;Role Prompting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need machine-parseable output&lt;/td&gt;
&lt;td&gt;Structured Output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High-stakes decisions needing max accuracy&lt;/td&gt;
&lt;td&gt;Self-Consistency (multiple CoT runs + voting)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with the simplest pattern that works. Add complexity only when the data proves you need it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Zero-Shot was the fastest (1.58s), cheapest (50 tokens), and tied for most accurate (98%). Every other pattern either matched it at higher cost, or actively hurt performance.&lt;/p&gt;

&lt;p&gt;The biggest mistake I see is reaching for Chain-of-Thought or complex prompting strategies when a direct instruction would have been faster, cheaper, and more reliable.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Full experiment code available on &lt;a href="https://www.kaggle.com/code/manishramavat/how-to-choose-the-right-prompt-engineering-pattern/notebook" rel="noopener noreferrer"&gt;Kaggle&lt;/a&gt;. Model: Claude Sonnet 4.5. Dataset: 50 SST-2 sentiment samples (28 positive, 22 negative).&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Wei, J., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS.&lt;/li&gt;
&lt;li&gt;Wang, X., et al. (2023). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR.&lt;/li&gt;
&lt;li&gt;Brown, T., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS.&lt;/li&gt;
&lt;li&gt;Anthropic. (2024). "Prompt Engineering for Claude." Anthropic Documentation.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>beginners</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
