<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Christina Norman</title>
    <description>The latest articles on DEV Community by Christina Norman (@kitaekatt).</description>
    <link>https://dev.to/kitaekatt</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3784242%2F386d6a01-213a-401e-84b5-a48a39d244b9.png</url>
      <title>DEV Community: Christina Norman</title>
      <link>https://dev.to/kitaekatt</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kitaekatt"/>
    <language>en</language>
    <item>
      <title>The Claude Code Information Hierarchy</title>
      <dc:creator>Christina Norman</dc:creator>
      <pubDate>Sat, 21 Feb 2026 19:58:07 +0000</pubDate>
      <link>https://dev.to/kitaekatt/the-claude-code-information-hierarchy-n7m</link>
      <guid>https://dev.to/kitaekatt/the-claude-code-information-hierarchy-n7m</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;a href="https://kitaekatt.github.io/claude-code-information-hierarchy/" rel="noopener noreferrer"&gt;Printable version&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every piece of information you give Claude Code can be evaluated on three axes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;🟡 &lt;strong&gt;Recall&lt;/strong&gt; — Likelihood Claude will load this information into context at all. &lt;em&gt;Claude can't apply information it doesn't recall.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;🔵 &lt;strong&gt;Attention&lt;/strong&gt; — Likelihood Claude will skillfully apply the information once it's loaded. &lt;em&gt;If Claude ignores information, it might as well not recall it.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;🟢 &lt;strong&gt;Context Load&lt;/strong&gt; — How much token budget this information consumes. &lt;em&gt;Higher context load means higher cost, worse attention, slower speed.&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No single tier of the hierarchy wins on all three dimensions. That's why you need the hierarchy rather than putting everything in one place.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hierarchy
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;🟡 Recall&lt;/th&gt;
&lt;th&gt;🔵 Attention&lt;/th&gt;
&lt;th&gt;🟢 Context Load&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🟥 &lt;strong&gt;Root CLAUDE.md&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Highest — always loaded&lt;/td&gt;
&lt;td&gt;Degrades with context size&lt;/td&gt;
&lt;td&gt;High — always consuming tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟣 &lt;strong&gt;Sub-dir CLAUDE.md&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;High — loaded on file access&lt;/td&gt;
&lt;td&gt;Higher — less noise than root&lt;/td&gt;
&lt;td&gt;Medium — only when relevant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔷 &lt;strong&gt;Skills&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Lower — but user can manually invoke&lt;/td&gt;
&lt;td&gt;Highest — fresh, focused context&lt;/td&gt;
&lt;td&gt;Lowest — on-demand only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Loading Triggers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;When It's Loaded&lt;/th&gt;
&lt;th&gt;Who Triggers It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🟥 Root CLAUDE.md&lt;/td&gt;
&lt;td&gt;Always&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟣 Sub-dir CLAUDE.md&lt;/td&gt;
&lt;td&gt;Claude reads a file in that directory or below&lt;/td&gt;
&lt;td&gt;Claude's file access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔷 Skill (claude-invoked)&lt;/td&gt;
&lt;td&gt;Claude decides the skill is valuable based on its description&lt;/td&gt;
&lt;td&gt;Claude's judgment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔷 Skill (user-invoked)&lt;/td&gt;
&lt;td&gt;User runs a command (e.g., &lt;code&gt;/review-work-before-submit&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;User&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Progressive Disclosure (Skills)
&lt;/h2&gt;

&lt;p&gt;Skills can contain reference documents that are conditionally loaded into context, adding further efficiency within the most efficient tier:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft5py6lsisjsqq7qkwiku.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft5py6lsisjsqq7qkwiku.png" alt="Progressive disclosure layers — Layer 1: Skill Description (cheap, always visible for routing), Layer 2: Skill Content (moderate, loaded on invocation), Layer 3: Referenced Documents (expensive, loaded on-demand)" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Decision Flow
&lt;/h2&gt;

&lt;p&gt;When adding new information, ask:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6rqs57q3kefjd3rizy4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6rqs57q3kefjd3rizy4.png" alt="Decision flow: Does Claude need this on most interactions? YES → Root CLAUDE.md. NO → Does Claude need this to work with most files in this directory? YES → Sub-dir CLAUDE.md. NO → Skill" width="665" height="589"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  CLAUDE.md Maintenance
&lt;/h2&gt;

&lt;p&gt;When CLAUDE.md grows too large, audit each item: "Is this always needed?"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuovhuy42hiz8es29upkf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuovhuy42hiz8es29upkf.png" alt="CLAUDE.md maintenance flow: CLAUDE.md grows too large → audit each item → NO directory → Sub-dir CLAUDE.md, NO workflow → Skill, YES → Keep in root" width="659" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Sub-dir CLAUDE.md files are loaded less often, so are lower priority to optimize.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Skill Maintenance
&lt;/h2&gt;

&lt;p&gt;When a skill grows too large, audit each item: "Always needed when this skill is invoked?"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4gtliprfhvfbxt00q67.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4gtliprfhvfbxt00q67.png" alt="Skill maintenance flow: Skill grows too large → audit each item → NO → Referenced document, YES → Keep in skill" width="662" height="440"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Tradeoff Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;🟡 Recall&lt;/th&gt;
&lt;th&gt;🔵 Attention&lt;/th&gt;
&lt;th&gt;🟢 Context Load&lt;/th&gt;
&lt;th&gt;Design Effort&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🟥 &lt;strong&gt;Root CLAUDE.md&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Highest — always in context&lt;/td&gt;
&lt;td&gt;Degrades with context size&lt;/td&gt;
&lt;td&gt;High — always loaded&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟣 &lt;strong&gt;Sub-dir CLAUDE.md&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;High — loaded on file access&lt;/td&gt;
&lt;td&gt;Higher — less noise&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔷 &lt;strong&gt;Skills&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Lower — but user can manually invoke&lt;/td&gt;
&lt;td&gt;Highest — fresh, focused&lt;/td&gt;
&lt;td&gt;Lowest — on-demand&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Mastering Cache Hits in Claude Code</title>
      <dc:creator>Christina Norman</dc:creator>
      <pubDate>Sat, 21 Feb 2026 18:54:04 +0000</pubDate>
      <link>https://dev.to/kitaekatt/mastering-cache-hits-in-claude-code-5648</link>
      <guid>https://dev.to/kitaekatt/mastering-cache-hits-in-claude-code-5648</guid>
      <description>&lt;p&gt;Understanding how caching works behind the scenes so you can reduce costs and get faster responses — even though you never touch the API directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;What Are Cache Hits and Why Should I Care?&lt;/li&gt;
&lt;li&gt;Anatomy of an API Call&lt;/li&gt;
&lt;li&gt;Cache Hits and Misses Explained&lt;/li&gt;
&lt;li&gt;What Breaks the Cache&lt;/li&gt;
&lt;li&gt;Cache Lifetime and the TTL Timer&lt;/li&gt;
&lt;li&gt;Structuring Your Work for Better Caching&lt;/li&gt;
&lt;li&gt;Caching Anti-Patterns&lt;/li&gt;
&lt;li&gt;API-Level Details (For When You Need Them)&lt;/li&gt;
&lt;li&gt;References&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What Are Cache Hits and Why Should I Care?
&lt;/h2&gt;

&lt;p&gt;Every time Claude Code sends a message on your behalf, it makes an API call to Anthropic. That API call includes everything Claude needs to respond: the system prompt, any tool definitions, your &lt;code&gt;CLAUDE.md&lt;/code&gt; files, and your entire conversation history. On a long session with a big codebase loaded, this can easily be 50,000–200,000+ tokens of input.&lt;/p&gt;

&lt;p&gt;Without caching, Anthropic's servers have to fully process all of those tokens from scratch on every single message — even though 99% of them are identical to what was sent 30 seconds ago. That's expensive and slow.&lt;/p&gt;

&lt;h3&gt;
  
  
  What caching does
&lt;/h3&gt;

&lt;p&gt;Caching saves the computational work that the server already did on the unchanged portion of your input. Think of it like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Without caching&lt;/strong&gt;: You send a 100-page document plus a question. The server reads the entire 100 pages, then answers your question. You send a different question about the same document. The server reads the entire 100 pages again, then answers. Every question costs the same.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With caching&lt;/strong&gt;: You send a 100-page document plus a question. The server reads the 100 pages, saves its understanding of them, then answers. You send a different question. The server loads its saved understanding of the 100 pages (fast and cheap), then only processes your new question. Every follow-up question is dramatically faster and cheaper.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What it costs
&lt;/h3&gt;

&lt;p&gt;The token costs below are API prices, but they're directly analogous to how your Claude plan's usage limit is consumed. Cache hits use less of your usage allowance than cache misses. Optimizing your cache hit rate stretches your plan further.&lt;/p&gt;

&lt;p&gt;Using Claude Sonnet 4.5 as an example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What you're paying for&lt;/th&gt;
&lt;th&gt;Cost per million tokens&lt;/th&gt;
&lt;th&gt;Relative to base&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Regular input tokens (no caching)&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache write (5-min TTL)&lt;/td&gt;
&lt;td&gt;$3.75&lt;/td&gt;
&lt;td&gt;1.25x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache write (1-hour TTL)&lt;/td&gt;
&lt;td&gt;$6.00&lt;/td&gt;
&lt;td&gt;2x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache read (hitting the cache)&lt;/td&gt;
&lt;td&gt;$0.30&lt;/td&gt;
&lt;td&gt;0.1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output tokens&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;(not affected by caching)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key number: &lt;strong&gt;a cache hit is 10x cheaper than uncached input.&lt;/strong&gt; It's also 12.5x cheaper than a cache write on the 5-minute TTL, or 20x cheaper than a 1-hour TTL write. You break even after just 2 messages — message 1 costs $0.375 on 100K tokens (the cache write), message 2 costs $0.03 (the cache read), for a cumulative $0.405. Without caching, those same 2 messages would cost 2 × $0.30 = $0.60. After that, every additional message saves ~$0.27 on a 100K-token context.&lt;/p&gt;

&lt;p&gt;Over a full 20-turn session with 100K tokens of stable context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Without caching&lt;/strong&gt;: 20 × $0.30 = &lt;strong&gt;$6.00&lt;/strong&gt; in input costs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With caching&lt;/strong&gt;: $0.375 (first write) + 19 × $0.03 = &lt;strong&gt;$0.945&lt;/strong&gt; in input costs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Savings: 84%&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cached tokens are also served significantly faster — latency on time-to-first-token improves substantially, though the exact improvement varies by platform and workload. Some platforms have reported up to 85% TTFT reduction in optimal conditions.&lt;/p&gt;

&lt;p&gt;As a Claude Code user, you don't directly control caching — Claude Code handles the API calls for you and automatically places cache breakpoints on your behalf. But understanding how caching works lets you structure your projects and conversations in ways that naturally lead to more cache hits, lower costs, and snappier responses.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note on model support:&lt;/strong&gt; Prompt caching is generally available on Claude Opus 4.5, Opus 4.1, Opus 4, Sonnet 4.5, Sonnet 4, Haiku 4.5, Haiku 3.5, and Haiku 3. It's also supported on Opus 4.6 and Sonnet 4.6 — the official documentation hasn't been updated to reflect this yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caches are isolated per model.&lt;/strong&gt; Switching models mid-session (e.g. via &lt;code&gt;/model&lt;/code&gt;) means a full cache miss. You cannot build a cache with a cheaper model and read it with a more expensive one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Cache hit rate
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cache_hit_rate = cache_read_tokens / (cache_read_tokens + cache_write_tokens + input_tokens)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A well-cached active session should have a hit rate around 90%.&lt;/p&gt;

&lt;h3&gt;
  
  
  cache-kit plugin
&lt;/h3&gt;

&lt;p&gt;Understanding cache performance isn't easy out of the box — Claude Code doesn't surface cache hit rates or token breakdowns anywhere in its UI. I built a plugin to fix that. The cache-kit plugin for Claude Code provides a &lt;code&gt;/cache-report&lt;/code&gt; skill that reads your local session transcripts and generates a formatted cache performance summary: hit rate, token breakdown by TTL, and per-request stats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Marketplace:&lt;/strong&gt; &lt;code&gt;git@github.com:kitaekatt/plugins-kit.git&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Plugin name:&lt;/strong&gt; &lt;code&gt;cache-kit&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0nnivvem7uoyjwenheq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0nnivvem7uoyjwenheq.png" alt="cache-kit /cache-report output showing 89% hit rate on a 112-request session" width="800" height="612"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Anatomy of an API Call
&lt;/h2&gt;

&lt;p&gt;To understand caching, it helps to know what Claude Code actually sends to the API on your behalf. Every API call has these parts, assembled in this order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tool definitions&lt;/strong&gt;&lt;br&gt;
All the tools Claude Code can use (bash, file edit, search, etc.)&lt;br&gt;
These are the same on every call within a session.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;System prompt&lt;/strong&gt;&lt;br&gt;
Claude Code's instructions, your &lt;code&gt;CLAUDE.md&lt;/code&gt; content, project context.&lt;br&gt;
This is mostly the same on every call within a session.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Messages (your conversation)&lt;/strong&gt;&lt;br&gt;
Every user message and assistant response in the current session, in chronological order. This grows by ~2 messages each turn (your new message + Claude's response from last turn).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Caching works on this content &lt;strong&gt;from the top down&lt;/strong&gt;. The server looks at the input starting from the tool definitions and moving forward, and checks how far it can go before it hits something that's different from what it cached last time.&lt;/p&gt;

&lt;p&gt;The portion that matches the cache is called the &lt;strong&gt;cached prefix&lt;/strong&gt; — "prefix" just means "the beginning part." Everything from the start of the input up to the point where something changed gets served from cache. Everything after that point gets processed fresh.&lt;/p&gt;
&lt;h3&gt;
  
  
  How Claude Code places cache breakpoints
&lt;/h3&gt;

&lt;p&gt;Claude Code automatically places &lt;code&gt;cache_control&lt;/code&gt; breakpoints on the last content block of user and assistant messages. Thinking blocks are explicitly excluded from cache breakpoints. This is all automatic — no user configuration needed or possible (aside from disabling caching entirely via environment variables).&lt;/p&gt;


&lt;h2&gt;
  
  
  Cache Hits and Misses Explained
&lt;/h2&gt;

&lt;p&gt;This is the core concept, so let's walk through a concrete example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: A 5-turn Claude Code session&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Imagine you're working in Claude Code on a project. Your session has tool definitions (~5K tokens), a system prompt with &lt;code&gt;CLAUDE.md&lt;/code&gt; (~3K tokens), and you're going back and forth with Claude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 1 — You ask your first question&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What's sent to the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Tools: 5K tokens] [System: 3K tokens] [Your message: 50 tokens]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cache result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache miss (nothing cached yet)&lt;/li&gt;
&lt;li&gt;→ Server processes all 8,050 tokens fresh&lt;/li&gt;
&lt;li&gt;→ Saves the processed result to cache for next time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Turn 2 — You ask a follow-up question&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What's sent to the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Tools: 5K] [System: 3K] [Turn 1 Q: 50] [Turn 1 Answer: 500] [Your new Q: 60]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cache result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache HIT on [Tools + System + Turn 1 Q] = 8,050 tokens&lt;/li&gt;
&lt;li&gt;Cache miss on [Turn 1 Answer + Your new Q] = 560 tokens&lt;/li&gt;
&lt;li&gt;→ 93% of tokens served from cache&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Turn 3 — Another follow-up&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What's sent to the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Tools: 5K] [System: 3K] [Turn 1 Q+A] [Turn 2 Q+A] [Your new Q: 45]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cache result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache HIT on [Tools + System + Turn 1 + Turn 2 Q] = ~8,660 tokens&lt;/li&gt;
&lt;li&gt;Cache miss on [Turn 2 Answer + Your new Q] = ~545 tokens&lt;/li&gt;
&lt;li&gt;→ 94% of tokens served from cache&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By turn 5, you might have 95%+ of tokens served from cache. By turn 20 with a large codebase loaded, it could be 50K+ tokens of cached history and only a few hundred tokens of new content per turn.&lt;/p&gt;

&lt;h3&gt;
  
  
  The key insight
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;You don't get 100% cache hits or 100% cache misses on a message.&lt;/strong&gt; Each message is a mix. The unchanged beginning of the conversation is a cache hit. The new content at the end is a cache miss. As your conversation grows, the ratio of cached-to-fresh tokens improves dramatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the "prefix" concept matters
&lt;/h3&gt;

&lt;p&gt;Caching only works on the &lt;strong&gt;beginning&lt;/strong&gt; of the input, moving forward. It can't skip over a changed section and cache something after it. If you imagine the input as a timeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[cached   ] [cached   ] [cached   ] [CHANGED  ] [not cached  ] [not cached  ]
                                         ↑
                                  Cache stops here.
                            Everything after this is processed fresh,
                            even if it's identical to last time.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In normal Claude Code usage, the prefix grows cleanly — each turn appends new content at the end, and everything before it stays cached. The prefix only breaks mid-conversation if you do something that alters earlier content, like using &lt;code&gt;/rewind&lt;/code&gt; (which removes messages from the history) or toggling a feature flag (which changes the system prompt). These are the situations where understanding prefix matching helps you diagnose unexpected cache misses.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Breaks the Cache
&lt;/h2&gt;

&lt;p&gt;The cache requires &lt;strong&gt;exact byte-for-byte matching&lt;/strong&gt; from the beginning of the input forward. Anything that changes the content — even in ways that seem trivial — breaks the cache from that point onward.&lt;/p&gt;

&lt;h3&gt;
  
  
  Things that invalidate the cache
&lt;/h3&gt;

&lt;p&gt;In the context of Claude Code, the most common cache-breakers are:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session-level changes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Starting a new conversation (no cache carries over between sessions)&lt;/li&gt;
&lt;li&gt;Resuming a session (&lt;code&gt;--continue&lt;/code&gt;, &lt;code&gt;--resume&lt;/code&gt;, &lt;code&gt;--fork-session&lt;/code&gt;) — the system prompt is regenerated, tool definitions are reassembled, and the cache TTL has almost certainly expired while you were away. In theory, if you resumed within the TTL window (e.g. you accidentally closed a terminal and immediately reopened), you could get a cache hit since the prefix is identical — but in practice this rarely happens.&lt;/li&gt;
&lt;li&gt;Being idle longer than the cache retention period (5 minutes for Pro/API, 1 hour for Max)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Content changes that cascade:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Toggling features like web search or extended thinking on/off&lt;/li&gt;
&lt;li&gt;Switching models via &lt;code&gt;/model&lt;/code&gt; (caches are per-model)&lt;/li&gt;
&lt;li&gt;Toggling fast mode (changes the model configuration)&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;/rewind&lt;/code&gt; (see Caching Anti-Patterns)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The invalidation hierarchy
&lt;/h3&gt;

&lt;p&gt;Changes cascade downward through this structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tool definitions  →  If these change, EVERYTHING below loses its cache
       ↓
System prompt     →  If this changes, all messages lose their cache
       ↓
Messages          →  Changes here only affect messages from the change onward
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a normal active Claude Code session, tool definitions and the system prompt don't change between turns — they're assembled once at session start. This means the cache stays valid and you get consistent cache hits on the entire prefix up to your latest new content. The invalidation hierarchy only matters if you change your &lt;code&gt;CLAUDE.md&lt;/code&gt;, system prompt, or tools and then restart — at which point the system prompt is regenerated, tool definitions are reassembled, and the cache is cold.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyv1lctw6zukvko2zmlxs.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyv1lctw6zukvko2zmlxs.jpg" alt="Cache invalidation matrix showing which changes break the tools, system, and messages cache layers" width="800" height="407"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Things that don't break the cache
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Adding a new message at the end of the conversation (the prefix is preserved)&lt;/li&gt;
&lt;li&gt;Changing only your latest question (the prefix up to the new content is preserved)&lt;/li&gt;
&lt;li&gt;Claude giving a different-length response (output is never cached, only input)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Cache Lifetime and the TTL Timer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How the timer works
&lt;/h3&gt;

&lt;p&gt;Cached content expires after a set period of inactivity. Every time a cache hit occurs, the timer resets. So as long as you're actively working, your cache stays warm.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key word is "inactivity."&lt;/strong&gt; If you send a message at 2:00, another at 2:03, and another at 2:07 — the cache stays warm the whole time because each hit resets the clock. But if you send a message at 2:00 and then nothing until the TTL expires — the cache is gone and your next message rebuilds it from scratch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which TTL you get
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Max plan subscribers get 1-hour TTL.&lt;/strong&gt; This is based on Claude Code source code inspection as of February 2026 — the &lt;code&gt;P31&lt;/code&gt; function grants 1h TTL to Max subscribers (not on overage), controlled by a server-side feature flag (&lt;code&gt;tengu_prompt_cache_1h_config&lt;/code&gt;). All cache writes use the 1h TTL automatically. Note that this is an implementation detail that could change without notice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pro plan and API-key users get 5-minute TTL.&lt;/strong&gt; This is the default, and it means you need to stay more active to keep the cache warm.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the TTL matters: a cost example
&lt;/h3&gt;

&lt;p&gt;Consider a session with 100K tokens of cached context:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 1&lt;/strong&gt; — You take 6 minutes to compose a thoughtful message (Pro plan, 5-minute TTL):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache has expired. Your message triggers a full cache rebuild.&lt;/li&gt;
&lt;li&gt;Cost on the cached content: &lt;strong&gt;1.25x&lt;/strong&gt; (cache write)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example 2&lt;/strong&gt; — After 3 minutes you send "do nothing, just keeping cache warm", then 3 minutes later send your real message:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The keepalive message hits the cache (0.1x) and resets the timer.&lt;/li&gt;
&lt;li&gt;Your real message also hits the cache (0.1x).&lt;/li&gt;
&lt;li&gt;Cost on the cached content: &lt;strong&gt;0.2x&lt;/strong&gt; (two cache reads)&lt;/li&gt;
&lt;li&gt;Plus a small output cost for the throwaway response.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example 2 is about &lt;strong&gt;6x cheaper&lt;/strong&gt; on the cached content, even accounting for the throwaway message. But it's a clunky user experience — and it illustrates exactly why the 1-hour TTL on Max plans is valuable. Max users don't need keepalive hacks; they have a full hour of breathing room.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical tips
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Max users&lt;/strong&gt;: You have a 1-hour window. Take your time composing messages, reviewing output, or stepping away briefly. Your cache will be there when you get back.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pro/API users&lt;/strong&gt;: Stay within the 5-minute window during active work. If you know you'll be away longer, accept that the next message will rebuild the cache (slightly slower and more expensive), then caching resumes normally.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Structuring Your Work for Better Caching
&lt;/h2&gt;

&lt;p&gt;Now that you understand what the cache is, how it expires, and what breaks it, here are the habits that keep your cache hit rate high.&lt;/p&gt;

&lt;h3&gt;
  
  
  Batch related work into one session
&lt;/h3&gt;

&lt;p&gt;If you know your work is going to involve reading significant data into context, it's better to have one longer conversation about that data than many small conversations where you re-read the same data each time. Loading the same context once and asking multiple questions is dramatically cheaper than loading it separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Loading a 100K-token file and asking 5 questions in one session:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Turn 1: 100K cache write (1.25x) + question&lt;/li&gt;
&lt;li&gt;Turns 2–5: 100K cache read (0.1x) each + question&lt;/li&gt;
&lt;li&gt;Total file cost: 1.25x + 4 × 0.1x = &lt;strong&gt;1.65x&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Loading the same file in 5 separate sessions, 1 question each:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each session: 100K cache write (1.25x) + question, cache never reused&lt;/li&gt;
&lt;li&gt;Total file cost: 5 × 1.25x = &lt;strong&gt;6.25x&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The five-session approach is &lt;strong&gt;~3.8x more expensive&lt;/strong&gt; on the file alone, because you pay the cache write cost every time and never get a cache hit. The cache you built in each session is thrown away unused.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fork sessions for parallel work
&lt;/h3&gt;

&lt;p&gt;Claude Code's &lt;code&gt;--fork-session&lt;/code&gt; feature lets you create multiple independent sessions that all share the same conversation history up to the fork point. Because the forked sessions share an identical prefix, they all benefit from the same cache.&lt;/p&gt;

&lt;p&gt;The workflow: build a base session with shared context (load files, establish the problem), then fork it into parallel investigations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build a shared context session&lt;/span&gt;
claude
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Read src/db/queries.ts, src/api/routes.ts, src/middleware/auth.ts,
  src/services/payment.ts, and the last 200 lines of logs/production.log

&lt;span class="c"&gt;# Name it, then fork into parallel investigations in separate terminals&lt;/span&gt;
/rename bug-investigation-base

&lt;span class="c"&gt;# Terminal 1:&lt;/span&gt;
claude &lt;span class="nt"&gt;--resume&lt;/span&gt; bug-investigation-base &lt;span class="nt"&gt;--fork-session&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"Investigate the database timeout errors in the production logs."&lt;/span&gt;

&lt;span class="c"&gt;# Terminal 2:&lt;/span&gt;
claude &lt;span class="nt"&gt;--resume&lt;/span&gt; bug-investigation-base &lt;span class="nt"&gt;--fork-session&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"Analyze whether token expiry handling could cause the 401 errors."&lt;/span&gt;

&lt;span class="c"&gt;# Terminal 3:&lt;/span&gt;
claude &lt;span class="nt"&gt;--resume&lt;/span&gt; bug-investigation-base &lt;span class="nt"&gt;--fork-session&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"Check if the API route error handling is swallowing transaction failures."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The base session pays the cache write once. Each fork gets cache hits on the shared context: 1.25x + 3 × 0.1x = &lt;strong&gt;1.55x&lt;/strong&gt; total. Three independent sessions each loading the same files would cost 3 × 1.25x = &lt;strong&gt;3.75x&lt;/strong&gt; — about 2.4x more expensive and slower.&lt;/p&gt;

&lt;p&gt;Each fork gets its own independent conversation history from the fork point onward. The base session remains unchanged and can be forked again. For clean separation between parallel workstreams, prefer &lt;code&gt;--fork-session&lt;/code&gt; over resuming the same session in multiple terminals (which interleaves messages in the same session file).&lt;/p&gt;




&lt;h2&gt;
  
  
  Caching Anti-Patterns
&lt;/h2&gt;

&lt;p&gt;These are the things that silently destroy your cache or waste tokens. Avoid them when possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Don't let the cache TTL expire during active work
&lt;/h3&gt;

&lt;p&gt;The cache expires after a period of inactivity — 5 minutes for Pro/API users, 1 hour for Max users. Every time you let the timer expire, your next message pays the full cache write cost again. Keep the conversation moving during active work. If you're reviewing Claude's output and need more time, you're fine as long as you respond before the timer runs out.&lt;/p&gt;

&lt;h3&gt;
  
  
  Don't preemptively load large context
&lt;/h3&gt;

&lt;p&gt;There's no caching advantage to loading files early in a conversation "just in case." You pay the same cache write cost regardless of when you load them, and loading them early means you pay cache-read costs on those tokens for every remaining turn — even turns where the files are irrelevant to what you're asking about. Load context when you actually need it so you're not paying to re-cache tokens that aren't contributing to the current task.&lt;/p&gt;

&lt;h3&gt;
  
  
  Don't switch models mid-session
&lt;/h3&gt;

&lt;p&gt;Caches are isolated per model. If you switch from Sonnet to Opus via &lt;code&gt;/model&lt;/code&gt; mid-session, the entire cache built on Sonnet is useless — Opus starts from a cold cache and pays the full write cost again. If you need to use a different model for a specific task, consider doing it in a separate session rather than switching back and forth within one session.&lt;/p&gt;

&lt;h3&gt;
  
  
  Don't use /rewind
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;/rewind&lt;/code&gt; command removes messages from the end of your conversation history. Even rewinding by a single turn changes the prefix — the cached version includes the message you just removed, but your new request doesn't. The server can't match the prefix it stored, so you get a cache miss on everything after the rewind point.&lt;/p&gt;

&lt;p&gt;This matters most when you've built up significant context. If you loaded files, had a 30-turn conversation, and then rewind back to just after the file loads, you lose the cache on all 30 turns of conversation. Your next message rebuilds the cache from the file-load point forward. If you want to "start fresh" with the same loaded context, forking the session from the right point is far more cache-friendly than rewinding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Don't restart after CLAUDE.md or tool changes
&lt;/h3&gt;

&lt;p&gt;Claude Code reads your &lt;code&gt;CLAUDE.md&lt;/code&gt; files and assembles tool definitions at session start. If you edit &lt;code&gt;CLAUDE.md&lt;/code&gt;, install a new MCP server, or add a plugin while a session is running, your current session is unaffected — it's still using the versions from when it started. It's tempting to restart immediately so Claude picks up the changes, but restarting means a full cache rebuild. If you're mid-flow on an expensive session with a lot of context loaded, the change can wait until you naturally finish the current task.&lt;/p&gt;

&lt;h3&gt;
  
  
  Don't install MCP servers or plugins mid-session
&lt;/h3&gt;

&lt;p&gt;Installing an MCP server or a plugin changes your tool definitions, which sit at the very top of the invalidation hierarchy — meaning everything below (system prompt and all messages) loses its cache on the next session start. This isn't a problem for your current session (it won't pick up the new tools until restart), but it means your next session will start with a cold cache that includes the new tool definitions. If you're planning to install several tools, batch them together so you only pay one cache rebuild rather than several.&lt;/p&gt;




&lt;h2&gt;
  
  
  API-Level Details (For When You Need Them)
&lt;/h2&gt;

&lt;p&gt;This section covers details that go beyond day-to-day Claude Code usage. It's here if you're building custom tooling, parsing session data, or just want to understand the internals.&lt;/p&gt;

&lt;h3&gt;
  
  
  Environment variables for caching control
&lt;/h3&gt;

&lt;p&gt;Claude Code's caching behavior is controlled entirely via environment variables — there are no settings.json options.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Env Var&lt;/th&gt;
&lt;th&gt;Effect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DISABLE_PROMPT_CACHING&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Disable all caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DISABLE_PROMPT_CACHING_HAIKU&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Disable caching for Haiku model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DISABLE_PROMPT_CACHING_SONNET&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Disable caching for Sonnet model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DISABLE_PROMPT_CACHING_OPUS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Disable caching for Opus model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ENABLE_PROMPT_CACHING_1H_BEDROCK&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Enable 1h TTL on Bedrock&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CLAUDE_CODE_FORCE_GLOBAL_CACHE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Force global system prompt caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ANTHROPIC_LOG=debug&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SDK debug logging (shows HTTP request/response details)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  JSONL session transcripts
&lt;/h3&gt;

&lt;p&gt;Every Claude Code session is stored as a JSONL file under &lt;code&gt;~/.claude/projects/&amp;lt;project&amp;gt;/&lt;/code&gt;. Each assistant message entry contains a full usage object with cache metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cache_creation_input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;13099&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cache_read_input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;17976&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cache_creation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ephemeral_5m_input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ephemeral_1h_input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;13099&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"output_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can parse these files directly to build custom cache analytics. This is how the cache-kit plugin's &lt;code&gt;/cache-report&lt;/code&gt; skill generates its reports — it reads the JSONL for the current session, aggregates the usage objects, and computes hit rates and token breakdowns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extended thinking and caching
&lt;/h3&gt;

&lt;p&gt;When Claude uses extended thinking in Claude Code, the thinking blocks appear in the conversation history sent on subsequent turns. These thinking blocks get cached automatically as part of the normal conversation prefix — they're treated like any other content in the message history.&lt;/p&gt;

&lt;p&gt;The nuance is that thinking blocks are explicitly excluded from being &lt;em&gt;cache breakpoints&lt;/em&gt; (the markers that tell the server "you can cache up to here"). This is handled automatically and doesn't affect your cache hit rate in practice. The thinking content still gets cached as part of the prefix when a later block has a breakpoint.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Words
&lt;/h2&gt;

&lt;p&gt;Cache hits aren't everything and there's nothing wrong with cache misses. Do whatever works for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Anthropic Official Documentation&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://platform.claude.com/docs/en/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Prompt Caching — Claude API Docs&lt;/a&gt;&lt;/strong&gt; — The authoritative reference covering implementation, pricing, invalidation rules, and breakpoint strategy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/prompt-caching" rel="noopener noreferrer"&gt;Prompt Caching with Claude — Anthropic Blog&lt;/a&gt;&lt;/strong&gt; — The announcement post covering use cases, cost/latency benefits, and early customer results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/anthropics/anthropic-cookbook/blob/main/misc/prompt_caching.ipynb" rel="noopener noreferrer"&gt;Prompt Caching Cookbook — Anthropic GitHub&lt;/a&gt;&lt;/strong&gt; — Jupyter notebook with hands-on examples comparing non-cached, cache-write, and cache-hit API calls with timing and cost analysis.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.anthropic.com/news/token-saving-updates" rel="noopener noreferrer"&gt;Token-Saving Updates on the Anthropic API — Anthropic Blog&lt;/a&gt;&lt;/strong&gt; — Covers cache-aware rate limits, simplified prompt caching, and token-efficient tool use.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.anthropic.com/en/docs/claude-code/costs" rel="noopener noreferrer"&gt;Manage Costs Effectively — Claude Code Docs&lt;/a&gt;&lt;/strong&gt; — Official guidance on Claude Code cost management, token optimization, and how Claude Code automatically applies prompt caching.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Platform-Specific Guides&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html" rel="noopener noreferrer"&gt;Prompt Caching on Amazon Bedrock — AWS Documentation&lt;/a&gt;&lt;/strong&gt; — Bedrock-specific implementation details, CloudWatch monitoring for cache metrics, and TTL behavior.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://aws.amazon.com/blogs/machine-learning/effectively-use-prompt-caching-on-amazon-bedrock/" rel="noopener noreferrer"&gt;Effectively Use Prompt Caching on Amazon Bedrock — AWS Blog&lt;/a&gt;&lt;/strong&gt; — Walkthrough of monitoring cache hit rates with CloudWatch dashboards and cost estimation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.cloud.google.com/vertex-ai/generativeai/docs/partner-models/claude/prompt-caching" rel="noopener noreferrer"&gt;Prompt Caching on Vertex AI — Google Cloud Documentation&lt;/a&gt;&lt;/strong&gt; — Google Cloud-specific caching behavior, TTL options, and pricing for Anthropic models on Vertex AI.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Community Articles and Deep Dives&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://medium.com/tr-labs-ml-engineering-blog/prompt-caching-the-secret-to-60-cost-reduction-in-llm-applications-6c792a0ac29b" rel="noopener noreferrer"&gt;Prompt Caching: The Secret to 60% Cost Reduction — Thomson Reuters Labs&lt;/a&gt;&lt;/strong&gt; — Practical guide covering cache warming patterns, parallel request handling, and real-world cost analysis.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://ngrok.com/blog/prompt-caching/" rel="noopener noreferrer"&gt;Prompt Caching: 10x Cheaper LLM Tokens, But How? — ngrok Blog&lt;/a&gt;&lt;/strong&gt; — Technical deep-dive into what happens at the infrastructure level, with latency benchmarks comparing Anthropic and OpenAI caching.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://promptbuilder.cc/blog/prompt-caching-token-economics-2025" rel="noopener noreferrer"&gt;Prompt Caching Guide 2025 — Prompt Builder&lt;/a&gt;&lt;/strong&gt; — Cross-provider comparison of caching strategies across Anthropic, OpenAI, and Google.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models" rel="noopener noreferrer"&gt;Prompt Caching with OpenAI, Anthropic, and Google Models — PromptHub&lt;/a&gt;&lt;/strong&gt; — Side-by-side comparison of caching features, pricing, and best practices across major providers.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Tools&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;cache-kit Plugin for Claude Code&lt;/strong&gt; — Provides &lt;code&gt;/cache-report&lt;/code&gt; skill for viewing per-session cache performance stats directly in Claude Code. &lt;code&gt;git@github.com:kitaekatt/plugins-kit.git&lt;/code&gt; (plugin: &lt;code&gt;cache-kit&lt;/code&gt;)&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Last updated: February 2026. Pricing, model support, and feature details are subject to change — always verify against the &lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;official Anthropic documentation&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>llm</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
