<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yegor Shustyk</title>
    <description>The latest articles on DEV Community by Yegor Shustyk (@shustyk).</description>
    <link>https://dev.to/shustyk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1374355%2Fe44a4148-4cd9-416b-ad36-df96e9fc15b1.jpg</url>
      <title>DEV Community: Yegor Shustyk</title>
      <link>https://dev.to/shustyk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shustyk"/>
    <language>en</language>
    <item>
      <title>Prompt Caching Cut My Claude Bill by 70% — Here's the Exact Setup</title>
      <dc:creator>Yegor Shustyk</dc:creator>
      <pubDate>Sat, 23 May 2026 16:57:45 +0000</pubDate>
      <link>https://dev.to/shustyk/prompt-caching-cut-my-claude-bill-by-70-heres-the-exact-setup-2535</link>
      <guid>https://dev.to/shustyk/prompt-caching-cut-my-claude-bill-by-70-heres-the-exact-setup-2535</guid>
      <description>&lt;p&gt;I run a Claude-powered Telegram bot in production. Last 14 days: &lt;strong&gt;905 API calls, $7.62 total&lt;/strong&gt;. That's $0.0084 per call against a system prompt that's about 6,000 tokens. Without prompt caching, the same workload would have cost me roughly $25.&lt;/p&gt;

&lt;p&gt;The Anthropic docs cover prompt caching at the spec level, but the practical "how do I wire this into a real Node app that makes hundreds of calls per day" is scattered. Here's the exact setup that's actually running in production, plus the five gotchas that cost me a day each to figure out.&lt;/p&gt;




&lt;h3&gt;
  
  
  The problem
&lt;/h3&gt;

&lt;p&gt;A typical Claude call from my bot looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System prompt:&lt;/strong&gt; ~6,000 tokens. Big block of instructions: tone, response shape, framework lenses, formatting rules, language guidance, anti-pattern checklist.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-call dynamic context:&lt;/strong&gt; ~500-2,000 tokens. User's memory card, recent entries, current message.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reply:&lt;/strong&gt; 200-800 tokens out.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without caching, every call pays full price on those 6,000 system tokens. With ~900 calls in two weeks, that's 5.4M tokens just in &lt;strong&gt;system prompt repetition&lt;/strong&gt;. At Claude Sonnet 4.5 input pricing ($3/MTok), that's $16+ on text the model has already seen.&lt;/p&gt;

&lt;p&gt;The fix takes about 10 lines.&lt;/p&gt;




&lt;h3&gt;
  
  
  The setup
&lt;/h3&gt;

&lt;p&gt;Anthropic's API accepts the &lt;code&gt;system&lt;/code&gt; field as either a string (simple case) or an array of typed blocks (cache-aware case). To enable caching, you split the system field into static and dynamic pieces, and mark the static block with &lt;code&gt;cache_control: { type: "ephemeral" }&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here's the helper I use across every call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// claude.js&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;withPromptCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;staticPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;dynamicSuffix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;staticPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;cache_control&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ephemeral&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dynamicSuffix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;blocks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dynamicSuffix&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;blocks&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the call site:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;askClaude&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nf"&gt;withPromptCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;FREE_MESSAGE_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;userContextSuffix&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nx"&gt;userMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;recentExchanges&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;callType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;free&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2048&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;FREE_MESSAGE_PROMPT&lt;/code&gt; is the big static block. &lt;code&gt;userContextSuffix&lt;/code&gt; is the small per-user dynamic part (memory card, recent entries). The dynamic part stays uncached — that's intentional and the right tradeoff.&lt;/p&gt;

&lt;p&gt;Inside &lt;code&gt;askClaude&lt;/code&gt;, the body sent to Anthropic is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// ← the array from withPromptCache()&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;contextMessages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;userMessage&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it for setup. Now the interesting part: tracking whether it actually works.&lt;/p&gt;




&lt;h3&gt;
  
  
  Reading the three token counters
&lt;/h3&gt;

&lt;p&gt;When caching is active, Anthropic returns three input counters instead of one. You have to track all three or you'll never know if caching is doing anything.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt;         &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;         &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input_tokens&lt;/span&gt;                &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cacheCreated&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cache_creation_input_tokens&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cacheRead&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cache_read_input_tokens&lt;/span&gt;     &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output_tokens&lt;/span&gt;               &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="c1"&gt;// Cost math: cache-creation is +25% on top of normal price,&lt;/span&gt;
&lt;span class="c1"&gt;// cache-read is -90% off normal price.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;        &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;INPUT_PRICE&lt;/span&gt;
           &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;cacheCreated&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;INPUT_PRICE&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.25&lt;/span&gt;
           &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;cacheRead&lt;/span&gt;    &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;INPUT_PRICE&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;
           &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;       &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;OUTPUT_PRICE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What you want to see in your logs: &lt;strong&gt;&lt;code&gt;cacheRead&lt;/code&gt; should dwarf &lt;code&gt;cacheCreated&lt;/code&gt;&lt;/strong&gt;. The first call in a cache window writes (1.25×), every subsequent call within ~5 minutes reads (0.10×). If &lt;code&gt;cacheCreated&lt;/code&gt; is always equal to your static prompt size, the cache is never hitting.&lt;/p&gt;

&lt;p&gt;I write all three counters to a &lt;code&gt;token_usage&lt;/code&gt; table per call, so &lt;code&gt;/admin&lt;/code&gt; can show effective spend and hit-rate over time.&lt;/p&gt;




&lt;h3&gt;
  
  
  The 5 gotchas
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Minimum token threshold (silent failure mode)
&lt;/h4&gt;

&lt;p&gt;Anthropic requires your cached block to be &lt;strong&gt;at least 1024 tokens&lt;/strong&gt; for Sonnet/Opus, &lt;strong&gt;2048 for Haiku&lt;/strong&gt;. Below that, the &lt;code&gt;cache_control&lt;/code&gt; field is silently ignored. No error, no warning. You'll just see &lt;code&gt;cacheRead: 0&lt;/code&gt; forever and wonder why.&lt;/p&gt;

&lt;p&gt;If you're caching a small system prompt, you have two options: pad it with relevant context until it crosses the threshold, or accept that caching doesn't apply at your scale.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. The 5-minute TTL
&lt;/h4&gt;

&lt;p&gt;The cache is ephemeral with a ~5-minute TTL. This matters more than people realize when planning where to apply caching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Active chat sessions&lt;/strong&gt; (user-bot back-and-forth) — every turn within the session hits the cache. Huge win.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cron loops&lt;/strong&gt; (e.g. nightly job that hits Claude per user) — if your loop processes one user every 10 seconds, the cache stays warm across the whole loop. Also a win.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sparse one-off calls&lt;/strong&gt; (one insight request per day per user) — these always miss. You'll pay the 1.25× cache-creation penalty for nothing. &lt;strong&gt;Skip caching here.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Separate static from dynamic at the right line
&lt;/h4&gt;

&lt;p&gt;Putting the wrong content in the cached block invalidates the cache constantly. The rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Cached block = bytes that are &lt;strong&gt;identical&lt;/strong&gt; across the call pattern you're optimizing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For my bot, that means the cached block contains the system prompt and &lt;em&gt;nothing else&lt;/em&gt;. The user's memory card, recent entries, and current question all go into the &lt;strong&gt;dynamic suffix&lt;/strong&gt; (unmarked, uncached). If I put the memory card into the cached block, every user would invalidate the cache for every other user.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Cache key is content, not order
&lt;/h4&gt;

&lt;p&gt;The cache key is a hash of the cached block's exact content. Even one whitespace change kills the cache. This bites you if you do something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// BAD — string concatenation creates a new cache key every call&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;BASE_PROMPT&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;\n\nUser language: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;language&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;user.language&lt;/code&gt; interpolation makes the "static" block per-user. Either move it to the dynamic suffix, or accept multiple cache entries (one per language).&lt;/p&gt;

&lt;h4&gt;
  
  
  5. Cache costs +25% the first time
&lt;/h4&gt;

&lt;p&gt;The first call after a cache miss pays 1.25× normal input price to &lt;em&gt;write&lt;/em&gt; the cache. If your traffic is too sparse to amortize this across enough reads, you're losing money.&lt;/p&gt;

&lt;p&gt;Rough rule: you break even at &lt;strong&gt;~3 cache hits per write&lt;/strong&gt;. Below that, just send the system field as a plain string and skip the wrapper.&lt;/p&gt;




&lt;h3&gt;
  
  
  Real numbers from the project
&lt;/h3&gt;

&lt;p&gt;Last 14 days, broken down by call type (sorted by spend):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;free (chat)                 109 calls  $1.92
memory_card_midweek          41 calls  $1.43
evening                      46 calls  $0.76
morning_ack                  46 calls  $0.55
morning                     159 calls  $0.55
evening_opener              193 calls  $0.53
memory_card                  19 calls  $0.49
weekly_summary               19 calls  $0.34
...
─────────────────────────────────────
TOTAL                       905 calls  $7.62
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Average &lt;strong&gt;$0.0084 per call&lt;/strong&gt; on Sonnet 4.5 at ~6k input + 500 output. Without caching, this would land at roughly $0.025/call — about 3× more. Across 905 calls, that's the difference between $7 and $25 for the same work.&lt;/p&gt;

&lt;p&gt;The win compounds as the bot scales. Doubling users doesn't double the cost — most additional traffic hits warm caches.&lt;/p&gt;




&lt;h3&gt;
  
  
  When NOT to use prompt caching
&lt;/h3&gt;

&lt;p&gt;I want to be specific because the docs gloss over this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sparse, one-off calls&lt;/strong&gt; where you have &amp;lt;3 hits per 5-minute window. The 1.25× write penalty exceeds the read savings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-user prompts&lt;/strong&gt; where the "static" block is actually per-user. You'll write a fresh cache for every user; pay the penalty, get no benefit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Below the token threshold&lt;/strong&gt; (1024 Sonnet / 2048 Haiku). Caching silently doesn't apply. Don't bother wrapping in &lt;code&gt;withPromptCache&lt;/code&gt; — just save the indirection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;During development&lt;/strong&gt; when you're iterating on the prompt. Every prompt edit invalidates the cache, so the savings show up only after the prompt is stable.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  What this powers
&lt;/h3&gt;

&lt;p&gt;This caching setup runs a Telegram-based self-reflection bot called Wise Insights — daily morning and evening check-ins, weekly summaries, memory layer that learns user patterns over time. It's live at &lt;a href="https://wise.synergize.digital/" rel="noopener noreferrer"&gt;wise.synergize.digital&lt;/a&gt; if you want to see what's running on top of all this token plumbing.&lt;/p&gt;

&lt;p&gt;Happy to share more of the architecture (Supabase + grammy + node-cron, plus how I handle the memory layer without vector embeddings) if there's interest — drop questions in the comments.&lt;/p&gt;

&lt;p&gt;The main lesson: &lt;strong&gt;prompt caching is one of those features that looks like a 10% optimization and turns out to be a 70% one&lt;/strong&gt;, but only if your traffic pattern fits. Measure the three counters, watch the hit rate, and don't wrap it where it won't help.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>webdev</category>
      <category>telegram</category>
    </item>
    <item>
      <title>How I Built a Multi-System Astrology Bot in Python (And What Meta Banned Me For)</title>
      <dc:creator>Yegor Shustyk</dc:creator>
      <pubDate>Sat, 23 May 2026 16:50:11 +0000</pubDate>
      <link>https://dev.to/shustyk/how-i-built-a-multi-system-astrology-bot-in-python-and-what-meta-banned-me-for-1j4e</link>
      <guid>https://dev.to/shustyk/how-i-built-a-multi-system-astrology-bot-in-python-and-what-meta-banned-me-for-1j4e</guid>
      <description>&lt;p&gt;Вот, держи готовый — копируй в body dev.to:&lt;/p&gt;




&lt;p&gt;Every horoscope app reduces you to 1 of 12 sun signs. Real astrologers don't work like that — they cross-reference Western astrology, Vedic (Jyotish), Chinese Ba Zi, numerology, Human Design, and more. So I built a Telegram bot that does the same: one daily forecast synthesized from 13 systems, based on your full birth date.&lt;/p&gt;

&lt;p&gt;It's been live for ~1 month. Small still — 83 users — but I want to share the parts that actually taught me something.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: Why Combining 13 Systems Is a Data Problem, Not an Astrology Problem
&lt;/h2&gt;

&lt;p&gt;Each system is a separate calculator. Western astrology needs ecliptic longitudes (I use Skyfield + NASA ephemeris). Vedic needs tithi (lunar day, 1-30) and nakshatra (27 lunar mansions). Ba Zi needs solar-term boundaries to assign the day-pillar element. Numerology needs digit-reductions with master-number exceptions (11, 22, 33 don't reduce before arithmetic).&lt;/p&gt;

&lt;p&gt;Each one is finicky in its own way. Combine them and you get an interesting failure mode: &lt;strong&gt;latent bugs that wait for the calendar.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My favourite: a lunar-day translation table had 5 entries, but &lt;code&gt;_tithi_group(30)&lt;/code&gt; returned index 5 (Amavasya / new moon). The bug sat dormant for weeks. Then a new moon arrived:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;day_label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_TITHI_DAY_LABEL&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;group_idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# IndexError: list index out of range
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Content generation crashed for all three languages. The bot's startup also called &lt;code&gt;ensure_content(today)&lt;/code&gt;, so it entered a crash-loop. I learned two things that day:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Latent bugs wait for the calendar.&lt;/strong&gt; Any code path that runs only on specific astronomical events needs explicit tests at those boundary conditions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Startup hooks shouldn't crash the process.&lt;/strong&gt; Wrap them in &lt;code&gt;try/except&lt;/code&gt; so the bot stays alive and the admin can still introspect via diagnostic commands.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  LLM Cost Architecture: One Sentinel Saved 99% of the Bill
&lt;/h2&gt;

&lt;p&gt;The bot rewrites raw template output into warm conversational language using Gemini. Daily, monthly, yearly forecasts. With per-user rewriting, costs scale linearly with users — bad.&lt;/p&gt;

&lt;p&gt;But the &lt;strong&gt;general&lt;/strong&gt; forecast (the morning broadcast everyone receives) is identical for every user. So I use a sentinel pattern: &lt;code&gt;user_id=0&lt;/code&gt; means "shared cache row". The first user to trigger the daily LLM rewrite warms the cache; everyone else reads from it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cached&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LLMOutputCache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a 5-line idea, but it cut my LLM bill from "uncomfortable" to "barely noticeable." Pre-warm cron at 03:00 UTC fills the cache before anyone wakes up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hallucination Guard
&lt;/h2&gt;

&lt;p&gt;Gemini is happy to invent astrological facts that aren't in your seed. The seed mentions the Moon; the rewrite confidently introduces Venus. For an astrology bot, that's a catastrophe — users trust the output.&lt;/p&gt;

&lt;p&gt;My guard tokenises both texts and rejects the rewrite if any &lt;strong&gt;new planet name&lt;/strong&gt; appears in the output that wasn't in the input. Sign names are tolerated (LLM often adds "the Scorpio Moon" as natural metaphor — that's fine), but actual planet additions = reject and fall back to Groq, then to plain template.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;new_planets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_extract_astro_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rewritten&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
            &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;_extract_astro_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;new_planets&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;=&lt;/span&gt; &lt;span class="n"&gt;_PLANET_TOKENS&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;new_planets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination guard fired: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_planets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# fall back
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;About 2-3% of Gemini outputs trigger it. The bot silently falls back; the user never sees garbage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Auto-posting: Single Source of Truth
&lt;/h2&gt;

&lt;p&gt;I publish the same daily forecast to Telegram channel, Instagram (carousel of 4 PNG slides), and Threads. Three different formats, three different APIs, one piece of source content.&lt;/p&gt;

&lt;p&gt;Key insight: &lt;strong&gt;share the cached LLM rewrite across surfaces.&lt;/strong&gt; The IG caption pulls from &lt;code&gt;llm_output_cache&lt;/code&gt; for &lt;code&gt;user_id=0&lt;/code&gt;. Threads' main post pulls from the same cache and crops at the nearest sentence boundary under 500 chars. Zero extra LLM cost; one truth.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;main_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;get_cached&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CONTENT_TYPE_DAILY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;head&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;main_text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sep&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;! &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;? &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rfind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;main_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The IG slide renderer uses a separate Gemini call with &lt;code&gt;response_mime_type=application/json&lt;/code&gt; for tight char budgets (slides have visual constraints PNG-renderer must respect). One LLM call per language per day, cached 24h in Redis.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Meta Ban (Or: What I Did Wrong)
&lt;/h2&gt;

&lt;p&gt;Here's the part I'd undo. I had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-post engagement-bait&lt;/strong&gt; on every Threads/IG post: "leave a reaction, share with someone" — identical wording every day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily 5-post self-reply chains&lt;/strong&gt; (main post + numerology reply + Ba Zi reply + Jyotish reply + CTA-with-link reply).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine-perfect timing&lt;/strong&gt;: 04:02 UTC ±0 every single day.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these is a textbook spam signal. The combination — automated bot posting, identical engagement-bait, daily self-reply chains with outbound links — is exactly what Meta's integrity systems are designed to penalise.&lt;/p&gt;

&lt;p&gt;The English account was disabled outright: "We've reviewed your account and found that it doesn't follow our Community Standards on account integrity." The Russian one survived but was shadow-restricted (posts publish via API but the account vanishes from search/profiles).&lt;/p&gt;

&lt;p&gt;The de-spam was straightforward in code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dropped per-post engagement-bait, kept only a soft "link in bio" CTA&lt;/li&gt;
&lt;li&gt;Cut the 5-post chain to a single forecast post&lt;/li&gt;
&lt;li&gt;Added &lt;code&gt;jitter=14400&lt;/code&gt; seconds (±4h) to the cron so the post lands at varying times each day
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;send_threads_post&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trigger&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cron&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;jitter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;14400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ±4h — fires anywhere in 06:00-14:00 UTC daily
&lt;/span&gt;    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threads_post&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The harder lesson: &lt;strong&gt;automated social posting on Meta platforms is fragile by design.&lt;/strong&gt; Meta does not want pure-broadcast bot accounts. A new account you create and immediately hook to a cron will get banned again, the same way. If social presence matters to a project, the human-run path is the only durable one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Numbers After 1 Month
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;83 users&lt;/li&gt;
&lt;li&gt;DAU/MAU ratio: ~9% (healthy benchmarks are 20%+ — retention is my real problem)&lt;/li&gt;
&lt;li&gt;Profile completion rate: 73.5% (onboarding works)&lt;/li&gt;
&lt;li&gt;Most-used feature: monthly forecast (high re-engagement, 7 users opened it 25 times in a week)&lt;/li&gt;
&lt;li&gt;Least-used feature: invite/referral (1 invite in 30 days — turns out shipping a referral mechanism in code is nothing if it's not surfaced in the UI)&lt;/li&gt;
&lt;li&gt;Paid conversions: 0 (haven't pushed monetisation yet)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I'd Tell Past-Me
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Distribution is harder than the product.&lt;/strong&gt; I shipped the bot in 3 weeks. Getting people to use it is the actual work, and it's an entirely different skill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boring infrastructure decisions compound.&lt;/strong&gt; Sentinel cache, hallucination guard, dockerised stack with admin diagnostic commands — none of these are cool. All of them have saved hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't optimise for channels that hate you.&lt;/strong&gt; Meta's auto-poster ban is a feature, not a bug. Build for the channels where your behaviour is welcome.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The bot is live and free: &lt;strong&gt;&lt;a href="https://t.me/CosmoCast_bot" rel="noopener noreferrer"&gt;t.me/CosmoCast_bot&lt;/a&gt;&lt;/strong&gt; — send your birth date, get the forecast.&lt;/p&gt;

&lt;p&gt;Happy to answer anything in the comments about the LLM cost architecture, the hallucination guard, the auto-poster setup, or the Meta-ban post-mortem.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>showdev</category>
      <category>indiehackers</category>
    </item>
  </channel>
</rss>
