DEV Community: Yegor Shustyk

Prompt Caching Cut My Claude Bill by 70% — Here's the Exact Setup

Yegor Shustyk — Sat, 23 May 2026 16:57:45 +0000

I run a Claude-powered Telegram bot in production. Last 14 days: 905 API calls, $7.62 total. That's $0.0084 per call against a system prompt that's about 6,000 tokens. Without prompt caching, the same workload would have cost me roughly $25.

The Anthropic docs cover prompt caching at the spec level, but the practical "how do I wire this into a real Node app that makes hundreds of calls per day" is scattered. Here's the exact setup that's actually running in production, plus the five gotchas that cost me a day each to figure out.

The problem

A typical Claude call from my bot looks like this:

System prompt: ~6,000 tokens. Big block of instructions: tone, response shape, framework lenses, formatting rules, language guidance, anti-pattern checklist.
Per-call dynamic context: ~500-2,000 tokens. User's memory card, recent entries, current message.
Reply: 200-800 tokens out.

Without caching, every call pays full price on those 6,000 system tokens. With ~900 calls in two weeks, that's 5.4M tokens just in system prompt repetition. At Claude Sonnet 4.5 input pricing ($3/MTok), that's $16+ on text the model has already seen.

The fix takes about 10 lines.

The setup

Anthropic's API accepts the system field as either a string (simple case) or an array of typed blocks (cache-aware case). To enable caching, you split the system field into static and dynamic pieces, and mark the static block with cache_control: { type: "ephemeral" }.

Here's the helper I use across every call:

// claude.js
export function withPromptCache(staticPrompt, dynamicSuffix = '') {
  const blocks = [
    { type: 'text', text: staticPrompt, cache_control: { type: 'ephemeral' } },
  ]
  if (dynamicSuffix) blocks.push({ type: 'text', text: dynamicSuffix })
  return blocks
}

And the call site:

const reply = await askClaude(
  withPromptCache(FREE_MESSAGE_PROMPT, userContextSuffix),
  userMessage,
  recentExchanges,
  { userId: user.id, callType: 'free', maxTokens: 2048 }
)

FREE_MESSAGE_PROMPT is the big static block. userContextSuffix is the small per-user dynamic part (memory card, recent entries). The dynamic part stays uncached — that's intentional and the right tradeoff.

Inside askClaude, the body sent to Anthropic is:

const body = JSON.stringify({
  model: 'claude-sonnet-4-5',
  max_tokens: 2048,
  system: systemPrompt,  // ← the array from withPromptCache()
  messages: [...contextMessages, { role: 'user', content: userMessage }],
})

That's it for setup. Now the interesting part: tracking whether it actually works.

Reading the three token counters

When caching is active, Anthropic returns three input counters instead of one. You have to track all three or you'll never know if caching is doing anything.

const usage         = data.usage
const input         = usage.input_tokens                ?? 0
const cacheCreated  = usage.cache_creation_input_tokens ?? 0
const cacheRead     = usage.cache_read_input_tokens     ?? 0
const output        = usage.output_tokens               ?? 0

// Cost math: cache-creation is +25% on top of normal price,
// cache-read is -90% off normal price.
const cost = input        * INPUT_PRICE
           + cacheCreated * INPUT_PRICE * 1.25
           + cacheRead    * INPUT_PRICE * 0.10
           + output       * OUTPUT_PRICE

What you want to see in your logs: cacheRead should dwarf cacheCreated. The first call in a cache window writes (1.25×), every subsequent call within ~5 minutes reads (0.10×). If cacheCreated is always equal to your static prompt size, the cache is never hitting.

I write all three counters to a token_usage table per call, so /admin can show effective spend and hit-rate over time.

The 5 gotchas

1. Minimum token threshold (silent failure mode)

Anthropic requires your cached block to be at least 1024 tokens for Sonnet/Opus, 2048 for Haiku. Below that, the cache_control field is silently ignored. No error, no warning. You'll just see cacheRead: 0 forever and wonder why.

If you're caching a small system prompt, you have two options: pad it with relevant context until it crosses the threshold, or accept that caching doesn't apply at your scale.

2. The 5-minute TTL

The cache is ephemeral with a ~5-minute TTL. This matters more than people realize when planning where to apply caching:

Active chat sessions (user-bot back-and-forth) — every turn within the session hits the cache. Huge win.
Cron loops (e.g. nightly job that hits Claude per user) — if your loop processes one user every 10 seconds, the cache stays warm across the whole loop. Also a win.
Sparse one-off calls (one insight request per day per user) — these always miss. You'll pay the 1.25× cache-creation penalty for nothing. Skip caching here.

3. Separate static from dynamic at the right line

Putting the wrong content in the cached block invalidates the cache constantly. The rule:

Cached block = bytes that are identical across the call pattern you're optimizing.

For my bot, that means the cached block contains the system prompt and nothing else. The user's memory card, recent entries, and current question all go into the dynamic suffix (unmarked, uncached). If I put the memory card into the cached block, every user would invalidate the cache for every other user.

4. Cache key is content, not order

The cache key is a hash of the cached block's exact content. Even one whitespace change kills the cache. This bites you if you do something like:

// BAD — string concatenation creates a new cache key every call
const prompt = `${BASE_PROMPT}\n\nUser language: ${user.language}`

The user.language interpolation makes the "static" block per-user. Either move it to the dynamic suffix, or accept multiple cache entries (one per language).

5. Cache costs +25% the first time

The first call after a cache miss pays 1.25× normal input price to write the cache. If your traffic is too sparse to amortize this across enough reads, you're losing money.

Rough rule: you break even at ~3 cache hits per write. Below that, just send the system field as a plain string and skip the wrapper.

Real numbers from the project

Last 14 days, broken down by call type (sorted by spend):

free (chat)                 109 calls  $1.92
memory_card_midweek          41 calls  $1.43
evening                      46 calls  $0.76
morning_ack                  46 calls  $0.55
morning                     159 calls  $0.55
evening_opener              193 calls  $0.53
memory_card                  19 calls  $0.49
weekly_summary               19 calls  $0.34
...
─────────────────────────────────────
TOTAL                       905 calls  $7.62

Average $0.0084 per call on Sonnet 4.5 at ~6k input + 500 output. Without caching, this would land at roughly $0.025/call — about 3× more. Across 905 calls, that's the difference between $7 and $25 for the same work.

The win compounds as the bot scales. Doubling users doesn't double the cost — most additional traffic hits warm caches.

When NOT to use prompt caching

I want to be specific because the docs gloss over this:

Sparse, one-off calls where you have <3 hits per 5-minute window. The 1.25× write penalty exceeds the read savings.
Per-user prompts where the "static" block is actually per-user. You'll write a fresh cache for every user; pay the penalty, get no benefit.
Below the token threshold (1024 Sonnet / 2048 Haiku). Caching silently doesn't apply. Don't bother wrapping in withPromptCache — just save the indirection.
During development when you're iterating on the prompt. Every prompt edit invalidates the cache, so the savings show up only after the prompt is stable.

What this powers

This caching setup runs a Telegram-based self-reflection bot called Wise Insights — daily morning and evening check-ins, weekly summaries, memory layer that learns user patterns over time. It's live at wise.synergize.digital if you want to see what's running on top of all this token plumbing.

Happy to share more of the architecture (Supabase + grammy + node-cron, plus how I handle the memory layer without vector embeddings) if there's interest — drop questions in the comments.

The main lesson: prompt caching is one of those features that looks like a 10% optimization and turns out to be a 70% one, but only if your traffic pattern fits. Measure the three counters, watch the hit rate, and don't wrap it where it won't help.

How I Built a Multi-System Astrology Bot in Python (And What Meta Banned Me For)

Yegor Shustyk — Sat, 23 May 2026 16:50:11 +0000

Вот, держи готовый — копируй в body dev.to:

Every horoscope app reduces you to 1 of 12 sun signs. Real astrologers don't work like that — they cross-reference Western astrology, Vedic (Jyotish), Chinese Ba Zi, numerology, Human Design, and more. So I built a Telegram bot that does the same: one daily forecast synthesized from 13 systems, based on your full birth date.

It's been live for ~1 month. Small still — 83 users — but I want to share the parts that actually taught me something.

The Architecture: Why Combining 13 Systems Is a Data Problem, Not an Astrology Problem

Each system is a separate calculator. Western astrology needs ecliptic longitudes (I use Skyfield + NASA ephemeris). Vedic needs tithi (lunar day, 1-30) and nakshatra (27 lunar mansions). Ba Zi needs solar-term boundaries to assign the day-pillar element. Numerology needs digit-reductions with master-number exceptions (11, 22, 33 don't reduce before arithmetic).

Each one is finicky in its own way. Combine them and you get an interesting failure mode: latent bugs that wait for the calendar.

My favourite: a lunar-day translation table had 5 entries, but _tithi_group(30) returned index 5 (Amavasya / new moon). The bug sat dormant for weeks. Then a new moon arrived:

day_label = _TITHI_DAY_LABEL[lang][group_idx]
# IndexError: list index out of range

Content generation crashed for all three languages. The bot's startup also called ensure_content(today), so it entered a crash-loop. I learned two things that day:

Latent bugs wait for the calendar. Any code path that runs only on specific astronomical events needs explicit tests at those boundary conditions.
Startup hooks shouldn't crash the process. Wrap them in try/except so the bot stays alive and the admin can still introspect via diagnostic commands.

LLM Cost Architecture: One Sentinel Saved 99% of the Bill

The bot rewrites raw template output into warm conversational language using Gemini. Daily, monthly, yearly forecasts. With per-user rewriting, costs scale linearly with users — bad.

But the general forecast (the morning broadcast everyone receives) is identical for every user. So I use a sentinel pattern: user_id=0 means "shared cache row". The first user to trigger the daily LLM rewrite warms the cache; everyone else reads from it.

async def get_cached(session, user_id, date, lang, content_type):
    row = await session.get(LLMOutputCache,
                            (user_id, content_type, 0, date, lang))
    return row.text if row else None

This is a 5-line idea, but it cut my LLM bill from "uncomfortable" to "barely noticeable." Pre-warm cron at 03:00 UTC fills the cache before anyone wakes up.

The Hallucination Guard

Gemini is happy to invent astrological facts that aren't in your seed. The seed mentions the Moon; the rewrite confidently introduces Venus. For an astrology bot, that's a catastrophe — users trust the output.

My guard tokenises both texts and rejects the rewrite if any new planet name appears in the output that wasn't in the input. Sign names are tolerated (LLM often adds "the Scorpio Moon" as natural metaphor — that's fine), but actual planet additions = reject and fall back to Groq, then to plain template.

new_planets = _extract_astro_tokens(rewritten) \
            - _extract_astro_tokens(original)
new_planets &= _PLANET_TOKENS
if new_planets:
    log.warning("hallucination guard fired: %s", new_planets)
    return None  # fall back

About 2-3% of Gemini outputs trigger it. The bot silently falls back; the user never sees garbage.

Auto-posting: Single Source of Truth

I publish the same daily forecast to Telegram channel, Instagram (carousel of 4 PNG slides), and Threads. Three different formats, three different APIs, one piece of source content.

Key insight: share the cached LLM rewrite across surfaces. The IG caption pulls from llm_output_cache for user_id=0. Threads' main post pulls from the same cache and crops at the nearest sentence boundary under 500 chars. Zero extra LLM cost; one truth.

main_text = await get_cached(0, today, lang, CONTENT_TYPE_DAILY)
if len(main_text) > 500:
    head = main_text[:500]
    for sep in (". ", "! ", "? "):
        idx = head.rfind(sep)
        if idx >= 200:
            main_text = head[:idx+1].rstrip()
            break

The IG slide renderer uses a separate Gemini call with response_mime_type=application/json for tight char budgets (slides have visual constraints PNG-renderer must respect). One LLM call per language per day, cached 24h in Redis.

The Meta Ban (Or: What I Did Wrong)

Here's the part I'd undo. I had:

Per-post engagement-bait on every Threads/IG post: "leave a reaction, share with someone" — identical wording every day.
Daily 5-post self-reply chains (main post + numerology reply + Ba Zi reply + Jyotish reply + CTA-with-link reply).
Machine-perfect timing: 04:02 UTC ±0 every single day.

Each of these is a textbook spam signal. The combination — automated bot posting, identical engagement-bait, daily self-reply chains with outbound links — is exactly what Meta's integrity systems are designed to penalise.

The English account was disabled outright: "We've reviewed your account and found that it doesn't follow our Community Standards on account integrity." The Russian one survived but was shadow-restricted (posts publish via API but the account vanishes from search/profiles).

The de-spam was straightforward in code:

Dropped per-post engagement-bait, kept only a soft "link in bio" CTA
Cut the 5-post chain to a single forecast post
Added jitter=14400 seconds (±4h) to the cron so the post lands at varying times each day

scheduler.add_job(
    send_threads_post,
    trigger="cron",
    hour=10, minute=0,
    jitter=14400,  # ±4h — fires anywhere in 06:00-14:00 UTC daily
    id="threads_post",
)

The harder lesson: automated social posting on Meta platforms is fragile by design. Meta does not want pure-broadcast bot accounts. A new account you create and immediately hook to a cron will get banned again, the same way. If social presence matters to a project, the human-run path is the only durable one.

Honest Numbers After 1 Month

83 users
DAU/MAU ratio: ~9% (healthy benchmarks are 20%+ — retention is my real problem)
Profile completion rate: 73.5% (onboarding works)
Most-used feature: monthly forecast (high re-engagement, 7 users opened it 25 times in a week)
Least-used feature: invite/referral (1 invite in 30 days — turns out shipping a referral mechanism in code is nothing if it's not surfaced in the UI)
Paid conversions: 0 (haven't pushed monetisation yet)

What I'd Tell Past-Me

Distribution is harder than the product. I shipped the bot in 3 weeks. Getting people to use it is the actual work, and it's an entirely different skill.
Boring infrastructure decisions compound. Sentinel cache, hallucination guard, dockerised stack with admin diagnostic commands — none of these are cool. All of them have saved hours.
Don't optimise for channels that hate you. Meta's auto-poster ban is a feature, not a bug. Build for the channels where your behaviour is welcome.

The bot is live and free: t.me/CosmoCast_bot — send your birth date, get the forecast.

Happy to answer anything in the comments about the LLM cost architecture, the hallucination guard, the auto-poster setup, or the Meta-ban post-mortem.