Every tutorial about building AI chatbots reaches for the same starter pack: vector database, embeddings model, similarity search, RAG. I did too. Then I ran the numbers on prompt caching and threw most of it out.
Here's what happened.
The setup
I'm building a customer support bot for a B2C SaaS product. It hooks into Crisp (live chat), reads incoming customer messages, looks up answers in a knowledge base, and replies — escalating to a human when it can't help.
The stack:
- Bun server (one file, ~300 lines)
- Claude Sonnet 4.6 for the LLM
- Supabase pgvector for vector search (initially)
-
OpenAI
text-embedding-3-smallfor query embeddings
The knowledge base: 16 markdown articles covering account, billing, and technical topics. ~250 tokens per article, ~4,000 tokens total.
The "obvious" architecture: RAG
Standard playbook:
- Chunk each article into ~500-token pieces
- Embed each chunk with OpenAI
- Store in Supabase pgvector
- On each customer message: embed the message, do a similarity search, retrieve top 3 chunks, inject into the LLM prompt
Per request:
- 1 OpenAI embedding call (~200 ms)
- 1 Supabase RPC (~150 ms)
- 1 Claude API call (~800 ms)
- ~1,150 ms before the customer sees a single character
It worked. But it kept feeling over-engineered for what is fundamentally 4,000 tokens of static text.
The thing I kept circling back to
Anthropic's prompt caching. The deal:
- Pay 1.25× the normal input rate to write a prefix into cache
- Pay 0.1× (90% off) for every subsequent read within 5 minutes
- The TTL refreshes on every read — so an active session keeps the cache warm indefinitely
Minimum cacheable block on Sonnet 4.6: 1,024 tokens. My KB is 4,000+. Comfortably above.
What if I just stuffed the whole KB into the system prompt and cached it?
Conventional wisdom says no — too many tokens per call. But the math with caching is different than the math without.
Running the numbers
Scenario: 10 customers per day, 10 messages each. 100 messages across 10 sessions.
RAG (the original setup):
- 1,380 input tokens per call (system + retrieved chunks + history + user msg)
- 0% cache hit rate — the system prompt is below the 1,024 cache minimum
- ~$17/month total
All-in-context with prompt caching:
- 4,330 input tokens per call (system + entire KB + history + user msg)
- ~90% cache hit rate after the first message of each session
- ~$18/month total
About a buck different. The interesting part isn't the cost — it's everything else.
The trade-off table
| RAG | All-in-context + cache | |
|---|---|---|
| Cost (100 msgs/day) | $17/mo | $18/mo |
| First-token latency (cache hit) | ~1,150 ms | ~700 ms |
| Lines of code | ~250 | ~150 |
| External services | Supabase + OpenAI + Anthropic | Anthropic only |
| Retrieval failures | Possible (threshold tuning hell) | Impossible — KB always visible |
| Cross-article reasoning | Limited to top-K chunks | Sees the whole KB |
| Scales to 100 articles | Yes | Yes |
| Scales to 10,000 articles | Yes | No — context window limit |
For my use case — small KB, conversational chat, frequent sessions — all-in-context is faster, simpler, and handles harder cross-article questions better. The cost difference is in the noise.
The implementation
I made it switchable via one env var so I could A/B compare:
const KB_MODE = Bun.env.KB_MODE === "inline" ? "inline" : "rag";
const PROMPT_FILE = KB_MODE === "inline" ? "CRISP-inline.md" : "CRISP-rag.md";
const SYSTEM_PROMPT = (await Bun.file(PROMPT_FILE).text()).trim();
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" }, // ← the magic line
}],
messages,
});
CRISP-rag.md is just the persona/rules (~600 tokens — too small to cache).
CRISP-inline.md is persona + entire KB baked in (~2,500 tokens — caches happily).
In inline mode, searchKB() is never called. No embedding round-trip, no Supabase query. The KB is sitting in the system prompt, cached on Anthropic's side, ready to be reused for every subsequent message.
The cache logging proves it:
[claude] response in 1432ms (in: 47, out: 92, cache_create: 2447, cache_read: 0) ← 1st msg
[claude] response in 712ms (in: 47, out: 88, cache_create: 0, cache_read: 2447) ← cached
[claude] response in 689ms (in: 47, out: 95, cache_create: 0, cache_read: 2447) ← cached
After the first message, every reply is ~50% faster and 90% cheaper on input.
What I learned
RAG isn't free. It adds two API hops, an embedding model, a vector database, chunking logic, threshold tuning, and an entire class of "the right chunk wasn't retrieved" failure modes. It's the right answer above some KB size — but that size is way bigger than most tutorials assume.
Prompt caching changes the break-even point. Without caching, stuffing 4,000 tokens into every request is wasteful. With caching, it's nearly free after the first call. The 1,024-token minimum is the only real gate.
A rough heuristic:
| KB size | Recommendation |
|---|---|
| < 50k tokens | Start with all-in-context + caching. You probably don't need a vector DB. |
| 50k–200k tokens | Hybrid — cache a "core" set of always-relevant content, RAG the long tail. |
| > 200k tokens | RAG is mandatory (context window limit). |
For the typical "I have a few dozen markdown files" scenario, you almost certainly don't need a vector database.
Caveats
This isn't a universal "RAG is dead" take. RAG still wins when:
- KB is genuinely large (thousands of articles)
- KB updates constantly (every edit invalidates every cache)
- Different customers need different KB subsets (caching is org-scoped, not user-scoped)
- You need precise per-chunk attribution
But for a small product with a small KB? Reach for prompt caching first. It's a one-line change (cache_control: { type: "ephemeral" }) with measurable wins, and you can always add RAG later when your KB grows into it.
The most useful thing I did was build a switch. Don't make this a religious choice — measure both for your specific traffic shape and let the numbers decide.
Top comments (0)