When NOT to use RAG (lessons from building a Claude-powered support bot)

#ai #claude #webdev #rag

Every tutorial about building AI chatbots reaches for the same starter pack: vector database, embeddings model, similarity search, RAG. I did too. Then I ran the numbers on prompt caching and threw most of it out.

Here's what happened.

The setup

I'm building a customer support bot for a B2C SaaS product. It hooks into Crisp (live chat), reads incoming customer messages, looks up answers in a knowledge base, and replies — escalating to a human when it can't help.

The stack:

Bun server (one file, ~300 lines)
Claude Sonnet 4.6 for the LLM
Supabase pgvector for vector search (initially)
OpenAI text-embedding-3-small for query embeddings

The knowledge base: 16 markdown articles covering account, billing, and technical topics. ~250 tokens per article, ~4,000 tokens total.

The "obvious" architecture: RAG

Standard playbook:

Chunk each article into ~500-token pieces
Embed each chunk with OpenAI
Store in Supabase pgvector
On each customer message: embed the message, do a similarity search, retrieve top 3 chunks, inject into the LLM prompt

Per request:

1 OpenAI embedding call (~200 ms)
1 Supabase RPC (~150 ms)
1 Claude API call (~800 ms)
~1,150 ms before the customer sees a single character

It worked. But it kept feeling over-engineered for what is fundamentally 4,000 tokens of static text.

The thing I kept circling back to

Anthropic's prompt caching. The deal:

Pay 1.25× the normal input rate to write a prefix into cache
Pay 0.1× (90% off) for every subsequent read within 5 minutes
The TTL refreshes on every read — so an active session keeps the cache warm indefinitely

Minimum cacheable block on Sonnet 4.6: 1,024 tokens. My KB is 4,000+. Comfortably above.

What if I just stuffed the whole KB into the system prompt and cached it?

Conventional wisdom says no — too many tokens per call. But the math with caching is different than the math without.

Running the numbers

Scenario: 10 customers per day, 10 messages each. 100 messages across 10 sessions.

RAG (the original setup):

1,380 input tokens per call (system + retrieved chunks + history + user msg)
0% cache hit rate — the system prompt is below the 1,024 cache minimum
~$17/month total

All-in-context with prompt caching:

4,330 input tokens per call (system + entire KB + history + user msg)
~90% cache hit rate after the first message of each session
~$18/month total

About a buck different. The interesting part isn't the cost — it's everything else.

The trade-off table

	RAG	All-in-context + cache
Cost (100 msgs/day)	$17/mo	$18/mo
First-token latency (cache hit)	~1,150 ms	~700 ms
Lines of code	~250	~150
External services	Supabase + OpenAI + Anthropic	Anthropic only
Retrieval failures	Possible (threshold tuning hell)	Impossible — KB always visible
Cross-article reasoning	Limited to top-K chunks	Sees the whole KB
Scales to 100 articles	Yes	Yes
Scales to 10,000 articles	Yes	No — context window limit

For my use case — small KB, conversational chat, frequent sessions — all-in-context is faster, simpler, and handles harder cross-article questions better. The cost difference is in the noise.

The implementation

I made it switchable via one env var so I could A/B compare:

const KB_MODE = Bun.env.KB_MODE === "inline" ? "inline" : "rag";
const PROMPT_FILE = KB_MODE === "inline" ? "CRISP-inline.md" : "CRISP-rag.md";

const SYSTEM_PROMPT = (await Bun.file(PROMPT_FILE).text()).trim();

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [{
    type: "text",
    text: SYSTEM_PROMPT,
    cache_control: { type: "ephemeral" }, // ← the magic line
  }],
  messages,
});

CRISP-rag.md is just the persona/rules (~600 tokens — too small to cache).
CRISP-inline.md is persona + entire KB baked in (~2,500 tokens — caches happily).

In inline mode, searchKB() is never called. No embedding round-trip, no Supabase query. The KB is sitting in the system prompt, cached on Anthropic's side, ready to be reused for every subsequent message.

The cache logging proves it:

[claude] response in 1432ms (in: 47, out: 92, cache_create: 2447, cache_read: 0)    ← 1st msg
[claude] response in 712ms  (in: 47, out: 88, cache_create: 0,    cache_read: 2447) ← cached
[claude] response in 689ms  (in: 47, out: 95, cache_create: 0,    cache_read: 2447) ← cached

After the first message, every reply is ~50% faster and 90% cheaper on input.

What I learned

RAG isn't free. It adds two API hops, an embedding model, a vector database, chunking logic, threshold tuning, and an entire class of "the right chunk wasn't retrieved" failure modes. It's the right answer above some KB size — but that size is way bigger than most tutorials assume.

Prompt caching changes the break-even point. Without caching, stuffing 4,000 tokens into every request is wasteful. With caching, it's nearly free after the first call. The 1,024-token minimum is the only real gate.

A rough heuristic:

KB size	Recommendation
< 50k tokens	Start with all-in-context + caching. You probably don't need a vector DB.
50k–200k tokens	Hybrid — cache a "core" set of always-relevant content, RAG the long tail.
> 200k tokens	RAG is mandatory (context window limit).

For the typical "I have a few dozen markdown files" scenario, you almost certainly don't need a vector database.

Caveats

This isn't a universal "RAG is dead" take. RAG still wins when:

KB is genuinely large (thousands of articles)
KB updates constantly (every edit invalidates every cache)
Different customers need different KB subsets (caching is org-scoped, not user-scoped)
You need precise per-chunk attribution

But for a small product with a small KB? Reach for prompt caching first. It's a one-line change (cache_control: { type: "ephemeral" }) with measurable wins, and you can always add RAG later when your KB grows into it.

The most useful thing I did was build a switch. Don't make this a religious choice — measure both for your specific traffic shape and let the numbers decide.