DEV Community

Cover image for When NOT to use RAG (lessons from building a Claude-powered support bot)
Vitalii
Vitalii

Posted on

When NOT to use RAG (lessons from building a Claude-powered support bot)

Every tutorial about building AI chatbots reaches for the same starter pack: vector database, embeddings model, similarity search, RAG. I did too. Then I ran the numbers on prompt caching and threw most of it out.

Here's what happened.

The setup

I'm building a customer support bot for a B2C SaaS product. It hooks into Crisp (live chat), reads incoming customer messages, looks up answers in a knowledge base, and replies — escalating to a human when it can't help.

The stack:

  • Bun server (one file, ~300 lines)
  • Claude Sonnet 4.6 for the LLM
  • Supabase pgvector for vector search (initially)
  • OpenAI text-embedding-3-small for query embeddings

The knowledge base: 16 markdown articles covering account, billing, and technical topics. ~250 tokens per article, ~4,000 tokens total.

The "obvious" architecture: RAG

Standard playbook:

  1. Chunk each article into ~500-token pieces
  2. Embed each chunk with OpenAI
  3. Store in Supabase pgvector
  4. On each customer message: embed the message, do a similarity search, retrieve top 3 chunks, inject into the LLM prompt

Per request:

  • 1 OpenAI embedding call (~200 ms)
  • 1 Supabase RPC (~150 ms)
  • 1 Claude API call (~800 ms)
  • ~1,150 ms before the customer sees a single character

It worked. But it kept feeling over-engineered for what is fundamentally 4,000 tokens of static text.

The thing I kept circling back to

Anthropic's prompt caching. The deal:

  • Pay 1.25× the normal input rate to write a prefix into cache
  • Pay 0.1× (90% off) for every subsequent read within 5 minutes
  • The TTL refreshes on every read — so an active session keeps the cache warm indefinitely

Minimum cacheable block on Sonnet 4.6: 1,024 tokens. My KB is 4,000+. Comfortably above.

What if I just stuffed the whole KB into the system prompt and cached it?

Conventional wisdom says no — too many tokens per call. But the math with caching is different than the math without.

Running the numbers

Scenario: 10 customers per day, 10 messages each. 100 messages across 10 sessions.

RAG (the original setup):

  • 1,380 input tokens per call (system + retrieved chunks + history + user msg)
  • 0% cache hit rate — the system prompt is below the 1,024 cache minimum
  • ~$17/month total

All-in-context with prompt caching:

  • 4,330 input tokens per call (system + entire KB + history + user msg)
  • ~90% cache hit rate after the first message of each session
  • ~$18/month total

About a buck different. The interesting part isn't the cost — it's everything else.

The trade-off table

RAG All-in-context + cache
Cost (100 msgs/day) $17/mo $18/mo
First-token latency (cache hit) ~1,150 ms ~700 ms
Lines of code ~250 ~150
External services Supabase + OpenAI + Anthropic Anthropic only
Retrieval failures Possible (threshold tuning hell) Impossible — KB always visible
Cross-article reasoning Limited to top-K chunks Sees the whole KB
Scales to 100 articles Yes Yes
Scales to 10,000 articles Yes No — context window limit

For my use case — small KB, conversational chat, frequent sessions — all-in-context is faster, simpler, and handles harder cross-article questions better. The cost difference is in the noise.

The implementation

I made it switchable via one env var so I could A/B compare:

const KB_MODE = Bun.env.KB_MODE === "inline" ? "inline" : "rag";
const PROMPT_FILE = KB_MODE === "inline" ? "CRISP-inline.md" : "CRISP-rag.md";

const SYSTEM_PROMPT = (await Bun.file(PROMPT_FILE).text()).trim();

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [{
    type: "text",
    text: SYSTEM_PROMPT,
    cache_control: { type: "ephemeral" }, // ← the magic line
  }],
  messages,
});
Enter fullscreen mode Exit fullscreen mode

CRISP-rag.md is just the persona/rules (~600 tokens — too small to cache).
CRISP-inline.md is persona + entire KB baked in (~2,500 tokens — caches happily).

In inline mode, searchKB() is never called. No embedding round-trip, no Supabase query. The KB is sitting in the system prompt, cached on Anthropic's side, ready to be reused for every subsequent message.

The cache logging proves it:

[claude] response in 1432ms (in: 47, out: 92, cache_create: 2447, cache_read: 0)    ← 1st msg
[claude] response in 712ms  (in: 47, out: 88, cache_create: 0,    cache_read: 2447) ← cached
[claude] response in 689ms  (in: 47, out: 95, cache_create: 0,    cache_read: 2447) ← cached
Enter fullscreen mode Exit fullscreen mode

After the first message, every reply is ~50% faster and 90% cheaper on input.

What I learned

RAG isn't free. It adds two API hops, an embedding model, a vector database, chunking logic, threshold tuning, and an entire class of "the right chunk wasn't retrieved" failure modes. It's the right answer above some KB size — but that size is way bigger than most tutorials assume.

Prompt caching changes the break-even point. Without caching, stuffing 4,000 tokens into every request is wasteful. With caching, it's nearly free after the first call. The 1,024-token minimum is the only real gate.

A rough heuristic:

KB size Recommendation
< 50k tokens Start with all-in-context + caching. You probably don't need a vector DB.
50k–200k tokens Hybrid — cache a "core" set of always-relevant content, RAG the long tail.
> 200k tokens RAG is mandatory (context window limit).

For the typical "I have a few dozen markdown files" scenario, you almost certainly don't need a vector database.

Caveats

This isn't a universal "RAG is dead" take. RAG still wins when:

  • KB is genuinely large (thousands of articles)
  • KB updates constantly (every edit invalidates every cache)
  • Different customers need different KB subsets (caching is org-scoped, not user-scoped)
  • You need precise per-chunk attribution

But for a small product with a small KB? Reach for prompt caching first. It's a one-line change (cache_control: { type: "ephemeral" }) with measurable wins, and you can always add RAG later when your KB grows into it.

The most useful thing I did was build a switch. Don't make this a religious choice — measure both for your specific traffic shape and let the numbers decide.

Top comments (0)