A plain-English guide to caching in AI apps — no background needed.
The problem, in one breath
When lots of people use an AI app, they keep asking the same questions — the same ones over and over, sometimes worded a little differently. And every single time the AI answers, it costs real money and makes the person wait a few seconds.
So we want a system that remembers answers it has already given, and hands them back instantly instead of bothering the AI every single time.
The best way to picture this whole system is a help desk. Let me introduce the people and tools at this help desk one by one — and next to each one, I'll put its real tech name in brackets so you always know what's what.
Part 1 Meet the team (the services we use)
1. The expert in the back room (the AI model / "LLM" like GPT, Claude, or Gemini)
This is the genius who can answer almost anything, but is slow and expensive. Every time you ask the expert something, it costs money and takes a few seconds. So the golden rule of the whole help desk is: only bother the expert for questions we haven't answered before.
2. The notebook of answers (the cache)
This is a notebook where the help desk writes down answers it has already figured out. Next time the same question comes in, a clerk just reads the answer from the notebook instead of waking the expert. Reading the notebook is instant and free. There are actually two notebooks:
The word-for-word notebook (the "exact cache" usually Redis or Valkey)
Super fast. If someone asks a question typed exactly the same as before, this notebook finds it in a blink.The same-meaning notebook (the "semantic cache" e.g. Redis LangCache, RedisVL, or GPTCache)
Smarter. It catches questions that mean the same thing even if the words are different "how do I reverse a string" vs "how do I flip a string."
3. The meaning-reader (the embedding model)
For the same-meaning notebook to work, the help desk needs a way to tell when two questions mean the same thing. The meaning-reader takes any question and turns it into a kind of "meaning fingerprint" (its vector embedding, in technical terms). Two questions that mean the same thing get almost identical fingerprints, even if the words differ. That's the whole trick behind matching reworded questions. (You don't need to know how it makes the fingerprint — just that it does.)
4. The smart table of contents (the vector store / index e.g. Redis Search, pgvector, Qdrant, Pinecone)
Once the notebook has a lot of pages, flipping through all of them every time would be slow. So the help desk keeps a smart table of contents that, given a new question's fingerprint, jumps straight to the few pages that are likely matches instead of reading every page. This is what keeps "same-meaning" lookups fast even with millions of saved answers.
5. The front-desk clerk (the router / gateway e.g. Portkey, Helicone, Cloudflare AI Gateway)
This is the person at the front who receives every question and decides what to do with it: check the notebooks first, and only if there's no match, decide which expert to send it to (a cheaper junior expert for easy questions, the senior expert for hard ones). The clerk is the traffic director.
6. The label on each page (the "scope" / tenant tag)
Every answer written in the notebook gets a label saying who's allowed to read it. Some answers are labeled "anyone" (general questions). Some are labeled "this person only" (questions about someone's private stuff). This label is how we make sure we never give one person's personal answer to someone else.
7. The expiring sticky-notes (TTL / session memory)
Some notes are only useful for a short while — like the back-and-forth of one ongoing conversation. The help desk writes those on sticky-notes that automatically fall off after a while, so they don't pile up forever.
8. The expert's own quick-skim discount (provider "prefix caching" built into OpenAI, Anthropic, Gemini)
Even when we do call the expert, the expert gives a small discount for the part of the question it just read a moment ago, so it doesn't fully re-read the same long background twice in one conversation. It's a nice little saving — but note: the expert still writes a fresh answer every time. This discount is not the same as our notebook, which skips the expert entirely. It's also short-lived — these provider discounts usually expire within minutes of inactivity, while your own notebook can keep answers for as long as you choose. (More on this difference below.)
Part 2 How they all work together
Now let's walk a real question through the help desk and watch the team play their parts.
A question arrives.
You ask: "How do I reverse a string in Python?" The front-desk clerk (router) catches it first.
Check the fast notebook.
The clerk peeks at the word-for-word notebook (exact cache Redis). Has this exact question been asked before? If yes → hand back the saved answer instantly. Done, the expert is never disturbed.
Check the smart notebook.
If the exact wording isn't found, the clerk asks the meaning-reader (embedding model) to make a fingerprint of the question, then uses the smart table of contents (vector store) to look in the same-meaning notebook (semantic cache). Is there a saved answer that means the same thing? If it's a close enough match → hand it back. Still no expert needed.
Only now, wake the expert.
If neither notebook has it, this really is a new question. The clerk decides which expert to use (easy → cheaper model, hard → top model) and the expert (the LLM) writes a fresh answer.
Write it down for next time.
The new answer goes into the notebooks, with a label (scope tag) saying who can reuse it. This is the important bit you remembered earlier: we save the answer after the expert gives it. The first person "pays" for it; everyone after gets it free from the notebook.
The neat part: a different person asks the same thing
Later, a totally different user types: "what's the way to flip a string in python?" different words, same meaning. The clerk makes a fingerprint, the smart table of contents finds the page the first user created, the meaning matches closely enough → and this new person gets the answer straight from the notebook, no expert, instantly. That's the "serve a new user from the cache" idea it's just the same-meaning notebook doing its job.
How we decide what to save (and what NOT to share)
Before writing an answer in the shared "anyone" notebook, the help desk asks one question: "Is this answer the same for everyone, or only for this person?"
- It first checks a free clue: did answering it require the person's private stuff? ("Where is my order?" needed to look up their order → personal. "What is a closure?" needed nothing personal → general.)
- A quick glance at words like "my / this / I'm getting" adds another hint.
- Only for the genuinely unclear cases does it ask a small, cheap judge (a small LLM or classifier) and only those cases, not every question, because running a judge on everything would cost as much as it saves.
- When still unsure → don't share. Worst case we ask the expert again; that's far better than handing someone a wrong answer.
General answers get the "anyone" label and go in the shared notebook. Personal answers get a "this person only" label, so they're kept just for that user and never shown to others.
Part 3 What happens when millions of people show up
This is where people panic — "won't the notebook become impossibly huge?" Here's why it stays manageable, in plain terms.
There aren't millions of different questions.
Even with millions of users, they keep asking the same popular questions over and over. So the shared notebook grows with the number of different questions (smallish), not the number of users (huge). More users mostly means the same pages get read more often — which is fine.
Many clerks, not one.
One clerk flipping through one giant notebook would be a bottleneck, so you hire lots of clerks, each holding a slice of the notebook (this splitting is called sharding — e.g. Redis Cluster). Busy? Add more clerks. The system is just many identical helpers working in parallel.
The smart table of contents keeps lookups fast.
As covered above, you never read all million pages — the index jumps you to the likely matches.
Throw away stale notes.
Pages nobody has used in a long time get erased to make room, so the notebook stays full of useful answers, not clutter. Personal sticky-notes expire on their own.
When something goes viral (engineers call this a "cache stampede").
If 10,000 people suddenly ask the same brand-new question at once, you don't want all 10,000 waking the expert. So the first one goes to the expert, the answer gets written down, and the other 9,999 wait a heartbeat and read the freshly-written page. One expert call instead of ten thousand.
The punchline.
The expensive expert only ever sees the genuinely new questions. All the repeats — which is most of the traffic — come from the notebook in a blink. So as you grow from a thousand users to fifty million, your AI bill grows much slower than your user count, because the notebook soaks up all the repeats.
Part 4 The big picture (this is the "HLD" high-level design)
"HLD" just means the map seen from high up: which parts exist and who talks to whom, without the tiny details. Here's our help desk as a map. Follow the arrows a question travels from top to bottom, and stops the moment an answer is found.
The whole point of the map: the expert at the bottom is only reached when both notebooks come up empty. Most questions never get that far they're answered straight from a notebook near the top.
Part 5 The fine print (this is the "LLD" low-level design)
"LLD" means zooming all the way in: what one saved answer actually looks like, and the exact steps of a lookup. Still in plain words.
What one page in the notebook actually holds
Every saved answer is one "page," and each page carries a few things:
ONE SAVED PAGE
- the question → "how do I reverse a string in python"
- the answer → "...the steps the expert gave..."
- the meaning-fingerprint → a long row of numbers (used to match similar questions)
- the label → "anyone" (or "Abhi only")
- expires on → a date after this, the page is erased
That's it. A question, its answer, a fingerprint for same-meaning matching, a label for who's allowed to read it, and an expiry date so old pages don't pile up.
The exact steps when a question arrives
- Tidy the question. Make small wording cleanups (lowercase, trim spaces) so tiny differences don't cause misses. Some teams also drop filler words like "the", "a", or "please" called stop-words so the exact notebook matches a little more cleverly without even needing the meaning-reader.
- Try the fast notebook first. Look up the cleaned-up question word-for-word (exact cache). If it's there → hand it back. (This step is so cheap we always do it first.)
- Make a fingerprint. If step 2 missed, ask the meaning-reader (embedding model) to turn the question into its fingerprint.
- Search the smart table of contents. Use the fingerprint to find the closest saved page (vector store) but only among pages whose label this person is allowed to read.
- Apply the "close enough" dial. The search returns a closeness score. If it clears our threshold → hand back that page's answer. If not → treat it as new.
- Wake the expert, then write it down. On a true miss, the expert answers, and we save a new page with the right label and an expiry date.
The "close enough" dial (the similarity threshold)
When matching by meaning, we need to decide how close is close enough to count as "the same question." That's a single dial. (In technical terms, closeness is measured as cosine similarity from 0 to 1, and a threshold around 0.85–0.90 is a common sweet spot with a model like OpenAI's text-embedding-3-small — the right number shifts with whichever embedding model you use.) The dial works like this:
- Turn it too loose → you hand back answers to questions that only looked similar (wrong answers).
- Turn it too strict → you miss real matches and wake the expert needlessly (wasted money).
- The fix: set it sensibly per topic — relaxed for simple definitions, strict for anything where a wrong answer is costly — and when a match only barely passes, double-check it instead of trusting it.
How labels keep people separate
The label is what makes "share with everyone" safe. A general answer gets the label "anyone," so it sits in the shared part of the notebook. A personal answer gets "this person only," so when someone else searches, the table of contents simply never shows them that page. No clever real-time decision — the safety comes from the label we wrote at save time.
Part 6 What can go wrong (and how we avoid it)
Three honest failure cases, and the simple guard for each:
- An out-of-date answer. The world changed but the notebook still has the old answer. Guard: every page has an expiry date, and we erase pages when the underlying facts change.
- The wrong person sees a personal answer. Guard: the label on each page — personal pages are never shown to others.
- A loose match gives a wrong answer. Guard: the "close enough" dial, plus double-checking borderline matches and defaulting to "ask the expert" when unsure.
And a friendly build order if you ever make this: start with the word-for-word notebook (easiest, big wins), add the same-meaning notebook next, then the labels for safety, and only worry about the many-clerks scaling once you actually have lots of users.
A quick before-and-after
Picture an app handling 100,000 questions a month, each costing about $0.01 to answer with the model — roughly $1,000 / month.
Add the notebook (the exact + same-meaning cache), and say it catches half the traffic:
- 50,000 questions answered straight from the notebook
- about $500 / month saved on model calls
- those answers come back in under 50 ms instead of 3–10 seconds — roughly a 99% drop in wait time
The first person to ask still pays the full cost; everyone after rides for free. (Tune the cache well on FAQ-style traffic and the hit rate — and the savings — climb higher.)
Quick cheat-sheet: analogy → real service
| At the help desk… | …is really | Example tools |
|---|---|---|
| The expert in the back room | The AI model (LLM) | GPT, Claude, Gemini |
| The notebook of answers | The cache | Redis / Valkey |
| word-for-word notebook | Exact cache | Redis, Valkey |
| same-meaning notebook | Semantic cache | Redis LangCache, RedisVL, GPTCache |
| The meaning-reader | The embedding model | OpenAI / other embedding models |
| The smart table of contents | Vector store / index | Redis Search, pgvector, Qdrant, Pinecone |
| The front-desk clerk | Router / gateway | Portkey, Helicone, Cloudflare AI Gateway |
| The label on each page | Scope / tenant tag | (your own design) |
| Expiring sticky-notes | TTL / session memory | Redis with TTL |
| The small judge | Small LLM / classifier | a cheap model |
| Many clerks with notebook slices | Sharding | Redis Cluster |
| The expert's quick-skim discount | Provider prefix caching | OpenAI, Anthropic, Gemini |
| The "close enough" dial | Similarity threshold | Cosine similarity (~0.85–0.90) |
| Tidying the question | Normalization / stop-words | Lowercase, trim, stop-word removal |
| Handling the viral rush once | Cache-stampede protection | Request coalescing / single-flight |
| The all-in-one bundle | Managed AI cache stack | Redis for AI (LangCache, RedisVL, Agent Memory) |
The whole thing in four sentences
People keep asking an AI app the same questions over and over, and calling the AI every time is slow and costly. So we keep a notebook (cache) behind the scenes that remembers past answers and hands them back instantly without waking the AI. We save each answer after the AI gives it, with a label that decides who's allowed to reuse it. And because people keep asking the same popular questions, this notebook stays small and fast even with millions of users — so the AI bill grows far slower than the crowd.
A note on the numbers: the figures in this guide are representative ranges drawn from vendor benchmarks and industry case studies (2024–2026) — e.g. Redis / LangCache, Anthropic and OpenAI docs, and published semantic-cache write-ups. Real results vary by workload, traffic pattern, and how carefully you tune the cache.

Top comments (1)
what hit rate are you seeing in production?