DEV Community

Mahak Faheem
Mahak Faheem

Posted on

Redis Caching in RAG: Normalized Queries, Semantic Traps & What Actually Worked

When I first added Redis caching to my RAG API, the motivation was simple: latency was creeping up, costs were rising and many questions looked repetitive.
Caching felt like the obvious win.
But once I went beyond the happy path, I realized caching in RAG isn’t about Redis at all. It’s about what you choose to cache and how safely you decide two queries are “the same”.

This post walks through:

  • why Redis caching works for RAG
  • what a normalized query really means
  • why semantic caching is tempting but dangerous
  • and how a proper normalization layer keeps correctness intact

Why Redis Caching Makes Sense in RAG

RAG pipelines are expensive because they repeatedly do the same things:

  • embedding generation
  • vector retrieval
  • context assembly
  • LLM inference

For many user questions, especially in internal tools:
the answer doesn’t change between requests

Redis gives you:

  • sub-millisecond reads
  • TTL-based eviction
  • simple operational model
  • predictable cost

So the first version of my cache looked like this:

cache_key = hash(user_query)

Enter fullscreen mode Exit fullscreen mode

Why this doesn't work. You know it.

Text Equality Is Not Intent Equality

These queries are clearly the same:

  • "Explain docker networking"
  • "Can you explain Docker networking?"
  • "docker networking explained"

But Redis treats them as different keys.
That’s when the idea of a normalized query enters the picture.

What Is a Normalized Query (Really)?

A normalized query about stripping away presentation noise while preserving intent.

The goal:

  • improve cache hit rate
  • without returning wrong answers

Safe normalizations:

  • lowercasing
  • trimming whitespaces
  • removing punctuation
  • collapsing filler phrases

Dangerous normalizations:

  • removing numbers
  • collapsing versions
  • replacing domain terms
  • synonym substitution
  • semantic guessing In RAG, wrong cache hits are worse than cache misses.

An Example of Normalization Function

import re

FILLER_PHRASES = ["can you", "please", "tell me", "explain"]

def normalize_query(query: str) -> str:
    q = query.lower().strip()

    for phrase in FILLER_PHRASES:
        q = q.replace(phrase, "")

    q = re.sub(r"[^\w\s]", "", q)
    q = re.sub(r"\s+", " ", q)

    return q.strip()

Enter fullscreen mode Exit fullscreen mode

This intentionally avoids:

  • NLP stopword lists
  • embeddings
  • synonym expansion

Boring. Predictable. Correct.

A Better Cache Key

Text alone is still not enough.
A correct cache key must capture how the answer was produced, not just the question.

cache_key = hash(
    model_name +
    normalized_query +
    retrieval_config
)
Enter fullscreen mode Exit fullscreen mode

This prevents:

  • reusing answers across models
  • mixing retrieval strategies
  • silent correctness bugs

Where Semantic Caching Tempted Me (& Why It’s Risky)

At some point, I considered:
"What if I reuse answers for similar questions?"
This is semantic caching.
Example:

"How does Redis caching work in RAG?"
"Explain caching strategy for RAG systems"
Enter fullscreen mode Exit fullscreen mode

They feel similar.
But semantic similarity is probabilistic, not deterministic.

The risks:

  • incorrect reuse
  • subtle hallucinations
  • hard-to-debug failures
  • broken trust

For production RAG, that’s dangerous.

Where Semantic Caching Can Work (Carefully)

Semantic caching is acceptable when:

  • questions are FAQs
  • answers are generic
  • correctness tolerance is high
  • fallback to exact cache exists
  • The safe pattern is two-tier caching:
  • Exact cache (normalized query)
  • Semantic cache (optional, guarded)
  • Retrieval fallback Never semantic-cache authoritative answers.

The Normalization Layer (The Missing Piece)

The biggest realization for me was this:
Normalization is not a function; it’s a layer.

Especially when RAG involves:

  • SQL / Athena
  • APIs
  • logs
  • metrics In those cases, the “query” isn’t text anymore. It’s intent + constraints. Instead of caching raw SQL, normalize the logical query shape:
{
  "source": "athena",
  "table": "deployments",
  "metrics": ["count"],
  "filters": {
    "status": "FAILED",
    "time_range": "LAST_7_DAYS"
  }
}

Enter fullscreen mode Exit fullscreen mode

Then hash a canonical form.

  • This makes caching:
  • deterministic
  • debuggable
  • correct

What Actually Worked in Practice

My final setup looked like this:

  • Redis for fast cache
  • conservative text normalization
  • intent-level normalization for structured queries
  • no semantic caching for critical paths
  • TTL aligned with data freshness

Results:

  • ~40% cost reduction
  • lower latency
  • zero correctness regressions
  • predictable behavior
  • Most importantly, I trusted my system again.

Takeaways

  • Redis caching is easy — correct caching is not
  • Normalize form, not meaning
  • Over-normalization silently breaks RAG
  • Semantic caching should be optional, not default
  • Structured queries need intent-level normalization
  • Determinism beats cleverness

Final Thoughts

Caching in RAG isn’t about saving tokens.
It’s about engineering discipline.

If we get normalization right, Redis becomes a superpower.
If we don’t, caching becomes a liability.

Thanks for reading.
Mahak

p.s. This is a deceptively hard problem, and there’s no one-size-fits-all solution. Different RAG setups demand different normalization strategies depending on how context is retrieved, structured & validated. In my own project, this exact approach didn’t work out of the box, the real implementation was far more constrained & nuanced. What I’ve shared here is the idea and way of thinking that helped me reason about the problem, not a drop-in solution. Production-grade systems inevitably require careful, system-specific trade-offs.

Top comments (0)