DEV Community

Cover image for Semantic Caching in RAG Systems & AI Agents
Seenivasa Ramadurai
Seenivasa Ramadurai

Posted on

Semantic Caching in RAG Systems & AI Agents

What Is Caching?

Caching is the practice of storing the result of an expensive operation so that future requests for the same result can be served instantly without repeating the work.

The concept is foundational in computing. A** web browser caches images so pages load faster*. A database caches query results so it does not re read the disk. **A CDN caches static files close to the user*. In every case the principle is the same: compute once, reuse many times.

Traditional caches work on exact matches. The key is the exact input string or request. “What is the capital of France?” and “What’s the capital of France?” are different keys the cache misses on the second even though the answer is identical. This works fine for static web assets, but it falls apart the moment users express the same intent in different words.

What Is Semantic Caching?

Semantic caching replaces the exact string key with a meaning based key. Instead of asking “is this input identical to a stored input?”, it asks “is this input similar enough in meaning to a stored input?”

It does this using embeddings. Every query is converted into a dense numerical vector a point in high dimensional space where semantically similar sentences sit close together. The cache stores these vectors alongside their responses. When a new query arrives, its vector is compared to all stored vectors using cosine similarity. If the closest match is above a threshold (e.g. 0.92), the cached response is returned.

The result: a user who asks “How many sick days do I get?” and another who asks “What is our sick leave allowance?” both get the same cached answer because the questions mean the same thing.

The Problem Why RAG Pipelines Waste Money at Scale

A standard RAG pipeline does four things every time a user sends a query:

  • Embed the query convert the text to a vector using an embedding model
  • Vector search and find the most relevant chunks in your document store
  • Assemble context + build a prompt from the retrieved chunks and the user query
  • LLM invocation send the prompt to the model and pay per token

Each step adds latency and cost. An LLM call alone typically adds 1–4 seconds and costs money on every single request. The problem is that in production, over 40% of queries are near paraphrases of questions already answered. Without caching, the system repeats all four steps for every one of them.

10,000 queries per day. 40% duplicates = 4,000 unnecessary LLM calls. At $0.002 per call that is $8 wasted daily, $2,920 per year before accounting for latency degradation during peak load.

The solution is to intercept duplicate queries before they reach step 2 that is vector search. A semantic cache sits at the front of the pipeline. If a semantically equivalent query has been answered before, return that answer immediately. The entire pipeline retrieval, context assembly, LLM call is bypassed.

The Solution How Semantic Caching Works

Every cached entry is a triple: the query embedding vector, the stored response, and a timestamp for expiry management.

📄 Cache entry structure

At inference time the flow is:

  • Embed the incoming query → vector v_q
  • Search the cache: find the stored vector with highest cosine similarity to v_q
  • If max similarity ≥ threshold θ → return the cached response y_i immediately
  • Otherwise → run the full RAG pipeline, then store the new (v_q, response, timestamp) in the cache

The Three Cache Layers

Semantic caching can be applied at three points in the pipeline, each with different hit rates and trade offs:

Semantic Caching in the AI Agentic World

A standard RAG pipeline answers one question per request. An AI agent does something more ambitious. it plans, decides which tools to call, executes those tools, reasons over the results, and repeats, sometimes across many turns to complete a task.

This makes the cost and latency problem significantly worse. Where a RAG system makes one LLM call per query, an agent may make 5–15. Where a RAG system makes one tool call, an agent may make the same tool call repeatedly across different user sessions fetching the same product info, the same knowledge article, the same company record.

🤖 In an agentic system, semantic caching is not just about saving one LLM call it is about short circuiting entire reasoning chains. A cached tool result prevents a retrieval step, which prevents a reasoning step, which may prevent two further tool calls downstream.

Where Caching Fits in an Agent Loop

An agent loop has two natural places to insert a cache:

Before the first LLM call: If the user’s intent has been handled before in a similar session, return the full cached final answer immediately. Bypasses the entire loop.

Before each tool execution: Before calling an external tool (database lookup, API call, knowledge base search), check whether the same call or a semantically equivalent one was made recently. Return the cached tool result instead of executing.

Real World Use Cases

HR Policy Bot

Internal HR chatbots are one of the highest value deployments for semantic caching. Employees ask the same questions constantly sick leave, parental leave, expense claims, performance review timelines phrased differently by every person.

A 2,000 person company. Employees ask ~3 HR questions each per year = 6,000 annual queries. Analysis shows 60% are near paraphrases of existing questions. Semantic caching eliminates ~3,600 LLM calls per year. HR policy changes at most quarterly stale cache is rarely a risk.

Example queries that resolve to the same cached answer:

  • “How many sick days do I get?”
  • “What is our sick leave allowance?”
  • “Can I take a sick day without a doctor’s note?”
  • “What’s the policy on calling in sick?”

All four embed to vectors with cosine similarity > 0.93 against the same cached entry. One LLM call. Four employees served.

⚠️Never cache personal HR queries. “How many sick days do I have left?” is a personal balance query it must bypass the cache and hit the HRIS system directly. Detect these with entity patterns before the cache lookup.

Use Case 2 Customer Support Services

Customer support is the highest volume use case. Password resets, billing queries, refund policies, order tracking questions a small set of issues accounts for the vast majority of ticket volume. The same problem, asked by thousands of different customers, in thousands of different ways.

An Ecommerce platform with 50,000 support queries per day. 12 issue categories account for 73% of volume. Semantic caching at query level reduces average response time from 4.2s to 0.3s for cache hits a 93% latency improvement for nearly three quarters of all users.

Example queries resolving to the same cached response:

  • “I can’t log in”
  • “How do I reset my password?”
  • “I’m locked out of my account”
  • “Forgot my password, what do I do?”

Identical reset flow, identical cached answer. Cache hit rate for this category alone: 85%+.

⚠️Order specific queries must bypass the cache. Any query containing an order number, transaction ID, or account reference is personal it gets routed to a live data lookup, never a cached response.

When NOT to Use Semantic Caching

A cache that returns a wrong, stale, or contextually mismatched answer is worse than no cache. Applied in the wrong contexts, semantic caching silently degrades quality and in regulated environments creates compliance liability.

🚫 The cache returns a past response to a new user. If that response was wrong, personalized, or time sensitive, the cache amplifies the mistake at scale. One bad cached entry can poison thousands of downstream responses.

Quick Decision Checklist

Before adding any query or tool call to the cache, run through these six checks:

  • Is the answer the same regardless of who asks it? If no → skip cache.
  • Could the answer change within your TTL window? If yes → shorten TTL or skip cache.
  • Does the query contain personal identifiers? If yes → bypass cache unconditionally.
  • Is the domain regulated (medical, legal, financial)? If yes → get explicit policy approval before caching.
  • Is the expected hit rate above ~15%? If no → cache overhead likely exceeds the savings.
  • Has the response passed output validation? If no → never cache unvalidated LLM outputs.

Implementation Qdrant Docker + Repository Pattern

Qdrant is a purpose built, open source vector database written in Rust. It is the ideal default backend for a semantic cache: it stores vectors and response payloads together, supports native TTL filtering, and runs in Docker in under 30 seconds.

The Repository Pattern wraps the backend behind a clean interface. Your RAG pipeline and agent loop only ever talk to that interface they never import Qdrant or FAISS directly. Swapping backends is one environment variable.

I am running Qdrant docker container in Docker Desktop

Repository Pattern — Semantic Caching Implementation

This project uses the Repository Pattern to abstract cache storage behind a clean interface. The application code depends only on CacheRepository not on Qdrant, FAISS, or any specific backend. Backends are swappable via configuration.

This project applies the Repository Pattern to decouple semantic cache storage from application logic. Every consumer the agent loop, RAG pipeline, HR pipeline, and tool cache depends only on the CacheRepository abstract interface, never on a concrete backend.

The payoff: swap backends by changing a single environment variable. Qdrant runs in production, FAISS runs locally for zero Docker development, and adding Redis or Pinecone means creating one file and one factory registration. No application code changes.

DESIGN PRINCIPLE

Application code never imports

QdrantCache
or

FaissCache directly. Everything flows through get_cache(), which returns a CacheRepository. Storage concerns are fully isolated from business logic.

Python · src/cache_repository.py

from abc import ABC, abstractmethod
from typing import Optional, Tuple
import numpy as np

class CacheRepository(ABC):
    """Abstract base class for semantic cache backends."""

    @abstractmethod
    def lookup(self, v_q: np.ndarray, threshold: float = 0.92,
               namespace: Optional[str] = None) -> Optional[str]:
        """Return cached response if similarity >= threshold, else None."""
        ...

    @abstractmethod
    def lookup_with_score(self, v_q: np.ndarray, threshold: float = 0.92,
                          namespace: Optional[str] = None
                         ) -> Tuple[Optional[str], float]:
        """Same as lookup but also returns the similarity score."""
        ...

    @abstractmethod
    def insert(self, v_q: np.ndarray, response: str,
               namespace: Optional[str] = None,
               ttl_hours: Optional[int] = None,
               query: Optional[str] = None) -> None:
        """Store query embedding and response in the cache."""
        ...

    @abstractmethod
    def invalidate(self, max_age_hours: int = 24) -> int:
        """Remove entries older than max_age_hours. Returns count removed."""
        ...
Enter fullscreen mode Exit fullscreen mode

WHY LOOKUP_WITH_SCORE?

lookup_with_score exists because consumers occasionally need the raw similarity score, not just a hit/miss boolean. The tool cache uses it to log near misses (high score, below threshold) for threshold tuning. The agent loop uses it to decide whether to show a cache-hit indicator in the UI.

Factory: cache_factory.get_cache()

src/cache_factory.py
The factory reads CACHE_BACKEND and creates the appropriate concrete implementation, storing it as a module-level singleton. Every subsequent call returns the same instance  avoiding reconnections to Qdrant or repeated FAISS index construction.

Python · src/cache_factory.py
import os
from .cache_repository import CacheRepository

_cache: CacheRepository | None = None

def get_cache() -> CacheRepository:
    """Return singleton backend based on CACHE_BACKEND env var."""
    global _cache
    if _cache is not None:
        return _cache

    backend = os.getenv("CACHE_BACKEND", "qdrant")
    if backend == "qdrant":
        from .qdrant_cache import QdrantCache
        _cache = QdrantCache(
            url=os.getenv("QDRANT_URL", "http://localhost:6333"),
            collection_name=os.getenv("QDRANT_COLLECTION", "semantic_cache")
        )
    elif backend == "faiss":
        from .faiss_cache import FaissCache
        _cache = FaissCache()
    else:
        raise ValueError(f"Unknown CACHE_BACKEND: {backend}")

    return _cache

Enter fullscreen mode Exit fullscreen mode

standard consumer pattern

from src.cache_factory import get_cache

cache = get_cache()              # Always returns CacheRepository

# Step 1: check before running the pipeline
cached = cache.lookup(v_q, threshold=0.85)
if cached:
    return cached                # LLM is never called on a hit

# Step 2: run the full pipeline
response = run_pipeline(query)

# Step 3: store for future queries
cache.insert(v_q, response, query=user_query)
Enter fullscreen mode Exit fullscreen mode

Adding a New Backend

Adding a backend Redis, Pinecone, Azure Cache requires exactly three steps. No changes are needed in agent_loop.py, tool_cache.py, hr_pipeline.py, or main.py.

Step 1 — Implement the interface

Python · src/redis_cache.py  (new file)
from .cache_repository import CacheRepository
from typing import Optional, Tuple
import numpy as np

class RedisCache(CacheRepository):
    def __init__(self, url: str = 'redis://localhost:6379'):
        import redis
        self.client = redis.from_url(url)

    def lookup(self, v_q, threshold=0.92, namespace=None) -> Optional[str]:
        response, _ = self.lookup_with_score(v_q, threshold, namespace)
        return response

    def lookup_with_score(self, v_q, threshold=0.92,
                          namespace=None) -> Tuple[Optional[str], float]:
        # implement vector similarity search via Redis VSIM or custom hashing
        ...

    def insert(self, v_q, response, namespace=None,
               ttl_hours=None, query=None) -> None: ...

    def invalidate(self, max_age_hours=24) -> int: ...


## Step 2 — Register in the factory

Python · src/cache_factory.py  (add one elif)
    elif backend == "redis":
        from .redis_cache import RedisCache
        _cache = RedisCache(url=os.getenv("REDIS_URL", "redis://localhost:6379"))


## Step 3 — Set the environment variable

.env
CACHE_BACKEND=redis
REDIS_URL=redis://localhost:6379

Enter fullscreen mode Exit fullscreen mode

ZERO APPLICATION CHANGES

After these three steps every consumer agent_loop, tool_cache, hr_pipeline, main automatically uses Redis. No imports to update, no signatures to change. This is the Repository Pattern payoff.

The Repository Pattern separates what the cache does (the interface) from how it stores data (the backend). This is the property that makes the system testable with FAISS, production ready with Qdrant, and infinitely extensible without touching application code.

Request flow

Output/Testing

SEMANTIC CACHING — ALL QUERIES + TOOL CACHE EXAMPLES

TOOL CACHE TEST (same tool, same param — Call 1 then Call 2):
get_company_policy(topic='remote work') — Call 1: MISS, Call 2: HIT

Query #1
PROMPT: What is our sick leave policy?
RESPONSE: Our sick leave policy allows employees to take 15 sick days per year. A doctor's note is required if you are absent for more than 3 consecutive days.
QUERY CACHE: MISS | TOOL CACHE: get_company_policy(topic='sick leave'): MISS | TIME: 2.48s

Query #2
PROMPT: Tell me about our sick leave policy
RESPONSE: Our sick leave policy allows employees to take 15 sick days per year. A doctor's note is required if you are absent for more than 3 consecutive days.
QUERY CACHE: HIT | TOOL CACHE: skipped (query cached) | TIME: 0.17s

Query #3
PROMPT: What is our sick leave policy?
RESPONSE: Our sick leave policy allows employees to take 15 sick days per year. A doctor's note is required if you are absent for more than 3 consecutive days.
QUERY CACHE: HIT | TOOL CACHE: skipped (query cached) | TIME: 0.16s

Query #4
PROMPT: Tell me about vacation days
RESPONSE: Employees are entitled to 20 days of paid time off (PTO) per year for vacation. Requests for vacation days must be submitted at least 2 weeks in advance.
** QUERY CACHE: MISS | TOOL CACHE**: get_company_policy(topic='vacation days'): MISS | TIME: 2.59s

Query #5
PROMPT: Tell me about vacation
RESPONSE: Employees are entitled to 20 days of paid time off (PTO) per year for vacation. Requests for vacation days must be submitted at least 2 weeks in advance.
** QUERY CACHE: HIT | TOOL CACHE:** skipped (query cached) | TIME: 0.22s

Query #6
PROMPT: How many PTO days for vacation?
RESPONSE: You have 20 PTO days for vacation per year. Please note that you must request your time off at least 2 weeks in advance.
QUERY CACHE: MISS | TOOL CACHE: get_company_policy(topic='PTO days for vacation'): HIT | TIME: 2.21s

=============================================================

TOOL CACHE EXAMPLES (tool call → result, from this run):

  1. get_company_policy(topic='remote work') → Call 1: MISS, Call 2: HIT
    Result: Hybrid: 2 days in office required. Equipment provided.

  2. get_company_policy(topic='sick leave') → MISS

Result: Employees get 15 sick days per year. Doctor's note required after 3 consecutive days.

  1. get_company_policy(topic='vacation days') → MISS
    Result: 20 days PTO per year. Must request 2 weeks in advance.

  2. get_company_policy(topic='PTO days for vacation') → HIT
    Result: 20 days PTO per year. Must request 2 weeks in advance.

==============================================================

Total: 7.83s for 6 queries

Conclusion

Semantic caching turns repeated or similar questions into instant answers **instead of expensive LLM **and tool calls. By using embeddings and vector similarity instead of exact string matching, you can cache responses for "What is our sick leave policy?" and serve them when users ask "Tell me about sick leave", "How many sick days do we get?", or other paraphrased variants.

The Repository Pattern keeps the implementation clean and flexible: you can run with FAISS for local development and switch to Qdrant for production without changing application code. The two layer approach query cache for full responses and tool cache for tool results gives you fine grained control over what gets cached and when.

Key takeaways:

Meaning over exact match Embeddings capture intent, so paraphrases and synonyms hit the cache.

Two layers Cache full responses and tool results separately; each has its own threshold and namespace.

Swappable backends Use Qdrant for persistence or FAISS for fast, in-memory testing.

Know when to skip Avoid caching personal, entity-specific, or rapidly changing data.

Thanks
Sreeni Ramadorai

Top comments (4)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.