DEV Community

Cover image for Semantic Caching in Agentic AI: Determining Cache Eligibility and Invalidation
Ashwin Hariharan for Redis

Posted on • Originally published at ashwinhariharan.com

Semantic Caching in Agentic AI: Determining Cache Eligibility and Invalidation

A few years ago, "AI in your app" meant a chatbot that answered FAQs. Today, it means an agent that can search, filter, compare, book, and transact - all while remembering what you said three messages ago.

Amazon's Rufus has already fielded tens of millions of questions from shoppers - product comparisons like "What's the difference between OLED and QLED TVs?", recommendations ("best wireless outdoor speakers"), and also specific use cases ("lawn games for kids' birthday parties"). Booking.com's AI Trip Support does the same for travelers - a guest with a car asks "Is parking available at the hotel?", the agent pulls property information and responds in seconds.

At any given moment, thousands of those questions are just variations of something that's already been asked and answered.

But these assistants aren't just answering questions. Every time users ask the same questions over and over, it triggers an expensive, high-latency LLM call, and often database lookups too. AI agents need to search, filter, compare, and carry context across an entire conversation. And that one distinction changes everything.

This is exactly the problem semantic caching was built to solve. For a standard Retrieval-Augmented Generation (RAG) application, it works really well - convert commonly-asked questions into vector embeddings and store the question-response pairs in a cache. The next time someone asks something semantically similar, you simply return the cached answer.

But when an AI agent can take actions and produce responses that depend on who's asking and what they've already done, you can't cache everything the same way. Cache the wrong response, and you're serving someone else's cart. Cache too aggressively, and your users get stale prices and out-of-stock recommendations.

So the question isn't just "can we cache this?" - it's "should we, and for how long?" That's a much harder problem.

Table of Contents

  1. Caching in AI Agents
  2. Strategies for Semantic Caching
  3. Handling Multi-Turn Conversations
  4. Production Considerations

Caching in Agentic AI

Agentic AI introduces caching challenges that a conventional RAG pipeline never has to deal with. With AI agents, state is maintained across turns, tools are called, multi-step workflows are executed, and responses depend on context that keeps changing. That changes the caching problem considerably. Let's explore what that looks like in practice!

To see how these challenges play out, I'll use an AI-powered e-commerce application as a running example. Think of a shopping assistant that helps users search for products, manage their cart, and answer product-related questions. For such an application, here's how the workflow might look like:

  1. User query comes in
  2. Check semantic cache first
  3. If cache hit → return cached response (fast path)
  4. If cache miss → invoke AI agent with tools
  5. Agent processes query, calls tools as needed
  6. Store the agent's response to a semantic cache with appropriate TTL
  7. Return response to user

Here's a basic implementation in LangGraph:

from langgraph.graph import StateGraph, START, END
from typing import TypedDict, List

class AgentState(TypedDict):
    messages: List[dict]
    session_id: str
    cache_status: str
    result: str
    tools_used: List[str]

# Node 1: Check semantic cache
async def query_cache_check(state: AgentState) -> AgentState:
    query = state["messages"][-1]["content"]

    # Check if we have a semantically similar cached response
    cached_result = await check_semantic_cache(query)

    if cached_result:
        return {
            **state,
            "cache_status": "hit",
            "result": cached_result
        }

    return {
        **state,
        "cache_status": "miss"
    }

# Node 2: AI Agent with tools
async def agent_node(state: AgentState) -> AgentState:
    # Invoke LLM with tools
    result = await agent.invoke(state["messages"])

    return {
        **state,
        "result": result["output"],
        "tools_used": result["tools_called"]
    }

# Node 3: Cache the result
async def cache_result_node(state: AgentState) -> AgentState:
    query = state["messages"][-1]["content"]

    await save_to_semantic_cache(query, state["result"], ttl)

    return state

# Build the graph
graph = StateGraph(AgentState)

# Add nodes
graph.add_node("cache_check", query_cache_check)
graph.add_node("agent", agent_node)
graph.add_node("cache_result", cache_result_node)

# Define edges
graph.add_edge(START, "cache_check")

# If cache hit, go straight to END; if miss, run the agent
def should_invoke_agent(state: AgentState) -> str:
    return END if state["cache_status"] == "hit" else "agent"

graph.add_conditional_edges("cache_check", should_invoke_agent)

graph.add_edge("agent", "cache_result")
graph.add_edge("cache_result", END)

workflow = graph.compile()
Enter fullscreen mode Exit fullscreen mode

This gets us a working agentic pipeline with semantic caching baked in. But there's a bigger question lurking underneath.

The hardest part for semantic caching is in deciding what to cache and for how long.

CommitStrip Comic

"There are only two hard things in Computer Science: cache invalidation and naming things."

— Phil Karlton

Strategies for Semantic Caching

Not all agent operations have the same caching requirements. Cache invalidation was already hard - agentic AI just made it harder.

For example, for a shopping AI agent:

  • Product searches depend on inventory (changes hourly)
  • Product information is relatively static (changes rarely)
  • Cart operations are personal (shouldn't be cached)
  • General shopping advice can be timeless (have a much longer ttl)

How do you make this decision programmatically?

Let's explore four approaches.

Approach 1: String-Based Pattern Matching

The most straightforward approach is to scan the user's query for keywords.

def determine_cache_ttl_by_string(query: str) -> int:

    # Long TTL for product information (relatively static)
    product_info_keywords = ['what is', 'tell me about', 'specs', 'features']
    if any(keyword in query for keyword in product_info_keywords):
        return 24 * 60 * 60  # 24 hours

    # Short TTL for product searches (inventory changes)
    search_keywords = ['find', 'search', 'show me', 'looking for']
    if any(keyword in query for keyword in search_keywords):
        return 2 * 60 * 60  # 2 hours

    # Don't cache personal operations
    personal_keywords = ['my cart', 'add to cart', 'my order', 'checkout']
    if any(keyword in query for keyword in personal_keywords):
        return 0  # Don't cache

    # Default
    return 6 * 60 * 60  # 6 hours
Enter fullscreen mode Exit fullscreen mode

Usage:

query = "What are the features of the MacBook Pro?"
ttl = determine_cache_ttl_by_string(query)  # Returns 86400 (24 hours)

await save_to_semantic_cache(query, response, ttl)
Enter fullscreen mode Exit fullscreen mode

Now, this approach is a decent starting point to understand the problem, but I wouldn't ship this in production. The upside is that it's dead simple to implement, fast (no LLM calls involved), and easy to debug. The patterns are explicit, so you always know why a decision was made.

But the downside is that it's fragile and would break easily. "Show my cart" and "View my cart" might behave differently depending on which keywords you've defined. It has no understanding of context or intent, and every time your app evolves, someone has to go back and update the keyword list.

Approach 2: LLM-Based Decision Making

Instead of hardcoding rules, why not delegate the decision to the LLM?

from langchain_openai import ChatOpenAI

async def determine_cache_ttl_by_llm(query: str) -> int:
    """
    Use an LLM to intelligently determine cache TTL
    """
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    prompt = f"""Analyze this user query and determine the appropriate cache TTL.

Query: "{query}"

Categories:
- PERSONAL: User-specific operations (cart, orders, profile). Return: 0 (don't cache)
- PRODUCT_INFO: Product details, specs, features. Return: 86400 (24 hours)
- ADVICE: Shopping recommendations, buying guides. Return: 43200 (12 hours)
- SEARCH: Product searches, availability checks. Return: 7200 (2 hours)
- DEFAULT: Anything else. Return: 21600 (6 hours)

Return ONLY the TTL number (in seconds), nothing else."""

    response = await llm.ainvoke(prompt)

    return ttl
Enter fullscreen mode Exit fullscreen mode

Usage:

query = "Can you add this laptop to my cart?"
ttl = await determine_cache_ttl_by_llm(query)  # Returns 0 (personal operation)

query = "What are the specs of the iPhone 15?"
ttl = await determine_cache_ttl_by_llm(query)  # Returns 86400 (product info)
Enter fullscreen mode Exit fullscreen mode

The upsides are real, with some tradeoffs as well.

The big win here is that it actually understands what the query means. Ambiguous or oddly phrased queries are handled gracefully, there's no keyword list to maintain, and it can reason about edge cases that a string matcher would simply miss.

Every cache decision now requires an LLM call - adding 200–500ms of latency before you've even answered the user's question. It also costs money on every single request, even if small models are relatively cheap. Because LLMs are non-deterministic, the same query might get a different TTL on different days, making behavior harder to predict. If your prompt isn't well-crafted, you'll get inconsistent decisions that are difficult to debug. Also, it's not easy to write unit tests for LLM decisions.

So this approach is a good fit if you can stomach the latency and cost in exchange for smarter decisions.

Approach 3: Tool-Based Decision Making

In agentic AI, the tools an agent calls can reveal the nature of an operation better than the query text alone. So instead of analyzing what the user asked, look at what tools the agent used.

def determine_tool_based_cache_ttl(tools_used: List[str]) -> int:

    # Don't cache personal/dynamic operations
    personal_tools = [
        'add_to_cart',
        'view_cart',
        'clear_cart',
        'checkout',
        'view_order_history'
    ]

    # Long TTL for product information (changes rarely)
    product_info_tools = [
        'get_product_details',
        'get_product_specs',
        'get_product_reviews'
    ]

    # Medium TTL for shopping advice (static content)
    advice_tools = [
        'get_recommendations',
        'compare_products',
        'answer_general_question'
    ]

    # Short TTL for product searches (inventory changes)
    search_tools = [
        'search_products',
        'check_availability',
        'filter_by_category'
    ]

    if any(tool in tools_used for tool in personal_tools): # Personal operations detected
        return 0  # Don't cache

    if any(tool in tools_used for tool in product_info_tools):
        return 24 * 60 * 60  # 24 hours

    if any(tool in tools_used for tool in advice_tools): # Shopping advice detected
        return 12 * 60 * 60  # 12 hours

    if any(tool in tools_used for tool in search_tools):
        return 2 * 60 * 60  # 2 hours

    # Default for direct responses or unknown tools
    return 6 * 60 * 60  # 6 hours
Enter fullscreen mode Exit fullscreen mode

Advantages of Tool-Based Caching

In practice, this approach adds up to some real advantages:

1. No Additional LLM Calls:
The agent already executed - you're just reading which tools it used. No extra latency or cost.

2. Deterministic and Testable:

Because it's working from what the agent actually did - not what the user said - the decisions tend to be reliable.

def test_cache_ttl_decisions():
    # Personal operations shouldn't cache
    assert determine_tool_based_cache_ttl(['add_to_cart']) == 0
    assert determine_tool_based_cache_ttl(['view_cart', 'search_products']) == 0

    # Product info gets long TTL
    assert determine_tool_based_cache_ttl(['get_product_details']) == 86400

    # Product searches get short TTL
    assert determine_tool_based_cache_ttl(['search_products']) == 7200
Enter fullscreen mode Exit fullscreen mode

Tests are simple, fast, and reliable.

3. Intent-Based, Not Text-Based:

The query "show me options" could mean:

  • Show cart → don't cache (uses view_cart tool)
  • Show products → cache for 2 hours (uses search_products tool)

Tools reveal actual intent.

4. Easy to Extend:

Adding a new tool? Just update the mapping:

# New tool for checking order status
personal_tools = [
    'add_to_cart',
    'view_cart',
    'clear_cart',
    'check_order_status'  # ← New tool
]
Enter fullscreen mode Exit fullscreen mode

5. Works Across Different Query Phrasings:

However the user phrases it, if the same tool gets called, the same caching decision gets made. For example, these queries use the same tool → same caching behavior:

  • "Add milk to my cart"
  • "Put milk in cart"
  • "I want to add milk"

In each case, the agent calls add_to_cart, and that's enough to know: TTL = 0, don't cache.

Check out an implementation on Github

That said, there's one thing worth keeping in mind. Since the caching decision is made after the agent runs, the very first request always goes to the LLM - there's no way to short-circuit that. And because the same query can sometimes invoke different tools depending on context or agent state, caching behavior can occasionally be inconsistent in ways that are hard to predict.

Approach 4: Semantic Routing

Semantic routing uses vector embeddings to classify user input / queries into pre-defined categories or labels based on their semantic meaning. This enables some really useful patterns:

Instead of doing this:

User query → LLM → analyze tools → determine cache TTL

You can do this:

User query → vector classifier → category → determine cache TTL

How it works with RedisVL:

RedisVL provides a SemanticRouter that simplifies semantic routing using Redis for vector storage.

from redisvl.extensions.router import SemanticRouter
from redisvl.extensions.router import Route
from redisvl.utils.vectorize import HFTextVectorizer

# Define routes with examples and TTLs

personal_operations = Route(
    "name" = "personal",
    "references" = [
        "show my cart",
        "add to cart",
        "view my orders",
        "checkout",
        "my purchase history"
    ],
    "metadata" = {"ttl": 0}  # Don't cache
)

product_info = Route(
    "name" = "product_info",
    "references" = [
        "what are the specs",
        "tell me about this product",
        "product features",
        "product details"
    ],
    "metadata" = {"ttl":  86400}  # 24 hours
)

shopping_advice = Route(
    "name" = "shopping_advice",
    "references" = [
        "what should I buy",
        "recommend a laptop",
        "best headphones",
        "which one is better"
    ],
    "metadata" = {"ttl": 43200}  # 12 hours
)

# Create semantic router
router = SemanticRouter(
    name="cache_router",
    routes=[personal_operations, product_info, shopping_advice],
    vectorizer=HFTextVectorizer(),
    redis_url="redis://localhost:6379"
)

async def classify_query_semantically(query: str) -> tuple[str, int]:
    """
    Classify query using RedisVL semantic router
    Returns (category_name, cache_ttl)
    """
    # Route the query to the best matching category
    result = router(query)

    category = result.name
    ttl = result.metadata["ttl"]
    similarity = result.score

    print(f"🎯 Classified '{query}' as '{category}' (similarity: {similarity:.3f})")

    return category, ttl
Enter fullscreen mode Exit fullscreen mode

Then in your langgraph code, you can do this:

# Integration with workflow
async def determine_cache_ttl_by_routing(state: AgentState) -> AgentState:
    query = state["messages"][-1]["content"]

    # First, classify the query semantically
    category, ttl = await classify_query_semantically(query)

    # If it's a personal query, skip cache entirely
    if category == "personal":
        return {
            **state,
            "cache_status": "skip",
            "category": category
        }

    # Check semantic cache for non-personal queries
    cached_result = await check_semantic_cache(query)

    if cached_result:
        return {
            **state,
            "cache_status": "hit",
            "result": cached_result,
            "category": category
        }

    return {
        **state,
        "cache_status": "miss",
        "category": category,
        "planned_ttl": ttl  # Pass TTL to caching node
    }
Enter fullscreen mode Exit fullscreen mode

Usage example:

# User asks
query = "what are the specs of the MacBook Pro?"

# Semantic router classifies
category, ttl = await classify_query_semantically(query)
# Output: Classified 'what are the specs of the MacBook Pro?' as 'product_info' (similarity: 0.912)
# Returns: ("product_info", 86400)

# Different phrasing, same classification
query = "tell me about the MacBook Pro features"
category, ttl = await classify_query_semantically(query)
# Output: Classified 'tell me about the MacBook Pro features' as 'product_info' (similarity: 0.889)
# Returns: ("product_info", 86400)
Enter fullscreen mode Exit fullscreen mode

Advantages of Semantic Routing

Using vector embeddings to classify intent turns out to be a surprisingly versatile idea -

1. Works Anywhere in the Pipeline:

Most caching approaches are tied to a specific point in the request lifecycle. However, Semantic routing can fit wherever you need it:

  • Before cache check: Skip cache entirely for personal queries
  • Before agent execution: Route to specialized agents per category
  • After agent execution: Validate caching decisions
  • For query preprocessing: Clean or enrich queries before processing

2. No LLM Calls:

The classification is done entirely with an embedding model and a vector similarity lookup - no LLM involved. That keeps it fast (20 - 50ms) and cheap (a fraction of a cent per call), compared to 200 -500ms and ~$0.001 for an LLM-based decision.

3. Deterministic and Testable:

Because classification is based on vector similarity against fixed reference examples, the behavior is consistent. You can write unit tests for it:

async def test_semantic_routing():
    # Personal queries
    category, ttl = await classify_query_semantically("show my cart")
    assert category == "personal" and ttl == 0

    category, ttl = await classify_query_semantically("add laptop to cart")
    assert category == "personal" and ttl == 0

    # Product info queries
    category, ttl = await classify_query_semantically("what are the specs of iPhone 15")
    assert category == "product_info" and ttl == 86400

    category, ttl = await classify_query_semantically("tell me about the MacBook features")
    assert category == "product_info" and ttl == 86400
Enter fullscreen mode Exit fullscreen mode

4. Easy to Tune:

Want better classification? Add more reference examples to your routes:

router.add_route_references(
    route_name="product_info",
    references=[
        "technical specifications",
        "product description"
    ]
)
Enter fullscreen mode Exit fullscreen mode

RedisVL automatically updates the route embeddings when you add new references.

5. Multi-Category Support:

RedisVL's semantic router can return confidence scores for multiple routes, allowing you to handle edge cases:

# Get top N routes instead of just the best match
result = router(query, return_top_k=3)

# Can inspect multiple categories and their scores
# Then combine TTLs or route to multiple agents based on confidence
Enter fullscreen mode Exit fullscreen mode

One limitation worth noting: when a query is genuinely ambiguous - with no additional context to anchor it - the router will still pick the closest matching route rather than flagging uncertainty. So, this technique works best when queries are reasonably self-contained.

Handling Multi-Turn Conversations

For stateless Q&A, where every query stands on its own without requiring session state or prior context, semantic caching works very well with the above approaches - one query, one caching decision. But in a conversational AI application, that's rarely the case.

Consider this exchange:

  1. Turn 1: "Show me wireless headphones under $100"
  2. Turn 2: "What about Sony ones?"
  3. Turn 3: "Are any of them waterproof?"

Turn 2 and Turn 3 are meaningless on their own. And this creates problems at two points in the caching pipeline.

  • On the cache check step: If you embed the raw message "What about Sony ones?" and search the cache, it could match all sorts of unrelated conversations - another user asking about Sony TVs, Sony cameras, anything Sony. You'd serve the wrong cached response, and the user would never know why the answer felt off.
  • On the cache storage step: Caching a raw message like "What about Sony ones?" without any context is just as problematic. That string could mean anything depending on the conversation, so you'd either over-match (returning wrong results) or under-match (never finding a valid hit).

The most practical solution is query rewriting. Before checking the cache, use an LLM to rewrite the current message into a self-contained, standalone question. This involves context selection and context compression techniques to pick the relevant turns from the conversation history and distill them into what's actually needed:

  • "What about Sony ones?""Show me Sony wireless headphones under $100"
  • "Are any of them waterproof?""Are any Sony wireless headphones under $100 waterproof?"

The rewritten query is what gets embedded and checked against the cache - and what gets stored, if there's a miss. This way, it doesn't matter how the user phrased it or how many turns deep they are. What gets cached is the intent, not the message.

The tradeoff is an extra LLM call per turn. But if you're already using an LLM for TTL decisions, you can combine both into a single call - rewrite the query and decide the TTL in one shot.

Production Considerations

For simple applications, a single TTL configured for the entire cache is often good enough. Agentic AI changes that - when your system can take actions and respond depending on user state, a one-size-fits-all TTL becomes a liability. That's precisely the problem the approaches in this article are designed to solve.

But production semantic caching is messier than demos make it look.

The first thing to reconsider is the assumption that every LLM response should be automatically written back to the cache. In practice, that's risky. It's often preferable - especially for FAQ bots and customer support tools - to maintain a human-curated cache: a verified set of question-response pairs that are periodically reviewed and expanded based on analytics.

When you do auto-cache LLM responses, build an audit process around it. Have a human or another LLM periodically review recently cached entries for factual accuracy before they keep getting served to users.

Relatedly, LLMs don't always give clean, cacheable answers. Sometimes they ask a clarifying question. Sometimes they say they don't have enough information. Caching those responses and serving them to future users is its own category of problem. If your workflow supports these kinds of responses, add a classification step - have the LLM categorize its own output before deciding whether to cache it at all.

And then there's PII. If users are asking questions that contain personally identifiable information, it needs to be scrubbed before it hits the cache, or be user-scoped. It's non-negotiable if you're operating under GDPR or similar regulations.

If you want to see what this looks like in practice, here's an example of a dedicated compliance step built into an agent workflow.

Finally, the safest production deployments tend to be the ones with narrow scopes - an FAQ bot for account info, a separate one for offers, another for product details. The broader the scope, the harder it is to reason about what's safe to cache and for how long. If you're building a general-purpose agent, expect to invest significantly more in guardrails.


Tools like Redis can make implementing semantic caching much simpler and more reliable. If you'd like to dive deeper, the resources below are a great place to start:

Resources:

Top comments (0)