DEV Community

Debby McKinney
Debby McKinney

Posted on

How to Set Up Semantic Caching in Your LLM Gateway

If you're running an LLM application with any volume of repeated or similar queries, you're paying for a lot of redundant API calls.

Exact-match caching catches identical requests. But users don't ask questions identically. "What are your business hours?" and "When are you open?" are the same question. Exact-match caching treats them as completely different.

Semantic caching fixes this. Here's how to set it up in Bifrost and where the gotchas are.

GitHub logo maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…


How Semantic Caching Actually Works

Bifrost doesn't replace exact-match caching with semantic search. It layers them.

Every incoming request first checks for an exact hash match. If that misses, it falls through to a vector similarity search against previously cached responses. The similarity search uses embeddings — by default, OpenAI's text-embedding-3-small — to find semantically equivalent requests.

If the similarity score exceeds your configured threshold, the cached response is returned. No API call to the LLM provider.

This two-step approach matters for performance. Exact hash lookups are sub-microsecond. Vector similarity searches take longer. Running everything through the vector store when most of your traffic has exact matches would add unnecessary latency.

Semantic Caching - Bifrost

Intelligent response caching based on semantic similarity. Reduce costs and latency by serving cached responses for semantically similar requests.

favicon docs.getbifrost.ai

Setting Up Your First Semantic Cache

You need two things before semantic caching works: a vector store and the cache plugin configured.

Step 1: Set up your vector store.

Semantic caching currently requires Weaviate. The broader Bifrost vector store framework supports Weaviate, Redis, and Qdrant, but the semantic cache plugin only works with Weaviate right now.

{
  "vector_store": {
    "enabled": true,
    "type": "weaviate",
    "config": {
      "host": "localhost:8080",
      "scheme": "http"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

For Weaviate Cloud, swap in your cluster URL and API key.

Step 2: Enable the semantic cache plugin.

{
  "plugins": [
    {
      "enabled": true,
      "name": "semantic_cache",
      "config": {
        "provider": "openai",
        "embedding_model": "text-embedding-3-small",
        "ttl": "5m",
        "threshold": 0.8,
        "conversation_history_threshold": 3,
        "exclude_system_prompt": false,
        "cache_by_model": true,
        "cache_by_provider": true
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Add a cache key to your requests.

This is the part most people miss. Semantic caching doesn't activate without a cache key. No key, no caching.

# This request gets cached
curl -H "x-bf-cache-key: session-123" \
     -H "Content-Type: application/json" \
     -X POST http://localhost:8080/v1/chat/completions \
     -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}]}'

# This request bypasses caching entirely
curl -X POST http://localhost:8080/v1/chat/completions \
     -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}]}'
Enter fullscreen mode Exit fullscreen mode

If you're using the Go SDK, set the cache key in the request context instead:

ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
response, err := client.ChatCompletionRequest(ctx, request)
Enter fullscreen mode Exit fullscreen mode

Tuning Your Similarity Threshold

The threshold controls how similar two requests need to be before the cache considers them a match. Default is 0.8.

Here's what the ranges mean in practice:

0.8–0.85 gives you more cache hits but risks serving responses that aren't quite right. Good for FAQ-style applications where questions are genuinely interchangeable.

0.9–0.95 is stricter. Fewer false positives, fewer hits. Better for applications where subtle wording differences actually change the expected response.

You can override this per request if different parts of your application need different precision:

curl -H "x-bf-cache-key: session-123" \
     -H "x-bf-cache-threshold: 0.92" \
     ...
Enter fullscreen mode Exit fullscreen mode

The Conversation History Threshold: Why Long Conversations Skip Caching

If you're building a chatbot or any multi-turn conversation, this setting is important.

Bifrost skips caching entirely when a conversation exceeds a certain number of messages. The default is 3.

Why? Two reasons. First, long conversations accumulate enough topic overlap that they semantically match unrelated conversations. A conversation about billing that later touches account setup looks similar to a completely different conversation that also covered both topics. The similarity scores are high. The cached responses are wrong.

Second, long conversations almost never produce exact hash matches anyway. The context shifts with every message. Running them through the vector store adds cost without improving hit rates.

{
  "conversation_history_threshold": 5
}
Enter fullscreen mode Exit fullscreen mode

Raise this if your conversations are short and focused. Lower it if you're seeing incorrect cached responses in multi-turn flows.


Per-Request TTL Overrides

Not every query in your application has the same freshness requirements.

A query about your product's current pricing needs a short TTL or no caching at all. A query about how to reset a password can cache for hours.

Override TTL per request instead of setting one global value:

# Short TTL for dynamic content
curl -H "x-bf-cache-key: pricing-lookup" \
     -H "x-bf-cache-ttl: 30s" \
     ...

# Longer TTL for stable content
curl -H "x-bf-cache-key: docs-query" \
     -H "x-bf-cache-ttl: 1h" \
     ...
Enter fullscreen mode Exit fullscreen mode

Reading the Cache Debug Metadata

Every response from Bifrost includes cache debug information in extra_fields. This is how you know whether a response came from cache and what kind of match it was.

A direct (exact hash) hit looks like this:

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "direct",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

A semantic hit includes the similarity score and the embedding model used:

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "semantic",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
      "threshold": 0.8,
      "similarity": 0.95,
      "provider_used": "openai",
      "model_used": "gpt-4o-mini",
      "input_tokens": 100
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

A miss still tells you what the similarity search found:

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": false,
      "provider_used": "openai",
      "model_used": "gpt-4o-mini",
      "input_tokens": 20
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Use the cache_id from hits to clear specific entries if you ever need to invalidate a cached response:

curl -X DELETE http://localhost:8080/api/cache/clear/550e8500-e29b-41d4-a725-446655440001
Enter fullscreen mode Exit fullscreen mode

Common Mistakes to Avoid

Forgetting the cache key. The most common one. If you're not seeing any cache hits, check whether you're sending x-bf-cache-key on your requests. Without it, caching is completely inactive.

Setting the threshold too low. A threshold of 0.7 or below will match requests that are semantically in the same neighborhood but aren't actually the same question. Your cache hit rate goes up. So do incorrect responses.

Using one cache key for everything. Cache keys provide isolation. If you use the same key across different users or sessions, you risk serving one user's cached response to another. Scope your cache keys to the unit that makes sense — session, user, or application context.

Ignoring the conversation history threshold in chatbots. If you're building multi-turn conversations and seeing wrong cached responses, this is likely why. Lower the threshold or disable caching for long conversations.

Changing the embedding dimension without clearing the namespace. If you update the dimension config, old cached entries with the previous dimension will still be in the vector store. This causes retrieval failures. Either use a different namespace or set cleanup_on_shutdown: true before restarting.


System Prompts and Cache Keys

By default, system prompts are included in the cache key. Two requests with the same user message but different system prompts will not match.

This is the right default for most applications. A customer support prompt and a coding assistant prompt should never share cached responses.

If you have multiple system prompt variations for the same use case and want caching to focus purely on user message similarity, set exclude_system_prompt to true.


Why This Matters for Your Application

Semantic caching is most effective for applications with repetitive query patterns — support bots, documentation Q&A, internal knowledge bases. The same questions get asked in dozens of different ways.

For applications where every query is unique — open-ended generation, creative writing, real-time data lookups — semantic caching adds overhead without payoff. Use the per-request controls to opt those traffic patterns out.

The dual-layer design means you get the speed of exact matching where it applies and the intelligence of semantic matching where it doesn't. The configuration granularity lets you tune both independently per use case.

Top comments (0)