Pranay Batta

Posted on Feb 3

Semantic Caching in a Production LLM Gateway: Three Design Decisions That Mattered

#webdev #programming #ai #devops

Exact-match caching is simple. Hash the request. Check the cache. Serve or miss.

It also breaks the moment someone rephrases the same question.

Bifrost's semantic cache solves this with vector similarity search. The implementation involved three design decisions that weren't obvious. Here's what they were and why.

maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration…

View on GitHub

Decision 1: Two Layers, Not One

The first instinct is to replace exact-match caching with semantic caching entirely. We didn't.

Bifrost uses a dual-layer approach. Every request checks exact hash matching first. If that misses, it falls through to semantic similarity search.

The math is straightforward: exact hash lookups are sub-microsecond. Semantic similarity search requires a vector store query. Running every request through the vector store first adds latency that exact matching avoids entirely.

Implementation detail: exact hash hits return "hit_type": "direct". Semantic hits return "hit_type": "semantic". Both are visible in the response metadata.

{
  "extra_fields": {
    "cache_debug": {
      "cache_hit": true,
      "hit_type": "direct",
      "cache_id": "550e8500-e29b-41d4-a725-446655440001"
    }
  }
}

Trade-off: the dual-layer adds a code path. The latency cost of maintaining both is lower than running everything through vector search.

Semantic Caching - Bifrost

Intelligent response caching based on semantic similarity. Reduce costs and latency by serving cached responses for semantically similar requests.

docs.getbifrost.ai

Decision 2: Long Conversations Skip Caching

This one surprised us.

The ConversationHistoryThreshold setting tells the cache to skip caching entirely when a conversation exceeds a certain number of messages. Default is 3.

Why skip? Two reasons.

First: semantic false positives. A 10-message conversation about billing that later touches on account setup will semantically match a different conversation that also touched both topics. The similarity score will be high. The cached response will be wrong.

Second: direct cache inefficiency. Long conversations almost never produce exact hash matches. The conversation context changes with every message. Running them through the vector store adds cost without meaningfully improving hit rates.

{
  "conversation_history_threshold": 3
}

Recommended values from the docs: 1-2 is very conservative. 3-5 is the balanced range. 10+ increases false positive risk.

The threshold is a blunt instrument. It doesn't distinguish between a 4-message conversation that's tightly focused and one that has already drifted across topics. Both get the same treatment based purely on message count.

Decision 3: Cache Keys Are Mandatory

Semantic caching only activates when a cache key is provided. No cache key, no caching. The request goes straight to the provider.

This was a deliberate choice, not a limitation.

Without a mandatory cache key, every request becomes a potential cache write. In a multi-tenant gateway, that means one tenant's responses pollute the cache for another. Cache key isolation is the enforcement mechanism.

// This request WILL be cached
ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
response, err := client.ChatCompletionRequest(ctx, request)

// This request will NOT be cached
response, err := client.ChatCompletionRequest(context.Background(), request)

HTTP equivalent uses the x-bf-cache-key header.

Trade-off: developers have to opt in explicitly. The alternative — auto-caching everything — would require automatic isolation logic that's harder to reason about in production.

Per-Request Overrides

Both TTL and similarity threshold can be overridden per request. This matters because different use cases have different cache requirements on the same gateway.

Static documentation queries can use long TTLs and lower thresholds. Real-time data queries need short TTLs or no caching at all.

curl -H "x-bf-cache-key: session-123" \
     -H "x-bf-cache-ttl: 30s" \
     -H "x-bf-cache-threshold: 0.9" \
     ...

You can also force the cache to use only one layer:

# Direct hash matching only
curl -H "x-bf-cache-key: session-123" \
     -H "x-bf-cache-type: direct" \
     ...

# Semantic similarity only
curl -H "x-bf-cache-key: session-123" \
     -H "x-bf-cache-type: semantic" \
     ...

And there's a no-store mode: read from cache, but don't write the response back. Useful for read-only lookups where you don't want to pollute the cache.

System Prompts in Cache Keys

Whether to include system prompts in the cache key is configurable.

Include them (default) when the system prompt meaningfully changes the response. A customer support prompt and a coding assistant prompt should never share cached responses.

Exclude them when you have multiple system prompt variations for the same use case and want caching to focus on user message similarity.

{
  "exclude_system_prompt": false
}

Streaming Support

The cache supports full streaming response caching with proper chunk ordering. The streaming accumulator assembles the complete response from deltas before the cache write happens. Cached streaming responses are served back with chunk ordering preserved.

Vector Store

Semantic caching currently requires Weaviate as the vector store backend. The broader Bifrost vector store framework supports Weaviate, Redis, and Qdrant, but the semantic cache plugin specifically only works with Weaviate at this point.

Implementation detail: changing the embedding dimension in config without clearing the namespace results in mixed-dimension data. Either use a different namespace or set cleanup_on_shutdown: true before restarting.

What We'd Build Differently

The conversation history threshold is a blunt instrument. A threshold of 3 messages works for most cases but doesn't distinguish between a short conversation that's tightly focused and one that's already drifted across topics. Both get the same treatment based purely on message count.

Cache key isolation is correct but puts the burden on the developer. The trade-off: explicit opt-in is easier to reason about in multi-tenant production environments than automatic isolation.

DEV Community