TL;DR
Every GPT API call costs money and takes time. If your app sends the same (or very similar) prompts repeatedly, you are paying full price each time for answers you already have. Bifrost, an open-source LLM gateway, ships with a semantic caching plugin that uses dual-layer caching: exact hash matching plus vector similarity search. Cache hits cost zero. Semantic matches cost only the embedding lookup. This post walks you through how it works and how to set it up.
The cost problem with GPT API calls
If you are building anything production-grade with GPT-4, GPT-4o, or any OpenAI model, you already know that API costs add up fast. Token-based pricing means every request burns through your budget, whether it is a fresh question or something your system answered three minutes ago.
Here is the thing: in most real applications, a significant portion of requests are either identical or semantically similar to previous ones. Think about it. Customer support bots get asked the same questions in slightly different words. Code assistants receive near-identical prompts from different users. RAG pipelines retrieve similar context and ask similar follow-ups.
Without caching, you pay full model cost for every single one of those requests. You also wait for the full round-trip to the provider each time, adding latency that your users notice.
The obvious fix is caching. But traditional exact-match caching has a big limitation: it only works when the prompt is character-for-character identical. Change one word, add a comma, rephrase slightly, and you get a cache miss. That is where semantic caching changes the game.
What semantic caching is and how it differs from exact-match caching
Exact-match caching hashes the entire request and looks up that hash. If the hash matches a stored response, you get a cache hit. If even one character is different, it is a miss. This works well for automated pipelines where prompts are templated and predictable. It falls apart for user-facing applications where people phrase things differently.
Semantic caching converts the request into a vector embedding and searches for similar embeddings in a vector store. If a stored request is semantically similar enough (above a configurable threshold), the cached response is returned. This means "How do I reset my password?" and "What are the steps to change my password?" can both hit the same cache entry.
Bifrost combines both approaches in a dual-layer architecture, giving you the speed of exact matching with the intelligence of semantic similarity as a fallback.
How Bifrost implements dual-layer caching
Bifrost's semantic cache plugin uses a two-step lookup process for every request that has a cache key:
Layer 1: Exact hash match. The plugin hashes the request and checks for a direct match. This is the fastest path. If it hits, you get the cached response with zero additional cost. No embedding generation, no vector search, no provider call.
Layer 2: Semantic similarity search. If the exact match misses, Bifrost generates an embedding for the request and searches the vector store for semantically similar entries. If a match is found above the similarity threshold (default 0.8), the cached response is returned. The only cost here is the embedding generation.
If both layers miss, the request goes to the LLM provider as normal. The response is then stored in the vector store with its embedding for future lookups.
You can also control which layer to use per request. If you know your use case only needs exact matching (templated prompts), you can skip the semantic layer entirely. If you want semantic-only, that is an option too. The default is both, with direct matching first and semantic as fallback.
Here is how the cost breaks down:
| Scenario | LLM API Cost | Embedding Cost | Total Cost |
|---|---|---|---|
| Exact cache hit | Zero | Zero | Zero |
| Semantic cache hit | Zero | Embedding only | Minimal |
| Cache miss | Full model cost | Embedding generation | Full + embedding |
Bifrost also handles cost calculation natively through CalculateCostWithCacheDebug, which automatically accounts for cache hits, semantic matches, and misses in your cost tracking. All pricing data is cached in memory for O(1) lookup, so the cost calculation itself adds no overhead.
Check out the full Bifrost documentation for the complete API reference.
Setting it up
Follow the setup guide to get Bifrost running, then configure two things: a vector store and the semantic cache plugin.
Step 1: Configure the vector store
Bifrost uses Weaviate as its vector store. You can run Weaviate locally with Docker or use Weaviate Cloud.
Local setup with Docker:
docker run -d \
-p 8080:8080 \
-p 50051:50051 \
-e PERSISTENCE_DATA_PATH='/var/lib/weaviate' \
semitechnologies/weaviate:latest
config.json (local Weaviate):
{
"vector_store": {
"enabled": true,
"type": "weaviate",
"config": {
"host": "localhost:8080",
"scheme": "http"
}
}
}
config.json (Weaviate Cloud):
{
"vector_store": {
"enabled": true,
"type": "weaviate",
"config": {
"host": "your-cluster.weaviate.network",
"scheme": "https",
"api_key": "your-weaviate-api-key"
}
}
}
Step 2: Configure the semantic cache plugin
Add the plugin to your Bifrost config:
{
"plugins": [
{
"enabled": true,
"name": "semantic_cache",
"config": {
"provider": "openai",
"embedding_model": "text-embedding-3-small",
"ttl": "5m",
"threshold": 0.8,
"conversation_history_threshold": 3,
"exclude_system_prompt": false,
"cache_by_model": true,
"cache_by_provider": true,
"cleanup_on_shutdown": true
}
}
]
}
A few things to note about these settings:
-
threshold: The similarity score (0 to 1) required for a semantic match. 0.8 is a good starting point. Higher means stricter matching, fewer false positives, but more cache misses. -
conversation_history_threshold: Defaults to 3. If a conversation has more messages than this, caching is skipped. Long conversations have high probability of false positive semantic matches due to topic overlap, and they rarely produce exact hash matches anyway. -
ttl: How long cached responses stay valid. Accepts duration strings like"30s","5m","1h", or numeric seconds. -
cache_by_modelandcache_by_provider: When true, cache entries are isolated per model and provider combination. A GPT-4 response will not be returned for a GPT-3.5-turbo request.
Step 3: Trigger caching per request
Caching is opt-in per request. You need to set a cache key, either via the Go SDK or HTTP headers:
HTTP API:
# This request WILL be cached
curl -H "x-bf-cache-key: session-123" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4", "messages": [{"role": "user", "content": "What is semantic caching?"}]}' \
http://localhost:8080/v1/chat/completions
Go SDK:
ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
response, err := client.ChatCompletionRequest(ctx, request)
Without the cache key, requests bypass caching entirely. This gives you fine-grained control over what gets cached and what does not.
Per-request overrides (HTTP):
curl -H "x-bf-cache-key: session-123" \
-H "x-bf-cache-ttl: 30s" \
-H "x-bf-cache-threshold: 0.9" \
http://localhost:8080/v1/chat/completions
Cache type control:
# Direct hash matching only (fastest, no embedding cost)
curl -H "x-bf-cache-key: session-123" \
-H "x-bf-cache-type: direct" ...
# Semantic similarity search only
curl -H "x-bf-cache-key: session-123" \
-H "x-bf-cache-type: semantic" ...
# Default: both (direct first, semantic fallback)
curl -H "x-bf-cache-key: session-123" ...
You can also use no-store mode to read from cache without storing the response:
curl -H "x-bf-cache-key: session-123" \
-H "x-bf-cache-no-store: true" ...
When semantic caching helps vs when it does not
Semantic caching is not a universal solution. Here is where it works well and where it does not.
Good fit:
- Customer support bots where users ask the same questions in different words
- FAQ-style applications with predictable query patterns
- RAG pipelines where similar contexts produce similar queries
- Internal tools where multiple team members ask overlapping questions
- Any high-volume application with repetitive prompt patterns
Not a good fit:
- Conversations that are heavily context-dependent and unique every time
- Long multi-turn conversations (the
conversation_history_thresholdexists for this reason, as longer conversations create false positive matches) - Applications where responses must reflect real-time data that changes frequently
- Creative generation tasks where you want varied outputs for similar inputs
The key insight is that semantic caching works best when your application naturally produces clusters of similar requests. If every request is genuinely unique, caching of any kind will not help much.
Other performance details worth knowing
Beyond semantic caching, Bifrost caches aggressively at multiple levels:
- Tool discovery is cached after the first request, bringing subsequent lookups down to roughly 100-500 microseconds.
- Health check results are cached at approximately 50 nanoseconds.
- All pricing data is cached in memory for O(1) lookups during cost calculations.
Cache entries use namespace isolation. Each Bifrost instance gets its own vector store namespace to prevent conflicts. When the Bifrost client shuts down (with cleanup_on_shutdown set to true), all cache entries and the namespace itself are cleaned up. You can also programmatically clear cache by key or clear cache by request ID via the API.
Cache metadata is automatically added to responses via response.ExtraFields.CacheDebug, so you can inspect whether a response came from direct cache, semantic match, or a fresh provider call. You can also use the log statistics API for deeper observability into your cache performance.
Wrapping up
If your GPT-powered application handles any volume of requests, there is a good chance a meaningful portion of those requests are semantically similar. Paying full API cost for every one of them does not make sense.
Bifrost's semantic cache plugin gives you dual-layer caching with exact matching and vector similarity search, opt-in per request, configurable thresholds, and built-in cost tracking. It is open source, written in Go, and designed for production workloads.
Check out the GitHub repo to get started, read the docs for the full configuration reference, or visit the Bifrost website to learn more about the gateway.
Top comments (0)