Debby McKinney

Posted on Mar 5

How to Cut LLM API Costs by 60% with Semantic Caching

#programming #ai #javascript #tutorial

TL;DR: Most LLM caching is exact-match — same input string, same output. But users rarely phrase the same question identically. Semantic caching matches by meaning, serving cached responses for queries that are similar but not identical. Bifrost (open-source, Go) implements dual-layer caching; exact hash + vector similarity — with sub-millisecond retrieval. Here's how to set it up and what kind of savings to expect.

The Problem with Exact-Match Caching

If you're running LLM API calls in production, you've probably thought about caching. The idea is simple — if someone asks the same question, serve the cached response instead of making another API call.

Here's the catch: users almost never ask the exact same question.

User A: "What's the return policy?"
User B: "How do I return something?"
User C: "Can I get a refund?"

All three questions are asking the same thing. An exact-match cache treats them as three separate, uncached requests. Three API calls. Three sets of tokens billed.

Now multiply this across your entire user base. If you're handling lakhs of queries per day, the wasted spend on semantically identical questions is enormous.

What Semantic Caching Actually Does

Semantic caching converts each query into a vector embedding — a numerical representation of its meaning — and searches for similar cached queries using vector similarity.

Here's what this means for you:

Query comes in → Bifrost generates an embedding for it
Similarity search → Checks the vector store for cached queries above a similarity threshold (default: 0.8)
Cache hit → If a similar query exists, serve the cached response. No API call.
Cache miss → Forward to the LLM provider, cache the response with its embedding for future matches

The difference from exact-match:

	Exact-Match Cache	Semantic Cache
"What's the return policy?" → "How do I return something?"	Miss ❌	Hit ✅
Different phrasing, same intent	Miss ❌	Hit ✅
Typos and variations	Miss ❌	Hit ✅
Completely different question	Miss ❌	Miss ❌

Bifrost implements both layers. Exact hash matching runs first (near-zero overhead). If that misses, semantic similarity search kicks in. You get the speed of exact-match for identical queries plus the intelligence of semantic matching for everything else.

Setting Up Semantic Caching in Bifrost

Step 1: Get Bifrost Running

npx -y @maximhq/bifrost
# Or Docker:
docker pull maximhq/bifrost
docker run -p 8080:8080 -v $(pwd)/data:/app/data maximhq/bifrost

Open http://localhost:8080 — you'll see the Web UI.

Step 2: Set Up a Vector Store

Semantic caching needs a vector database to store and search embeddings. Bifrost currently supports Weaviate as the vector store for semantic caching.

Option A: Weaviate Cloud

Option B: Local Weaviate (Docker)

docker run -d \
  --name weaviate \
  -p 8081:8080 \
  -e QUERY_DEFAULTS_LIMIT=25 \
  -e AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' \
  -e PERSISTENCE_DATA_PATH='/var/lib/weaviate' \
  semitechnologies/weaviate

Step 3: Configure Caching in Bifrost

In the Bifrost Web UI, navigate to the Cache settings and configure:

Vector store: Weaviate (with your connection details)
Similarity threshold: 0.8 (default — raise it for stricter matching, lower for broader)
Embedding model: Configure an embedding provider (OpenAI's text-embedding-3-small works well)

Alternatively, if you're using file-based configuration, add the cache and vector store config to your config.json.

Step 4: Test It

# First request — cache miss, hits the LLM provider
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "What is the return policy?"}]}'

# Second request — different wording, same intent — semantic cache hit
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "How do I return a product?"}]}'

The second request should return a cached response in sub-milliseconds instead of the typical 1-3 second LLM API call.

The Cost Math

Here's what this means in practice. Let's say you're running a customer support bot handling 50,000 queries per day.

Without caching:

50,000 API calls/day
Average cost per call (GPT-4o-mini): ~₹0.50
Daily cost: ~₹25,000
Monthly cost: ~₹7,50,000

With exact-match caching (typical 15-20% hit rate):

~40,000 API calls/day
Monthly cost: ~₹6,00,000
Savings: ~₹1,50,000/month

With semantic caching (typical 50-65% hit rate):

~20,000 API calls/day
Monthly cost: ~₹3,00,000
Savings: ~₹4,50,000/month

The semantic cache embedding cost is minimal — embedding models are orders of magnitude cheaper than chat completion models. You're spending paise per embedding to save rupees per API call.

Your actual hit rate depends on your use case. Support bots and FAQ-style applications see the highest rates (60%+). Creative/generative tasks see lower rates (20-30%). But even a 30% semantic cache hit rate dramatically reduces costs compared to exact-match only.

Things to Know Before You Deploy

Similarity threshold tuning matters. Too low (0.5) and you'll serve irrelevant cached responses. Too high (0.95) and you're barely better than exact-match. Start at 0.8 and adjust based on your use case. Monitor a sample of cache hits to check quality.

Streaming works. Bifrost caches streaming responses and replays them with proper chunk ordering. Your users get the same streaming experience whether the response is live or cached.

Cache invalidation. If your underlying data changes (product catalog update, policy change), you'll want to clear relevant cache entries. Bifrost supports namespace management — organise cached data into collections so you can invalidate specific categories without nuking the entire cache.

Conversation context matters. For multi-turn conversations, Bifrost uses a configurable conversation history threshold (default: 3 turns) for cache key generation. This means two conversations that diverge early won't return cached responses from each other.

When Semantic Caching Doesn't Help

If every query to your LLM is genuinely unique — say, code generation with highly specific context — semantic caching won't give you much. The embeddings will all be too dissimilar.

Similarly, if your application needs guaranteed fresh responses (real-time data queries, live recommendations), caching of any kind is the wrong approach.

But for the vast majority of production LLM use cases — customer support, documentation Q&A, content summarisation, classification — semantic caching is the single highest-ROI optimisation you can make.

Get Started

npx -y @maximhq/bifrost
# Open http://localhost:8080 → Configure cache settings

GitHub: git.new/bifrost | Docs: getmax.im/bifrostdocs | Website: getmax.im/bifrost-home

The setup takes about 10 minutes. The cost savings start with the first semantically matched query. Worth trying?

DEV Community