DEV Community

Kuldeep Paul
Kuldeep Paul

Posted on

Semantic Caching Cut Our LLM Costs by 40%

The Problem

Our agent answers questions about product documentation. Users ask the same questions differently:

  • "How do I reset my password?"
  • "What's the password reset process?"
  • "I forgot my password, how do I change it?"

All three hit the LLM. Same answer, three separate API calls. Three separate charges.

Exact-match caching doesn't help because the queries aren't identical. We needed something smarter.


Enter Semantic Caching

Instead of matching exact strings, semantic caching matches meaning.

How it works:

  1. Generate embedding for incoming query
  2. Search for similar queries in cache (cosine similarity)
  3. If similarity > threshold (e.g., 0.95), return cached response
  4. Otherwise, call LLM and cache the result

Example:

`Query 1: "How do I reset my password?"
→ No cache hit
→ Call LLM ($0.002)
→ Cache: embedding + response

Query 2: "What's the password reset process?"
→ Similarity: 0.97 (cache hit!)
→ Return cached response ($0.000)
→ Saved: $0.002, 800ms latency

Query 3: "I forgot my password, help?"
→ Similarity: 0.96 (cache hit!)
→ Return cached response ($0.000)
→ Saved: $0.002, 800ms latency`


Real Results

Our production numbers (30 days):

`Total requests: 45,000
Cache hits: 18,000 (40%)
Cost saved: $76
Latency saved: ~14,400 seconds

Average response time:
├─ Cache miss: 1.2s
└─ Cache hit: 0.05s (24x faster)`

40% cache hit rate might not sound impressive, but that's 40% of requests that are instant and free.


When Semantic Caching Works

Great for:

  • Documentation Q&A (same questions, different wording)
  • Customer support (common issues asked repeatedly)
  • Code explanation (similar code patterns)
  • Translation tasks (same phrases)

Not great for:

  • Queries requiring current data ("today's weather")
  • Personalized responses (user-specific context)
  • Creative generation (want variety, not cached outputs)
  • Low-traffic endpoints (not enough queries to benefit)

Implementation in Bifrost

Bifrost has semantic caching built in. Here’s how to enable it.

1. Configure caching

{
  "cache": {
    "enabled": true,
    "similarity_threshold": 0.95,
    "ttl_seconds": 3600,
    "embedding_model": "text-embedding-3-small"
  }
}
Enter fullscreen mode Exit fullscreen mode

2. That’s it

The gateway automatically:

  • Generates embeddings for incoming queries
  • Searches the cache for semantically similar queries
  • Returns cached responses when similarity exceeds the threshold
  • Updates the cache with new responses on cache misses

Dashboard visibility

The Bifrost dashboard shows:

  • Cache hit rate
  • Cost savings
  • Latency improvements

The Tradeoffs

Pros

  • Massive cost savings (~40% in our case)
  • Much faster responses (~24× faster in our case)
  • Zero application code changes required

Cons

  • Embedding generation adds ~50 ms latency on cache misses
  • Cache storage costs (minimal — embeddings are small)
  • Potential for stale responses if underlying data changes frequently

Cost Breakdown

Cost calculation:
├─ Embedding: $0.00002 per query
├─ LLM call: $0.00200 per query
└─ Savings per cache hit: $0.00198

Enter fullscreen mode Exit fullscreen mode

At a 40% cache hit rate (18,000 hits):

18,000 × $0.00198 = $35.64 saved

Enter fullscreen mode Exit fullscreen mode

Embedding costs for all queries (45,000 total):

45,000 × $0.00002 = $0.90

Enter fullscreen mode Exit fullscreen mode

Net savings:

$35.64 − $0.90 = $34.74 saved per 45k queries

Enter fullscreen mode Exit fullscreen mode

Even accounting for embedding costs, the savings are substantial.


Tuning the Similarity Threshold

Similarity threshold selection is critical:

0.90 → High hit rate, higher risk of incorrect cache hits
0.95 → Balanced (our recommended default)
0.98 → Safer, but lower hit rate

Enter fullscreen mode Exit fullscreen mode

Our test results

0.90 → 52% hit rate, ~3% incorrect cache hits
0.95 → 40% hit rate, ~0.5% incorrect cache hits
0.98 → 28% hit rate, ~0.1% incorrect cache hits

Enter fullscreen mode Exit fullscreen mode

Recommendation:

Start at 0.95, then tune based on your accuracy and freshness requirements.

Cache Invalidation

Time-based (TTL):
Set expiration time. Good for data that changes predictably.

json

{
"ttl_seconds": 3600 *// 1 hour*
}

Manual invalidation:
Clear cache when you update documentation or data sources.

bash

curl -X POST http://localhost:8080/cache/clear

Selective clearing:
Tag cache entries by topic, clear specific topics when updated.


Try It Yourself

Bifrost is open source and MIT licensed:

bash

git clone https://github.com/maximhq/bifrost
cd bifrost
docker compose up

Enable semantic caching in the UI settings. Monitor the dashboard to see cache hit rates and cost savings in real-time.

Full implementation details in the GitHub repo.


The Bottom Line

Semantic caching is the easiest optimization we've implemented:

  • Zero code changes
  • 40% cost reduction
  • 24x faster responses on cache hits

If you're making repeated LLM calls with similar queries, semantic caching pays for itself immediately.


Built by the team at Maxim AI. We also build evaluation and observability tools for AI agents.

Top comments (0)