DEV Community

Cover image for I Built a Semantic Cache That Cuts LLM API Costs by 72% - What Actually Worked and What Didn't
Vinay Kumar Reddy Budideti
Vinay Kumar Reddy Budideti

Posted on

I Built a Semantic Cache That Cuts LLM API Costs by 72% - What Actually Worked and What Didn't

The Results First

100 real Anthropic API calls. Three architectures tested. One that actually worked.

V3 Hybrid Engine — 100-query live benchmark:

Metric Value
Cache hit rate 87.5%
Total cost $0.24 (vs $0.87 without cache)
Cost savings 71.8%
Zero-cost direct hits 54 queries
Adapted (cheap model) 35 queries
Full misses 9 queries
Tokens saved 179,445

The warm-up curve is the real story. The cache starts cold at 42.9% hit rate on the first 10 queries. By query 20: 90%. By query 31: every single query hits cache. Queries 31–40 cost $0.00 — not approximately zero, literally zero dollars.

The system is called Intent Atoms. It sits between your application and any LLM API, using FAISS vector search and MPNet embeddings to match incoming queries against cached responses. When it finds a match, it returns the cached response in ~97ms instead of waiting 8–25 seconds for a fresh generation.

But the 87.5% number is the end of the story. The beginning was much uglier.


V1: The Elegant Idea That Cost 3x More

My original hypothesis: most LLM queries are compound. "How do I deploy a React app with Docker on AWS?" is really three questions — React builds, Docker containerization, AWS deployment. If I decompose queries into these atomic intents, cache each fragment, and recompose them for new queries, I could reuse fragments across completely different questions.

I built the full pipeline: decompose with Haiku, embed atoms, match via FAISS, generate missing atoms with Sonnet, compose fragments into a response.

10-query live benchmark results:

Metric V1 (Decompose)
Hit rate 28.6%
Cost savings -24.3%
Total time 112 seconds

Negative cost savings. The decomposition overhead — an extra Haiku call to break the query apart, another to compose the response — exceeded whatever the cache saved. Every query was paying for three LLM calls instead of one.

The decomposer did catch overlaps that simpler systems missed. Query 5 ("Deploy Flask with Docker") matched atoms from Query 1 ("Deploy React with Docker") — the Docker and AWS atoms were reused. But the overhead of decomposing and composing ate the savings alive.


V2: Simple FAISS — Fast but Blind

After V1 failed, I read five research papers. GPTCache, MeanCache, GPT Semantic Cache, and two 2025 papers on domain-specific embeddings and cache eviction. The universal finding: every successful system caches at the full query level with FAISS vector search. None use sub-query decomposition.

V2 was radically simple: embed the full query with MPNet, search FAISS, return cached response if similarity > 0.83, otherwise generate and cache.

10-query live benchmark results:

Metric V2 (FAISS)
Hit rate 10.0%
Cost savings 11.5%
Total time 124 seconds

Only one hit in 10 queries — Query 10, an exact repeat of Query 1 (similarity = 1.000, cost = $0.000, time = 58ms). Every other query missed because the semantic variations were too different for the threshold to catch.

V2 proved the mechanism worked — that one hit was free and instant. But 10% hit rate isn't useful.


V3: The Hybrid That Actually Works

The answer was combining both approaches. Use fast full-query matching as the primary strategy, and only fall back to expensive atom decomposition when the query is genuinely novel.

Three tiers, cheapest first:

  • Tier 1 — Direct hit (similarity > 0.85): Return cached response. Zero cost. ~97ms. This caught 54 of 100 queries.
  • Tier 2 — Adapt (similarity 0.70–0.85): Take the closest cached response and use Haiku to tweak it for the new query. ~$0.002. This caught 35 queries.
  • Tier 3 — Full miss (similarity < 0.70): Fall through to atom-level decomposition. Only 9 queries reached this tier.

The adaptation tier is where the real savings come from. When someone asks "How to deploy a Flask app with Docker on AWS?" and I already have a cached response for "How to deploy a React app with Docker on AWS?", the similarity is 0.739 — too low for a direct hit, but the answer is 80% the same. A cheap Haiku call adapts it for $0.002 instead of $0.015 for full Sonnet generation. That's an 87% cost reduction on that single query.

10-query live comparison (all three engines, same queries):

Engine Hit Rate Total Cost Total Time
V1 (Decompose) 28.6% $0.128 112s
V2 (FAISS) 10.0% $0.120 124s
V3 (Hybrid) 27.3% $0.074 78s

On 10 queries, V3 is cheapest and fastest. But the real proof is at scale.


The 100-Query Benchmark

10 topics, 10 paraphrases each, shuffled randomly, real Anthropic API calls with real money.

Cache warm-up curve (cost and hit rate per 10-query block):

Block Cost Hit Rate
1–10 $0.108 42.9%
11–20 $0.028 90.0%
21–30 $0.034 70.0%
31–40 $0.000 100%
41–50 $0.019 90.0%
51–60 $0.014 100%
61–70 $0.016 100%
71–80 $0.003 100%
81–90 $0.019 100%
91–100 $0.003 100%

Block 31–40 is the highlight. Ten queries, ten direct cache hits, zero dollars. By this point the cache has seen enough variations of each topic that every new paraphrase lands above the 0.85 similarity threshold.

The dip at block 41–50 (90% hit rate, $0.019) is real — one query introduced a new enough phrasing to trigger an adaptation instead of a direct hit. But even that adapted response costs only $0.002–0.004 vs $0.007–0.028 for a full generation.

Tier breakdown across all 100 queries:

Tier Count Percentage
Direct hit ($0) 54 54.0%
Adapted (~$0.002) 35 35.0%
Atom hit (varies) 2 2.0%
Full miss ($0.007–0.028) 9 9.0%

91 out of 100 queries served from cache. Total: $0.244 vs $0.866 without caching.


What the Numbers Don't Tell You

The benchmark is favorable. 10 topics × 10 paraphrases means high overlap by design. Real-world query distributions have a long tail — 60–70% of queries are unique in most production systems. The research shows that semantic caching works best on narrow, repetitive domains: customer support bots, educational platforms, internal knowledge bases.

An EdTech study achieved 45.1% hit rate on real student queries — not 87%. That's a more realistic number for production.

Atom-level matching barely fires. Only 2 of 100 queries reached the atom decomposition layer. The full-query matching with adaptation handles almost everything. Layer 2 adds complexity without proportional value in this benchmark.

The decomposer is inconsistent. Haiku sometimes produces different atom breakdowns for similar queries, which reduces atom-level cache hits. This was the core problem with V1 and it persists as a fallback limitation in V3.


The Technical Stack

  • Embeddings: sentence-transformers/all-mpnet-base-v2 (768-dim, runs locally — no API cost for embedding)
  • Vector search: FAISS IndexFlatIP (cosine similarity via inner product on normalized vectors)
  • LLM providers: Anthropic Claude — Haiku for cheap operations (decompose, adapt, compose), Sonnet for generation
  • API: FastAPI with async support
  • Dashboard: React + Recharts, deployable to Vercel
  • Persistence: JSON metadata + binary FAISS index files

The embedding model choice matters enormously. SHA-256 hashes (my V1 mistake) gave 0% hit rate. MPNet gives 87.5%. Same architecture, completely different results.


What I'd Build Differently

Skip atom decomposition entirely. The 2-hit contribution from Layer 2 doesn't justify the code complexity. A two-tier system (direct hit + adaptation) would achieve nearly identical results with half the codebase.

Add conversation context. The current system treats each query independently. Follow-up questions like "What about using Kubernetes instead?" require the prior context to make sense. MeanCache addresses this with context chains — worth implementing for any production deployment.

Fine-tune the embedding model. The 2025 domain-specific embeddings paper showed that general-purpose models miss domain paraphrases. "Containerize a React frontend" and "Put React in a Docker image" are the same intent but look different to MPNet. Domain-specific fine-tuning would push the similarity scores higher and catch more near-misses.


Try It

The code is open source under MIT license.

GitHub logo vinaybudideti / intent-atoms

Sub-query level semantic caching for LLM APIs — 3-tier hybrid engine with FAISS vector search. 87.5% cache hit rate, 71.8% cost savings on 100 real API calls.

Intent Atoms

Sub-query level semantic caching for LLM APIs with FAISS vector search.

Reduce API costs by up to 71.8% with a hybrid 3-tier caching engine that matches at the full-query, adapted, and atomic intent levels.

Tested on 100 real Anthropic API calls: 87.5% cache hit rate, $0.24 vs $0.87 without cache, 54 zero-cost direct hits.


Benchmark Results (Live API — 100 Queries)

Metric Value
Cache Hit Rate 87.5%
Cost Savings 71.8% ($0.24 vs $0.87)
Direct Hits (zero cost) 54 queries
Adapted (cheap Haiku call) 35 queries
Atom Hits (partial cache) 2 queries
Full Misses 9 queries
Tokens Saved 179,445

Cache Warm-Up Curve

The cache starts cold and improves with every query. By query 30, hit rate reaches 100% per block:

Block Cost Hit Rate
1-10 $0.090010 42.9%
11-20 $0.020117 90.0%
21-30 $0.034376 70.0%
31-40 $0.000000 100.0%
41-50 $0.014142 100.0%
51-60 $0.004077 100.0%
61-70 $0.002554 100.0%

Live dashboard: intent-atoms.vercel.app

Includes all three engine versions, the complete benchmark suite with terminal screenshots, a FastAPI REST API, and a React analytics dashboard.

git clone https://github.com/vinaybudideti/intent-atoms.git
cd intent-atoms
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python tests/benchmark_100.py  # run the 100-query benchmark yourself
Enter fullscreen mode Exit fullscreen mode

Top comments (0)