Jalil B.

Posted on Dec 7

Your LLM Wrapper is Leaking Money: The Architecture of Semantic Caching

#ai #python #systemdesign #performance

In the rush to deploy GenAI features, most engineering teams hit the same three hurdles: the 504 Gateway Timeout, the Hallucination Loop, and—the most painful one—the Wallet Burner.

I’ve seen production logs where a startup was burning $5,000/month on OpenAI bills, simply because they treated LLM APIs like standard REST endpoints. They implemented caching, but they implemented it wrong.

When you are working with Large Language Models, key-value caching is dead. You need Semantic Caching.

Here is how you stop paying for the same query twice, even when users phrase it differently.

The Bad Pattern: Exact Match Caching

Most backend engineers start by wrapping the API call in a simple Redis check. The logic is straightforward: hash the user's prompt, check if it exists, return the value.

It looks like this:

# The Naive Approach
def get_ai_response(user_query, mock_llm, cache):
    # PROBLEM: Only checks exact matches.
    if user_query in cache:
        return cache[user_query]['response']

    response = mock_llm.generate(user_query)
    cache[user_query] = response
    return response

Why this fails in production

Human beings are inconsistent.

User A asks: "What is your pricing?"
User B asks: "How much does it cost?"
User C asks: "Price list please"

To a standard key-value store, these are three distinct keys. You will pay for the generation three times, even though the intent—and the answer—is identical. In high-traffic apps, this redundancy accounts for 40-60% of total token usage, let's do the math:
Example Cost Breakdown:

1,000 requests/day

Without semantic caching: $150/month (assuming 1,500 input tokens avg, $0.005/1K tokens)

With 50% cache hit rate: $75/month

Embedding cost overhead: ~$2/month (text-embedding-3-small is 0.00002/1K tokens)

Net savings: $73/month → $876/year

The Good Pattern: Semantic Caching

To fix this, we need to move from lexical equality (do the strings match?) to semantic similarity (do the meanings match?).

We achieve this using Vector Embeddings.

The Architecture

Embed: Convert the incoming user query into a vector (a list of floating-point numbers) using a cheap model like text-embedding-3-small.
Search: Compare this vector against your cache of previous query vectors.
Threshold: Calculate the Cosine Similarity. If the similarity score is above a threshold (e.g., $0.9$), return the cached response.

The Implementation

Here is the Python logic that handles the vector math. The mathematical intuition is best understood in code:

import math
from openai import OpenAI

# 1. The Math: Cosine Similarity
# Calculates the angle between two vectors. 
# 1.0 = Identical direction (Same meaning)
# 0.0 = Orthogonal (Unrelated)
def cosine_similarity(v1, v2):
    dot_product = sum(a*b for a, b in zip(v1, v2))
    norm_a = math.sqrt(sum(a*a for a in v1))
    norm_b = math.sqrt(sum(b*b for b in v2))
    return dot_product / (norm_a * norm_b)

def get_ai_response_semantic(user_query, llm, cache):
    # 2. Embed the current query
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    query_embedding = response.data[0].embedding

    # 3. Define a strict threshold
    # Too low = wrong answers. Too high = missed savings.
    threshold = 0.9 

    best_sim = -1
    best_response = None

    # 4. Iterate / Search Vector DB
    for cached_query, data in cache.items():
        cached_embedding = data['embedding']
        sim = cosine_similarity(query_embedding, cached_embedding)

        if sim > best_sim:
            best_sim = sim
            best_response = data['response']

    # 5. The Decision Logic
    if best_sim > threshold:
        print(f"Cache Hit! Similarity: {best_sim:.4f}")
        return best_response

    # 6. Cache Miss: Pay the "Token Tax"
    response = llm.generate(user_query)

    # Store response AND the vector for future matching
    cache[user_query] = {
        'response': response,
        'embedding': query_embedding
    }
    return response

The loop-based search above is for learning only. Beyond ~100 cached queries, you must use a vector database with ANN (Approximate Nearest Neighbor) indexing. Options: pgvector (Postgres), Pinecone, Weaviate, or Qdrant.

The Danger Zone: False Positives

There is a catch. If you set your threshold too low (e.g., $0.7$), you risk a False Positive Cache Hit.

Query: "Can I delete my account?"
Cached: "Can I delete my post?"
Similarity: $0.85$

If you return the cached instructions for deleting a post when the user wants to wipe their account, you have a UX disaster.

Production Tip: For sensitive actions, use a Re-ranker. Once you find a cache hit, perform a quick second check with a specialized Cross-Encoder model to verify the two queries actually entail the same output.

Summary

Building AI apps is easy. Building profitable AI apps requires systems engineering.

Exact Match: Easy to build, expensive to run.
Semantic Cache: Harder to build, cuts API bills by ~40%.

Where to Practice This
Understanding semantic caching conceptually is one thing. Debugging it under production constraints—where you're balancing threshold tuning, false positive rates, and embedding costs in real-time—is what separates theory from mastery.
I built TENTROPY specifically to simulate these failure modes. The "Wallet Burner" challenge drops you into a live codebase with a burning API bill and asks you to implement the vector logic above to stop the bleed. It's closer to an oncall scenario than a LeetCode problem.
Try the Semantic Caching Challenge (The Wallet Burner)

DEV Community

Your LLM Wrapper is Leaking Money: The Architecture of Semantic Caching

The Good Pattern: Semantic Caching

The Danger Zone: False Positives

Summary

Top comments (0)