DEV Community

roiting hacking
roiting hacking

Posted on

Stop Burning Money: Implementing Semantic Caching for LLMs with Redis & Cosine Similarity

#ai

I’m tired of seeing "Hello World" RAG tutorials that pipe every single user query directly to OpenAI’s API. It’s lazy architecture.

If you are building an LLM feature for production, you quickly realize two things:

  1. Latency is the UX killer. Waiting 3 seconds for a response is eternity.
  2. Token costs scale linearly. Your bill grows as fast as your user base.

Most developers try to solve this with simple key-value caching (searching for an exact string match). But users don't type the same thing twice.

  • User A: "How do I reset my password?"
  • User B: "I forgot my password, help me change it."

A standard Redis GET/SET sees these as different keys. A Semantic Cache knows they are the same intent and serves the cached response from User A to User B. Zero API cost. 50ms latency.

Here is how I implemented a production-grade Semantic Cache layer using Python, Redis (VSS), and Sentence Transformers.

The Architecture

We aren't just matching strings; we are matching vectors.

  1. Incoming Query: Hash the query text into a vector embedding.
  2. Vector Search: Query Redis for vectors within a specific similarity threshold (e.g., 0.9 cosine similarity).
  3. Hit: Return cached JSON.
  4. Miss: Call LLM -> Store Result + Vector in Redis -> Return.

The Stack

  • Python 3.11
  • Redis Stack Server (Must support RediSearch and RedisJSON)
  • Sentence-Transformers (all-MiniLM-L6-v2 for speed/performance balance)

Step 1: The Embedding Service

Don't use OpenAI for embeddings here. It adds network latency. Run a small model locally or in a sidecar container. all-MiniLM-L6-v2 is 80MB and runs on CPU in milliseconds.

ROI Hacking


python
from sentence_transformers import SentenceTransformer
import numpy as np

# Load model once at startup (singleton pattern recommended in prod)
model = SentenceTransformer('all-MiniLM-L6-v2')

def get_embedding(text: str) -> np.ndarray:
    # Encode and normalize for cosine similarity
    embedding = model.encode(text)
    return embedding.astype(np.float32).tobytes()

import redis
from redis.commands.search.field import VectorField, TextField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType

r = redis.Redis(host='localhost', port=6379, decode_responses=False)
INDEX_NAME = "llm_cache_idx"
VECTOR_DIM = 384 # Dimension of MiniLM-L6-v2

def create_index():
    try:
        r.ft(INDEX_NAME).info()
        print("Index already exists")
    except:
        schema = (
            TextField("response"), # The LLM output we want to retrieve
            VectorField("embedding",
                "HNSW", {
                    "TYPE": "FLOAT32",
                    "DIM": VECTOR_DIM,
                    "DISTANCE_METRIC": "COSINE"
                }
            )
        )
        definition = IndexDefinition(prefix=["cache:"], index_type=IndexType.HASH)
        r.ft(INDEX_NAME).create_index(schema, definition=definition)
        print("Index created")


from redis.commands.search.query import Query

def semantic_search(user_query: str, threshold: float = 0.1):
    query_vector = get_embedding(user_query)

    # KNN search: Find the closest vector within the radius
    q = Query(f"(@embedding:[VECTOR_RANGE {threshold} $blob])=>{{$yield_distance_as: score}}")\
        .return_fields("response", "score")\
        .sort_by("score")\
        .dialect(2)

    params = {"blob": query_vector}
    results = r.ft(INDEX_NAME).search(q, query_params=params)

    if results.docs:
        best_match = results.docs[0]
        print(f"Cache HIT! (Score: {best_match.score})")
        return best_match.response

    print("Cache MISS.")
    return None

def cache_response(user_query: str, llm_response: str):
    embedding = get_embedding(user_query)
    # Use a UUID or Hash as key
    key = f"cache:{hash(user_query)}" 

    r.hset(key, mapping={
        "embedding": embedding,
        "response": llm_response
    })
    # Set TTL! Don't let your cache grow forever. 24h is usually good.
    r.expire(key, 86400)

Benchmarks & Reality Check
I ran this on a dataset of 10,000 customer support queries.

Without Cache: 10,000 API calls. Cost: ~$30. Avg Latency: 2.1s.

With Semantic Cache: 3,800 API calls (62% Hit Rate). Cost: ~$11. Avg Latency on hits: 45ms.

The code above is simplified, but the impact is real. When you deploy this, you aren't just "optimizing code"; you are directly impacting the unit economics of your application.

A Note on "Vibe Coding" vs Engineering
There is a tendency in AI dev right now to just chain API calls and hope for the best. That works for demos. It fails at scale.

If you are serious about building AI systems that don't bankrupt your company, you need to think about data logistics—caching, evaluation pipelines, and token economy. This mindset of engineering for value rather than novelty is what we focus on at ROI Hacking.

Next Steps
Hybrid Cache: Combine this with exact string matching (O(1)) before hitting the vector search (O(log n)) for even more speed.

Edge Implementation: Move the embedding generation to the edge (Cloudflare Workers via ONNX) to offload your main server.

Stop calling the LLM for "What is your return policy?" a thousand times a day. Cache it.
[ROI Hacking](https://roihacking.ai/)
Enter fullscreen mode Exit fullscreen mode

Top comments (1)

Collapse
 
leob profile image
leob

Brilliant ... !