- Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You ship a RAG system. The first cost report lands. Embedding spend is a third of your bill, and most of it is paying to re-embed the same chunks you already embedded yesterday. The corpus barely moves. The user query distribution is heavy on the same handful of intents. Every webhook that touches a document re-embeds it from scratch. Every nightly index rebuild ignores the previous run's vectors and asks the vendor for new ones.
The fix is obvious: cache the embeddings. Look up by content, return the stored vector if it exists, call the API only when it does not. Three lines of pseudocode.
Then the questions show up. What if the source document changes? What about a swap of the embedding model under you? And on day zero, when the cache is empty and your producer is firing at the rate the cache is supposed to absorb, what then? An embedding cache that does not answer these three questions is worse than no cache, because it serves stale vectors confidently.
The cache key is the whole design
The single most load-bearing decision is the key. Get it wrong and the cache is broken in ways that look correct from the outside.
Use the tuple (model_name, model_version, doc_hash). Each piece is non-negotiable.
model_name covers the case where you have more than one embedder live at once. A search index running text-embedding-3-small and a relevance reranker running text-embedding-3-large cannot share a cache row. Their vector spaces are unrelated. Mixing them returns garbage.
model_version covers silent vendor updates and your own self-hosted model rollouts. OpenAI, Cohere, and Voyage have all shipped versioned embedding models; OpenAI's embeddings docs list text-embedding-3-small and text-embedding-3-large as the current third-generation models alongside the older text-embedding-ada-002. The version in the key is what lets you ship a new model behind the cache without pretending the old vectors are still valid.
doc_hash is a SHA-256 of the normalized chunk text. Normalize whitespace, strip control characters, lowercase if your embedder is case-insensitive (most are not, so usually leave case alone). The hash is the content fingerprint. If the chunk text is byte-identical, you reuse the vector. If a single character changed, you recompute.
import hashlib
def normalize(text: str) -> str:
return " ".join(text.split()).strip()
def cache_key(
model_name: str,
model_version: str,
chunk_text: str,
) -> str:
h = hashlib.sha256(
normalize(chunk_text).encode("utf-8")
).hexdigest()
return f"emb:{model_name}:{model_version}:{h}"
A common mistake is keying on document ID instead of content hash. Then when the document is edited, you serve the stale vector for the new text. Keying on content hash sidesteps the problem: edited text produces a new hash, which produces a cache miss, which triggers a fresh embedding.
Invalidation: TTL is the wrong default
A TTL on embeddings is almost never what you want. Embeddings do not go stale on a clock. They go stale when the source text changes or the model changes, both of which the cache key already handles. A 7-day TTL on a vector you derived from immutable text is throwing money at the vendor for no reason.
Two cases justify a TTL.
The first is a memory ceiling on a hot in-process cache (Redis or a local LRU). You want eviction, not expiration. Configure the LRU policy or use EXPIRE with a long horizon (30+ days) as a memory bound, not as a freshness signal.
The second is when your "documents" are actually short-lived: chat-history snippets that you never need again after the session, or per-user state where retention is governed by data-deletion policy. There a TTL is doing the right job, and the right job is data lifecycle, not embedding freshness.
For everything else, invalidation is event-driven. A document update fires an event. The worker that consumes it computes the new chunk hashes and writes new rows. Old rows can stay for as long as your storage budget allows; their keys will never be looked up again because no chunk will ever hash to them.
from dataclasses import dataclass
@dataclass
class ChunkEvent:
doc_id: str
chunk_id: str
text: str
op: str # "upsert" or "delete"
def handle_event(ev, store, embed_fn, model, ver):
if ev.op == "delete":
store.delete_by_doc(ev.doc_id)
return
key = cache_key(model, ver, ev.text)
if store.exists(key):
store.bind(ev.chunk_id, key)
return
vec = embed_fn([ev.text])[0]
store.put(key, vec)
store.bind(ev.chunk_id, key)
bind(chunk_id, key) is a small indirection table from your application's chunk identity to the content-addressed cache row. When a chunk is rewritten, you update the binding to a new key. The vector store has two tables: the content-addressed embedding rows (huge, append-mostly) and the chunk-to-key map (small, mutable).
Versioning: model swaps invalidate the whole cache
Every embedding model swap invalidates the cache, by definition. The vector you stored against text-embedding-3-small@v1 is meaningless to a reader who is using text-embedding-3-large. Mixing the two in the same retrieval call returns nonsense distances.
Two strategies, depending on how much downtime you can take.
Stop-the-world re-embed. Pick a maintenance window, run a worker that rehashes every chunk under the new (model_name, model_version) and embeds them, swap the index pointer when done. This is the simplest path. For a corpus under a million chunks, it is usually fast enough. Cost depends on vendor pricing; OpenAI's pricing page is the source of truth at the time you plan it.
Dual-write with a flag. During the migration, every cache miss writes both the old-model vector and the new-model vector. The retrieval path is gated by a flag. When the new index has enough coverage, flip the flag, decommission the old rows. Reads stay green throughout. Writes pay double until you flip.
The cache key being a tuple makes both strategies cheap. There is no schema migration. New rows live alongside old rows. The lookup either hits the new key or falls through to a fresh embed call.
The cold-start problem
Day zero, your cache is empty. The first hour of traffic is 100% misses. If your producer fires at 200 chunks per second and the embedding API is the slow path, you have either melted your vendor budget or hit a rate limit and watched the queue back up.
Three patterns work.
Backfill before you cut over. Run an offline backfill job that walks your corpus and populates the cache before any traffic hits it. The job is bounded by your vendor's bulk-rate ceiling; OpenAI's Batch API accepts large request files (the docs list a per-file cap and a turnaround window; check the page for the current numbers) and runs asynchronously at a discounted rate. Whichever vendor you use, check their bulk endpoint and its quotas before you start. The whole point of backfilling is to eat the cost on a discounted, slower lane instead of the live one.
Throttle on the read path. When a request misses, do not stampede. Insert an in-flight set: if another request for the same key is already mid-embed, wait on its result instead of issuing a duplicate API call. This is the single-flight pattern. Without it, a popular new chunk fires N concurrent embed calls for the same text on its first appearance.
Soft-fail to a smaller local model. If the vendor returns a rate-limit error and the request is user-facing, you can fall back to a self-hosted small model (all-MiniLM-L6-v2, BGE-small, etc.) for that single request and tag the row as "fallback" so it gets rewritten with the canonical vector later. The fallback model lives in a different vector space, so the row is only useful until the canonical embedding lands.
A single-flight cache over Redis:
import json, time, threading
import numpy as np
import redis
R = redis.Redis(decode_responses=False)
INFLIGHT: dict[str, threading.Event] = {}
INFLIGHT_LOCK = threading.Lock()
def encode(vec: np.ndarray) -> bytes:
return vec.astype(np.float32).tobytes()
def decode(buf: bytes, dim: int) -> np.ndarray:
return np.frombuffer(
buf, dtype=np.float32, count=dim
)
def get_or_compute(
text: str,
model: str,
version: str,
dim: int,
embed_fn,
):
key = cache_key(model, version, text).encode()
cached = R.get(key)
if cached is not None:
return decode(cached, dim)
with INFLIGHT_LOCK:
ev = INFLIGHT.get(key)
if ev is None:
ev = threading.Event()
INFLIGHT[key] = ev
owner = True
else:
owner = False
if not owner:
ev.wait(timeout=30)
cached = R.get(key)
if cached is not None:
return decode(cached, dim)
raise TimeoutError("inflight owner did not publish in time")
try:
vec = embed_fn([text])[0]
R.set(key, encode(vec))
return vec
finally:
with INFLIGHT_LOCK:
INFLIGHT.pop(key, None)
ev.set()
The INFLIGHT map collapses concurrent misses on the same key into one API call. The other waiters block on a threading.Event, then read the row the owner just wrote. The non-owner branch raises on timeout rather than re-issuing its own API call. The whole point of single-flight is to avoid the duplicate request, so a timeout is a degraded path the caller should observe, not paper over.
For a multi-process deployment, replace the in-process map with a Redis lock (SET NX PX ...) on a lock:<key> sentinel. The shape is similar, but you will need a lock TTL and an owner token to release safely, plus a story for what happens when the lock holder dies mid-embed.
What this does to your cost graph
Three numbers move once the cache is in place. The cache hit rate climbs from 0 to whatever your corpus reuse pattern supports; on a stable corpus with predictable user intents, the hit rate can climb high enough that vendor calls become the exception. Vendor cost falls in proportion: the only API calls are misses. P99 retrieval latency drops too — the slow path is gone for the hot keys.
The number that does not move is retrieval quality. The cache is content-addressed, so a hit returns the same vector the API would have returned. As long as the key includes model_version, the cache cannot serve a vector from the wrong model. The thing it changes is the bill.
What to build first
If you have no cache today, the smallest useful thing is a (model, version, hash) -> vector table in whatever store you already run, plus the single-flight wrapper around your embed call. That alone cuts the second-day cost. The event-driven invalidator and the dual-write migration story can wait until the corpus updates start landing.
The wrong order is to build a TTL-based cache first. You will spend a week tuning the TTL, then realize on day eight that a TTL of 7 days threw away a perfectly correct vector and paid the vendor to recompute it. Skip the step.
If this was useful
The RAG Pocket Guide covers embedding-pipeline economics end to end: where to cache, how to key it, when to dual-write, and how to back-fill without melting the API. If you are about to swap embedders or backfill a fresh corpus, the chapters on versioning and migration are the ones to read first.

Top comments (0)