Posted on May 16

I built a vector embedding cache that makes stale hits structurally impossible

#rag #llm #python #vectordatabase

Wrote up the design behind embcache, a GPU-native two-tier cache for embeddings and KV states.

The problem it solves: embedding caches that key on content hash alone silently return stale vectors after a model upgrade or tokenizer change. The cache looks healthy. The vectors are wrong.

The fix is a composite EmbeddingFingerprint covering model_id, tokenizer hash, chunking strategy, normalization version, prompt template, and dataset version. No partial matches, so no path to a stale hit from a pipeline change.

Full writeup with benchmarks (98.3% hit rate, 400-450x speedup on KV cache hits) on Medium: https://bh3r1th.medium.com/the-vector-embedding-cache-bug-that-costs-nothing-and-corrupts-everything-157be6c575e8

Repo: https://github.com/bh3r1th/embcache

Not on PyPI yet. Looking for feedback, especially on whether the fingerprint schema covers all the axes that could cause a stale hit in your pipeline.

DEV Community

I built a vector embedding cache that makes stale hits structurally impossible

Top comments (0)