Wrote up the design behind embcache, a GPU-native two-tier cache for embeddings and KV states.
The problem it solves: embedding caches that key on content hash alone silently return stale vectors after a model upgrade or tokenizer change. The cache looks healthy. The vectors are wrong.
The fix is a composite EmbeddingFingerprint covering model_id, tokenizer hash, chunking strategy, normalization version, prompt template, and dataset version. No partial matches, so no path to a stale hit from a pipeline change.
Full writeup with benchmarks (98.3% hit rate, 400-450x speedup on KV cache hits) on Medium: https://bh3r1th.medium.com/the-vector-embedding-cache-bug-that-costs-nothing-and-corrupts-everything-157be6c575e8
Repo: https://github.com/bh3r1th/embcache
Not on PyPI yet. Looking for feedback, especially on whether the fingerprint schema covers all the axes that could cause a stale hit in your pipeline.
Top comments (0)