KV Marketplace: Peer-to-Peer Cross-GPU Prefix Caching for LLMs
I've been experimenting with something I haven't really seen explored in the open-source LLM inference world: GPU-to-GPU reuse of prefix KV caches.
Most inference engines (vLLM, TensorRT-LLM, FasterTransformer, etc.) implement prefix caching: if multiple requests begin with the same sequence of tokens, you can reuse the prefill attention states (the K/V tensors) instead of recomputing them. But this caching is local to a single process.
What KV Marketplace Does
KV Marketplace is a small research prototype that:
- Computes a compact hash of each request's completed prefix
- Exports its full KV tensors (all layers) into a lightweight registry
- Lets other processes import these tensors directly over RDMA or NVLink
- Skips redundant prefill compute and jumps straight into decode
Think of it like "memcached for transformer attention states," but GPU-native.
Why This Matters
On multi-GPU inference servers, shared prefixes happen constantly in:
- Chatbot deployments with common system prompts
- Multi-tenant apps using the same templates
- Agents, tool-use loops, and memory-augmented workloads
- Benchmarking pipelines that rerun similar sequences
Yet most systems recompute the same prefix every time, per replica.
With KV Marketplace, under optimistic conditions you can reach:
- ~15% lower end-to-end latency
- Higher throughput at the same batch size
- Zero changes to model weights or architecture
- Zero CPU load increase
These are early results (the prototype is intentionally minimal) but the direction is promising.
How It Works (High Level)
When a request completes its prefill phase:
- Token sequence → hashed (xxh3_64 + model version)
- KV tensors → exported into a GPU registry
- Another process with the same prefix:
- Looks up the hash
- Pulls KV tensors via P2P
- Adopts them before decode
The prototype supports:
- A pluggable hashing scheme
- Type/shape checking between peers
- vLLM integration hooks for before/after prefill
- CUDA P2P paths (NVLink and RDMA depending on hardware)
Limitations (for now)
This is not production-ready. It does not include:
- Distributed registry (currently process-local)
- Eviction / TTL
- Version negotiation across heterogeneous clusters
- CPU/disk tiers
- Fault tolerance
- Security / isolation
It's deliberately simple. The goal is to provide a minimal, hackable prototype of fast-path P2P prefix reuse.
If you're building on top of vLLM or working on distributed inference systems, this might spark ideas or serve as a baseline to build a smarter tiered cache.
Early Results
With perfect prefix hits on multi-GPU setups:
- ~15% latency reduction in end-to-end generation
- Helps both long-prefill and short-prefill models
- Overhead dominated by a single P2P memcpy
In practice, real-world gains depend on template reuse and batch composition.
Try It Out
The repo includes:
- vLLM integration hooks
- Simple CUDA P2P copy utilities
- Minimal test harness
- Logging utilities to observe prefix reuse behavior
Repo: https://github.com/neelsomani/kv-marketplace
Why I Built This
As LLM applications become more agentic and multi-tenant, prefix locality increases. We're going to need better caching primitives - fast ones, distributed ones, and eventually learned ones.
KV Marketplace is one small piece of that future. If you're working on inference engines, caching systems, or GPU transport layers, I'd love to hear your thoughts or collaborate.
Top comments (0)