KV Marketplace: A Cross-GPU KV Cache

#llm #inference #machinelearning

KV Marketplace: Peer-to-Peer Cross-GPU Prefix Caching for LLMs

I've been experimenting with something I haven't really seen explored in the open-source LLM inference world: GPU-to-GPU reuse of prefix KV caches.

Most inference engines (vLLM, TensorRT-LLM, FasterTransformer, etc.) implement prefix caching: if multiple requests begin with the same sequence of tokens, you can reuse the prefill attention states (the K/V tensors) instead of recomputing them. But this caching is local to a single process.

What KV Marketplace Does

KV Marketplace is a small research prototype that:

Computes a compact hash of each request's completed prefix
Exports its full KV tensors (all layers) into a lightweight registry
Lets other processes import these tensors directly over RDMA or NVLink
Skips redundant prefill compute and jumps straight into decode

Think of it like "memcached for transformer attention states," but GPU-native.

Why This Matters

On multi-GPU inference servers, shared prefixes happen constantly in:

Chatbot deployments with common system prompts
Multi-tenant apps using the same templates
Agents, tool-use loops, and memory-augmented workloads
Benchmarking pipelines that rerun similar sequences

Yet most systems recompute the same prefix every time, per replica.

With KV Marketplace, under optimistic conditions you can reach:

~15% lower end-to-end latency
Higher throughput at the same batch size
Zero changes to model weights or architecture
Zero CPU load increase

These are early results (the prototype is intentionally minimal) but the direction is promising.

How It Works (High Level)

When a request completes its prefill phase:

Token sequence → hashed (xxh3_64 + model version)
KV tensors → exported into a GPU registry
Another process with the same prefix:

Looks up the hash
Pulls KV tensors via P2P
Adopts them before decode

The prototype supports:

A pluggable hashing scheme
Type/shape checking between peers
vLLM integration hooks for before/after prefill
CUDA P2P paths (NVLink and RDMA depending on hardware)

Limitations (for now)

This is not production-ready. It does not include:

Distributed registry (currently process-local)
Eviction / TTL
Version negotiation across heterogeneous clusters
CPU/disk tiers
Fault tolerance
Security / isolation

It's deliberately simple. The goal is to provide a minimal, hackable prototype of fast-path P2P prefix reuse.

If you're building on top of vLLM or working on distributed inference systems, this might spark ideas or serve as a baseline to build a smarter tiered cache.

Early Results

With perfect prefix hits on multi-GPU setups:

~15% latency reduction in end-to-end generation
Helps both long-prefill and short-prefill models
Overhead dominated by a single P2P memcpy

In practice, real-world gains depend on template reuse and batch composition.

Try It Out

The repo includes:

vLLM integration hooks
Simple CUDA P2P copy utilities
Minimal test harness
Logging utilities to observe prefix reuse behavior

Repo: https://github.com/neelsomani/kv-marketplace

Why I Built This

As LLM applications become more agentic and multi-tenant, prefix locality increases. We're going to need better caching primitives - fast ones, distributed ones, and eventually learned ones.

KV Marketplace is one small piece of that future. If you're working on inference engines, caching systems, or GPU transport layers, I'd love to hear your thoughts or collaborate.