DEV Community

Neel Somani
Neel Somani

Posted on

KV Marketplace: A Cross-GPU KV Cache

KV Marketplace: Peer-to-Peer Cross-GPU Prefix Caching for LLMs

I've been experimenting with something I haven't really seen explored in the open-source LLM inference world: GPU-to-GPU reuse of prefix KV caches.

Most inference engines (vLLM, TensorRT-LLM, FasterTransformer, etc.) implement prefix caching: if multiple requests begin with the same sequence of tokens, you can reuse the prefill attention states (the K/V tensors) instead of recomputing them. But this caching is local to a single process.

What KV Marketplace Does

KV Marketplace is a small research prototype that:

  • Computes a compact hash of each request's completed prefix
  • Exports its full KV tensors (all layers) into a lightweight registry
  • Lets other processes import these tensors directly over RDMA or NVLink
  • Skips redundant prefill compute and jumps straight into decode

Think of it like "memcached for transformer attention states," but GPU-native.

Why This Matters

On multi-GPU inference servers, shared prefixes happen constantly in:

  • Chatbot deployments with common system prompts
  • Multi-tenant apps using the same templates
  • Agents, tool-use loops, and memory-augmented workloads
  • Benchmarking pipelines that rerun similar sequences

Yet most systems recompute the same prefix every time, per replica.

With KV Marketplace, under optimistic conditions you can reach:

  • ~15% lower end-to-end latency
  • Higher throughput at the same batch size
  • Zero changes to model weights or architecture
  • Zero CPU load increase

These are early results (the prototype is intentionally minimal) but the direction is promising.

How It Works (High Level)

When a request completes its prefill phase:

  1. Token sequence → hashed (xxh3_64 + model version)
  2. KV tensors → exported into a GPU registry
  3. Another process with the same prefix:
  • Looks up the hash
  • Pulls KV tensors via P2P
  • Adopts them before decode

The prototype supports:

  • A pluggable hashing scheme
  • Type/shape checking between peers
  • vLLM integration hooks for before/after prefill
  • CUDA P2P paths (NVLink and RDMA depending on hardware)

Limitations (for now)

This is not production-ready. It does not include:

  • Distributed registry (currently process-local)
  • Eviction / TTL
  • Version negotiation across heterogeneous clusters
  • CPU/disk tiers
  • Fault tolerance
  • Security / isolation

It's deliberately simple. The goal is to provide a minimal, hackable prototype of fast-path P2P prefix reuse.

If you're building on top of vLLM or working on distributed inference systems, this might spark ideas or serve as a baseline to build a smarter tiered cache.

Early Results

With perfect prefix hits on multi-GPU setups:

  • ~15% latency reduction in end-to-end generation
  • Helps both long-prefill and short-prefill models
  • Overhead dominated by a single P2P memcpy

In practice, real-world gains depend on template reuse and batch composition.

Try It Out

The repo includes:

  • vLLM integration hooks
  • Simple CUDA P2P copy utilities
  • Minimal test harness
  • Logging utilities to observe prefix reuse behavior

Repo: https://github.com/neelsomani/kv-marketplace

Why I Built This

As LLM applications become more agentic and multi-tenant, prefix locality increases. We're going to need better caching primitives - fast ones, distributed ones, and eventually learned ones.

KV Marketplace is one small piece of that future. If you're working on inference engines, caching systems, or GPU transport layers, I'd love to hear your thoughts or collaborate.

Top comments (0)