KV Marketplace: A Cross-GPU KV Cache

Neel Somani — Wed, 12 Nov 2025 22:12:00 +0000

KV Marketplace: Peer-to-Peer Cross-GPU Prefix Caching for LLMs

I've been experimenting with something I haven't really seen explored in the open-source LLM inference world: GPU-to-GPU reuse of prefix KV caches.

Most inference engines (vLLM, TensorRT-LLM, FasterTransformer, etc.) implement prefix caching: if multiple requests begin with the same sequence of tokens, you can reuse the prefill attention states (the K/V tensors) instead of recomputing them. But this caching is local to a single process.

What KV Marketplace Does

KV Marketplace is a small research prototype that:

Computes a compact hash of each request's completed prefix
Exports its full KV tensors (all layers) into a lightweight registry
Lets other processes import these tensors directly over RDMA or NVLink
Skips redundant prefill compute and jumps straight into decode

Think of it like "memcached for transformer attention states," but GPU-native.

Why This Matters

On multi-GPU inference servers, shared prefixes happen constantly in:

Chatbot deployments with common system prompts
Multi-tenant apps using the same templates
Agents, tool-use loops, and memory-augmented workloads
Benchmarking pipelines that rerun similar sequences

Yet most systems recompute the same prefix every time, per replica.

With KV Marketplace, under optimistic conditions you can reach:

~15% lower end-to-end latency
Higher throughput at the same batch size
Zero changes to model weights or architecture
Zero CPU load increase

These are early results (the prototype is intentionally minimal) but the direction is promising.

How It Works (High Level)

When a request completes its prefill phase:

Token sequence → hashed (xxh3_64 + model version)
KV tensors → exported into a GPU registry
Another process with the same prefix:

Looks up the hash
Pulls KV tensors via P2P
Adopts them before decode

The prototype supports:

A pluggable hashing scheme
Type/shape checking between peers
vLLM integration hooks for before/after prefill
CUDA P2P paths (NVLink and RDMA depending on hardware)

Limitations (for now)

This is not production-ready. It does not include:

Distributed registry (currently process-local)
Eviction / TTL
Version negotiation across heterogeneous clusters
CPU/disk tiers
Fault tolerance
Security / isolation

It's deliberately simple. The goal is to provide a minimal, hackable prototype of fast-path P2P prefix reuse.

If you're building on top of vLLM or working on distributed inference systems, this might spark ideas or serve as a baseline to build a smarter tiered cache.

Early Results

With perfect prefix hits on multi-GPU setups:

~15% latency reduction in end-to-end generation
Helps both long-prefill and short-prefill models
Overhead dominated by a single P2P memcpy

In practice, real-world gains depend on template reuse and batch composition.

Try It Out

The repo includes:

vLLM integration hooks
Simple CUDA P2P copy utilities
Minimal test harness
Logging utilities to observe prefix reuse behavior

Repo: https://github.com/neelsomani/kv-marketplace

Why I Built This

As LLM applications become more agentic and multi-tenant, prefix locality increases. We're going to need better caching primitives - fast ones, distributed ones, and eventually learned ones.

KV Marketplace is one small piece of that future. If you're working on inference engines, caching systems, or GPU transport layers, I'd love to hear your thoughts or collaborate.

Verifying CUDA Kernels in Coq with Rust MIR (Introducing VeriCUDA)

Neel Somani — Thu, 23 Oct 2025 01:51:00 +0000

Overview

GPU kernels are notoriously difficult to reason about formally. While prior work such as Lustig et al. (ASPLOS 2019) defined a Coq model of the PTX memory semantics, there has not been a practical toolchain connecting high-level kernel code to verifiable proofs.

VeriCUDA is an experimental Rust-to-Coq pipeline that lets you write CUDA kernels in safe Rust and automatically generate Coq representations of their execution traces. Each MIR instruction is translated into Coq terms compatible with the PTX memory model, allowing proofs of correctness and safety properties.

Architecture

MIR Extraction - VeriCUDA operates on Rust's Mid-level IR.
MIR to Coq Translation — the compiler pass encodes MIR statements into Coq syntax.
Integration with PTX Model — generated Coq files import the formal PTX semantics from Lustig et al.
Verification — developers can reason about kernel safety and memory consistency directly in Coq.

Significance

By connecting Rust's ownership and type guarantees with Coq's formal proof system, VeriCUDA provides a foundation for provably safe GPU programming. It demonstrates that performance-oriented languages and proof-based verification can coexist in a single toolchain.

GitHub: https://github.com/neelsomani/VeriCUDA

DEV Community: Neel Somani