<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Neel Somani</title>
    <description>The latest articles on DEV Community by Neel Somani (@neeljsomani).</description>
    <link>https://dev.to/neeljsomani</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F66488%2F2069488a-8688-435c-8d04-359885c7df32.jpg</url>
      <title>DEV Community: Neel Somani</title>
      <link>https://dev.to/neeljsomani</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/neeljsomani"/>
    <language>en</language>
    <item>
      <title>KV Marketplace: A Cross-GPU KV Cache</title>
      <dc:creator>Neel Somani</dc:creator>
      <pubDate>Wed, 12 Nov 2025 22:12:00 +0000</pubDate>
      <link>https://dev.to/neeljsomani/kv-marketplace-a-cross-gpu-kv-cache-262b</link>
      <guid>https://dev.to/neeljsomani/kv-marketplace-a-cross-gpu-kv-cache-262b</guid>
      <description>&lt;h1&gt;
  
  
  KV Marketplace: Peer-to-Peer Cross-GPU Prefix Caching for LLMs
&lt;/h1&gt;

&lt;p&gt;I've been experimenting with something I haven't really seen explored in the open-source LLM inference world: GPU-to-GPU reuse of prefix KV caches.&lt;/p&gt;

&lt;p&gt;Most inference engines (vLLM, TensorRT-LLM, FasterTransformer, etc.) implement prefix caching: if multiple requests begin with the same sequence of tokens, you can reuse the prefill attention states (the K/V tensors) instead of recomputing them. But this caching is local to a single process.&lt;/p&gt;

&lt;h2&gt;
  
  
  What KV Marketplace Does
&lt;/h2&gt;

&lt;p&gt;KV Marketplace is a small research prototype that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Computes a compact hash of each request's completed prefix&lt;/li&gt;
&lt;li&gt;Exports its full KV tensors (all layers) into a lightweight registry&lt;/li&gt;
&lt;li&gt;Lets other processes import these tensors directly over RDMA or NVLink&lt;/li&gt;
&lt;li&gt;Skips redundant prefill compute and jumps straight into decode&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like "memcached for transformer attention states," but GPU-native.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;On multi-GPU inference servers, shared prefixes happen constantly in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chatbot deployments with common system prompts&lt;/li&gt;
&lt;li&gt;Multi-tenant apps using the same templates&lt;/li&gt;
&lt;li&gt;Agents, tool-use loops, and memory-augmented workloads&lt;/li&gt;
&lt;li&gt;Benchmarking pipelines that rerun similar sequences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Yet most systems recompute the same prefix every time, per replica.&lt;/p&gt;

&lt;p&gt;With KV Marketplace, under optimistic conditions you can reach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~15% lower end-to-end latency&lt;/li&gt;
&lt;li&gt;Higher throughput at the same batch size&lt;/li&gt;
&lt;li&gt;Zero changes to model weights or architecture&lt;/li&gt;
&lt;li&gt;Zero CPU load increase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are early results (the prototype is intentionally minimal) but the direction is promising.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works (High Level)
&lt;/h2&gt;

&lt;p&gt;When a request completes its prefill phase:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Token sequence → hashed (xxh3_64 + model version)&lt;/li&gt;
&lt;li&gt;KV tensors → exported into a GPU registry&lt;/li&gt;
&lt;li&gt;Another process with the same prefix:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Looks up the hash&lt;/li&gt;
&lt;li&gt;Pulls KV tensors via P2P&lt;/li&gt;
&lt;li&gt;Adopts them before decode&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The prototype supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A pluggable hashing scheme&lt;/li&gt;
&lt;li&gt;Type/shape checking between peers&lt;/li&gt;
&lt;li&gt;vLLM integration hooks for before/after prefill&lt;/li&gt;
&lt;li&gt;CUDA P2P paths (NVLink and RDMA depending on hardware)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Limitations (for now)
&lt;/h2&gt;

&lt;p&gt;This is not production-ready. It does not include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distributed registry (currently process-local)&lt;/li&gt;
&lt;li&gt;Eviction / TTL&lt;/li&gt;
&lt;li&gt;Version negotiation across heterogeneous clusters&lt;/li&gt;
&lt;li&gt;CPU/disk tiers&lt;/li&gt;
&lt;li&gt;Fault tolerance&lt;/li&gt;
&lt;li&gt;Security / isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's deliberately simple. The goal is to provide a minimal, hackable prototype of fast-path P2P prefix reuse.&lt;/p&gt;

&lt;p&gt;If you're building on top of vLLM or working on distributed inference systems, this might spark ideas or serve as a baseline to build a smarter tiered cache.&lt;/p&gt;

&lt;h2&gt;
  
  
  Early Results
&lt;/h2&gt;

&lt;p&gt;With perfect prefix hits on multi-GPU setups:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~15% latency reduction in end-to-end generation&lt;/li&gt;
&lt;li&gt;Helps both long-prefill and short-prefill models&lt;/li&gt;
&lt;li&gt;Overhead dominated by a single P2P memcpy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, real-world gains depend on template reuse and batch composition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Out
&lt;/h2&gt;

&lt;p&gt;The repo includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;vLLM integration hooks&lt;/li&gt;
&lt;li&gt;Simple CUDA P2P copy utilities&lt;/li&gt;
&lt;li&gt;Minimal test harness&lt;/li&gt;
&lt;li&gt;Logging utilities to observe prefix reuse behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/neelsomani/kv-marketplace" rel="noopener noreferrer"&gt;https://github.com/neelsomani/kv-marketplace&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;As LLM applications become more agentic and multi-tenant, prefix locality increases. We're going to need better caching primitives - fast ones, distributed ones, and eventually learned ones.&lt;/p&gt;

&lt;p&gt;KV Marketplace is one small piece of that future. If you're working on inference engines, caching systems, or GPU transport layers, I'd love to hear your thoughts or collaborate.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>inference</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Verifying CUDA Kernels in Coq with Rust MIR (Introducing cuq)</title>
      <dc:creator>Neel Somani</dc:creator>
      <pubDate>Thu, 23 Oct 2025 01:51:00 +0000</pubDate>
      <link>https://dev.to/neeljsomani/verifying-cuda-kernels-in-coq-with-rust-mir-introducing-cuq-2b21</link>
      <guid>https://dev.to/neeljsomani/verifying-cuda-kernels-in-coq-with-rust-mir-introducing-cuq-2b21</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;GPU kernels are notoriously difficult to reason about formally. While prior work such as Lustig et al. (ASPLOS 2019) defined a Coq model of the PTX memory semantics, there has not been a practical toolchain connecting high-level kernel code to verifiable proofs.&lt;/p&gt;

&lt;p&gt;cuq is an experimental Rust-to-Coq pipeline that lets you write CUDA kernels in safe Rust and automatically generate Coq representations of their execution traces. Each MIR instruction is translated into Coq terms compatible with the PTX memory model, allowing proofs of correctness and safety properties.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;MIR Extraction&lt;/strong&gt; - cuq operates on Rust's Mid-level IR.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIR to Coq Translation&lt;/strong&gt; — the compiler pass encodes MIR statements into Coq syntax.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with PTX Model&lt;/strong&gt; — generated Coq files import the formal PTX semantics from Lustig et al.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification&lt;/strong&gt; — developers can reason about kernel safety and memory consistency directly in Coq.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Significance
&lt;/h2&gt;

&lt;p&gt;By connecting Rust's ownership and type guarantees with Coq's formal proof system, cuq provides a foundation for provably safe GPU programming. It demonstrates that performance-oriented languages and proof-based verification can coexist in a single toolchain.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/neelsomani/cuq" rel="noopener noreferrer"&gt;https://github.com/neelsomani/cuq&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>coq</category>
      <category>cuda</category>
      <category>formalverification</category>
    </item>
  </channel>
</rss>
