<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Prashanth Velidandi</title>
    <description>The latest articles on DEV Community by Prashanth Velidandi (@pmv_inferx).</description>
    <link>https://dev.to/pmv_inferx</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3382940%2F7c22cb83-8e48-49dd-8ec0-2aa7fb44b3e5.jpg</url>
      <title>DEV Community: Prashanth Velidandi</title>
      <link>https://dev.to/pmv_inferx</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pmv_inferx"/>
    <language>en</language>
    <item>
      <title>RAG became the default answer for private knowledge access. We asked a different question: what if context didn’t need to be repeatedly retrieved at all? Persistent KV cache changed the economics completely.</title>
      <dc:creator>Prashanth Velidandi</dc:creator>
      <pubDate>Tue, 26 May 2026 12:21:50 +0000</pubDate>
      <link>https://dev.to/pmv_inferx/rag-became-the-default-answer-for-private-knowledge-access-we-asked-a-different-question-what-if-45m9</link>
      <guid>https://dev.to/pmv_inferx/rag-became-the-default-answer-for-private-knowledge-access-we-asked-a-different-question-what-if-45m9</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/pmv_inferx/we-replaced-our-rag-pipeline-with-persistent-kv-cache-heres-what-we-found-7cl" class="crayons-story__hidden-navigation-link"&gt;We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found.&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/pmv_inferx" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3382940%2F7c22cb83-8e48-49dd-8ec0-2aa7fb44b3e5.jpg" alt="pmv_inferx profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/pmv_inferx" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Prashanth Velidandi
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Prashanth Velidandi
                
              
              &lt;div id="story-author-preview-content-3731263" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/pmv_inferx" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3382940%2F7c22cb83-8e48-49dd-8ec0-2aa7fb44b3e5.jpg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Prashanth Velidandi&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/pmv_inferx/we-replaced-our-rag-pipeline-with-persistent-kv-cache-heres-what-we-found-7cl" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;May 23&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/pmv_inferx/we-replaced-our-rag-pipeline-with-persistent-kv-cache-heres-what-we-found-7cl" id="article-link-3731263"&gt;
          We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found.
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/rag"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;rag&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/serverless"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;serverless&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/machinelearning"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;machinelearning&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/webdev"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;webdev&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/pmv_inferx/we-replaced-our-rag-pipeline-with-persistent-kv-cache-heres-what-we-found-7cl" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/fire-f60e7a582391810302117f987b22a8ef04a2fe0df7e3258a5f49332df1cec71e.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;2&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/pmv_inferx/we-replaced-our-rag-pipeline-with-persistent-kv-cache-heres-what-we-found-7cl#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            3 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found.</title>
      <dc:creator>Prashanth Velidandi</dc:creator>
      <pubDate>Sat, 23 May 2026 08:34:13 +0000</pubDate>
      <link>https://dev.to/pmv_inferx/we-replaced-our-rag-pipeline-with-persistent-kv-cache-heres-what-we-found-7cl</link>
      <guid>https://dev.to/pmv_inferx/we-replaced-our-rag-pipeline-with-persistent-kv-cache-heres-what-we-found-7cl</guid>
      <description>&lt;p&gt;RAG has become the default answer for giving LLMs access to private knowledge. And for good reason — it works. But after running it in production we kept hitting the same wall. Not retrieval accuracy. The operational tax.&lt;/p&gt;

&lt;p&gt;Re-embedding on data changes. Chunking drift. Retrieval misses on edge cases. Pipeline failures at 2am. The vector database that needs babysitting.&lt;/p&gt;

&lt;p&gt;So we ran an experiment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Hypothesis&lt;/strong&gt;&lt;br&gt;
What if instead of chunking, embedding, and retrieving — we just loaded the full document into the LLM context, cached the KV state persistently, and reused it across every query?&lt;/p&gt;

&lt;p&gt;No retrieval step. No embedding pipeline. No vector database. Just the model with full document context, warm and ready.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;&lt;br&gt;
The core idea is simple. When an LLM processes a prompt it generates a key-value attention cache — the internal representation of everything it has read. Normally this cache is transient. It lives in VRAM during the request and disappears after.&lt;br&gt;
We persist it.&lt;br&gt;
The initialization prompt — your document — gets processed once. The resulting KV cache gets stored externally and indexed to that document. Every subsequent query retrieves that cached state and appends the user query. The model never recomputes the document. Ever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The math&lt;/strong&gt;:&lt;br&gt;
KV_init = LLM.prefill(document)&lt;br&gt;
KV_store[document_id] = KV_init&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;# On every query:&lt;/strong&gt;&lt;br&gt;
KV_full = KV_store[document_id] + LLM.prefill(query)&lt;br&gt;
output = LLM.decode(KV_full)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What We Found&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer quality improved.&lt;br&gt;
No retrieval misses are possible when the full document is in context. The model has read everything. It doesn't guess which chunks are relevant — it knows the whole document. For complex multi-part questions that span different sections this is a significant improvement over chunked retrieval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Updates became trivial.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Document changes? Re-run the prefill, store the new KV cache. Minutes not hours. No re-embedding pipeline. No re-indexing. No retrieval regression testing. Just regenerate and deploy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational complexity dropped.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No embedding model to maintain. No vector database to monitor. No chunking strategy to tune. No retrieval quality metrics to track. The surface area for things to break quietly got dramatically smaller.&lt;br&gt;
Latency on warm cache is effectively instant.&lt;/p&gt;

&lt;p&gt;When the KV state is already loaded the query just appends and generates. No retrieval hop, no context injection latency.&lt;br&gt;
The Honest Tradeoffs&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context window is the ceiling.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Current limit is around 120k tokens — roughly 200-300 pages. Works well for focused documents. For large corpora you need a routing layer to select the right cache per query. You've pushed the retrieval problem up one level — instead of retrieving chunks you're selecting a cache. Simpler problem but not zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold cache restore adds latency.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first query after a cache restore pays a latency cost. For strict SLA requirements this matters. Warm cache is instant. Cold restore depends on your infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Initial prefill costs more than embedding.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Running a full forward pass on a large document costs more compute than embedding it. The economics work when query volume is high enough to amortize that cost. Low query, high update frequency — RAG still wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where This Wins&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This approach is clearly better when:&lt;/p&gt;

&lt;p&gt;You have a focused, structured document — legal contract, compliance policy, product manual, technical spec&lt;br&gt;
Query volume is high relative to update frequency&lt;br&gt;
Full context comprehension matters more than breadth&lt;br&gt;
You want to eliminate pipeline maintenance entirely&lt;br&gt;
Privacy matters — no document chunks sent to embedding APIs&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where RAG Still Wins&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Very large document collections where context limits apply&lt;br&gt;
Highly dynamic data that changes multiple times per day&lt;br&gt;
When you genuinely don't know which document is relevant at query time&lt;br&gt;
Low query volume where prefill cost doesn't amortize&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What We're Building&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We've been running this in production at InferX as part of our Sovereign Endpoints™ infrastructure. The persistent KV cache layer sits on top of our GPU snapshotting architecture — which is what makes the cold cache restore fast enough to be practical.&lt;br&gt;
We're now opening a limited beta for teams who want to test this on real workloads. Particularly interested in legal, compliance, finance, and developer tooling use cases.&lt;br&gt;
If you're running RAG in production and want to run a head-to-head comparison — we'd love to work with you.&lt;/p&gt;

&lt;p&gt;🎬 Demo dropping in 2 days — follow to see it first.&lt;br&gt;
&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://inferx.net/" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;inferx.net&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


</description>
      <category>rag</category>
      <category>serverless</category>
      <category>machinelearning</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
