Gaurav Vij

Posted on Apr 15

We Gave an AI Agent a Long Context Caching Idea. Here's what happened next!

#ai #rag #llm #gemma

A few days ago, Han Xiao (VP AI @ Elastic) shared an experiment on Linkedin that asked a provocative question: what happens if you stop treating retrieval as a separate system and instead use the model’s own KV cache as the document store?

The setup was ambitious:

Qwen3.5-35B-A3B LLM,
1M token context,
A single 24 GB L4 GPU, and
A pipeline that avoids embeddings, vector databases, and chunking entirely.

The core idea was simple. Prefill the document once, save the KV cache to disk, restore it on demand, and answer queries with the full document already resident in context.

We wanted to see whether NEO - Fully autonomous AI Engineering Agent, could take that idea and turn it into a working implementation on its own.

So we gave NEO the research direction and let it run.

In about 30 minutes, it autonomously produced a working Cache-Augmented Generation system that implements the same core pattern: ingest a document once, prefill the entire document into the model’s KV cache, persist the cache as a .bin file, restore it before each query, and answer against full-document context without re-embedding or re-chunking anything.

The resulting GitHub Repo also documents that the full implementation, debugging, GPU validation, and documentation were done autonomously by NEO, including fixing 9 bugs across CUDA, Python, and shell, and running 11 GPU validation tests end to end.

The original idea

Traditional RAG pipelines split documents into chunks, embed them, store those embeddings in a vector index, and retrieve a subset of chunks at query time. That architecture is practical and scalable, but it comes with tradeoffs. The model only sees selected fragments, retrieval quality becomes a separate engineering problem, and there is always some risk that the right information was chunked poorly or never retrieved at all. The repo’s own README summarizes that contrast directly: RAG gives the model chunked fragments, while CAG aims to keep the full document active for every query.

Han’s experiment pushed that idea hard. His post describes loading a 1.2 million word novel into KV cache, pre-filling 905K tokens on a single L4 24 GB GPU, and relying on several optimizations to make that feasible, including YaRN scaling, Q3_K_M quantization, compressed KV cache, slot save and restore, and custom patches to support the model architecture. He also reported a key caveat that matters a lot: the system worked mechanically, but retrieval quality degraded badly in the middle of the context window, which is the classic lost-in-the-middle problem.

That was the interesting part for us.

Not because “RAG is dead” is the right conclusion. It probably is not. But because the experiment is a good stress test for whether an AI agent can reproduce a non-trivial systems idea from a public technical post and turn it into runnable software.

What NEO built

We used Neo's Extension in our VS Code IDE and prompted it to build a cache augmented generated system which is a full document QA stack built around llama-server and a persistent KV slot workflow.

The flow is straightforward:

A document is wrapped into a structured prompt and sent to the model for a one-time prefill.
The resulting KV cache is saved to disk as a slot file.
For every future query, that slot file is restored into llama-server.
The user’s question is appended to the restored state.
The model answers with the entire document already present in active context.

That sounds small in one paragraph, but there is a lot packed into it.

The repo includes:

a setup script that builds the required inference stack and downloads the model
a server launch script
a FastAPI application for ingestion, querying, corpus management, and health checks
CLI scripts for document ingest and querying
a demo path
Docker artifacts
validation docs and a GPU testing checklist

Access Cache Augmented Generation GitHub Repo

The API surface is also clean enough to use like a real system, not just a one-off experiment.

There are endpoints for /ingest, /status/{job_id}, /query, /corpora, /corpora/{id}, and /health, with ingestion running asynchronously and status polled through a job state transition.

That matters because replication is not just “it ran once on my machine.” A credible reproduction needs to be shaped into something other people can actually use, inspect, and test.

Why the implementation is technically interesting

The most important architectural shift here is that retrieval moves from an external index into the model runtime itself.

In standard RAG:

storage is in a vector database
retrieval happens before generation
the model sees only the retrieved subset

In this CAG-style system:

storage is effectively the saved KV state
retrieval is replaced by restoring a prior attention state
the model answers after the full document context is already loaded

That changes both latency and operational behavior.

The expensive part becomes the first prefill pass. After that, repeated queries are cheap because the cache is restored instead of recomputed. The repo reports that after ingestion, the cache lives in kv_slots/my_doc.bin, and future queries restore it instantly while surviving server restarts.

This is a very different tradeoff from RAG. You pay a large one-time setup cost per document or corpus, then reuse that precomputed attention state repeatedly.

For some workloads, that is extremely attractive.

If you have a relatively fixed corpus and many follow-up queries, the economics can make sense. If your corpus changes constantly, or if you need many documents active concurrently, the tradeoff looks worse.

Reported results from the replication

According to the repo, all 11 GPU tests were run on an NVIDIA RTX A6000 with Qwen3.5-35B-A3B Q3_K_M at a 1,048,576 token context window. The README reports:

24.3 minute cold prefill for War and Peace at 922K tokens
1.2 second KV slot restore from disk
roughly 100 tokens per second decode speed at 1M context
4 GB KV cache size at 1M context versus 23 GB in f16
about 43% VRAM usage on the A6000 in that configuration

It also lists successful validation for:

TurboQuant cache types
KV compression
YaRN context extension from 262K to 1,048,576
slot save and restore timing
VRAM profiling
Flash Attention
end-to-end document QA demos
concurrent query handling
stress testing on War and Peace
API key authentication
persistence across server restarts

There is also a smaller demo run using Alice in Wonderland and Peter Pan where the repo reports 2 out of 2 documents ingested, 6 out of 6 queries answered correctly, average decode speed around 103 tok/s, and no OOM errors.

Those numbers are useful for two reasons.

First, they show the system is not just conceptually aligned with the original post. It is instrumented and benchmarked.

Second, they make it easier to reason about where this architecture is actually viable.

The engineering constraints are the real story

One thing I like about both the original post and the replication is that neither pretends this is magic.

The constraints are real.

The replicated system is explicitly Linux and NVIDIA only. The large-model path requires 24 GB or more of VRAM for the full 1M-token configuration. Smaller VRAM tiers fall back to smaller Qwen variants and much shorter context windows. The first setup takes about 35 minutes to build CUDA kernels. The full Qwen3.5-35B path also requires a Hugging Face token.

There are also architectural limitations:

the long initial prefill still costs about 24 minutes on the A6000 for a very large document
only one active corpus is supported in the current single-slot setup
switching corpora means restoring a different slot
the lost-in-the-middle problem remains real at extreme context depth

That last point is the big one.

Han’s own comment on the original post says the system could generate readable answers, but hallucinated badly in the middle of the 905K-token context and mainly attended to the start and end of the document. The replicated repo reports a similar caveat in its sample War and Peace results, where one ending-related question is marked only partial because of lost-in-the-middle behavior.

So no, this does not prove that traditional RAG is obsolete.

What it proves is that KV-cache-centric document serving is increasingly practical as a systems pattern, and that the bottleneck is moving from “can we load this much context” toward “can the model actually use it reliably.”

The part that matters most to me

The technical implementation is interesting.

But the more important story is how it got built.

The repo states that NEO handled setup, debugging, validation, and documentation autonomously. That means this was not just a code generation exercise. It involved:

navigating an unfamiliar architecture
getting the inference stack working
dealing with CUDA and shell issues
validating runtime behavior on GPU
wrapping the result in a usable API and CLI
writing documentation that explains how the system works and where it breaks

That is much closer to real engineering work than most “AI built X” demos.

The valuable question is no longer whether an agent can produce a toy script from a prompt.

The better question is whether it can take a new technical idea, explore the dependency stack, adapt it to real hardware constraints, instrument the result, debug its mistakes, and leave behind something another engineer can inspect and run.

This replication is a good example of that threshold being crossed.

What I would take away from this

I do not think the lesson is “replace RAG with giant context everywhere.”

I think the real lessons are:

Persistent KV state is becoming a usable systems primitive.
It is not just an internal optimization anymore. It can be treated as part of application architecture.
Long-context serving changes the shape of the stack.
You can move work from retrieval infrastructure into model runtime, but only for some workloads.
The hard part is now quality, not just capacity.
Getting 1M context to fit is impressive. Getting the model to attend well across that full range is the deeper challenge.
Autonomous agents are becoming useful for reproducing research systems.
Not in a magical “push button, get product” sense. In a practical engineering sense where they can compress a lot of setup, debugging, and validation work into one session.

That last point is the reason we cared enough to run this experiment in the first place.

A lot of technical posts die as inspiration. They get bookmarked, maybe discussed, and then disappear.

This one turned into a runnable system in about 30 minutes.

That is a meaningful change in what an AI engineering agent can do.

DEV Community