Nilofer 🚀

Posted on Apr 25 • Edited on Apr 27

Cache-Augmented Generation (CAG): A RAG-less Approach to Document QA

#ai #opensource #machinelearning #rag

Most document QA systems today rely on Retrieval-Augmented Generation (RAG). The standard pipeline is familiar: chunk the document, generate embeddings, store them in a vector database, and retrieve relevant chunks at query time.

This works, but it comes with trade-offs. The model only sees fragments of the document, retrieval adds latency, and the system becomes more complex with multiple moving parts.

Cache-Augmented Generation (CAG) explores a different approach,where the document is processed once and reused across queries instead of being retrieved repeatedly.

What is Cache-Augmented Generation

Cache-Augmented Generation (CAG) approaches document QA by reusing the model’s internal state instead of retrieving context for every query.

During ingestion, the entire document is processed in a single pass. In this step, the model builds its KV (key-value) cache, which represents the document’s context.

This KV cache is then saved to disk.

When a query is made, the cache is restored and the query is appended, allowing the model to generate responses using the previously processed document.

How It Works

Step 1: Ingest (done once per document)
The document is wrapped in a structured prompt and sent to llama-server. The model runs a full prefill pass, loading every token into the KV cache. This takes time, proportional to document size, but only happens once. The KV cache is then saved to a .bin file on disk.

Step 2: Query (instant, repeatable)
Before each query, the saved .bin file is restored into llama-server's KV cache in ~1 second. The user's question is appended and the model generates an answer with full document context active. No re-reading, no re-embedding.

Step 3: Persistence
KV slots survive server restarts. Kill the server, restart it, and your next query restores the cache from disk just as fast. The 24-minute prefill for War and Peace only needs to happen once ever.

Validated Results

All 11 GPU tests were run on an NVIDIA RTX A6000 (48 GB VRAM) with Qwen3.5-35B-A3B Q3_K_M at 1,048,576 token context.

Performance

Example Output
“Who is Pierre Bezukhov?” → Correct, detailed answer
“What happened at the Battle of Borodino?” → Correct, detailed answer

Quick Start

Prerequisites: Linux, NVIDIA GPU (8 GB+ VRAM), Python 3.8+

# 1. Build llama.cpp + download model (one-time, ~35 min)
./setup.sh

# 2. Start the LLM server
./start_server.sh

# 3. Start the API server
python3 src/api_server.py

# 4. Ingest a document
python3 src/ingest.py my_document.txt --corpus-id my_doc

# 5. Query it
python3 src/query.py my_doc "What is this document about?"

That's it. After step 4, the KV cache is saved to kv_slots/my_doc.bin. Every future query restores it instantly, and it survives server restarts.

Model Selection

setup.sh auto-detects GPU VRAM and picks the right model:

The 24 GB+ path uses unsloth/Qwen3.5-35B-A3B-GGUF on HuggingFace and requires a free HF account + access token.

REST API

Start the API server with python3 src/api_server.py --port 8000 (optionally set CAG_API_KEY env var to enable key auth).

Full API docs available at http://localhost:8000/docs when the server is running.

Directory Structure

.
├── setup.sh              # Builds llama.cpp, downloads model
├── start_server.sh       # Launches llama-server with CAG flags
├── requirements.txt
├── src/
│   ├── api_server.py     # FastAPI REST API
│   ├── ingest.py         # CLI: ingest a document
│   ├── query.py          # CLI: query a corpus
│   └── demo.py           # End-to-end demo
├── docker/
│   ├── Dockerfile
│   └── docker-compose.yml
├── docs/
│   ├── REPORT.md         # Full GPU validation report with all 11 test results
│   └── GPU_TESTING.md    # GPU test checklist
├── models/               # GGUF weights (not committed)
├── kv_slots/             # Saved KV cache .bin files (not committed)
└── logs/                 # Runtime logs (not committed)

Limitations

Linux + NVIDIA only: TurboQuant CUDA kernels require Linux and NVIDIA GPUs (no Windows, macOS, or AMD).

Long initial prefill: ~900K tokens can take ~24 minutes on an A6000. This is a one-time cost; subsequent queries restore in ~1 second.

VRAM gating: Systems with lower VRAM use smaller models with shorter context.

Single active corpus: Uses a single llama.cpp slot (slot 0). Switching corpora requires restoring a different KV cache (~1 second).

Long-context limitations: YaRN extrapolation biases attention toward the start and end of documents, so mid-document content can be missed at very large context sizes.

Build time: Initial setup (./setup.sh) can take ~35 minutes to compile CUDA kernels.

Model access requirements: Large models (e.g., Qwen3.5-35B) require a Hugging Face account and access token.

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The system was defined at a high level, describing a document QA workflow that avoids RAG by loading full documents into an LLM, saving the KV cache, and restoring it for repeated queries.

Based on this, NEO generated the implementation, handled debugging across CUDA, Python, and shell components, and validated the system through a series of GPU tests.

This included fixing multiple issues during development and running end-to-end validation to ensure ingestion, cache restoration, and query flows worked reliably.

How to Extend This Further with NEO

The system can be extended in several ways:

supporting multiple KV cache slots
improving handling of long-context attention limitations
optimizing cache storage and compression
exploring hybrid approaches combining CAG with retrieval
extending API capabilities

These extensions would require changes to the current implementation and can be explored based on system requirements.

Final Notes

Cache-Augmented Generation is an alternative way to approach document QA.

Instead of retrieving context at query time, it shifts the cost to a one-time preprocessing step and reuses the model’s KV cache.

This makes repeated queries faster and makes the document context available to the model through the KV cache, while introducing trade-offs in setup time and hardware requirements.

The code is at https://github.com/dakshjain-1616/Cache-Augmented-Generation-CAG-System
You can also build with NEO in your IDE using the VS Code extension or Cursor.