RAG answered 70% of my questions with zero LLM tokens — here's the ingestion trick that made it possible

#ai #rag #llm #python

70% of queries. Zero LLM tokens. $0.00.
That's what happens when you fix ingestion instead of obsessing over retrieval.
I ran 25 questions against a 500-page nutrition textbook. 24/25 correct (96%). Most tutorials stop there. What they don't show: 17 of those 25 questions never touched an LLM at all — answered by BM25 + cosine similarity in under 20ms.
Here's why that's possible, and what I built to make it work.

The ingestion problem nobody admits
Every RAG tutorial shows the same pipeline:
PDF → extract text → split every 512 tokens → embed → store → query
It works fine for blog posts. It falls apart completely for anything structured.
Take a financial report with this revenue table:
RegionQ2 RevenueQ3 RevenueChangeEurope38.1%45.2%+7.1ppAsia29.3%41.7%+12.4ppAmericasn/a52.1%—
After blind chunking at 512 tokens, your LLM receives:
"45.2% Q3 Europe 38.1% Q2 Europe 41.7% Q3 Asia 29.3%"
Numbers with no column headers. No caption. No context.
Ask "which region grew the most?" and you get an approximate guess — not an answer. The LLM isn't hallucinating because it's dumb. It's working with garbage input.
The same silent failure happens with:

Legal contracts — clause split mid-sentence, both halves meaningless alone
API docs — code example separated from its description
Research papers — figure caption disconnected from its analysis

This is not a retrieval problem. It's an ingestion problem. And almost no one fixes it at the source.

What I built: DocNest
I spent the last few months building DocNest — a document normalization engine that reads structure before touching content.
Instead of chunks, every heading becomes a navigable §section with its own ID. Every table is preserved as structured JSON. Every section gets a one-sentence LLM summary and a BM25 keyword index — computed once at ingest, never again.
The output is a .udf file (Unified Document Format) — a self-contained, portable knowledge base. Share it by email, copy it to S3, open it in the VSCode extension.
pythonfrom docnest.parsers.pymupdf_pdf import PyMuPDFParser
from docnest.normalizer import SectionNormaliser
from docnest.writer import UDFWriter
from docnest.reader import UDFIndex

Parse → normalise → save

No API key needed for this step

raw = PyMuPDFParser().parse("report.pdf")
doc = SectionNormaliser().normalise(raw)
UDFWriter().write(doc, "report.udf")

Query

idx = UDFIndex.load("report.udf")
result = idx.query(
"Which region had the highest Q3 growth?",
llm_provider="groq",
llm_model="llama-3.3-70b-versatile",
llm_api_key="gsk_...", # free tier at console.groq.com
)

print(result.answer) # "Asia grew the most, up +12.4pp"
print(result.layer_used) # 1 — answered from index, 0 LLM tokens
print(result.tokens_used) # 0

The five-layer query engine
This is the part that makes the zero-token results possible.
Instead of sending the full document to an LLM, queries escalate through 5 layers — stopping as soon as one can answer confidently:
LayerMechanismTokensTypical latency0Pre-computed summary + key numbers0< 1ms1BM25 + cosine → navigate to exact §section0< 20ms2Section-scoped LLM call~3001–3s3Multi-section synthesis~9002–5s4Full document fallback~4000+5–15s
In practice on real documents: Layers 0 and 1 handle ~70% of questions — the factual ones, the number lookups, the "what does section 3 say about X" type queries. You only pay for LLM compute when the question genuinely requires reasoning.

Handling large PDFs without running out of RAM
Standard Docling (the ML-quality PDF parser) loads the full document into RAM. A 600-page PDF can exhaust most machines.
DocNest solves this with automatic page chunking:
pythonfrom docnest.parsers.pdf import DoclingPDFParser

Auto-detects large PDFs and chunks automatically

raw = DoclingPDFParser().parse("600-page-annual-report.pdf")

Or tune explicitly for your hardware

raw = DoclingPDFParser(chunk_pages=10).parse("report.pdf") # low RAM
raw = DoclingPDFParser(chunk_pages=50).parse("report.pdf") # high throughput
PyMuPDF splits the PDF into N-page temp files. Docling processes each at full ML quality. Sections are merged. The output is identical to processing everything at once — peak RAM stays constant regardless of document size.

Real accuracy numbers
I tested against a 500-page open-source nutrition textbook, 25 questions, using PyMuPDF + Groq free tier:
Question typeScoreBasic facts (calories, macros)5/5Detailed nutrition (fiber, glycemic index)5/5Micronutrients (vitamins, minerals)4/5Hard synthesis (BMR, omega-3, antioxidants)5/5Edge cases (hallucination traps, tables, out-of-scope)5/5Total24/25 (96%)
The one failure: a table-only page where PyMuPDF extracted no text content. Fix: use DoclingPDFParser for documents where tables are the primary information carrier.

Try it
bashpip install docnest-ai
Supported formats: PDF (Docling ML + PyMuPDF), DOCX, XLSX, HTML, Markdown
LLM providers: Groq (free tier works), OpenAI, Ollama (fully local), Anthropic, Google, Mistral, Cohere, Bedrock, Together
Vector backends: numpy (zero deps), FAISS, ChromaDB
CLI:
bashdocnest convert report.pdf --llm-provider groq --llm-model llama-3.3-70b-versatile
docnest query report.udf "What are the key risks mentioned?"
docnest view report.udf # opens structured HTML viewer in browser
→ GitHub: https://github.com/tailorgunjan93/docnest
→ PyPI: https://pypi.org/project/docnest-ai
→ Format spec: https://github.com/tailorgunjan93/udf-spec

What's broken, what's coming
Current version is 0.4.0a2 — alpha, but works on real documents.
Open for contributions:

PPTX parser (PowerPoint slides → §sections)
Qdrant / Weaviate vector backends
SharePoint + Confluence connectors
EPUB parser for ebook indexing

If you've hit the table-structure problem in your own RAG pipeline — where the LLM gets numbers without context — I'd genuinely like to hear what document type caused it. Drop it in the comments.

Built in the open. Issues and PRs welcome.