I showed my RAG app to a friend.
He asked: "which region grew the most last quarter?"
It said Europe. The answer was Asia. By a lot.
I spent two days debugging embeddings, chunk sizes, temperature settings.
The bug was none of those things.
The table had been turned into this:
"45.2% Q3 Europe 38.1% Q2 Asia 41.7%..."
Numbers with no headers. No caption. No context.
The LLM wasn't hallucinating. It was working with garbage.

π οΈ So I built the thing I wished existed
Meet DocNest β not another chunker.
A document normalization engine that reads structure before touching content.
Every heading β a navigable Β§section with its own ID
Every table β preserved as { caption, headers, rows[] } JSON
Every section β one-sentence LLM summary + BM25 keyword index
All of it β packed into a portable .udf file
python
from docnest.pipeline import DocNestPipeline
from docnest.reader import UDFIndex
# Convert β runs once, costs a few LLM calls
pipeline = DocNestPipeline(
llm_provider="groq", # free tier works perfectly
llm_api_key="gsk_...",
emb_provider="huggingface", # local, no API key needed
)
pipeline.convert("report.pdf") # β report.udf β
# Query
idx = UDFIndex.load("report.udf")
result = idx.query("Which region had the highest Q3 growth?")
print(result.answer) # "Asia grew the most, up +12.4pp"
print(result.layer_used) # 1
print(result.tokens_used) # 0 β yes, really. zero.
β
Zero tokens. Correct answer. 18ms.
That's not a cherry-picked example. Here's why it's possible.
β‘ The 5-layer query engine
Instead of dumping the full document into an LLM, queries escalate through layers β stopping the moment one can answer confidently.
LayerWhat it doesTokensSpeed0Pre-computed summary + key numbers0< 1ms1BM25 + cosine β lands on exact Β§section0< 20ms2Section-scoped LLM call~3001β3s3Multi-section synthesis~9002β5s4Full document fallback~4000+5β15s
I expected layers 2β4 to do most of the work.
π€― Layers 0 and 1 handle roughly 70% of real-world questions β at zero token cost.
Seven out of ten queries answered from a structured index. You pay for LLM compute only when genuine reasoning is needed.
π Real numbers. Not vibes.
25 questions. 500-page open-source nutrition textbook. PyMuPDF + Groq free tier.
Question typeScoreBasic facts (calories, macros)β
5/5Detailed nutrition (fiber, glycemic index)β
5/5Micronutrients (vitamins, minerals)β
4/5Hard synthesis (BMR, omega-3, antioxidants)β
5/5Edge cases + hallucination trapsβ
5/5Total24/25 β 96%
The one failure: a table-only page where the text parser extracted nothing.
Fix: use DoclingPDFParser for image-heavy or scanned PDFs.
π§ Handles 600-page PDFs without exploding your RAM
Standard Docling loads the full document into memory. 600 pages on a normal laptop = π out of memory.
DocNest chunks automatically, processes each at full ML quality, merges the output. Peak RAM stays constant regardless of document size.
python
from docnest.parsers.pdf import DoclingPDFParser
# Just works β auto-detects large PDFs
raw = DoclingPDFParser().parse("600-page-annual-report.pdf")
# Or tune for your hardware
raw = DoclingPDFParser(chunk_pages=10).parse("report.pdf") # π» low RAM
raw = DoclingPDFParser(chunk_pages=50).parse("report.pdf") # π
speed mode
π Try it
bashpip install docnest-ai
Formats: PDF (ML + fast) Β· DOCX Β· XLSX Β· HTML Β· Markdown
LLM providers: Groq (free) Β· OpenAI Β· Ollama (local) Β· Anthropic Β· Mistral Β· Google Β· Cohere
Vector backends: numpy (zero deps) Β· FAISS Β· ChromaDB
bash# CLI β because boilerplate is boring
docnest convert report.pdf --llm-provider groq --llm-model llama-3.3-70b-versatile
docnest query report.udf "What are the key financial risks?"
docnest view report.udf # structured HTML viewer in browser
GitHub repo β star it if this solved a problem you've had:
tailorgunjan93
/
docnest
The document normalization engine RAG has always needed. Parse any document, understand its structure, build RAG that actually works.
DOCNEST
The document normalization engine RAG has always needed.
Parse any document. Understand its structure. Build RAG that actually works.
Why DOCNEST β’ Installation β’ Quick Start β’ Python API β’ PDF Parsing β’ How It Works β’ CLI Reference β’ Providers β’ Roadmap
The Problem with RAG Today
Every RAG pipeline ingests documents the same broken way:
PDF β extract text β split every 512 chars β embed β store β hope
What gets silently destroyed:
| Source | What blind chunking loses |
|---|---|
| Financial report | Table row 45.2% | Q3 | Europe has no column headers |
| Legal contract | Clause split mid-sentence across two chunks |
| API documentation | Code example separated from its description |
| Research paper | Figure caption disconnected from its analysis |
The LLM receives noise and returns approximate answers. This is not a retrieval problem β it is an ingestion problem.
See the difference
Take a financial report with a revenue tableβ¦
PyPI: https://pypi.org/project/docnest-ai
Format spec: https://github.com/tailorgunjan93/udf-spec
Top comments (1)
GitHub β github.com/tailorgunjan93/docnest
PyPI β pypi.org/project/docnest-ai
Spec β github.com/tailorgunjan93/udf-spec
The .udf format is an open spec β build on it,
extend it, contribute to it.
Stars and contributions genuinely appreciated β
rag #llm #python #ai #opensource #machinelearning
nlp #documentai #vectorsearch #buildinpublic