Picture this.
It's a client demo. They're watching. I type:
"Which region had the highest revenue growth last quarter?"
My RAG app β three weeks of work, carefully tuned embeddings, clever prompts β responds instantly.
The client nods. Writes it down.
The answer was wrong. By almost double.
I spent three days debugging the wrong things.
Chunk size? Tried 256, 512, 1024. Nothing.
Temperature? 0.0, 0.3, 0.7. Still wrong.
Embeddings model? Swapped three of them. Nope.
Prompt engineering? Added "think step by step", "be precise", "do not hallucinate". π
The LLM wasn't hallucinating. It was doing its best with this:
"45.2% Q3 Europe 38.1% Q2 Europe 41.7% Q3 Asia 29.3%"
Orphaned numbers. No column headers. No caption. No context.
The original table had all of that. My chunker ate it silently.
β οΈ The bug was never in retrieval. It was in ingestion. And I never thought to look there.
π₯ The dirty secret of RAG tutorials
Every tutorial shows you this pipeline:
PDF β extract text β chunk at 512 tokens β embed β store β retrieve β answer
Clean. Simple. Completely wrong for structured documents.
Here's what blind chunking silently destroys:
| Document | What you had | What the LLM gets |
|---|---|---|
| Financial report | Revenue table with headers | Orphaned numbers, zero context |
| Legal contract | 3-page clause | Split mid-sentence, both halves useless |
| API docs | Function + code example | Code separated from its description |
| Research paper | Figure with caption | Caption on chunk 7, analysis on chunk 12 |
ποΈ You're feeding the LLM garbage and expecting gold. The model isn't dumb β it's working with broken input.
π οΈ So I built the thing I wished existed
Meet DocNest β not another chunker.
A document normalization engine that reads structure before touching content.
- Every heading β a navigable
Β§sectionwith its own ID - Every table β preserved as
{ caption, headers, rows[] }JSON - Every section β one-sentence LLM summary + BM25 keyword index
- All of it β packed into a portable
.udffile
from docnest.pipeline import DocNestPipeline
from docnest.reader import UDFIndex
# Convert β runs once, costs a few LLM calls
pipeline = DocNestPipeline(
llm_provider="groq", # free tier works perfectly
llm_api_key="gsk_...",
emb_provider="huggingface", # local, no API key needed
)
pipeline.convert("report.pdf") # β report.udf β
# Query
idx = UDFIndex.load("report.udf")
result = idx.query("Which region had the highest Q3 growth?")
print(result.answer) # "Asia grew the most, up +12.4pp"
print(result.layer_used) # 1
print(result.tokens_used) # 0 β yes, really. zero.
β Zero tokens. Correct answer. 18ms.
That's not a cherry-picked example. Here's why it's possible.
β‘ The 5-layer query engine
Instead of dumping the full document into an LLM, queries escalate through layers β stopping the moment one can answer confidently.
| Layer | What it does | Tokens | Speed |
|---|---|---|---|
| 0 | Pre-computed summary + key numbers | 0 | < 1ms |
| 1 | BM25 + cosine β lands on exact Β§section | 0 | < 20ms |
| 2 | Section-scoped LLM call | ~300 | 1β3s |
| 3 | Multi-section synthesis | ~900 | 2β5s |
| 4 | Full document fallback | ~4000+ | 5β15s |
I expected layers 2β4 to do most of the work.
π€― Layers 0 and 1 handle roughly 70% of real-world questions β at zero token cost.
Seven out of ten queries answered from a structured index. You pay for LLM compute only when genuine reasoning is needed.
π Real numbers. Not vibes.
25 questions. 500-page open-source nutrition textbook. PyMuPDF + Groq free tier.
| Question type | Score |
|---|---|
| Basic facts (calories, macros) | β 5/5 |
| Detailed nutrition (fiber, glycemic index) | β 5/5 |
| Micronutrients (vitamins, minerals) | β 4/5 |
| Hard synthesis (BMR, omega-3, antioxidants) | β 5/5 |
| Edge cases + hallucination traps | β 5/5 |
| Total | 24/25 β 96% |
The one failure: a table-only page where the text parser extracted nothing.
Fix: use DoclingPDFParser for image-heavy or scanned PDFs.
π§ Handles 600-page PDFs without exploding your RAM
Standard Docling loads the full document into memory. 600 pages on a normal laptop = π out of memory.
DocNest chunks automatically, processes each at full ML quality, merges the output. Peak RAM stays constant regardless of document size.
from docnest.parsers.pdf import DoclingPDFParser
# Just works β auto-detects large PDFs
raw = DoclingPDFParser().parse("600-page-annual-report.pdf")
# Or tune for your hardware
raw = DoclingPDFParser(chunk_pages=10).parse("report.pdf") # π» low RAM
raw = DoclingPDFParser(chunk_pages=50).parse("report.pdf") # π speed mode
π Try it
pip install docnest-ai
Formats: PDF (ML + fast) Β· DOCX Β· XLSX Β· HTML Β· Markdown
LLM providers: Groq (free) Β· OpenAI Β· Ollama (local) Β· Anthropic Β· Mistral Β· Google Β· Cohere
Vector backends: numpy (zero deps) Β· FAISS Β· ChromaDB
# CLI β because boilerplate is boring
docnest convert report.pdf --llm-provider groq --llm-model llama-3.3-70b-versatile
docnest query report.udf "What are the key financial risks?"
docnest view report.udf # structured HTML viewer in browser
GitHub repo β star it if this solved a problem you've had:
tailorgunjan93
/
docnest
The document normalization engine RAG has always needed. Parse any document, understand its structure, build RAG that actually works.
DOCNEST
The document normalization engine RAG has always needed.
Parse any document. Understand its structure. Build RAG that actually works.
Why DOCNEST β’ Installation β’ Quick Start β’ Python API β’ PDF Parsing β’ How It Works β’ CLI Reference β’ Providers β’ Roadmap
The Problem with RAG Today
Every RAG pipeline ingests documents the same broken way:
PDF β extract text β split every 512 chars β embed β store β hope
What gets silently destroyed:
| Source | What blind chunking loses |
|---|---|
| Financial report | Table row 45.2% | Q3 | Europe has no column headers |
| Legal contract | Clause split mid-sentence across two chunks |
| API documentation | Code example separated from its description |
| Research paper | Figure caption disconnected from its analysis |
The LLM receives noise and returns approximate answers. This is not a retrieval problem β it is an ingestion problem.
See the difference
Take a financial report with a revenue tableβ¦
PyPI: https://pypi.org/project/docnest-ai
Format spec: https://github.com/tailorgunjan93/udf-spec
π¨ Honesty tax
π§ This is
0.4.0a2β alpha. It works on real documents, but PPTX parser isn't built yet, Qdrant/Weaviate backends are on the roadmap, and SharePoint/Confluence connectors are planned.
If any of those sound like something you want to build β good first issues are labeled and waiting.
π¬ One question for you
Most RAG infrastructure assumes text extraction is a solved problem.
It isn't. Not for tables. Not for anything where position and relationship carry meaning.
π¬ What document type has caused you the most RAG pain?
For me it was financial tables. Drop it in the comments β if it's a format DocNest doesn't handle yet, that's probably the next parser I build.
Building in the open at github.com/tailorgunjan93/docnest. Stars, issues, and brutal feedback all welcome. π
Top comments (0)