DEV Community: Gunjan Tailor

Your .NET RAG stack hides a Python sidecar. I built the engine that removes it.

Gunjan Tailor — Tue, 16 Jun 2026 05:09:08 +0000

TL;DR — Every .NET RAG project quietly ships a Python sidecar to do one job: chunk documents. I got rid of mine. DocNest .NET is an idiomatic C# / .NET 8 port of my DocNest engine — embeddings run locally (ONNX MiniLM, no key, offline), the LLM is optional (factual questions answered at zero tokens), and the .udf knowledge base it writes is byte-compatible with the Python version. Ingest in Python, query in C#. It's on NuGet today. Repo · NuGet.

The compromise every .NET dev quietly accepts

You're building on .NET. The product needs to answer questions over a pile of PDFs, contracts, spreadsheets — real retrieval-augmented generation. So you go looking for tooling, and you find the same thing I did:

It's all Python.

LangChain, LlamaIndex, every RAG tutorial worth reading — Python, Python, Python. So you do the thing nobody admits to in the architecture review: you stand up a little Python service on the side. A second runtime to containerize, deploy, version, monitor, and wake up to at 3 a.m. when it OOMs. All so it can split a document into chunks and hand them back to your actual app.

A whole extra language in production to chop up a PDF. I stared at that diagram one too many times and decided it had to go.

So I ported DocNest to C#. Not a wrapper shelling out to python.exe — a real, idiomatic .NET port. async/await end to end, every dependency behind an interface, shipped as proper NuGet packages. Nothing Python left in the runtime.

But to explain why DocNest is worth porting, I have to tell you about the bug that started the whole thing.

The 3-day bug that started all of this

A RAG app I'd built gave a client a confidently wrong number. Not "I don't know" — a clean, specific, wrong answer, delivered with total confidence. I spent three days assuming my retrieval ranking was off, tuning embeddings and k values and similarity thresholds.

The ranking was fine. The problem happened before any of that — at ingestion. Here's how almost every pipeline reads a document:

PDF → extract text → split every 512 chars → embed → store → hope

Watch what that does to a revenue table:

chunk_1: "45.2%  Q3  Europe  38.1%  Q2  Europe  41.7%  Q3"
chunk_2: "Asia   29.3%  Q2  Asia  Americas  52.1%  Q3  Ame"

The headers are gone. The rows are shredded across a chunk boundary. The model receives a bag of loose numbers with no idea which is revenue, which is a quarter, which region they belong to — and fills the gap with a confident guess. That's not a model problem or a retrieval problem. It's an ingestion problem. You destroyed the meaning before the model ever saw the data.

Reading the document like a human would

A person doesn't read a report as one long character stream. They see headings, sections, a table with columns. DocNest does the same: it reads the document's structure first. Every heading becomes a navigable §section. Every table is preserved as structured data — never flattened:

{
  "section": "§4.2 Revenue by Region",
  "table": {
    "headers": ["Region", "Q2", "Q3", "Change"],
    "rows": [
      ["Europe", "38.1%", "45.2%", "+7.1pp"],
      ["Asia",   "29.3%", "41.7%", "+12.4pp"]
    ]
  }
}

Same numbers, same model, same question — but now the answer is right, and it comes with a citation. The document is normalised once into a portable .udf file: a self-contained ZIP holding the section index, key numbers, keywords, section text, and quantised embeddings. Parse once, query forever.

The trick that makes this a real port, not a rewrite

Here's the part I'm proud of. The .udf format is an open spec, and the .NET writer produces files that are byte-compatible with the Python engine. That one constraint unlocks something genuinely useful:

Ingest a 600-page annual report in a nightly Python batch job, then
Ship the small .udf to your C# desktop app or ASP.NET service and query it offline — with no Python anywhere in the runtime.

One ingestion ecosystem, two languages, the same artifact moving between them. Nothing in the codebase is allowed to break that cross-ecosystem contract — it's the whole point.

Two knobs people always confuse

When I describe this, two questions come back every time. They're actually two independent choices:

1. Embeddings run locally. A small ONNX MiniLM model (~90 MB) downloads once and caches. No API key, fully offline. There's an optional ONNX cross-encoder reranker for dense PDFs.

2. The LLM is optional. Answer Layers 0–1 resolve factual questions deterministically — zero tokens, no key. You only bring an LLM for synthesis, and when you do, "OpenAI" means the answer model, not embeddings. The two never get coupled.

Try it in 60 seconds — no key, no internet

dotnet add package DocNest.Core
dotnet add package DocNest.Parsers
dotnet add package DocNest.Retrieval
dotnet add package DocNest.Query

using DocNest;
using DocNest.Parsers;
using DocNest.Pipeline;
using DocNest.Query;
using DocNest.Retrieval;
using DocNest.Udf;

// Parse → normalise → write a portable .udf
var raw = await new ParserFactory().Get("report.pdf").ParseAsync("report.pdf");
var doc = new DocNestPipeline().Process(raw);
await new UdfWriter().WriteAsync(doc, "report.udf");

// Load it back and ask — deterministic layers, no LLM
var document = (await UdfReader.LoadAsync("report.udf")).ToDocument();

using var retriever = new HybridRetriever(".docnest_cache");
var engine = new DocNestQueryEngine(retriever);   // no LLM → Layers 0–1 only
var result = await engine.AnswerAsync(document, "What was Q3 revenue?", allowLlm: false);

Console.WriteLine(result.Answer);     // "Q3 revenue: $38M (source: §3.1)"
Console.WriteLine(result.TokensUsed); // 0

Prefer the terminal?

dotnet tool install -g DocNest.Cli
docnest convert report.pdf -o report.udf
docnest query report.udf "What was Q3 revenue?"

When you actually need an LLM

OpenAiCompatibleLlmProvider talks to OpenAI, Groq, Cerebras, Together, OpenRouter and local servers (Ollama, LM Studio) — change the base URL and model. Anthropic has its own provider.

ILlmProvider llm = new OpenAiCompatibleLlmProvider(
    apiKey:  Environment.GetEnvironmentVariable("GROQ_API_KEY")!,
    model:   "llama-3.3-70b-versatile",
    baseUrl: "https://api.groq.com/openai/v1");

var engine = new DocNestQueryEngine(retriever, llm);
var result = await engine.AnswerAsync(document, "Summarise the key risks.", allowLlm: true);
Console.WriteLine(string.Join(", ", result.Citations));  // ["§5.2", "§5.3"]

Under the hood: five layers, escalate only when needed

file  → IParser → DocNestPipeline (normalise · key-numbers · keywords) → Document → .udf
query → HybridRetriever (BM25 + dense + cross-encoder rerank + RRF + 1-hop graph) → top-k
      → DocNestQueryEngine (5 layers) → answer + citations + tokens + confidence

Layer	Mechanism	Tokens
0	Pre-computed key-numbers / summary	0
1	Extractive from the top section	0
2	Single-section LLM	~300
3	Multi-section synthesis (reranked context)	~900
4	Broad fallback over retrieved sections	~1,500

The engine climbs this ladder only when a cheaper rung isn't confident. Layers 0–1 handle a surprising share of real factual questions at zero cost — you pay tokens only for genuine synthesis.

The benchmark I almost didn't publish

A multi-format eval — 10 documents, 88 questions, 5 formats (the same set as the Python reference), dense + cross-encoder rerank, gpt-oss-120b narrator, qwen2.5 judge:

Format	Score	Hit-rate (≥7)
XLSX	8.7 / 10	93%
MD	8.7 / 10	100%
DOCX	7.0 / 10	79%
HTML	4.8 / 10	50%
PDF	6.8 / 10	70%
Overall	~7.1 / 10	~78%

The Python reference sits at 8.5/10. This .NET port is at 7.1 and closing the gap slice by slice — the cross-encoder reranker alone dragged PDFs from 5.1 → 6.8 (hit-rate 47% → 70%). HTML is clearly my weakest format right now, and it's the next thing I'm fixing.

I could have cherry-picked a kinder run and quoted a bigger number. I'd rather ship the reproducible one with the eval harness sitting right next to it in the repo. If you don't trust a benchmark you can't re-run, neither do I.

What ships in the box

Package	Role
`DocNest.Abstractions`	Domain records + wrapper interfaces
`DocNest.Core`	Pipeline, normaliser, `.udf` reader/writer, quantizer
`DocNest.Parsers`	md / html / csv / docx / xlsx / pdf
`DocNest.Embeddings`	ONNX MiniLM embedder + ms-marco cross-encoder reranker
`DocNest.Retrieval`	Hybrid retriever (FTS5 BM25 + dense + rerank + RRF + graph)
`DocNest.Query`	5-layer answer engine + LLM providers
`DocNest.Storage`	`.udf` ZIP storage backend
`DocNest.Cli`	`docnest` dotnet tool

Parsers cover PDF (PdfPig), DOCX/XLSX (OpenXML), HTML (AngleSharp), CSV/TSV and Markdown. Every external dependency lives behind a DocNest interface, so swapping any of them is a one-line change.

Where it stands — honestly

This is pre-1.0, built slice-by-slice under a gated protocol: understand → plan → design + ADR → tests-first → full suite green → sign-off, per phase. The core pipeline, hybrid retrieval, cross-encoder reranking and the 5-layer engine are implemented and tested. Cloud embedding providers (OpenAI embeddings and friends) exist in the Python engine but aren't ported yet — embeddings here are local-only by design.

Try it

dotnet add package DocNest.Core
# or
dotnet tool install -g DocNest.Cli

Repo: https://github.com/tailorgunjan93/docnest-net
NuGet: https://www.nuget.org/profiles/GunjanTailor
Python original: https://github.com/tailorgunjan93/docnest (pip install docnest-ai)
.udf spec: https://github.com/tailorgunjan93/udf-spec

If you've ever stood up a Python sidecar just to chunk a PDF for a .NET app, I'd genuinely like to know whether this kills that step for you — tell me in the comments. And if it does, a star on the repo helps other .NET folks find it.

Secure · Fast · Reliable · Cost-Effective

I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

Gunjan Tailor — Sat, 30 May 2026 10:17:30 +0000

# I Built a Local-First AI Desktop Knowledge Base — Here's What I Learned

After building docnest-ai — a hybrid RAG engine for Python — the next logical question was: what does a great end-user app built on top of it actually look like?

That question led me to build Knovex: a local-first, AI-powered desktop knowledge base that runs entirely on your machine. No cloud uploads. No subscriptions. No data leakage. Just drop in your documents, ask questions, and learn.

This post covers the architecture decisions, the problems I hit, and the interesting technical bits. If you want to skip straight to the app: tailorgunjan93.github.io/knovex

Why build a desktop app in 2026?

Every AI knowledge tool I tried had the same deal: your documents leave your machine. Legal contracts, research notes, personal journals — all uploaded to some company's inference server. The privacy trade-off felt wrong.

The local-first principle changes the threat model entirely:

Your files never leave your machine unless you choose to enable cloud features
The app works fully offline (use Ollama for a zero-network setup)
API keys are encrypted at rest with Fernet AES-128, readable only by your OS account

The constraint also forced better engineering. When you can't lean on a cloud backend, you have to make the local stack actually fast.

Architecture overview

Knovex is a fully decoupled tri-layer app:

┌─────────────────────────────────────────┐
│  Electron 33 (desktop shell)            │
│  ┌─────────────────────────────────┐    │
│  │  React 18 + MUI v6 + TypeScript │    │
│  │  TanStack Query v5 + Zustand    │    │
│  └──────────────┬──────────────────┘    │
└─────────────────│───────────────────────┘
                  │  REST + SSE  (localhost:8765)
┌─────────────────▼───────────────────────┐
│  FastAPI + Python 3.11                  │
│  docnest-ai (hybrid RAG engine)         │
│  SQLite WAL + FTS5                      │
│  LiteLLM (multi-provider LLM bridge)    │
└─────────────────────────────────────────┘

The frontend is a pure API consumer — it knows nothing about RAG, embeddings, or LLMs. All intelligence lives in the Python backend. This made it very easy to swap out components independently.

Why Electron?

Electron gets a bad reputation, but for a privacy-first desktop app it's the right call:

Single installer ships backend binary (PyInstaller) + frontend + Electron in one .exe/.dmg/.AppImage
The backend process is spawned as a child process, communicates over localhost
Window state, tray, native OS file dialogs — all handled properly
Cross-platform with one codebase

The binary is ~85-92 MB depending on platform. Not tiny, but users get zero setup — no Python, no Node, no CLI gymnastics.

The RAG engine: docnest-ai

Rather than naive chunking (split every 512 chars → embed → hope), docnest-ai runs a 6-stage normalization pipeline:

Structure extraction — reads heading hierarchy, tables, lists (Docling or PyMuPDF)
Section assignment — every heading becomes a navigable §section
Table normalization — { caption, headers, rows[] } JSON, never loses column context
Section summarization — LLM called once per document
Document intelligence — summary, key numbers, insights
Embedding + quantize — BM25 keywords + float16 vectors

Stages 1–3 and 6 run locally at zero LLM cost. Stages 4–5 call an LLM once per document at ingest time. Every future query benefits from that upfront investment for free.

Query resolution: five layers

The query engine tries cheaper layers first before escalating:

Layer	Mechanism	Tokens	Latency
L0	Pre-computed summary/insights	0	< 1ms
L1	BM25 + cosine → navigate to §section	0	< 20ms
L2	Section-scoped LLM	~300	1–3s
L3	Multi-section synthesis	~900	2–5s
L4	Full-document fallback	~4000+	5–15s

In practice, L0+L1 answer ~70% of real-world questions at zero LLM cost. You only pay when you genuinely need the model.

Semantic search (v0.7.0+)

For Knovex v0.7.0 I added hybrid semantic search on top:

# ONNX-based local embeddings (all-MiniLM-L6-v2, ~45 MB, one-time download)
# OR OpenAI text-embedding-3-small via API

# Results fused with Reciprocal Rank Fusion (RRF):
# score = 1/(k + rank_fts5) + 1/(k + rank_ann)

RRF fusion handles the case where BM25 ranks a document high on keyword match but the semantic model ranks it high on conceptual similarity. The union tends to beat either individually.

Average query latency on a typical KB is still sub-millisecond for the FTS5 path and ~0.9s end-to-end including the LLM call on an M-series Mac.

Learn Mode: turning documents into learning sessions

This was the most fun feature to build. The idea: instead of just answering questions, the app can generate structured learning content from any document or topic.

Nine formats, all streaming via SSE:

Quiz — interactive MCQ with XP rewards per question
Flashcards — spaced repetition with interval scheduling
Mind Map — collapsible JSON tree rendered with D3
Timeline — chronological events extracted from the text
Guided — step-by-step walkthrough via GuidedViewer
Story — narrative markdown retelling of the content
ELI5 — explain like I'm five
Brainstorm — creative connections and lateral ideas
Speed Learn — bullet-point summary for fast review

The JSON formats (Quiz, Flashcards, Mind Map, Timeline) use a two-phase approach: LLM generates structured JSON → parse → re-stream the parsed results. Text formats (Story, ELI5, etc.) stream in real-time token by token.

Gamification

I added XP, level progression (10 tiers), daily streaks, and achievement badges. This was partly experimental — does adding game mechanics to a local productivity tool actually improve usage? Anecdotally yes: the streak counter creates a small daily habit pull.

The Progress Page (v0.8.0) shows:

26-week activity heatmap (sessions per day, colour-coded)
Learning velocity chart (sessions/week + active days/week dual-axis)
XP level with badge
Week-over-week session delta

Design patterns used throughout

Adapter pattern (anti-corruption layer)

Every third-party dependency sits behind a swappable interface:

# backend/adapters/llm_client.py
class ILLMClient(Protocol):
    async def complete(self, messages: list[dict], **kwargs) -> str: ...
    async def stream(self, messages: list[dict], **kwargs) -> AsyncIterator[str]: ...

class LiteLLMAdapter(ILLMClient):
    """Wraps litellm — the only place litellm is imported"""
    ...

class StubLLMClient(ILLMClient):
    """Used in tests — zero network calls"""
    ...

Same pattern for: HTTP client (httpx), PDF parser (PyMuPDF / Docling), web search (DuckDuckGo / Serper / Brave), paragraph parser (python-docx).

This made testing painless — all 61 E2E tests mock at the adapter boundary.

Strategy + plugin registration for parsers

_PARSERS: dict[str, type[IFileParser]] = {}

def register_parser(ext: str):
    def decorator(cls):
        _PARSERS[ext] = cls
        return cls
    return decorator

@register_parser(".pdf")
class PDFParser(IFileParser): ...

@register_parser(".docx")
class DocxParser(IFileParser): ...

Adding a new file format means writing one class and adding one decorator. Zero changes to the orchestration layer.

EventBus for decoupled notifications

# In-process typed EventBus — no external dependencies
bus = EventBus()

@dataclass
class FileIngested:
    file_id: str
    kb_id: str
    chunk_count: int

bus.emit_typed(FileIngested(file_id=..., kb_id=..., chunk_count=42))

The watcher service (which detects stale/missing files) communicates with the KB service through events rather than direct calls. This kept the service layer clean.

Challenges worth noting

SQLite WAL mode + concurrent async writes — FastAPI runs async, and SQLite's WAL mode handles readers well but writers queue. I had to add retry logic with exponential backoff for the ingestion pipeline, which can run as a background task while chat is active.

PyInstaller + Python 3.11 + ONNX — packaging the ONNX runtime into a PyInstaller binary was the most painful part of the v0.7.0 release. The model weights need to be bundled correctly, paths resolved at runtime via sys._MEIPASS. Worth documenting if you're going down this path.

SSE streaming through Electron's IPC — Electron's fetch API handles SSE properly, but the preload script needed explicit keep-alive handling to prevent the renderer from killing long-running streams during Learn Mode generation (which can take 10–30 seconds for complex documents).

Windows SmartScreen — unsigned NSIS installers get flagged. Adding instructions to the download page for "More info → Run anyway" reduced support questions significantly.

What's next

Phase 2 of Knovex moves toward cloud + organisation features:

Cloud Portal — web admin for org key management and user management
3 deployment modes — Personal (own keys) / Organisation (portal-managed) / Self-hosted (Docker)
LangGraph agent orchestration — beyond single-turn Q&A
Visual workflow builder — chain operations on your KB
Mobile app — React Native, same backend API
Plugin/connector marketplace — Notion, Confluence, GitHub, etc.

Try it

App: tailorgunjan93.github.io/knovex — free one-click installer for Windows, macOS, Linux

GitHub: github.com/tailorgunjan93/knovex

RAG engine: pip install docnest-ai

MIT licensed. v0.10.0 is stable with 61 E2E tests passing.

Happy to answer questions about any part of the stack in the comments.

I was embarrassed by my RAG demo. Turns out the bug was never in my code.

Gunjan Tailor — Thu, 21 May 2026 17:08:33 +0000

I showed my RAG app to a friend.

He asked: "which region grew the most last quarter?"

It said Europe. The answer was Asia. By a lot.

I spent two days debugging embeddings, chunk sizes, temperature settings.
The bug was none of those things.

The table had been turned into this:

"45.2% Q3 Europe 38.1% Q2 Asia 41.7%..."

Numbers with no headers. No caption. No context.
The LLM wasn't hallucinating. It was working with garbage.

🛠️ So I built the thing I wished existed
Meet DocNest — not another chunker.
A document normalization engine that reads structure before touching content.

Every heading → a navigable §section with its own ID
Every table → preserved as { caption, headers, rows[] } JSON
Every section → one-sentence LLM summary + BM25 keyword index
All of it → packed into a portable .udf file

python

from docnest.pipeline import DocNestPipeline
from docnest.reader import UDFIndex

# Convert — runs once, costs a few LLM calls
pipeline = DocNestPipeline(
    llm_provider="groq",           # free tier works perfectly
    llm_api_key="gsk_...",
    emb_provider="huggingface",    # local, no API key needed
)
pipeline.convert("report.pdf")    # → report.udf ✓

# Query
idx = UDFIndex.load("report.udf")
result = idx.query("Which region had the highest Q3 growth?")

print(result.answer)       # "Asia grew the most, up +12.4pp"
print(result.layer_used)   # 1
print(result.tokens_used)  # 0  ← yes, really. zero.

✅ Zero tokens. Correct answer. 18ms.
That's not a cherry-picked example. Here's why it's possible.

⚡ The 5-layer query engine
Instead of dumping the full document into an LLM, queries escalate through layers — stopping the moment one can answer confidently.
LayerWhat it doesTokensSpeed0Pre-computed summary + key numbers0< 1ms1BM25 + cosine → lands on exact §section0< 20ms2Section-scoped LLM call~3001–3s3Multi-section synthesis~9002–5s4Full document fallback~4000+5–15s
I expected layers 2–4 to do most of the work.

🤯 Layers 0 and 1 handle roughly 70% of real-world questions — at zero token cost.
Seven out of ten queries answered from a structured index. You pay for LLM compute only when genuine reasoning is needed.

📊 Real numbers. Not vibes.
25 questions. 500-page open-source nutrition textbook. PyMuPDF + Groq free tier.
Question typeScoreBasic facts (calories, macros)✅ 5/5Detailed nutrition (fiber, glycemic index)✅ 5/5Micronutrients (vitamins, minerals)✅ 4/5Hard synthesis (BMR, omega-3, antioxidants)✅ 5/5Edge cases + hallucination traps✅ 5/5Total24/25 — 96%
The one failure: a table-only page where the text parser extracted nothing.
Fix: use DoclingPDFParser for image-heavy or scanned PDFs.

🧠 Handles 600-page PDFs without exploding your RAM
Standard Docling loads the full document into memory. 600 pages on a normal laptop = 💀 out of memory.
DocNest chunks automatically, processes each at full ML quality, merges the output. Peak RAM stays constant regardless of document size.
python

from docnest.parsers.pdf import DoclingPDFParser

# Just works — auto-detects large PDFs
raw = DoclingPDFParser().parse("600-page-annual-report.pdf")

# Or tune for your hardware
raw = DoclingPDFParser(chunk_pages=10).parse("report.pdf")  # 💻 low RAM
raw = DoclingPDFParser(chunk_pages=50).parse("report.pdf")  # 🚀

speed mode

🚀 Try it

bashpip install docnest-ai

Formats: PDF (ML + fast) · DOCX · XLSX · HTML · Markdown
LLM providers: Groq (free) · OpenAI · Ollama (local) · Anthropic · Mistral · Google · Cohere
Vector backends: numpy (zero deps) · FAISS · ChromaDB
bash# CLI — because boilerplate is boring
docnest convert report.pdf --llm-provider groq --llm-model llama-3.3-70b-versatile
docnest query report.udf "What are the key financial risks?"
docnest view report.udf # structured HTML viewer in browser
GitHub repo — star it if this solved a problem you've had:

tailorgunjan93 / docnest

The document normalization engine RAG has always needed. Parse any document, understand its structure, build RAG that actually works.

DOCNEST

Secure · Fast · Reliable · Cost-Effective

The document normalization engine RAG has always needed.

Why DOCNEST • Installation • Quick Start • Python API • PDF Parsing • How It Works • Benchmark • Providers • Roadmap

The Problem with RAG Today

Every RAG pipeline ingests documents the same broken way:

PDF → extract text → split every 512 chars → embed → store → hope

What gets silently destroyed:

Source	What blind chunking loses
Financial report	Table row `45.2% \| Q3 \| Europe` has no column headers
Legal contract	Clause split mid-sentence across two chunks
API documentation	Code example separated from its description
Research paper	Figure caption disconnected from its analysis

The LLM receives noise and returns approximate answers. This is not a retrieval problem — it is an ingestion problem.

See the difference

Take a financial report with a revenue table. Here is what each approach…

View on GitHub

PyPI: https://pypi.org/project/docnest-ai

Format spec: https://github.com/tailorgunjan93/udf-spec

My RAG app confidently told my client the wrong answer. I spent 3 days debugging the wrong thing.

Gunjan Tailor — Mon, 18 May 2026 13:35:15 +0000

Picture this.

It's a client demo. They're watching. I type:

"Which region had the highest revenue growth last quarter?"

My RAG app — three weeks of work, carefully tuned embeddings, clever prompts — responds instantly.

The client nods. Writes it down.

The answer was wrong. By almost double.

I spent three days debugging the wrong things.

Chunk size? Tried 256, 512, 1024. Nothing.
Temperature? 0.0, 0.3, 0.7. Still wrong.
Embeddings model? Swapped three of them. Nope.
Prompt engineering? Added "think step by step", "be precise", "do not hallucinate". 😭

The LLM wasn't hallucinating. It was doing its best with this:

"45.2%  Q3  Europe  38.1%  Q2  Europe  41.7%  Q3  Asia   29.3%"

Orphaned numbers. No column headers. No caption. No context.

The original table had all of that. My chunker ate it silently.

⚠️ The bug was never in retrieval. It was in ingestion. And I never thought to look there.

🔥 The dirty secret of RAG tutorials

Every tutorial shows you this pipeline:

PDF → extract text → chunk at 512 tokens → embed → store → retrieve → answer

Clean. Simple. Completely wrong for structured documents.

Here's what blind chunking silently destroys:

Document	What you had	What the LLM gets
Financial report	Revenue table with headers	Orphaned numbers, zero context
Legal contract	3-page clause	Split mid-sentence, both halves useless
API docs	Function + code example	Code separated from its description
Research paper	Figure with caption	Caption on chunk 7, analysis on chunk 12

🗑️ You're feeding the LLM garbage and expecting gold. The model isn't dumb — it's working with broken input.

🛠️ So I built the thing I wished existed

Meet DocNest — not another chunker.

A document normalization engine that reads structure before touching content.

Every heading → a navigable §section with its own ID
Every table → preserved as { caption, headers, rows[] } JSON
Every section → one-sentence LLM summary + BM25 keyword index
All of it → packed into a portable .udf file

from docnest.pipeline import DocNestPipeline
from docnest.reader import UDFIndex

# Convert — runs once, costs a few LLM calls
pipeline = DocNestPipeline(
    llm_provider="groq",           # free tier works perfectly
    llm_api_key="gsk_...",
    emb_provider="huggingface",    # local, no API key needed
)
pipeline.convert("report.pdf")    # → report.udf ✓

# Query
idx = UDFIndex.load("report.udf")
result = idx.query("Which region had the highest Q3 growth?")

print(result.answer)       # "Asia grew the most, up +12.4pp"
print(result.layer_used)   # 1
print(result.tokens_used)  # 0  ← yes, really. zero.

✅ Zero tokens. Correct answer. 18ms.
That's not a cherry-picked example. Here's why it's possible.

⚡ The 5-layer query engine

Instead of dumping the full document into an LLM, queries escalate through layers — stopping the moment one can answer confidently.

Layer	What it does	Tokens	Speed
0	Pre-computed summary + key numbers	0	< 1ms
1	BM25 + cosine → lands on exact §section	0	< 20ms
2	Section-scoped LLM call	~300	1–3s
3	Multi-section synthesis	~900	2–5s
4	Full document fallback	~4000+	5–15s

I expected layers 2–4 to do most of the work.

🤯 Layers 0 and 1 handle roughly 70% of real-world questions — at zero token cost.

Seven out of ten queries answered from a structured index. You pay for LLM compute only when genuine reasoning is needed.

📊 Real numbers. Not vibes.

25 questions. 500-page open-source nutrition textbook. PyMuPDF + Groq free tier.

Question type	Score
Basic facts (calories, macros)	✅ 5/5
Detailed nutrition (fiber, glycemic index)	✅ 5/5
Micronutrients (vitamins, minerals)	✅ 4/5
Hard synthesis (BMR, omega-3, antioxidants)	✅ 5/5
Edge cases + hallucination traps	✅ 5/5
Total	24/25 — 96%

The one failure: a table-only page where the text parser extracted nothing.
Fix: use DoclingPDFParser for image-heavy or scanned PDFs.

🧠 Handles 600-page PDFs without exploding your RAM

Standard Docling loads the full document into memory. 600 pages on a normal laptop = 💀 out of memory.

DocNest chunks automatically, processes each at full ML quality, merges the output. Peak RAM stays constant regardless of document size.

from docnest.parsers.pdf import DoclingPDFParser

# Just works — auto-detects large PDFs
raw = DoclingPDFParser().parse("600-page-annual-report.pdf")

# Or tune for your hardware
raw = DoclingPDFParser(chunk_pages=10).parse("report.pdf")  # 💻 low RAM
raw = DoclingPDFParser(chunk_pages=50).parse("report.pdf")  # 🚀 speed mode

🚀 Try it

pip install docnest-ai

Formats: PDF (ML + fast) · DOCX · XLSX · HTML · Markdown

LLM providers: Groq (free) · OpenAI · Ollama (local) · Anthropic · Mistral · Google · Cohere

Vector backends: numpy (zero deps) · FAISS · ChromaDB

# CLI — because boilerplate is boring
docnest convert report.pdf --llm-provider groq --llm-model llama-3.3-70b-versatile
docnest query report.udf "What are the key financial risks?"
docnest view report.udf     # structured HTML viewer in browser

GitHub repo — star it if this solved a problem you've had:

tailorgunjan93 / docnest

The document normalization engine RAG has always needed. Parse any document, understand its structure, build RAG that actually works.

DOCNEST

The document normalization engine RAG has always needed.

Parse any document. Understand its structure. Build RAG that actually works.

Why DOCNEST • Installation • Quick Start • Python API • PDF Parsing • How It Works • CLI Reference • Providers • Roadmap

The Problem with RAG Today

Every RAG pipeline ingests documents the same broken way:

PDF → extract text → split every 512 chars → embed → store → hope

What gets silently destroyed:

Source	What blind chunking loses
Financial report	Table row `45.2% \| Q3 \| Europe` has no column headers
Legal contract	Clause split mid-sentence across two chunks
API documentation	Code example separated from its description
Research paper	Figure caption disconnected from its analysis

The LLM receives noise and returns approximate answers. This is not a retrieval problem — it is an ingestion problem.

See the difference

Take a financial report with a revenue table…

View on GitHub

PyPI: https://pypi.org/project/docnest-ai
Format spec: https://github.com/tailorgunjan93/udf-spec

🔨 Honesty tax

🚧 This is 0.4.0a2 — alpha. It works on real documents, but PPTX parser isn't built yet, Qdrant/Weaviate backends are on the roadmap, and SharePoint/Confluence connectors are planned.

If any of those sound like something you want to build — good first issues are labeled and waiting.

💬 One question for you

Most RAG infrastructure assumes text extraction is a solved problem.

It isn't. Not for tables. Not for anything where position and relationship carry meaning.

💬 What document type has caused you the most RAG pain?

For me it was financial tables. Drop it in the comments — if it's a format DocNest doesn't handle yet, that's probably the next parser I build.

Building in the open at github.com/tailorgunjan93/docnest. Stars, issues, and brutal feedback all welcome. 🙏