DEV Community

Cover image for Your .NET RAG stack hides a Python sidecar. I built the engine that removes it.
Gunjan Tailor
Gunjan Tailor

Posted on

Your .NET RAG stack hides a Python sidecar. I built the engine that removes it.

TL;DR — Every .NET RAG project quietly ships a Python sidecar to do one job: chunk documents. I got rid of mine. DocNest .NET is an idiomatic C# / .NET 8 port of my DocNest engine — embeddings run locally (ONNX MiniLM, no key, offline), the LLM is optional (factual questions answered at zero tokens), and the .udf knowledge base it writes is byte-compatible with the Python version. Ingest in Python, query in C#. It's on NuGet today. Repo · NuGet.


The compromise every .NET dev quietly accepts

You're building on .NET. The product needs to answer questions over a pile of PDFs, contracts, spreadsheets — real retrieval-augmented generation. So you go looking for tooling, and you find the same thing I did:

It's all Python.

LangChain, LlamaIndex, every RAG tutorial worth reading — Python, Python, Python. So you do the thing nobody admits to in the architecture review: you stand up a little Python service on the side. A second runtime to containerize, deploy, version, monitor, and wake up to at 3 a.m. when it OOMs. All so it can split a document into chunks and hand them back to your actual app.

A whole extra language in production to chop up a PDF. I stared at that diagram one too many times and decided it had to go.

So I ported DocNest to C#. Not a wrapper shelling out to python.exe — a real, idiomatic .NET port. async/await end to end, every dependency behind an interface, shipped as proper NuGet packages. Nothing Python left in the runtime.

But to explain why DocNest is worth porting, I have to tell you about the bug that started the whole thing.

The 3-day bug that started all of this

A RAG app I'd built gave a client a confidently wrong number. Not "I don't know" — a clean, specific, wrong answer, delivered with total confidence. I spent three days assuming my retrieval ranking was off, tuning embeddings and k values and similarity thresholds.

The ranking was fine. The problem happened before any of that — at ingestion. Here's how almost every pipeline reads a document:

PDF → extract text → split every 512 chars → embed → store → hope
Enter fullscreen mode Exit fullscreen mode

Watch what that does to a revenue table:

chunk_1: "45.2%  Q3  Europe  38.1%  Q2  Europe  41.7%  Q3"
chunk_2: "Asia   29.3%  Q2  Asia  Americas  52.1%  Q3  Ame"
Enter fullscreen mode Exit fullscreen mode

The headers are gone. The rows are shredded across a chunk boundary. The model receives a bag of loose numbers with no idea which is revenue, which is a quarter, which region they belong to — and fills the gap with a confident guess. That's not a model problem or a retrieval problem. It's an ingestion problem. You destroyed the meaning before the model ever saw the data.

Reading the document like a human would

A person doesn't read a report as one long character stream. They see headings, sections, a table with columns. DocNest does the same: it reads the document's structure first. Every heading becomes a navigable §section. Every table is preserved as structured data — never flattened:

{
  "section": "§4.2 Revenue by Region",
  "table": {
    "headers": ["Region", "Q2", "Q3", "Change"],
    "rows": [
      ["Europe", "38.1%", "45.2%", "+7.1pp"],
      ["Asia",   "29.3%", "41.7%", "+12.4pp"]
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Same numbers, same model, same question — but now the answer is right, and it comes with a citation. The document is normalised once into a portable .udf file: a self-contained ZIP holding the section index, key numbers, keywords, section text, and quantised embeddings. Parse once, query forever.

The trick that makes this a real port, not a rewrite

Here's the part I'm proud of. The .udf format is an open spec, and the .NET writer produces files that are byte-compatible with the Python engine. That one constraint unlocks something genuinely useful:

  • Ingest a 600-page annual report in a nightly Python batch job, then
  • Ship the small .udf to your C# desktop app or ASP.NET service and query it offline — with no Python anywhere in the runtime.

One ingestion ecosystem, two languages, the same artifact moving between them. Nothing in the codebase is allowed to break that cross-ecosystem contract — it's the whole point.

Two knobs people always confuse

When I describe this, two questions come back every time. They're actually two independent choices:

1. Embeddings run locally. A small ONNX MiniLM model (~90 MB) downloads once and caches. No API key, fully offline. There's an optional ONNX cross-encoder reranker for dense PDFs.

2. The LLM is optional. Answer Layers 0–1 resolve factual questions deterministically — zero tokens, no key. You only bring an LLM for synthesis, and when you do, "OpenAI" means the answer model, not embeddings. The two never get coupled.

Try it in 60 seconds — no key, no internet

dotnet add package DocNest.Core
dotnet add package DocNest.Parsers
dotnet add package DocNest.Retrieval
dotnet add package DocNest.Query
Enter fullscreen mode Exit fullscreen mode
using DocNest;
using DocNest.Parsers;
using DocNest.Pipeline;
using DocNest.Query;
using DocNest.Retrieval;
using DocNest.Udf;

// Parse → normalise → write a portable .udf
var raw = await new ParserFactory().Get("report.pdf").ParseAsync("report.pdf");
var doc = new DocNestPipeline().Process(raw);
await new UdfWriter().WriteAsync(doc, "report.udf");

// Load it back and ask — deterministic layers, no LLM
var document = (await UdfReader.LoadAsync("report.udf")).ToDocument();

using var retriever = new HybridRetriever(".docnest_cache");
var engine = new DocNestQueryEngine(retriever);   // no LLM → Layers 0–1 only
var result = await engine.AnswerAsync(document, "What was Q3 revenue?", allowLlm: false);

Console.WriteLine(result.Answer);     // "Q3 revenue: $38M (source: §3.1)"
Console.WriteLine(result.TokensUsed); // 0
Enter fullscreen mode Exit fullscreen mode

Prefer the terminal?

dotnet tool install -g DocNest.Cli
docnest convert report.pdf -o report.udf
docnest query report.udf "What was Q3 revenue?"
Enter fullscreen mode Exit fullscreen mode

When you actually need an LLM

OpenAiCompatibleLlmProvider talks to OpenAI, Groq, Cerebras, Together, OpenRouter and local servers (Ollama, LM Studio) — change the base URL and model. Anthropic has its own provider.

ILlmProvider llm = new OpenAiCompatibleLlmProvider(
    apiKey:  Environment.GetEnvironmentVariable("GROQ_API_KEY")!,
    model:   "llama-3.3-70b-versatile",
    baseUrl: "https://api.groq.com/openai/v1");

var engine = new DocNestQueryEngine(retriever, llm);
var result = await engine.AnswerAsync(document, "Summarise the key risks.", allowLlm: true);
Console.WriteLine(string.Join(", ", result.Citations));  // ["§5.2", "§5.3"]
Enter fullscreen mode Exit fullscreen mode

Under the hood: five layers, escalate only when needed

file  → IParser → DocNestPipeline (normalise · key-numbers · keywords) → Document → .udf
query → HybridRetriever (BM25 + dense + cross-encoder rerank + RRF + 1-hop graph) → top-k
      → DocNestQueryEngine (5 layers) → answer + citations + tokens + confidence
Enter fullscreen mode Exit fullscreen mode
Layer Mechanism Tokens
0 Pre-computed key-numbers / summary 0
1 Extractive from the top section 0
2 Single-section LLM ~300
3 Multi-section synthesis (reranked context) ~900
4 Broad fallback over retrieved sections ~1,500

The engine climbs this ladder only when a cheaper rung isn't confident. Layers 0–1 handle a surprising share of real factual questions at zero cost — you pay tokens only for genuine synthesis.

The benchmark I almost didn't publish

A multi-format eval — 10 documents, 88 questions, 5 formats (the same set as the Python reference), dense + cross-encoder rerank, gpt-oss-120b narrator, qwen2.5 judge:

Format Score Hit-rate (≥7)
XLSX 8.7 / 10 93%
MD 8.7 / 10 100%
DOCX 7.0 / 10 79%
HTML 4.8 / 10 50%
PDF 6.8 / 10 70%
Overall ~7.1 / 10 ~78%

The Python reference sits at 8.5/10. This .NET port is at 7.1 and closing the gap slice by slice — the cross-encoder reranker alone dragged PDFs from 5.1 → 6.8 (hit-rate 47% → 70%). HTML is clearly my weakest format right now, and it's the next thing I'm fixing.

I could have cherry-picked a kinder run and quoted a bigger number. I'd rather ship the reproducible one with the eval harness sitting right next to it in the repo. If you don't trust a benchmark you can't re-run, neither do I.

What ships in the box

Package Role
DocNest.Abstractions Domain records + wrapper interfaces
DocNest.Core Pipeline, normaliser, .udf reader/writer, quantizer
DocNest.Parsers md / html / csv / docx / xlsx / pdf
DocNest.Embeddings ONNX MiniLM embedder + ms-marco cross-encoder reranker
DocNest.Retrieval Hybrid retriever (FTS5 BM25 + dense + rerank + RRF + graph)
DocNest.Query 5-layer answer engine + LLM providers
DocNest.Storage .udf ZIP storage backend
DocNest.Cli docnest dotnet tool

Parsers cover PDF (PdfPig), DOCX/XLSX (OpenXML), HTML (AngleSharp), CSV/TSV and Markdown. Every external dependency lives behind a DocNest interface, so swapping any of them is a one-line change.

Where it stands — honestly

This is pre-1.0, built slice-by-slice under a gated protocol: understand → plan → design + ADR → tests-first → full suite green → sign-off, per phase. The core pipeline, hybrid retrieval, cross-encoder reranking and the 5-layer engine are implemented and tested. Cloud embedding providers (OpenAI embeddings and friends) exist in the Python engine but aren't ported yet — embeddings here are local-only by design.

Try it

dotnet add package DocNest.Core
# or
dotnet tool install -g DocNest.Cli
Enter fullscreen mode Exit fullscreen mode

If you've ever stood up a Python sidecar just to chunk a PDF for a .NET app, I'd genuinely like to know whether this kills that step for you — tell me in the comments. And if it does, a star on the repo helps other .NET folks find it.

Secure · Fast · Reliable · Cost-Effective

Top comments (0)