DEV Community

Ian Cowley
Ian Cowley

Posted on

Why Naive RAG is Dead: I built a zero-dependency C# Semantic Tree Parser

If you have built an AI Agent or a RAG (Retrieval-Augmented Generation) pipeline in the last year, you’ve almost certainly run into the exact same problem: Hallucinations caused by Naive Chunking.

The standard industry advice for feeding documents to an AI is to take a 50-page API manual, blindly chop it up into 500-token chunks, vectorize them, and throw them into a database.

This completely destroys the document's structure.

If you have a paragraph that says, "If initiated after 30 days, a 15% fee will be deducted", and it gets chunked away from its ## Enterprise Cancellations header, the AI has no idea who that rule applies to. When a user asks about Free Tier cancellations, the Vector DB might return that enterprise paragraph just because the words matched. Boom. Hallucination.

As a software engineer who hates bloat and relies on deterministic logic, I needed a better way. I didn't want to use massive Python libraries to solve this.

So, I built Glacier.DocTree.

It is a zero-dependency, bare-metal C# library that parses documents into a Semantic Tree instead of flattening them into dumb chunks.


The Fix: Hierarchical Parsing

Instead of chopping text blindly by character count, Glacier.DocTree reads Markdown and builds a strongly-typed parent-child object graph.

  • # (H1) becomes a root node.
  • ## (H2) becomes a child of the last active H1.
  • Standard text and code blocks become children of their most recent header.

When you feed it a messy API document, it instantly compiles this beautiful, queryable hierarchy in memory:

==========================================
 Glacier.DocTree | Semantic Parser Engine
==========================================

[1] Parsing Markdown into Semantic Tree...

[2] Visualizing the Document Structure:
└─ [Root] Document Root
  └─ [Header1] Glacier Enterprise API
    └─ [Paragraph] Welcome to the Glacier API. This docu...
    └─ [Header2] Authentication
      └─ [Paragraph] All requests to the API must be crypt...
      └─ [Header3] OAuth 2.0
        └─ [Paragraph] To authenticate via OAuth2, you must ...
        └─ [CodeBlock] ```

json {    "Authorization": "Bearer...
    └─ [Header2] Usage Policies
      └─ [Paragraph] Please adhere to the following usage ...
      └─ [Header3] Rate Limits
        └─ [Paragraph] Free tier users are limited to 100 re...
      └─ [Header3] Acceptable Use
        └─ [Paragraph] Do not use the API to train competing...

[3] Simulating Agent Query: 'Extract Rate Limits context'

--- SEMANTIC CONTEXT ---
LOCATION: Document Root > Glacier Enterprise API > Usage Policies > Rate Limits
--- BEGIN TEXT ---
Rate Limits
Free tier users are limited to 100 requests per minute.
Enterprise users have unlimited access.
If you exceed the limit, you will receive an HTTP 429 status code.
--- END TEXT ---


Enter fullscreen mode Exit fullscreen mode

Look at that LOCATION string.
If an LLM reads that, it cannot hallucinate. It knows exactly what document it's looking at, what section it is in, and who the policy applies to. The structure is the meaning.

Try it out

If you are a .NET developer building AI infrastructure, and you are tired of your Vector DB returning paragraphs completely devoid of context, you need a Semantic Layer.

GitHub: ian-cowley/Glacier.DocTree

It is purely native C#. No heavy frameworks, no Python interop, no API keys required. It just parses documents at blistering speeds and gives your agents the context they actually need.

Let's prove C# belongs in the modern AI ecosystem!

Top comments (0)