Alan West

Posted on Apr 4

Why RAG Falls Short for Documentation Search (and What to Try Instead)

#rag #llm #python #ai

If you've built an AI-powered documentation assistant, you've probably hit the same wall I did about six months ago. Your RAG pipeline works fine for simple questions, but the moment someone asks something that spans multiple pages — like "how do I set up authentication with custom middleware?" — the answers start falling apart.

I spent weeks tuning chunk sizes, overlap parameters, and reranking models before realizing the problem wasn't my implementation. It was the paradigm.

The Core Problem With RAG for Docs

RAG (Retrieval-Augmented Generation) treats your documentation like a bag of text chunks. You split everything into pieces, embed them into vectors, and retrieve the most semantically similar ones when a question comes in. For a lot of use cases, this works great.

But documentation isn't a bag of text. It's a tree.

Think about how docs are actually structured:

Pages live in sections
Sections have a deliberate ordering
Pages reference other pages
A "Getting Started" guide assumes you'll read pages in sequence
API references are organized by resource, not by semantic similarity

When you chunk all of this into 512-token blocks and toss them into a vector database, you lose the structural relationships that make documentation navigable. The LLM gets fragments without context — like ripping pages out of a manual and shuffling them.

Here's what goes wrong concretely:

Lost hierarchy: A chunk about config.auth.provider loses its relationship to the parent "Configuration" section
Broken cross-references: "See the section above" means nothing when there's no "above"
Missed multi-page answers: Questions that require synthesizing info from related pages only get fragments from one
Redundant retrieval: You pull the same content repeatedly because multiple chunks score similarly

The Virtual Filesystem Approach

The idea is deceptively simple: instead of embedding chunks, represent your documentation as a filesystem that the LLM can navigate. Give the model a directory listing and let it decide which files to read — just like a developer would browse docs.

# Instead of: query -> vector search -> chunks -> LLM
# Try:        query -> filesystem map -> LLM picks files -> LLM reads files -> answer

doc_tree = {
    "getting-started/": {
        "_meta": {"title": "Getting Started", "order": 1},
        "installation.md": {"summary": "Install via npm/pip, system requirements"},
        "quickstart.md": {"summary": "Build your first app in 5 minutes"},
        "configuration.md": {"summary": "Config file options, env vars, auth setup"},
    },
    "api-reference/": {
        "_meta": {"title": "API Reference", "order": 2},
        "authentication.md": {"summary": "API keys, OAuth, token refresh"},
        "endpoints/": {
            "users.md": {"summary": "CRUD operations for user resources"},
            "projects.md": {"summary": "Project management endpoints"},
        }
    }
}

The key insight: you're trading vector similarity search for the LLM's own judgment about what's relevant. And honestly? LLMs are surprisingly good at navigating file trees when you give them decent summaries.

Building It Step by Step

Step 1: Generate the File Map

First, parse your documentation into a tree structure with short summaries for each node. These summaries are critical — they're what the LLM uses to decide which files to open.

import os
from pathlib import Path

def build_doc_tree(docs_dir: str) -> dict:
    """Walk the docs directory and build a navigable tree."""
    tree = {}

    for root, dirs, files in os.walk(docs_dir):
        rel_path = os.path.relpath(root, docs_dir)
        current = tree

        if rel_path != ".":
            for part in Path(rel_path).parts:
                current = current.setdefault(part + "/", {})

        for fname in sorted(files):
            if not fname.endswith(".md"):
                continue
            filepath = os.path.join(root, fname)
            content = open(filepath).read()
            # Generate a 1-2 sentence summary
            # (use an LLM for this during build time — it's a one-time cost)
            summary = generate_summary(content)
            current[fname] = {
                "summary": summary,
                "path": filepath,
                "tokens": len(content.split())  # rough token estimate
            }

    return tree

The summaries are the secret sauce here. Spend time making them good. I typically run each page through a smaller model with a prompt like: "Summarize this documentation page in one sentence, focusing on what specific topics and features it covers."

Step 2: Let the LLM Navigate

This is where it gets interesting. Instead of one retrieval step, you give the LLM the tree and let it request files in a tool-use loop.

def answer_question(question: str, doc_tree: dict, llm_client) -> str:
    system_prompt = """You are a documentation assistant. You have access to 
a virtual filesystem of documentation. Use the available tools to:
1. List directories to see what's available
2. Read specific files to find answers
3. Answer the user's question based on what you've read

Always check the most relevant directories first. You can read 
multiple files if needed."""

    tools = [
        {
            "name": "list_directory",
            "description": "List contents of a directory with summaries",
            "parameters": {"path": "string"}
        },
        {
            "name": "read_file", 
            "description": "Read the full contents of a documentation file",
            "parameters": {"path": "string"}
        }
    ]

    # Seed the conversation with the root directory listing
    initial_context = format_directory(doc_tree, "/")

    # Run the tool-use loop — the LLM decides what to read
    return llm_client.run_agent_loop(
        system=system_prompt,
        user_message=f"Directory listing:\n{initial_context}\n\nQuestion: {question}",
        tools=tools,
        tool_handlers={
            "list_directory": lambda p: format_directory(doc_tree, p),
            "read_file": lambda p: read_doc_file(doc_tree, p)
        }
    )

Step 3: Add a Hybrid Fallback

Pure filesystem navigation can miss things. If the LLM's answer seems uncertain or the question is very specific (like searching for an exact config key), fall back to a traditional search.

def hybrid_answer(question: str, doc_tree: dict, search_index, llm_client):
    # Try filesystem navigation first
    result = answer_question(question, doc_tree, llm_client)

    # If confidence is low, supplement with keyword search
    if result.confidence < 0.7:
        # Simple BM25 or even grep works here — you don't need vectors
        search_hits = search_index.search(question, top_k=3)
        supplemental = "\n".join([hit.content for hit in search_hits])

        # Re-answer with additional context
        result = llm_client.complete(
            f"Based on the docs you browsed AND this additional context:"
            f"\n{supplemental}\n\nRevise your answer to: {question}"
        )

    return result

Notice I used BM25 (keyword search) for the fallback, not vector search. For documentation — where users often search for exact function names, config keys, or error messages — keyword matching frequently outperforms semantic similarity.

When This Works (and When It Doesn't)

This approach shines when:

Your docs have clear hierarchical structure
Questions often require context from multiple related pages
The documentation set is moderate-sized (under ~1000 pages)
Users ask conceptual "how do I" questions

It struggles when:

Your docs are flat with no meaningful structure
The doc set is enormous (the file map itself becomes too large for context)
Questions are hyper-specific keyword lookups (use search for these)
You need sub-second response times (the multi-turn navigation adds latency)

Performance Observations

After running both approaches side-by-side on the same doc set for a few weeks, here's what I noticed:

Multi-page questions: The filesystem approach produced noticeably better answers — it could pull from 3-4 related pages naturally
Latency: Slower on average (2-4 LLM calls vs 1), but the hybrid approach kept simple questions fast
Token usage: Higher per query, but fewer retries from users asking follow-ups because the first answer was incomplete
Maintenance: Way simpler than tuning chunking strategies and reranking pipelines

Prevention Tips

If you're still early in building a doc assistant, save yourself some pain:

Start with your doc structure, not your embedding model. If your docs are poorly organized, neither RAG nor filesystem navigation will save you. Fix the information architecture first.
Generate summaries at build time. Don't try to summarize on the fly. Pre-compute good summaries for every page and directory, and refresh them when content changes.
Keep a search index as backup. BM25 is trivially cheap to maintain alongside the filesystem approach. Use it for exact-match queries.
Measure what users actually ask. I was surprised how many questions were multi-page — if yours aren't, vanilla RAG might be fine.

The filesystem approach isn't a silver bullet. But if you've been fighting chunking strategies and your answers still feel like they're missing context, it's worth trying a fundamentally different retrieval model. Sometimes the best search is no search at all — just let the LLM browse.

DEV Community