Arek Mazur

Posted on Mar 26

RAG finds chunks. TrailGraph finds answers. Here's the difference.

#agents #ai #llm #rag

Imagine asking your AI assistant: "How does lead qualification work?"

RAG searches the vector store, pulls the top 5 chunks by cosine similarity, and hands them to the model. You might get three paragraphs about leads, one about opportunity stages, and one about case escalation — because they all share similar keywords. The model does its best to stitch them together. Sometimes it works. Sometimes you get a confidently wrong answer built from pieces that were never meant to be combined.

To be clear — RAG is a proven, powerful pattern. For unstructured documents, broad search, and fast retrieval, it's hard to beat. But when knowledge has clear hierarchy and multi-level relationships, retrieval by similarity alone can lose the structure that makes the answer meaningful.

So I built TrailGraph — an AI agent that navigates a knowledge graph step by step. It explores nodes, scores them for relevance, follows the most promising path, and only reads the full content when it's confident it found the right answer. No embeddings. No vector search. Just an LLM with a single tool and a graph of markdown files.

Where I ran into limits with standard retrieval

RAG was designed for retrieval. You embed documents, query a vector store, get back the top N chunks ranked by cosine similarity. It's fast, it scales, and for many use cases it's the right approach.

But working with Salesforce knowledge — which is inherently hierarchical — I kept running into situations where similarity-based retrieval felt like the wrong tool for the job:

Structure gets flattened. A Salesforce knowledge base is a tree: CRM → Sales → Lead Process → Lead Qualification. With chunked retrieval, that hierarchy disappears — every chunk is equally distant from every other chunk.

No feedback loop. Top-k gives you one shot. There's no way for the model to say "this chunk isn't quite right, show me something adjacent." You get your k chunks and that's it.

Context mixing. When retrieved chunks come from different branches of knowledge, the model has no way to know they're unrelated. This isn't a RAG-specific flaw — any retrieval method can surface unrelated content — but flat retrieval makes it harder to prevent.

The idea: what if the LLM navigated knowledge like a human?

When a domain expert answers a question, they don't search all documents at once. They start from a general area, narrow down, check if they're on the right track, and drill into the specifics.

TrailGraph gives the LLM this exact workflow:

Pick an entry point based on the question (CRM, Security, Integrations)
Explore — see the node's key points, children, and related nodes
Score — assign a relevance score (0–100) and decide where to go next
Drill down — follow the most promising path through children
Focus — when confidence is high enough, retrieve the full content and answer

The LLM never sees the full graph. It only sees the local view of the current node — just like navigating a real knowledge base.

How it works

The knowledge graph

Each node is a markdown file with metadata:

# Lead Qualification

summary: Detailed internal rules, stages, criteria and responsibilities
         for lead qualification.
parent: sales/Sales.md
children: []
related: [sales/Opportunity_management.md]
key_points: ["Mandatory data must be complete before qualification starts",
             "Duplicate check is required before conversion",
             "Business fit and ownership must be confirmed"]

## Content

Lead qualification in our company consists of four main steps:
1. Verify that all mandatory lead data fields are complete.
2. Check whether the company and contact already exist in the system.
3. Assess business fit and confirm ownership of the lead.
4. Decide the outcome: convert, reject, or request more information.

The summary and key_points are what the LLM sees during exploration. The full ## Content section is only revealed when the model commits to this node as the answer.

This separation is intentional — it forces the model to navigate rather than skim.

The graph structure

entry_points/
├── CRM.md ──────────┬── sales/Sales.md
│                     │      ├── Lead_process.md
│                     │      │      └── Lead_qualification.md
│                     │      └── Opportunity_management.md
│                     │             └── Opportunity_stages.md
│                     └── service/Service.md
│                            ├── Case_handling.md
│                            └── Escalation_process.md
├── Security.md
└── Integrations.md

The tool

The agent has one tool: get_knowledge_context. It accepts a node path, a view mode, a score, and a reason.

def run(self, node, view="exploration", score=0, reason=""):
    self.last_score = score
    if score > self.best_score:
        self.best_score = score
        self.best_node = node

    if score >= ANSWER_THRESHOLD:
        view = "focused"

    if node in self.visited and view != "focused":
        return {"already_visited": True, "suggestion": "Explore other candidates."}

    self.visited.append(node)
    self.hop_count += 1

    if view == "focused":
        self.disabled = True

    result = build_node_info(node, view)
    # ...
    return result

Key behaviors:

Two views: exploration returns key points, children, and related nodes. focused returns the full content.
Score-driven transitions: when the model assigns a score >= 95, the tool automatically switches to focused view — no matter what the model requested.
Self-disabling: after returning focused content, the tool disables itself. The model has what it needs; no more graph navigation.
Dead end detection: if a node has no children and no related nodes, the tool flags it as a dead end and triggers a fallback.

The agent loop

The agent itself is completely generic — it knows nothing about knowledge graphs:

class Agent:
    def __init__(self, tools, prompt_vars=None, verbose=False):
        self.client = OpenRouterClient(model=os.getenv("OPENROUTER_MODEL"))
        self.tools = tools
        self.tool_map = {tool.name: tool for tool in self.tools}
        # ...

    def _step(self, user_input):
        for tool in self.tools:
            tool.reset()

        for iteration in range(MAX_TOOL_ITERATIONS):
            active_tools = [s for tool, s in zip(self.tools, self.tool_schemas)
                           if not tool.disabled]
            message = self.client.complete(self.messages, tools=active_tools or None)

            if not tool_calls:
                print(f"Agent: {message.get('content', '')}")
                return

            for tool_call in tool_calls:
                result = tool.run(**tool_args)
                # ...
                fallback = tool.should_fallback()
                if fallback:
                    print(fallback)
                    return

All the intelligence lives in the tool. The agent just runs the loop, passes messages, and checks stop conditions. You could swap GetKnowledgeContext for a completely different tool and the agent would work the same way.

The flow

User: "How does lead qualification work?"
  │
  ├─→ LLM selects entry point: CRM.md
  │     → exploration view: sees Sales and Service as children
  │
  ├─→ LLM picks Sales.md (score: 70)
  │     → exploration view: sees Lead_process, Opportunity_management
  │
  ├─→ LLM picks Lead_process.md (score: 80)
  │     → exploration view: sees Lead_qualification as child
  │
  ├─→ LLM picks Lead_qualification.md (score: 98)
  │     → score >= 95 → automatic switch to focused view
  │     → full content returned, tool disables itself
  │
  └─→ LLM writes final answer based on focused content

4 hops. Each one narrowing the search space. Graph traversal guided by LLM reasoning.

Trade-offs: when does each approach make sense?

	RAG	TrailGraph
Retrieval	Semantic similarity (top-k)	Graph traversal (step by step)
Structure awareness	Chunks are independent	Nodes have parents, children, relations
Context control	Fixed (top-k chunks)	Dynamic — model decides what to explore
Latency	Fast (1 query)	Slower (multiple LLM calls per question)
Scalability	Proven at scale with vector DBs	Depends on graph depth and structure
Setup complexity	Moderate (embeddings, vector store)	Low (just markdown files)
Best for	Unstructured docs, broad search	Structured domains, multi-level knowledge

These aren't competing approaches — they solve different problems. RAG shines when you have large volumes of unstructured content and need fast, broad retrieval. Graph traversal shines when knowledge has explicit hierarchy and relationships. The roadmap for TrailGraph actually combines both: RAG for entry point selection, graph traversal for deep navigation.

What I learned building this

Scoring is everything

The system prompt includes a scoring guide that shapes how the model navigates:

0–59: Not relevant — stop exploring this path
60–94: Partially relevant — keep exploring children
95–100: Highly relevant — switch to focused view

Getting these thresholds right took multiple iterations. Set the answer threshold too low (e.g. 85) and the model stops at intermediate nodes like "Sales" instead of drilling into "Lead Qualification." Set it too high and the model never commits.

The key insight: only assign high scores to leaf nodes. If a node has children, there's always a more specific answer deeper in the graph. This rule is baked into the system prompt.

The model will hallucinate paths

Early versions had the model inventing node paths like Integrations/Salesforce_SAP_Integration.md — files that don't exist. The fix was a hard rule in the system prompt:

Never invent node paths. Only navigate to nodes explicitly listed in children or related of a previous tool response.

Combined with dead end detection in the tool (no children + no related = fallback), this eliminated path hallucination entirely.

What's next

TrailGraph is functional but minimal. Here's what's on the roadmap:

Near-term:

Multi-path exploration — instead of following a single path, explore the top N candidates in parallel (beam search)
Scoring module — extract scoring logic into a dedicated module to make strategies swappable

Longer-term:

Document parser — a tool that converts external documents (PDF, DOCX, Confluence) into the .md node format, making it easier to populate the graph
SQLite metadata store — store node metadata (summary, children, related, key_points) in a database while keeping full content in markdown files. Graph traversal happens against the DB; file I/O only on view=focused
RAG-based entry point selection — replace the fixed entry point list with a semantic search step. When a question arrives, RAG returns the top N most relevant nodes as candidate entry points. If the graph traversal from candidate i scores below threshold, automatically fall back to candidate i+1
Multi-query decomposition — split complex questions into sub-questions, run each through the full traversal independently, and consolidate into a single answer

The interesting part is that RAG and graph traversal aren't mutually exclusive. The roadmap leads to a hybrid: RAG for entry point selection, graph traversal for deep navigation.

Try it

TrailGraph is open source. The knowledge base covers Salesforce sales and service processes, but the architecture works for any structured domain.

git clone https://github.com/panhiszpandev/TrailGraph.git
cd TrailGraph
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Add your OpenRouter API key
python main.py --verbose --task "How does lead qualification work?"

Watch the verbose output. You'll see the agent hop through the graph, scoring each node, narrowing down step by step — the way a human would navigate knowledge.

Acknowledgments

This project was inspired by ideas from the AI_devs course, which pushed me to think beyond standard RAG patterns and explore agent-based architectures.

TrailGraph is a side project. Feedback, ideas, and PRs are welcome.
GitHub

DEV Community