<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Martin Arva</title>
    <description>The latest articles on DEV Community by Martin Arva (@martinarva).</description>
    <link>https://dev.to/martinarva</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3859445%2Fbe6aa26b-90ff-45d3-a86e-7911b2b5e2c9.jpeg</url>
      <title>DEV Community: Martin Arva</title>
      <link>https://dev.to/martinarva</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/martinarva"/>
    <language>en</language>
    <item>
      <title>We tested structured ontology vs Markdown+RAG for AI agents — "why?" recall was 0% vs 100%</title>
      <dc:creator>Martin Arva</dc:creator>
      <pubDate>Sat, 04 Apr 2026 12:17:14 +0000</pubDate>
      <link>https://dev.to/martinarva/we-tested-structured-ontology-vs-markdownrag-for-ai-agents-why-recall-was-0-vs-100-42p3</link>
      <guid>https://dev.to/martinarva/we-tested-structured-ontology-vs-markdownrag-for-ai-agents-why-recall-was-0-vs-100-42p3</guid>
      <description>&lt;p&gt;Our AI agent knew the company uses Provider A for identity verification. It could name the provider, list the integration specs, recite the timeline.&lt;/p&gt;

&lt;p&gt;Then we asked &lt;em&gt;why&lt;/em&gt; Provider A was chosen over Provider B.&lt;/p&gt;

&lt;p&gt;The agent couldn't answer. Not once across 24 attempts. Zero percent recall on reasoning questions.&lt;/p&gt;

&lt;p&gt;So we built the layer that was missing — and ran 48 controlled experiments to measure the difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: AI agents can't answer "why?"
&lt;/h2&gt;

&lt;p&gt;If you give an AI agent a folder of Markdown docs and let it use RAG to find answers, it handles factual questions well. &lt;em&gt;What modules exist? Who owns this component? When was this decision made?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;But "why?" is different.&lt;/p&gt;

&lt;p&gt;Reasoning is rarely stored as a discrete fact. It's spread across meeting notes, scattered through Slack threads, buried in the third paragraph of a design doc written six months ago. The &lt;em&gt;connection&lt;/em&gt; between a strategic goal and an operational decision almost never appears as a single retrievable chunk.&lt;/p&gt;

&lt;p&gt;This means vector search finds the documents that &lt;em&gt;mention&lt;/em&gt; the decision, but not the reasoning chain that &lt;em&gt;justifies&lt;/em&gt; it. The agent knows what happened. It doesn't know why.&lt;/p&gt;

&lt;p&gt;This matters more than it sounds. An agent that doesn't understand &lt;em&gt;why&lt;/em&gt; a decision was made will make follow-up decisions that are technically correct but institutionally wrong — optimizing for the wrong goal, violating an unwritten constraint, repeating a mistake that was already analyzed and rejected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Our approach: structured ontology as a navigation layer
&lt;/h2&gt;

&lt;p&gt;We didn't replace the Markdown docs. We added a structured layer on top — a four-level ontology that maps business reasoning into queryable relationships:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LORE       (foundational beliefs, worldview)
  ↓ interpreted_into
VISION     (goals, priorities, boundaries)
  ↓ operationalized_into
RULES      (policies, decision rules, constraints)
  ↓ applied_to
OPERATIONS (initiatives, decisions, tasks)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every connection between layers carries an &lt;em&gt;assertion&lt;/em&gt; — an explicit explanation of why that relationship exists. This means an agent can trace from any operational decision back to the foundational beliefs that justify it.&lt;/p&gt;

&lt;p&gt;Here's what that looks like in practice. Ask: "Why did we choose Provider A for identity verification?"&lt;/p&gt;

&lt;p&gt;The agent traces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OPERATIONS → Chose Provider A (affordable, OIDC-compatible)
  ← applied_to
RULES → Start with affordable identity provider, plan migration later
  ← operationalized_into
VISION → Build self-service tools for micro-entrepreneurs
  ← interpreted_into
LORE → Small business owners want to handle accounting themselves
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No vector search. No probabilistic retrieval. SQL queries over a versioned database.&lt;/p&gt;

&lt;p&gt;The backend is &lt;a href="https://github.com/dolthub/dolt" rel="noopener noreferrer"&gt;Dolt&lt;/a&gt; — a database with Git semantics. Branch, commit, diff, merge, pull request. Every change to the ontology goes through human review before it becomes canonical.&lt;/p&gt;

&lt;p&gt;The interface is &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;MCP&lt;/a&gt; (Model Context Protocol) — the de facto standard for connecting AI agents to external tools. Our server exposes 18 tools: 9 for querying, 4 for proposing changes, 3 for generating reasoning envelopes, and 2 for Dolt version control.&lt;/p&gt;

&lt;h2&gt;
  
  
  The experiment
&lt;/h2&gt;

&lt;p&gt;We tested this on a real business domain — a SaaS company's market expansion project. Same knowledge base, same questions, two modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mode A&lt;/strong&gt;: Agent gets Markdown documentation + file search tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mode B&lt;/strong&gt;: Agent gets the same knowledge as a structured ontology + Dolt MCP tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;48 sessions. 8 task types. 3 runs per task per mode. Two independent LLM judges (GPT-5.4 and Claude Opus 4.5) evaluated every answer against ground truth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Markdown + RAG&lt;/th&gt;
&lt;th&gt;Right Reasons&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Entity recall&lt;/td&gt;
&lt;td&gt;0.514&lt;/td&gt;
&lt;td&gt;0.976&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+90%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Why?" question recall&lt;/td&gt;
&lt;td&gt;0.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0% → 100%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning quality (1-5)&lt;/td&gt;
&lt;td&gt;1.96&lt;/td&gt;
&lt;td&gt;4.33&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+121%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stability (variance)&lt;/td&gt;
&lt;td&gt;1.457&lt;/td&gt;
&lt;td&gt;0.472&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3× more stable&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;284.6s&lt;/td&gt;
&lt;td&gt;183.8s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;35% faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pairwise wins&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;(4 ties)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The "why?" result is the headline: Mode A scored 0.0 entity recall across all 6 runs on reasoning questions. Not low — &lt;em&gt;zero&lt;/em&gt;. Mode B scored 1.0 across all 6 runs. This isn't statistical noise. It's a deterministic gap.&lt;/p&gt;

&lt;p&gt;The conventional assumption is that structured retrieval is a tradeoff — better recall but more overhead and higher latency. This experiment showed the opposite: the structured approach was simultaneously more accurate, faster, more stable, and more compact in its answers.&lt;/p&gt;

&lt;p&gt;Judge agreement was 83.3%. Average judge confidence was 0.927. The only disagreements were on impact analysis tasks where multiple valid reasoning paths existed.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we didn't prove (honestly)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ingest&lt;/strong&gt;: Getting business knowledge &lt;em&gt;into&lt;/em&gt; the ontology was manual. This is the hardest unsolved problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write path&lt;/strong&gt;: We only tested reading. Agents proposing ontology changes is designed but not yet benchmarked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generalization&lt;/strong&gt;: Tested on one domain (dev planning). Other domains are next.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How knowledge enters the ontology: EPICAL
&lt;/h2&gt;

&lt;p&gt;We're not expecting anyone to manually populate SQL tables. The designed ingest pipeline is called EPICAL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Source docs → EXTRACT → PONDER → INTERROGATE → CALIBRATE → AUTHENTICATE → LOAD
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first two stages (Extract and Ponder) are agent-driven — the AI proposes candidate objects and relationships from source documents. Interrogate and Calibrate refine confidence. Authenticate is the human gate — a Dolt diff review, just like a code PR. Only after human approval does knowledge become canonical.&lt;/p&gt;

&lt;p&gt;The epistemic boundary is strict: an agent cannot bypass human validation. The &lt;code&gt;promote_candidate&lt;/code&gt; tool requires &lt;code&gt;authenticated&lt;/code&gt; status.&lt;/p&gt;

&lt;h2&gt;
  
  
  OPS Contracts: reasoning envelopes for external work
&lt;/h2&gt;

&lt;p&gt;One more concept worth mentioning. When work happens in external systems (Jira, GitHub, CI/CD), the agent can generate an &lt;em&gt;OPS Contract&lt;/em&gt; — a reasoning envelope that attaches institutional context to a work item:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;generate_ops_contract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;external_work_ref&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jira://TASK-123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prepare annual report for submission&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contract_kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;annual_reporting&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The contract tells the executing agent &lt;em&gt;why&lt;/em&gt; this task matters, &lt;em&gt;what rules&lt;/em&gt; apply, and &lt;em&gt;which boundaries&lt;/em&gt; must not be crossed — without the agent needing to query the full ontology itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The full repo is open source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Right-Reasons/right-reasons
&lt;span class="nb"&gt;cd &lt;/span&gt;right-reasons
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;mcp-server &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Connect your agent, then ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Why did we choose Provider A over Provider B for identity? Use the get_explanation_packet tool with object ID ex_ops_02."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent will trace the full reasoning chain across all four layers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📦 &lt;a href="https://github.com/Right-Reasons/right-reasons" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🌐 &lt;a href="https://www.rightreasons.ai" rel="noopener noreferrer"&gt;Website&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📝 &lt;a href="https://medium.com/@kaspar.loit/missing-layer-in-ai-agent-systems-654b41962f37" rel="noopener noreferrer"&gt;Background article by Kaspar Loit&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📊 &lt;a href="https://github.com/Right-Reasons/right-reasons/blob/main/experiments/poc-results.md" rel="noopener noreferrer"&gt;Full experiment results&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Right Reasons is built by &lt;a href="https://www.rightreasons.ai" rel="noopener noreferrer"&gt;MindWorks Industries&lt;/a&gt;. We're looking for early users who want to give their AI agents actual institutional reasoning. Reach out at &lt;a href="mailto:hello@rightreasons.ai"&gt;hello@rightreasons.ai&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
