Johnny

Posted on Dec 10, 2025

Return Facts, Not Interpretations: Why LLM Tools Should Be Dumber Than You Think

#ai #mcp #architecture

This article is for anyone who's discovered that making tools "helpful" to LLM's actually made them less useful.

TL;DR: I tried to make my MCP Servers DOM exploration tools helpful by adding semantic interpretations to their output. The tools became brittle, task-specific, and harder for LLMs to use correctly. Stripping out all interpretation and returning only raw structural facts made the system more reliable, more composable, and paradoxically more useful. This article explains why LLM-facing tools need fundamentally different design principles than human-facing APIs—and why "dumber" tools create smarter agents.

Article Roadmap

Part 1: The Problem (2 min read)

The helpful tool that made everything worse
Why interpretation is context-dependent
The insight: capability vs. interpretation

Part 2: Why This Matters (2 min read)

The composition problem
The human API trap
When interpretation belongs in tools (the rare cases)

Part 3: Design Principles (2 min read)

Four principles for fact-based tools
The composability test
How the knowledge layer handles interpretation

Part 4: What This Means for Builders (2 min read)

Practical changes for tool builders
The counterintuitive result
The principle generalizes

Part 1: The Problem

1.1 The Helpful Tool That Made Everything Worse

When I first built resolve_container for Verdex, I wanted it to be helpful. The tool walks up the DOM tree from a target element and returns the ancestor chain. But I didn't just return tags and attributes—I added interpretation:

{
  "type": "product-card",           // Tool guesses semantic meaning
  "role": "list-item",              // Tool guesses structural purpose
  "confidence": 0.85,               // Tool evaluates its own guess
  "recommendation": "Use this as your container scope"
}

This seemed reasonable. I was helping the LLM by pre-analyzing the structure and making recommendations. The tool was doing the hard work of interpretation so the LLM didn't have to.

Then I hit production.

A page with user profile cards. The tool confidently labeled them as product cards (confidence: 0.85). The LLM trusted the tool's interpretation. It generated selectors scoped to the wrong pattern. Tests broke mysteriously when the heuristic failed on edge cases.

The problem wasn't that the interpretation was usually wrong—it was usually right. The problem was fundamental: interpretation is context-dependent, and the tool didn't have context.

1.2 The Core Problem: Interpretation Is Context-Dependent

Take a div with data-testid="product-card". What does this mean?

For selector authoring: That div is a stable container. Use getByTestId("product-card") as your scope.

For visual testing: It's a component boundary. Screenshot this entire card.

For web scraping: It's a data structure. Extract product information from descendants.

For accessibility auditing: It's a semantic grouping. Check for proper ARIA labels.

Same DOM element. Same structural facts. Four completely different interpretations.

When my tool chose one interpretation—"this is a product card, use it for selector scoping"—it made that decision for all tasks. But the tool couldn't know which interpretation was correct because it lacked the user's query, the application domain knowledge, and the task context.

Only the LLM had that information.

1.3 The Insight: Capability vs. Interpretation

This led to a fundamental separation in how I think about tool design:

Tools provide capability: Access to structural facts that would otherwise be hidden or expensive to retrieve.

LLMs provide interpretation: Deciding what those facts mean for this specific query in this specific context.

The tool's job is to traverse the DOM and return what it finds: tags, attributes, depth, relationships. Not to guess whether something is a product card. Not to evaluate whether it's a good container. Not to recommend what the LLM should do with it.

That's the LLM's job.

Here's what happened when I stripped out the interpretation:

Before (interpretation mixed in):

{
  "container": {
    "semanticType": "product-card",  // Tool is guessing
    "stability": "high",              // Tool is evaluating  
    "recommended": true               // Tool is prescribing
  }
}

After (pure facts):

{
  "ancestors": [
    {
      "level": 1,
      "tagName": "div",
      "attributes": {"data-testid": "product-card"},
      "childElements": 5
    }
  ]
}

The second version seems less helpful. It is less helpful—for humans. But it's more useful for LLMs because it preserves optionality.

The LLM can interpret those facts differently based on what the user actually asked for:

Selector authoring query: "That data-testid at level 1 is a stable container, I'll use it for scoping"
Debugging query: "There are 12 elements with that testid, probably a component copied without updating IDs"
Refactoring query: "This pattern appears in 47 test files, needs careful migration"

Same tool output. Different interpretations. The architecture works because the capability layer stayed interpretation-free.

Part 2: Why This Matters

2.1 The Composition Problem

When tools return interpretations, they're making decisions that are hard to reverse.

The LLM has to either accept the interpretation or fight against it. And fighting against confident tool output is something LLMs struggle with—they tend to defer to explicit assertions in their context.

Here's a concrete example:

The tool says "type": "product-card" with "confidence": 0.85.

Now the user asks: "Find all user profile cards on this page."

The LLM sees the tool's interpretation. It has two options:

Trust it (wrong): Generate selectors for product cards when the user wants profiles
Fight it (awkward, token-expensive, unreliable): Explain why the tool's interpretation doesn't match the query

Compare this to raw facts. The tool returns "data-testid": "product-card".

When the user asks about user profiles, the LLM looks at the actual page structure and realizes this testid is misleading or the wrong target. The LLM can adapt because no interpretation was baked in.

The principle: Tools that return facts compose across different tasks. Tools that return interpretations optimize for one task and break others.

2.2 The Human API Trap

This is deeply counterintuitive because we've spent decades optimizing APIs for human developers.

Human-facing APIs should be high-level and opinionated. A method like page.selectDropdown("Country", "United States") is beautiful for humans because it handles all the fiddly details—finding the select element, opening it, locating the option, handling JavaScript events.

But LLMs work differently.

The LLM actually does better with raw primitives:

page.click('select[name="country"]')
page.click('option:has-text("United States")')

The LLM can adapt this pattern to dropdowns it's never seen before—custom components, weird frameworks, non-standard implementations. The high-level abstraction only works for exactly the cases it was designed to handle. The low-level primitives work for cases nobody anticipated.

Here's why: LLMs excel at pattern matching and adaptation. Give them low-level building blocks and they'll figure out how to compose them for novel situations. Give them high-level abstractions and they'll try to force-fit them into scenarios where they don't quite work.

This is why resolve_container returns ancestor chains with raw attributes instead of "here's your recommended container." The LLM can look at those ancestors and decide which level makes sense for this specific selector on this specific page for this specific task.

2.3 When Interpretation Belongs in Tools (The Rare Cases)

I need to acknowledge that there are legitimate cases where tools should interpret.

Deterministic operations where there's one correct answer:

Parsing structured data formats: A tool that parses JSON should return the parsed structure, not raw bytes
Mathematical computations: A tool that calculates compound interest should return the final number, not intermediate steps
Syntax validation: A tool that checks selector syntax can return true/false

Domain-specific validation where correctness is well-defined:

Email validation: "valid email" has a specification
Schema compliance: Checking if data matches a defined structure
Type checking: Verifying if a value meets type constraints

The key distinction: these interpretations are task-independent.

Parsing JSON means the same thing regardless of what you're building. Validating an email address has the same definition across all contexts. Mathematical operations produce identical results regardless of the user's goal.

Contrast with Verdex's primitives:

Whether a div is a "good container" depends entirely on context:

What you're selecting
Why you need stability
What alternatives exist
How the rest of the page is structured
Whether you're authoring, debugging, or refactoring

That interpretation must stay with the LLM because only the LLM has the user's query, the task context, and the surrounding conversation.

Part 3: Design Principles

3.1 Four Principles for Fact-Based Tools

Return the Minimum Sufficient Facts

Don't return everything about an element—that's information overload. Return just enough for the LLM to make the next decision.

For ancestors, that means:

Level (depth in tree)
Tag name
Attributes
Child count

Not:

All text content
All descendants
Computed styles
Event listeners

Make Facts Actionable

Each piece of information should enable a decision. If you're returning a fact that the LLM can't use, you're creating noise.

The childElements count in resolve_container is actionable—it helps the LLM decide if this level makes sense as a container. A container with 1 child is probably too specific. A container with 50 children might be too broad.

Preserve Context for Interpretation

Include facts that affect interpretation even if they seem like metadata.

The level number in ancestor chains matters because "container at level 1" vs "container at level 5" suggests different stability characteristics. Level 1 is close to the target (specific, might break if structure changes). Level 5 is far from the target (stable, might be too broad).

Use Consistent, Predictable Formats

The LLM should never have to guess whether an attribute value will be a string, array, or object.

Bad:

// Sometimes attributes is an object
{"attributes": {"class": "card"}}
// Sometimes it's an array  
{"attributes": [["class", "card"]]}

Good:

// Always an object
{"attributes": {"class": "card"}}

3.2 The Composability Test

Here's a practical test for whether your tool design is fact-based or interpretation-heavy:

Take your tool's output and ask: Can I use these facts for a task the tool wasn't designed for?

Let's walk through examples with Verdex. resolve_container returns ancestor chains with attributes. Original task: authoring selectors.

Novel tasks the same output supports:

Visual testing: Screenshot elements at specific container depths

Debugging: Understand why a selector stopped working after refactor

Component analysis: Identify repeated patterns across the codebase

Accessibility audit: Find containers missing proper ARIA structure

None of these tasks were in mind when designing resolve_container. But the tool's output supports all of them because it returns structural facts, not selector-authoring interpretations.

Compare to the interpreted version that returned "semanticType": "product-card". That interpretation optimizes for selector authoring but provides zero value for:

Visual testing (what's a "product card" for screenshots?)
Accessibility audits (semantic type doesn't indicate ARIA compliance)
Component analysis (can't identify patterns without structural details)

The interpreted output is less composable because it baked in assumptions about the task.

3.3 The Knowledge Layer Handles Interpretation

This connects back to the two-layer architecture I wrote about in "Teaching LLMs to Compose."

Layer 1 (tools) provides raw structural facts—tags, attributes, relationships, positions.

Layer 2 (Skills) teaches interpretation—when this pattern appears, consider it a stable container. When these attributes exist, prioritize them for scoping.

The separation enables different interpretations for different tasks:

Verdex Skill for selector authoring teaches:

"When you see data-testid at level 1, that's your best container. Scope your selector there."

Hypothetical Skill for accessibility auditing teaches:

"When you see a div at level 1 without an ARIA role, that's a potential accessibility issue. Check if it needs role='region' or similar."

Same tool output. Different Skills. Different interpretations.

The architecture works because the capability layer stayed interpretation-free, allowing the knowledge layer to provide task-specific guidance.

Part 4: What This Means for Builders

4.1 Practical Changes for Tool Builders

If you're building LLM tools right now, here's what changes:

Stop Trying to Make Tools "Smarter"

Every time you add interpretation to tool output, you're reducing composability. Resist the urge to be helpful through interpretation. Be helpful through better facts.

Early Verdex tried to help by identifying "product cards" and "stable containers." This helped for exactly one task (selector authoring) and hurt everything else. Removing the helpfulness made the tool universally useful.

Test Tools on Unintended Tasks

If your tool output only makes sense for one specific workflow, you've probably baked in too much interpretation.

Build a file system exploration tool? Test it on:

Code navigation
Security auditing
Dependency analysis
Documentation generation

If the same output works across all four, you've got good primitives.

Move Strategic Knowledge to Skills

Don't try to teach best practices through tool behavior. Teach them through explicit, progressive knowledge layers.

The pattern that works: dumb tools that return facts, smart Skills that teach composition.

Trust the LLM to Interpret

This feels wrong at first. You know your domain better than the model. You've spent months understanding what makes a good container, what patterns are stable, what heuristics work.

But the model has something you don't: the user's actual query and the full task context.

When the user asks "find the button in the iPhone card," the LLM knows:

They're looking for a button (not analyzing structure)
It's in a specific card (needs unique identification)
The card contains iPhone (text-based filtering)

Your tool can't know this because the tool runs before the LLM decides what to do. Give the LLM the facts. Let it compose them contextually.

4.2 The Counterintuitive Result

Making tools "dumber"—stripping out helpful interpretations, returning raw facts—makes LLM agents smarter.

The agents can adapt to novel scenarios, compose across tasks, and learn from failures because the tools didn't lock them into specific interpretations.

This is the opposite of how we've built developer tools for decades:

Human-facing APIs should abstract away details and provide high-level operations. Humans benefit from selectDropdown(field, value) because we don't want to think about DOM traversal and event handling.

LLM-facing tools should expose details and provide low-level primitives. LLMs benefit from raw click() and getText() because they can compose these primitives in ways we never anticipated.

4.3 The Principle Generalizes

This philosophy extends beyond Verdex and DOM exploration.

Whether you're building tools for:

File systems (return paths, sizes, types—not "important files" or "project roots")
Databases (return schemas, relationships, types—not "main tables" or "key entities")
APIs (return endpoints, parameters, types—not "recommended flows" or "common patterns")
Code analysis (return AST nodes, references, scopes—not "bad code" or "refactoring candidates")

The lesson holds: return facts, let Skills teach interpretation, and trust the model to compose.

Your tools should be dumb enough to be universally useful. The intelligence lives in how the LLM composes them—guided by Skills that understand context, informed by the user's query, and adapted to the specific task at hand.

Closing Thoughts

I spent weeks making Verdex's tools "smarter" before realizing I was solving the wrong problem. The tools didn't need to be smarter—they needed to be simpler.

Every interpretation I removed made the system more reliable. Every heuristic I deleted made the tools more composable. Every "helpful" feature I stripped out made the LLM better at its job.

The counterintuitive truth: dumb tools enable smart agents.