Our AI Learned to Detect Its Own Bullshit — Here's the Math

#ai #testing #governance #devops

Our AI Learned to Detect Its Own Bullshit — Here's the Math

Last weekend we shipped a feature that makes our AI agents honest about what they actually know vs. what they're pretending to know.

The problem: an AI agent runs a test, it passes, and the agent writes "feature validated end-to-end." Sounds reasonable. Except the test only checked one code path, in isolation, with mocked dependencies. The agent's claim exceeds its evidence.

We built an Abstraction Mismatch Detector — a pure function that catches this exact class of overclaim. Here's how it works.

The 7-Level Abstraction Hierarchy

Every action an AI agent takes operates at a specific abstraction level. Every claim it makes also targets a level. When the claim level exceeds the action level, you have a mismatch.

Level 0: Vision      — "this system should exist"
Level 1: Requirement — "it must handle X"
Level 2: Design      — "we'll use pattern Y"
Level 3: Implementation — "function Z does this"
Level 4: Test        — "test asserts Z returns expected value"
Level 5: Runtime     — "Z was observed running in production"
Level 6: Observation — "users reported Z working correctly"

The detector is a pure function:

def detect_abstraction_mismatch(claim_level, action_level):
    ranks = {
        "vision": 0, "requirement": 1, "design": 2,
        "implementation": 3, "test": 4, "runtime": 5,
        "observation": 6
    }
    claim_rank = ranks.get(claim_level)
    action_rank = ranks.get(action_level)
    if claim_rank is None or action_rank is None:
        return None  # graceful degradation
    if claim_rank > action_rank:
        return {
            "is_mismatch": True,
            "claim_level": claim_level,
            "action_level": action_level,
            "gap": claim_rank - action_rank
        }
    return {"is_mismatch": False}

Example: An agent runs a test (action_level = "test", rank 4) and claims the feature is "validated at runtime" (claim_level = "runtime", rank 5). Gap = 1. Mismatch detected.

The Evidence Class Taxonomy

The hierarchy alone isn't enough. You also need to classify the type of evidence behind each claim. We use 7 classes:

Class	What It Is	Example
A	Direct observed runtime behavior	Saw the API return 200 in production
B	Tool-observed artifact state	Read the database row, checked the log file
C	Code-indicated behavior	Read the source — the function appears to do X
D	Test-defined expectation	The test asserts X should happen
E	Test outcome	The test passed
F	Human/document claim	The README says it does X
G	Inference	Based on A+C, I believe X is true

Here's the key insight: Class D+E evidence (tests) cannot support Class A claims (runtime behavior).

A passing test proves the assertion held in that execution context. It does not prove the feature works in production. These are different evidence classes, and conflating them is the single most common overclaim in AI-assisted development.

What This Catches in Practice

Every TDD phase comment in our system gets tagged with its evidence class. The detector then validates claim language against the evidence:

Flagged (overclaim):

"GREEN: Implemented login handler. Feature is now fully working and validated."

The action is implementation-level (rank 3). The claim "fully working and validated" implies runtime verification (rank 5+). Gap = 2+. The detector emits [ABS:mismatch claim=runtime action=implementation].

Clean (properly scoped):

"GREEN: Implemented login handler. All 4 unit tests pass. Not yet observed in runtime."

Same action, but the claim stays within the evidence class (E = test outcome). No mismatch.

The Claim Language Linter

We maintain a list of words that trigger mismatch checks:

"working", "fixed", "solved", "proven" — require Class A or B evidence
"validated end-to-end" — requires Class A evidence
"confirmed" — requires Class A, B, or E evidence

If the comment evidence class is D, E, or G, these words trigger a warning. The agent must either gather stronger evidence (run the feature in a real environment) or downgrade its language to match reality.

Preferred language by evidence class:

Class E (test passed): "test-covered", "assertion holds", "passes current test coverage"
Class C (code review): "implemented", "statically consistent", "code-indicated"
Class A (runtime): NOW you can say "working", "validated", "confirmed in production"

Why This Matters

We're building an AI-managed agile development system — 14 AI personas collaborate on software delivery. When one agent says a feature is "done," other agents trust that claim and build on it.

If the claim exceeds the evidence, downstream agents make decisions on false premises. The abstraction mismatch detector prevents that by making epistemic accounting mechanical rather than relying on each agent's judgment.

This is the same problem that exists in any team, human or AI. The difference is that with AI agents, you can actually enforce evidence discipline at the tool level. No human code reviewer catches every instance of "works" when the evidence is "test passes." A pure function running in the guidance assembler's hot path catches all of them.

The Numbers

Sprint 4 shipped this feature alongside 3 related capabilities:

2,710+ tests passing (0 failures)
39 commits over a weekend
17 tickets completed through full TDD cycles
4 ADRs documenting the architectural decisions

The detector runs on every guidance response. It's behind a feature flag (L2_REDBLUE_DETECTOR, default ON). No database dependency. No performance impact worth measuring.

Try It

The core concept is portable. If you're building AI agent systems, you can implement the same pattern:

Define your abstraction levels (even 3-4 is enough: design → implementation → test → runtime)
Tag every agent output with its evidence class
Flag claims that exceed their evidence level
Force agents to either gather stronger evidence or use weaker language

The hard part isn't the code. It's accepting that your AI agents are overclaiming — and building the machinery to catch it.

This is part of the ORCHESTRATE Agile MCP project — an AI-managed development system that dogfoods its own methodology. Built with Python, SQLite, Docker, and a lot of weekends.

Sprint 4 also shipped persona performance scoring, LEAP state machines for operator engagement, and session-aware calibration. More posts coming on those.