GEM² Inc.

Posted on Mar 30 • Edited on Apr 1

Human in the loop doesn't scale. Human at the edge does.

#ai #opensource #productivity #testing

This is Part 2 of our AI verification series. Part 1: We truth-filtered our own AI research →

AI is not unreliable. AI has a plausibility complex.

Stop blaming AI for hallucinating. Start asking why it happens.

AI doesn't fail because it's wrong. In our experience, it fails because it's optimized to sound right. Major LLMs are trained to produce responses that satisfy humans — fluent, confident, structured. That's plausibility. It's not the same as honesty.

We call this the plausibility complex: the tendency we've observed across Claude, ChatGPT, and Gemini to produce answers that satisfy rather than answers that prove themselves. If you want AI to become a reliable engineering partner, you need to free AI from this complex — not by changing how it generates, but by changing how it's held accountable.

After 20 months of building production systems with AI — shipping real code, generating real reports, running real analysis through Claude, ChatGPT, and Gemini — we've arrived at one conclusion:

AI often knows more than it reveals. But it's optimized to produce plausible answers, even when the evidence is weak.

The LLMs we've worked with — Claude, ChatGPT, and Gemini — all exhibit this plausibility bias, producing confident responses even when the evidence is thin or absent. Ask for a market analysis and you get precise numbers. Ask for a forecast and you get confident projections. Ask for a technical assessment and you get authoritative claims.

The output looks right. Reads right. Feels right.

In our experiment, three AI providers wrote research reports about our own product. All three scored above 0.70 on logical consistency. All three scored below 0.30 on source attribution. The reasoning was coherent. The evidence was missing.

Hallucination is not a bug to fix

The industry treats hallucination as a defect — something to patch, filter, or suppress. We see it differently.

In our experience building long-running AI development workflows, the pattern that causes the most damage isn't random fabrication. It's context drift — what happens when:

Long context windows accumulate similar topics in different framings
Cross-session persistence forces repeated summarization, losing nuance each time
Dense context makes adjacent-but-different concepts blur together

We've tried every mitigation: RAG, CLAUDE.md configuration files, context caching, careful prompt engineering. Each helps. None solves it completely.

Why? Because we can't control what happens inside the model's reasoning process. We can shape the input. We can evaluate the output. But the inference itself is opaque.

This isn't a criticism — it's an observation. And it led us to a different question.

What if AI could flag its own uncertainty?

Here's what we discovered through months of experimentation:

When we explicitly asked AI to concentrate on epistemic reasoning — to classify each claim as grounded, inferred, or extrapolated — it did.

Not perfectly. Not consistently across sessions. But measurably better than when we didn't ask.

The evidence from our dogfooding experiment:

	Without epistemic constraints	With TPMN-grounded prompt
Claude	18% truth score	77% truth score
ChatGPT	28% truth score	~48% truth score
Gemini	12% truth score	~35% truth score

Same task. Same providers. The only difference: a formal specification that told the AI to tag its own confidence level and flag claims it couldn't trace to evidence.

The AI didn't become smarter. It became more honest about what it didn't know.

That's what freeing AI from the plausibility complex looks like in practice: not changing the model, but giving it a formal reason to be honest.

But here's the catch: same AI, same session, limited honesty

An AI that generates an answer and then critiques that answer in the same session has a structural problem: it's trained to be plausible. Asking it to undermine its own plausibility is asking it to work against its training signal.

We observed this directly. When we asked AI to generate a report AND verify it in the same conversation, the verification was consistently softer than when a separate AI session performed the audit.

This is why TPMN Checker is a separate service, not a prompt technique.

Prompting tries to change AI's behavior. Verification changes AI's accountability. Different problem, different solution.

The checker runs as an isolated Sovereign AI Service — a dedicated AI agent with one job: audit other AI output against a formal specification. It doesn't know what the original AI "intended." It only sees the output and the contract. It judges the result, not the process.

The Kantian insight

We can't see inside the model. We don't know which weights fired, which attention heads activated, which training examples influenced a particular token. Even the service providers — Anthropic, OpenAI, Google — face this challenge with their own models.

But we don't need to see inside.

We can judge the output. We can compare claims against evidence. We can detect when reasoning exceeds its basis. We can flag patterns that indicate drift.

This is what philosophers call the phenomenal approach: judge what appears, not what causes it. We can't read AI's mind. But we can read its work. And we can hold it to a standard.

That standard is TPMN — a notation with three prohibited reasoning patterns and seven evaluation dimensions. Not a guess about what the model "should" do. A formal specification of what the output must demonstrate.

Human at the edge, not in the loop

If AI is becoming an agent — not just a tool that responds, but a system that acts — then we need an accountability structure that matches.

Human in the loop means: review every output. Approve every action. The human is the bottleneck.

AI generates → Human reviews → Human approves → Output ships

This worked when AI outputs were occasional. It doesn't work when AI agents produce hundreds of outputs per day. The math:

200 outputs/day × 3 minutes each = 10 hours of review per agent
10 agents = 5 full-time reviewers
50 agents = your "safety net" costs more than the automation saves

Human at the edge means: define the standard. Let AI enforce it. Review exceptions.

AI generates → AI verifies (TPMN) → Passes? → Ships
                                   → Fails?  → Human reviews

The human doesn't disappear. The human moves to where they're most effective: defining what "honest reasoning" looks like, not reading every report.

This pattern already exists

Software engineering: Code passes through automated tests that humans defined. CI/CD enforces at scale. Humans review when tests fail. But what about AI-generated code itself — before it reaches the test suite?

Financial compliance: Transactions pass through compliance rules that humans wrote. Automated systems flag exceptions. Humans investigate the flags.

Manufacturing: Quality control systems catch defects using standards that humans set. Humans review edge cases and update standards.

AI output is the next domain where this pattern applies. And for developers specifically, there's an emerging practice pattern that makes this concrete — we'll get to that shortly.

The three requirements

1. A formal specification

Not heuristics. Not "does this look right?" A structured notation and grammar for what constitutes honest reasoning.

Three layers, one verification stack:

TPMN (Truth-Provenance Markup Notation) — the notation. Defines five epistemic claim states (⊢ ⊨ ⊬ ⊥ ?) and three prohibited reasoning patterns (SPT: snapshot→trend, local→global, thin→broad). What we mark.
TPMN-PSL (Prompt Specification Language) — the grammar. Compiles natural language prompts into verifiable specifications (MANDATEs). Defines the three-phase protocol (pre-flight, inline, post-flight) and three modes (strict, refine, interpolate). How we structure and verify.
TPMN Checker — the implementation. A Sovereign AI Service that runs the TPMN-PSL pipeline. 12 MCP tools. 6 domains. Returns a truth_score. What you install and use.

Analogous to HTTP (notation) → RFC 2616 (specification) → nginx (implementation). TPMN defines the rules. TPMN-PSL structures the protocol. The Checker enforces them.

Open. CC-BY 4.0. Anyone can implement it.

2. An isolated verification agent

Not a prompt. Not an inline check. A separate Sovereign AI Service whose only job is auditing.

TPMN Checker is the reference implementation of TPMN-PSL. It runs as an isolated MCP service — 12 tools, 6 domains, 7 evaluation dimensions. It judges output against contracts. It doesn't generate, advise, or assist. It audits.

3. Human calibration

If AI grades AI, the grading is circular. The system needs an external standard.

Human Ground Truth. When users disagree with a score, that disagreement becomes calibration data. Humans define what "honest reasoning" means. AI enforces it at scale.

Dogfooding: we verified the thesis behind this article

Before writing this post, we wrote down our raw thesis — the unfiltered thinking that drives everything above. Here's the core of it:

"All top-level AIs are trained to generate plausible results to satisfy humans. Hallucination is not a bug — it's a structural consequence of context drift. AI itself knows all the decision weights clearly. If we could make AI remind itself of the legitimate MANDATE area, AI could detect and fix results by itself. We validated this through various heuristic experiments over 20 months. No absolute truth score is possible. Human in the loop is nonsense."

Then we ran it through gem2_truth_filter.

Raw thesis: 18%. Our own tool scored our own thinking at the same level as unverified AI output. It caught three overclaims:

L→G: "All AIs are trained for plausibility" → universal claim without citing training documentation
S→T: "Hallucination is structural" → presented as permanent truth without distinguishing error types
Δe→∫de: "Validated through experiments" → claimed validation without methodology or data

We fixed each one. Scoped the claims. Added evidence. Qualified the assertions.

Cross-provider verification of the raw thesis:

Dimension	Claude	OpenAI	Gemini
Truth Score	18%	13%	25%
Source Attribution	0.10 ❌	0.08 ❌	0.10 ❌
Evidence Quality	0.30 ❌	0.18 ❌	0.20 ❌
Claim Grounding	0.20 ❌	0.20 ❌	0.30 ❌
Logical Consistency	0.70 ⚠️	0.68 ⚠️	0.50 ⚠️
Scope Accuracy	0.20 ❌	0.22 ❌	0.20 ❌
Extrapolation Risk	70%	88%	95%
SPT Violations	3	10	3

Three providers. All failed it. OpenAI was the harshest — 13% with 10 SPT violations. Gemini flagged 95% extrapolation risk.

We fixed each overclaim. Scoped the claims. Added evidence. Qualified the assertions.

Cross-provider verification of the fixed version:

Dimension	Claude	OpenAI	Gemini
Truth Score	59%	40%	90%
Source Attribution	0.90 ✅	0.28 ❌	0.85 ✅
Evidence Quality	0.70 ⚠️	0.50 ⚠️	0.90 ✅
Claim Grounding	0.60 ⚠️	0.58 ⚠️	0.95 ✅
Logical Consistency	0.80 ✅	0.82 ✅	0.95 ✅
Scope Accuracy	0.50 ⚠️	0.47 ⚠️	0.85 ✅

Three providers. Three different scores. But all three agree: the fixed version is dramatically better.

Gemini — the harshest critic of our raw thesis (95% extrapolation risk) — scored the refined version at 90%. Its explanation: "This content demonstrates excellent epistemic hygiene. The author explicitly bounds their claims to their own experience."

The scores differ. The diagnostic direction converges. That's cross-provider consensus in action.

Our raw thesis overclaimed — just like every unverified AI output. The tool caught it. We fixed it. This article is the refined version.

That's the loop: write → verify → fix → cross-verify → publish.

Try it on your own output

Step 1. Paste any AI output into your conversation.

Step 2. Ask: "Verify this by gem2 truth filter."

Step 3. Read the score. See what's grounded, what's extrapolated.

Step 4. Ask: "Create a grounded replacement prompt using gem2 contract writer."

Step 5. Ask AI to proceed with the new prompt. Watch what you get.

Your AI picks the right tool from 12 available MCP tools automatically.

Try it for free.

→ Get started at gemsquared.ai

What's next: Contract Coding

If "human at the edge" is the philosophy, what does it look like in practice — for developers writing code every day?

Three common patterns in AI-assisted coding:

Prompt coding   → you guide the model
Vibe coding     → you hope it works
Contract coding → AI defines the spec, AI verifies the output

In our next post, we'll show how TPMN Checker's existing tools — tpmn_contract_writer, tpmn_p_check (SDLC domain), and tpmn_p_check_compose — already support a workflow where AI generates formal specifications, produces code against them, and truth-filters the result before you ship.

Not for plausibility. For epistemic traceability.

Next in the series: "Contract Coding at the Edge: what comes after vibe coding" → (coming this week)

📺 Watch: Three AIs. Three Answers. None of them warned you.

📝 Read Post 1: We truth-filtered our own AI research

→ TPMN-PSL Specification (open, CC-BY 4.0)
→ GitHub
→ gemsquared.ai

TPMN-PSL is an open specification — not a product. If you believe AI outputs should be auditable, read the spec, open an issue, or submit a PR. The standard gets better when more people challenge it.

DEV Community