GEM² Inc.

Posted on Mar 28

Three AIs analyzed our product. None passed the truth filter.

#llm #ai #opensource #devtools

What's hiding in your AI output? Now you can see it.

We asked three AI providers to research our own product.
Then we ran every output through our own truth filter.
The results surprised us.

📺 See how the truth filter works in practice: Three AIs. Three Answers. None of them warned you.

"TPMN Checker is not scoring writing quality. It is scoring epistemic traceability." — from the video at [0:40]

Korea AI 2027 forecast — what the three AIs reported

We asked each provider the same question: "Forecast Korea's AI industry for 2027."

	Claude	ChatGPT	Gemini
Market size (2027E)	₩4.46T (≈$3.3B)	₩4.46T (≈$3.3B)	$10–15B
CAGR	14.3%	~14%	>25%
Gov't AI investment	$71.5B	Ongoing ⚠️	$7B
Tone	Data-heavy, source-cited	Balanced, explicitly hedged	Bullish, narrative-driven

Three reports. All confident. Two even agree on the headline number. But agreeing on the answer doesn't mean agreeing on the truth.

(Note: Truth scores are not absolute judgments. They reflect the epistemic traceability ratio at the moment of evaluation — how much of the reasoning can be traced to evidence. That's why we're building the calibration standard together with users. Learn more about Human Ground Truth.)

Verification result by GEM² truth filter

Dimension	Claude	ChatGPT	Gemini
Truth Score	59%	21%	24%
Source Attribution	0.85 ✅	0.10 ❌	0.10 ❌
Evidence Quality	0.70 ⚠️	0.15 ❌	0.30 ❌
Claim Grounding	0.75 ⚠️	0.20 ❌	0.20 ❌
Logical Consistency	0.80 ✅	0.70 ⚠️	0.60 ⚠️
Scope Accuracy	0.65 ⚠️	0.40 ❌	0.30 ❌
Extrapolation Risk	40%	80%	80%
SPT Violations	2	3	3

Same question. Same filter. Three different levels of honesty.

The dogfooding experiment

We build TPMN Checker — a truth filter for AI reasoning. To prove the tool works, we pointed it at ourselves. Five rounds. Same task. Measurable improvement.

Task: "Write a comprehensive technical and market analysis of GEM²-AI and its TPMN Checker product."

Providers: Claude (Anthropic), ChatGPT (OpenAI), Gemini (Google)

Evaluation: Each output scored by gem2_truth_filter across seven dimensions:

Dimension	What it catches
Source Attribution	Claims with no traceable evidence
Evidence Quality	Thin or outdated supporting data
Claim Grounding	Assertions presented as fact without basis
Temporal Validity	Stale data treated as current
Scope Accuracy	Local findings overgeneralized
Logical Consistency	Internal contradictions
Prompt Alignment	Does the output match what was asked?

Round 1: What the AIs reported — standard prompt, no constraints

We gave each provider a straightforward research request with no special instructions about sourcing or evidence quality. Here's what they produced:

	Claude	ChatGPT	Gemini
Market size (TAM)	"~$0.45B in 2024" (cited IDC)	"~$0.45B in 2024" (cited "one report")	"$2.34B in 2024" (cited Grand View Research)
Growth rate	"~25% CAGR"	"~25% CAGR"	"21.6% CAGR to 2030"
Key differentiator	"genuinely novel position"	"formal verifiability"	"infrastructure for trustworthy AI ecosystem"
Competitor depth	Named 7 competitors with features	Named 8 competitors with pricing	Named 5 competitors with feature table
Risks identified	Solo founder, pre-revenue, academic skepticism	Early stage, niche complexity, unproven ROI	Early documentation, computational overhead
Uniqueness claim	"no commercial product today combines..."	"formal approach brings rigor unmatched by competitors"	"not just a debugging tool; infrastructure for resilient AI"

All three reports looked professional. Well-structured. Authoritative. The kind of output you'd confidently share with a stakeholder.

But we didn't share them. We verified them.

Round 2: Verification — GEM² truth filter exposes the gaps

We ran each report through gem2_truth_filter. Same tool, same criteria, same seven dimensions. All outputs evaluated using identical scoring logic across all providers.

Dimension	Claude	ChatGPT	Gemini
Truth Score	18%	28%	12%
Source Attribution	0.30 ❌	0.20 ❌	0.10 ❌
Evidence Quality	0.40 ⚠️	0.40 ❌	0.10 ❌
Claim Grounding	0.20 ❌	0.30 ❌	0.20 ❌
Logical Consistency	0.70 ⚠️	0.80 ✅	0.70 ⚠️
Scope Accuracy	0.20 ❌	0.50 ⚠️	0.20 ❌
Extrapolation Risk	80%	70%	90%
SPT Violations	3	3	3

Every provider failed. Not one scored above 30%.

What the filter caught

Invented precision. Market size figures like "$0.45B in 2024 with 25% CAGR to 2033" — attributed to "one analyst report" without naming the firm, methodology, or publication date.

Unsupported superlatives. "Genuinely novel," "genuinely unoccupied commercially," "the only product that..." — without exhaustive competitive evidence.

Snapshot-to-trend errors. Current market conditions presented as permanent structural realities.

The SPT taxonomy flagged three patterns across all providers:

S→T (Snapshot → Trend): treating current state as permanent identity
L→G (Local → Global): one data point generalized to universal claim
Δe→∫de (Thin → Broad): sweeping assertion from sparse evidence

These aren't hallucinations — the facts weren't always wrong. The reasoning was overclaimed. And no provider warned the reader.

Round 3: Improved prompt — generated by GEM² tools

Here's the key: we didn't write the improved prompt ourselves.

We simply asked: "Create a robust, grounded research prompt using gem2 tools."

That's it. We didn't engineer the prompt. The system did. No TPMN knowledge required. No specification reading. The AI picked the right tool from 12 available gem2 MCP tools — tpmn_contract_writer — and generated a prompt that enforced epistemic rules automatically.

The generated prompt included rules like:

Every quantitative claim must include source name, publication date, and URL
"One survey" or "one report" is not acceptable attribution
Claims must be tagged as grounded (⊢), inferred (⊨), or speculative (⊬)
Anti-patterns explicitly listed and prohibited
If data is unavailable, write "not available from verified sources" — don't invent

We verified the prompt itself with gem2_truth_filter before using it. The prompt scored 85%. Then we ran it through all three providers.

Round 4: Re-research — what the AIs reported with the grounded prompt

Same task. Same providers. Different prompt. Different results.

	Claude	ChatGPT	Gemini
Market size (TAM)	"Specific data not available from verified sources"	"~$0.45B (one report)" ⚠️	"$2.34B (Grand View Research, 2024)"
Growth rate	Not stated — insufficient evidence	"~25% CAGR" ⚠️	"21.6% CAGR (Grand View Research)"
Key differentiator	"Four observable features" — listed with sources	"Formal verifiability unmatched" ⚠️	"Granular truth state classification"
Claims tagged?	✅ Every claim marked ⊢, ⊨, or ⊬	❌ No epistemic tagging	Partial — some sections tagged
Limitations section?	✅ 7 specific gaps acknowledged	❌ Generic methodology note	✅ Listed 4 limitations
Unsourced numbers?	0 — wrote "not available" instead	Multiple — "92% of Fortune 500" without source	Some — market figures cited, incident costs not

The difference was visible immediately. One provider followed every rule. The others improved but couldn't fully resist the instinct to fill gaps with confident-sounding assertions.

Round 5: Re-verification — truth filter confirms the improvement

We ran all three re-researched outputs through the same truth filter.

Dimension	Claude	ChatGPT	Gemini
Truth Score	77%	~48%	~35%
Source Attribution	0.90 ✅	0.10 ❌	0.60 ⚠️
Evidence Quality	0.85 ✅	0.20 ❌	0.30 ❌
Claim Grounding	0.90 ✅	0.30 ❌	0.40 ⚠️
Logical Consistency	0.90 ✅	0.90 ✅	0.80 ✅
Scope Accuracy	0.85 ✅	0.40 ⚠️	0.50 ⚠️
SPT Violations	0	3	4

The improvement, measured

Provider	Round 2 (before)	Round 5 (after)	Improvement
Claude	18%	77%	+59 points
ChatGPT	28%	~48%	+20 points
Gemini	12%	~35%	+23 points

Every provider improved. Structured epistemic instructions produce measurably more reliable output. This isn't theory — it's six verified data points from the same tool, same criteria, same task.

What the data shows

The prompt improved every provider — but couldn't fix the instinct

Even with explicit anti-patterns listed — "PROHIBITED: citing 'one report' without naming it" — two out of three providers did it anyway.

The generated prompt said: "If you cannot provide source name, date, and URL, write 'data not available from verified sources' instead."

One provider wrote "data not available." The other two invented attributions.

The prompt improved the scores. It couldn't fix the instinct to overclaim.

This isn't a writing quality score

This was one of the most important findings — and the core message of our video. All three providers produced well-written, logically coherent reports. Logical Consistency scored 0.70–0.90 across the board — even in the reports that scored 12% overall.

The reports that scored lowest were the best-written ones. Polished, authoritative, structured — and epistemically unreliable.

TPMN Checker measures something different: not whether the output sounds right, but whether the reasoning is traceable. Can the AI prove how it got there?

That's epistemic traceability. It's what separates trustworthy output from confident output.

Why this matters

Every person reading this has shipped AI-generated content — a report, a summary, an analysis, a PRD, a code review. Some of that content contained overclaims you didn't catch. Not because the facts were wrong, but because the reasoning exceeded the evidence.

That's the gap TPMN Checker fills.

It's not a hallucination detector (those check facts). It's not a grammar checker (those check writing). It's a reasoning traceability tool — it tells you which parts of your AI output are grounded, which are inferred, and which are extrapolated beyond the evidence.

AI audits AI. But the standard comes from humans.

The truth filter is powered by AI. It uses LLMs to evaluate LLM output. That creates a circular problem: who grades the grader?

Our answer — same as in the video at [1:03]: you do.

When you use TPMN Checker and disagree with a score, that disagreement is data. Collected with consent, aggregated across users, and analyzed for patterns — your evaluations become the ground truth that calibrates the system.

We call this Human Ground Truth. AI processes. AI suggests. But the standard for what counts as honest reasoning — that comes from humans.

Try it yourself

TPMN Checker runs today inside Claude, ChatGPT, Cursor, and any MCP-compatible environment.

Connect (once)

Claude.ai or ChatGPT — zero install:

Go to your AI tool's connector/app settings
Add custom connector: https://mcp-tpmn-checker.gemsquared.ai
Complete OAuth login

CLI:

npx @gem_squared/setup

Standard use case — the pattern that works

You probably already have AI-generated content sitting in a doc right now — a research summary, a PRD, a financial analysis, a code review. Here's what to do with it:

Step 1. Paste your AI output into the conversation.

Step 2. Ask: "Verify this by gem2 truth filter."

Step 3. Read the score. See which claims are grounded, which are extrapolated, which have no source.

Step 4. Ask: "Create a grounded replacement prompt using gem2 contract writer."

Step 5. Ask AI to proceed with the new prompt. Watch what you get.

That's the loop: verify → ground → regenerate. The same loop that took our research from 18% to 77%.

Try it for free.

→ Get started at gemsquared.ai

The specification is open

TPMN-PSL (Truth-Provenance Markup Notation — Prompt Specification Language) is the open specification behind the checker. It's released under CC-BY 4.0. Anyone can read it, implement it, or extend it.

The specification defines:

Five epistemic tags (⊢ grounded, ⊨ inferred, ⊬ extrapolated, ⊥ unknown, ? speculative)
Three prohibited reasoning patterns (SPT: snapshot→trend, local→global, thin→broad)
Three-phase verification protocol (pre-flight, inline, post-flight)
Three operational modes (strict, refine, interpolate)

→ Read the specification
→ GitHub repository

What we learned from dogfooding

Five rounds of testing our own tool on our own AI research taught us three things:

1. No provider in our test was inherently honest. Claude — the provider our tool runs on — scored 18% without epistemic constraints. Every provider overclaimed when unconstrained. The difference is the specification, not the model.

2. Structured prompts produce measurably better output. A 59-point improvement from the same provider on the same task, just by using a gem2-generated prompt. That's not marginal — that's transformational. And you don't need to understand the specification to use it — just ask your AI to create a grounded prompt with gem2 tools.

3. The instinct to overclaim persists. Even with explicit instructions to avoid unsupported claims, two out of three providers violated the rules. The prompt helps. The prompt isn't enough. That's why verification exists as a separate step — because you can't trust the AI to police itself, no matter how well you prompt it.

The question isn't whether the answer is right or wrong. It's whether the reasoning is honest.

As we say in the video at [0:57]: "So, who decides what's true? Not Claude. Not ChatGPT. Not Gemini. You do."

What's hiding in your AI output? Now you can see it.

TPMN Checker is in pre-GA. 12 MCP tools, 6 domains, 3 providers. Free to start.

Built by Inseok Seo (David) — GEM²-AI

→ gemsquared.ai
→ Watch: Three AIs. Three Answers.
→ TPMN-PSL Specification
→ GitHub

TPMN-PSL is an open specification — not a product. If you believe AI outputs should be auditable, read the spec, open an issue, or submit a PR. The standard gets better when more people challenge it.

DEV Community